100% found this document useful (1 vote)
30 views38 pages

TinyML to TinyDL: Trade-offs and Advances

This survey explores the evolution from Tiny Machine Learning (TinyML) to Tiny Deep Learning (TinyDL), highlighting advancements in deploying deep learning models on resource-constrained edge devices. It covers architectural innovations, optimization techniques, and software toolchains, while reviewing applications across various domains and identifying future research directions. The aim is to provide a comprehensive resource for researchers and practitioners in the field of edge AI, emphasizing the importance of on-device learning and privacy.

Uploaded by

keerthimak123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
30 views38 pages

TinyML to TinyDL: Trade-offs and Advances

This survey explores the evolution from Tiny Machine Learning (TinyML) to Tiny Deep Learning (TinyDL), highlighting advancements in deploying deep learning models on resource-constrained edge devices. It covers architectural innovations, optimization techniques, and software toolchains, while reviewing applications across various domains and identifying future research directions. The aim is to provide a comprehensive resource for researchers and practitioners in the field of edge AI, emphasizing the importance of on-device learning and privacy.

Uploaded by

keerthimak123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

From Tiny Machine Learning to Tiny Deep Learning: A

Survey
SHRIYANK SOMVANSHI, Texas State University, USA
MD MONZURUL ISLAM, Texas State University, USA
GAURAB CHHETRI, Texas State University, USA
ROHIT CHAKRABORTY, Texas State University, USA
MAHMUDA SULTANA MIMI, Texas State University, USA
arXiv:2506.18927v2 [[Link]] 25 Jun 2025

SAWGAT AHMED SHUVO, Texas State University, USA


KAZI SIFATUL ISLAM, Texas State University, USA
SYED AAQIB JAVED, Texas State University, USA
SHARIF AHMED RAFAT, Texas State University, USA
ANANDI DUTTA, PH.D., Texas State University, USA
SUBASISH DAS, PH.D., Texas State University, USA
The rapid growth of edge devices has driven the demand for deploying artificial intelligence (AI) at the edge,
giving rise to Tiny Machine Learning (TinyML) and its evolving counterpart, Tiny Deep Learning (TinyDL).
While TinyML initially focused on enabling simple inference tasks on microcontrollers, the emergence of
TinyDL marks a paradigm shift toward deploying deep learning models on severely resource-constrained hard-
ware. This survey presents a comprehensive overview of the transition from TinyML to TinyDL, encompassing
architectural innovations, hardware platforms, model optimization techniques, and software toolchains. We
analyze state-of-the-art methods in quantization, pruning, and neural architecture search (NAS), and examine
hardware trends from MCUs to dedicated neural accelerators. Furthermore, we categorize software deployment
frameworks, compilers, and AutoML tools enabling practical on-device learning. Applications across domains
such as computer vision, audio recognition, healthcare, and industrial monitoring are reviewed to illustrate
the real-world impact of TinyDL. Finally, we identify emerging directions including neuromorphic computing,
federated TinyDL, edge-native foundation models, and domain-specific co-design approaches. This survey
aims to serve as a foundational resource for researchers and practitioners, offering a holistic view of the
ecosystem and laying the groundwork for future advancements in edge AI.
CCS Concepts: • Computing methodologies → Machine learning; Deep learning; TinyML; Model
interpretability; • Applied computing → Predictive analytics.
Additional Key Words and Phrases: Tiny Machine Learning, Tiny Deep Learning, Edge AI, Embedded Deep
Learning

Authors’ Contact Information: Shriyank Somvanshi, Texas State University, San Marcos, USA, shriyank@[Link]; Md
Monzurul Islam, Texas State University, San Marcos, USA, monzurul@[Link]; Gaurab Chhetri, Texas State University,
San Marcos, USA, gaurab@[Link]; Rohit Chakraborty, Texas State University, San Marcos, USA, rohitchakraborty@
[Link]; Mahmuda Sultana Mimi, Texas State University, San Marcos, USA, qnb9@[Link]; Sawgat Ahmed Shuvo,
Texas State University, San Marcos, USA, sawgat@[Link]; Kazi Sifatul Islam, Texas State University, San Marcos, USA,
kazi_sifat@[Link]; Syed Aaqib Javed, Texas State University, San Marcos, USA, [Link]@[Link]; Sharif Ahmed
Rafat, Texas State University, San Marcos, USA, sarafat@[Link]; Anandi Dutta, Ph.D., Texas State University, San Marcos,
USA, [Link]@[Link]; Subasish Das, Ph.D., Texas State University, San Marcos, USA, subasish@[Link].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@[Link].
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 1557-735X/2025/5-ART
[Link]

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


2 Somvanshi et al.

ACM Reference Format:


Shriyank Somvanshi, Md Monzurul Islam, Gaurab Chhetri, Rohit Chakraborty, Mahmuda Sultana Mimi, Sawgat
Ahmed Shuvo, Kazi Sifatul Islam, Syed Aaqib Javed, Sharif Ahmed Rafat, Anandi Dutta, Ph.D., and Subasish
Das, Ph.D.. 2025. From Tiny Machine Learning to Tiny Deep Learning: A Survey. J. ACM 37, 4 (May 2025),
38 pages. [Link]

1 Introduction
Tiny Machine Learning (TinyML) has emerged as a rapidly growing paradigm that brings machine
learning capabilities to severely resource-constrained edge devices. Traditionally, machine learning
models demanded significant computational resources, making their deployment on microcontroller
units (MCUs) and embedded platforms impractical. However, advances in hardware design, model
compression, and embedded inference have allowed real-time intelligence to be embedded on-
device, leading to a new class of systems that execute complex analytics at the edge. As the field
evolves, a distinct subdomain called Tiny Deep Learning (TinyDL) has gained momentum, focusing
specifically on deploying deep learning models, rather than shallow classifiers on low-power,
ultra-constrained hardware.
TinyML is typically defined as the deployment of machine learning inference tasks on devices
operating under 1 mW of power, often with only 32 to 512 kB of Static Random-Access Memory
(SRAM) and constrained flash storage. These devices, which usually lack an operating system and
hardware accelerators for floating-point operations, are capable of performing real-time analytics
while meeting stringent energy and memory budgets [1–3]. TinyDL builds upon this foundation
by emphasizing the use of deep neural networks, such as convolutional and transformer-based
architectures, under similar constraints. This term, introduced as early as 2017 with just-in-time
inference frameworks like TinyDL [4], now encompasses a range of state-of-the-art models such
as MCUNet, EfficientNet-lite, and DistilBERT variants that deliver strong accuracy with memory
footprints below 1 MB and latency below 20 milliseconds [5].
The rise of TinyML and TinyDL is primarily driven by limitations inherent in traditional cloud-
based machine learning workflows. Cloud inference introduces unacceptable round-trip latencies
in time-sensitive applications such as autonomous driving, drones, and wearables [6]. Moreover,
transmitting sensor data to the cloud raises substantial privacy concerns in healthcare and indus-
trial Internet of Things (IoT) contexts, where data sovereignty and user trust are paramount [7].
Finally, the energy consumption required to constantly stream data to remote servers introduces
a prohibitive cost, especially for battery-powered devices [8]. By shifting inference-and increas-
ingly, lightweight learning-onto the device, TinyDL enables ultra-low-latency responses, reduces
dependency on cloud connectivity, and enhances data privacy [1].
Initially, TinyML systems relied on shallow models such as linear classifiers, decision trees,
or single-layer perceptrons. These models, while lightweight, were unable to match the repre-
sentational power of deep neural networks and required extensive manual feature engineering,
particularly for audio and vision tasks [9]. The transition toward TinyDL was made possible
by several interrelated advances. First, architectural innovations such as depthwise separable
convolutions, inverted residuals, and attention mechanisms made it possible to compress model
complexity without sacrificing accuracy [2]. Second, a suite of optimization techniques including
quantization-aware training (QAT), structured pruning, knowledge distillation, and low-rank fac-
torization, dramatically reduced the runtime and memory demands of deep models [5]. Third, the
introduction of Neural Architecture Search (NAS) frameworks that co-optimize model topology
and deployment constraints-such as MCUNet and TinyNAS-has demonstrated that ImageNet-scale
tasks can be executed on MCUs with just 480 kB of SRAM [10]. Additionally, new developments

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 3

in on-device and continual learning allow models to adapt in real-time under strict memory and
compute constraints, further extending the practicality of TinyDL systems [11].

1.1 Objectives and Scope of This Survey


This survey aims to provide a comprehensive and timely synthesis of the emerging landscape
of TinyDL. While several reviews have previously outlined the evolution of TinyML and its
applications up to 2022 [3, 8, 12], they generally focus on classical machine learning or do not
sufficiently distinguish TinyDL as a distinct subfield. Our work addresses this gap by emphasizing
deep models and techniques tailored for kilobyte-scale environments. We highlight developments
occurring through 2025 and offer an integrated perspective that spans model design, software
toolchains, hardware platforms, and deployment strategies. This includes insights from academic
research, open-source benchmarks, and industrial deployment case studies. Moreover, we identify
critical gaps in current research, such as the lack of support for federated learning, the security of
over-the-air updates, and the absence of robust benchmarks for TinyDL systems, and propose a
structured agenda for future work.

1.2 Summary of Contributions


To guide this discussion, we contribute a unified definition and taxonomy that clearly delineates
TinyDL from traditional TinyML, incorporating hardware constraints and algorithmic character-
istics. We offer a comprehensive literature synthesis derived from over 200 sources, structured
around recent advances in model architectures, NAS methods, toolchains, and application domains.
In addition, we propose a benchmarking framework for evaluating TinyDL systems, incorporating
metrics such as inference latency, memory usage, model size, and energy efficiency. As a companion
to this paper, we also release awesome-tinyml1 , a curated, automatically updated open-source
repository of TinyML research papers, tools, frameworks, and tutorials to support community
knowledge sharing. Finally, we present a research roadmap that highlights open questions around
neuromorphic TinyDL, domain-specific accelerators, compiler–hardware co-design, and privacy-
preserving on-device learning. Through these contributions, this survey aims to support both
newcomers and experienced researchers in navigating and contributing to the evolving field of
TinyDL.
The remainder of this paper is structured as follows: Section 2 introduces TinyML and TinyDL
concepts; Section 3 reviews hardware platforms and benchmarks; Section 4 outlines the evolution
from TinyML to TinyDL; Section 5 presents lightweight deep learning architectures; Section 6
discusses software toolchains and deployment frameworks; Section 7 highlights key applications
across domains; Section 8 explores on-device learning methods; Section 9 covers evaluation metrics
and datasets; Section 10 discusses ongoing research challenges; Section 11 suggests future directions;
and Section 12 concludes the paper.

2 Background and Foundational Concepts


2.1 Tiny Machine Learning
TinyML has emerged as a transformative field within artificial intelligence, characterized by the
deployment and execution of machine learning models on highly resource-constrained embedded
devices, particularly MCUs. This paradigm facilitates on-device data processing and inference,
thereby pushing intelligence to the very edge of networks [8, 12]. The fundamental aim of TinyML
is to enable sophisticated analytical capabilities directly on hardware platforms that are severely

1 [Link]

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


4 Somvanshi et al.

limited in terms of their available resources. In doing so, it supports applications requiring low
latency, minimal power consumption, and enhanced data privacy by keeping data local [8, 9, 13].
The operational landscape of TinyML is shaped by three critical constraints: memory, power, and
compute limitations. Firstly, memory availability is exceptionally scarce. Devices typically include
SRAM ranging from a few kilobytes to several hundred kilobytes for runtime operations, and Flash
memory often under one megabyte for storing program code and ML models, though in rare cases
this may reach up to 2 MB [9, 12, 14–16]. This represents a stark contrast to conventional computing
platforms and necessitates the use of highly compact models [14]. Secondly, power consumption
is a paramount constraint. TinyML devices typically operate on ultra-low power budgets, often
in the milliwatt or even microwatt range [8, 9, 12, 15, 16]. This is essential for battery-powered
or energy-harvesting systems designed for long-term operation without frequent recharging or
maintenance [8, 15]. Thirdly, compute capabilities in these devices are limited. The MCUs generally
operate at clock speeds of several tens to a few hundred megahertz, and many lack Floating Point
Units (FPUs), which further constrains the deployment of typical ML models unless optimized
through quantization techniques [9, 12, 16]. These hardware limitations necessitate lightweight
and efficient models capable of running within constrained environments.
The hardware ecosystem supporting TinyML primarily consists of low-power MCUs integrated
with sensors that gather environmental data. Prominent examples include the ARM Cortex-M
series, such as the Cortex-M0, M4, and M7, which strike a balance between computational efficiency,
power consumption, and cost [3, 9, 17]. Other widely used platforms include the STM32 and ESP32
families [8, 16]. These MCUs are often paired with application-specific sensors, such as inertial
measurement units (IMUs) for motion tracking, microphones for voice command recognition, and
low-resolution cameras for vision tasks with constrained compute budgets [3, 9]. This combination
of efficient hardware and targeted sensors empowers TinyML to bring intelligence into everyday
objects, from wearables to smart infrastructure.

2.2 Tiny Deep Learning


TinyDL represents a specialized subfield within the broader TinyML domain, specifically concen-
trating on the adaptation and deployment of deep learning models, such as Convolutional Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs), onto extremely resource constrained
hardware, most notably MCUs [5, 18]. The primary objective of TinyDL is to enable sophisticated
tasks like image classification, object detection, and real time gesture recognition on these low
power devices [16, 19].
A key characteristic of TinyDL is the utilization of highly compressed deep models. These models
are meticulously optimized to fit within the stringent memory limitations of MCUs, often resulting
in model sizes of just a few hundred kilobytes [20, 21]. This substantial reduction is achieved
through various model compression techniques. Quantization, for instance, involves converting
the model’s parameters from higher precision floating point numbers to lower precision integers,
such as 8 bit integers, thereby shrinking the model size with minimal degradation in accuracy
[14, 16, 19]. Another prevalent technique is pruning, which systematically removes redundant
parameters or connections within the neural network to create sparser and more compact models
[5, 16]. Furthermore, TinyDL models are designed for real time inference. This means they can
process data and provide outputs almost instantaneously on the device itself, which is crucial for
applications requiring immediate responses [16, 19].
TinyDL differs significantly from conventional Deep Learning. Conventional DL typically relies
on powerful computing resources like GPUs and extensive memory (often gigabytes) to train and
run large, complex models, with the primary goal of achieving the highest possible accuracy [16, 17].
In contrast, TinyDL operates under severe hardware limitations, prioritizing on device efficiency,

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 5

low power consumption (milliwatts or microwatts), and minimal memory footprint (kilobytes)
[5, 16]. Recent advancements in TinyDL have introduced neural network architectures specifically
designed for edge execution, including MobileNet, SqueezeNet, and Tiny-YOLO [22–24]. These
models are tailored to execute with fewer floating-point operations and reduced parameter counts,
enabling inference in real time even on MCUs without FPUs [12, 16]. Furthermore, hardware-
aware NAS and energy-efficient training paradigms are being actively explored to enhance model
deployment on edge platforms [8]. Within the context of TinyML, TinyDL is a specialized area.
TinyML encompasses all machine learning techniques, including classical algorithms, that can
be deployed on resource-limited devices [25]. TinyDL, however, specifically addresses the more
demanding challenge of implementing and running inherently more complex and resource intensive
deep learning models under these same severe constraints [5, 20].

2.3 Workflow in TinyML and TinyDL


The development of TinyML solutions involves distinct workflows that bridge the creation of
machine learning models with their deployment on resource constrained hardware. Three primary
approaches are recognized in the literature: the ML oriented workflow, the hardware oriented
workflow, and the co design workflow [12].These workflows are illustrated in Figure 1, which
provides a comparative overview of the design focus, optimization stages, and implementation flow
for each approach. The ML oriented workflow is primarily driven by machine learning practitioners.
This approach commences with the design, training, and validation of an ML model suited for
the specific problem, often initially disregarding the precise limitations of the target hardware
to maximize performance and generalization [12, 15]. Following this, the model undergoes an
optimization phase, where techniques like pruning and quantization are applied to reduce its size
and computational demands to meet the hardware constraints [12, 15]. The final steps involve
deploying the optimized model to the target device and evaluating its real world performance [12].

1. Pruning
2. Knowledge Distillation
1. Model Selection 3. Quantization
2. Training
Resource Rich Device

Model Model Model


Dataset Training Evaluation Optimizing

1. Manual Programming
2. Code Generation
3. Tiny ML Interpreters
Microcontroller Units
Real World Data

Sensor Model Inference Ported Model

Preprocessed Sensor Data

Fig. 1. TinyML Pipeline [15]

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


Conversely, the hardware oriented workflow is led by hardware engineers. The main focus here
is on designing new, or enhancing existing, hardware platforms such as MCUs or specialized accel-
erators, to execute ML algorithms with greater efficiency [12]. This workflow involves identifying
computational bottlenecks in current hardware when running ML tasks and then architecting
specific hardware solutions to mitigate these issues, thereby improving throughput and reducing
power consumption for ML workloads [12]. The co design workflow represents a more integrated
and increasingly pivotal approach. In this paradigm, ML experts and hardware engineers collaborate
closely from the project’s inception [12]. Unlike the sequential nature of the ML oriented or HW
oriented workflows, co design involves the intertwined and simultaneous optimization of both the
ML model and the hardware architecture [5, 12]. This holistic methodology aims to achieve optimal
synergy between software and hardware, potentially unlocking performance and efficiency gains
unattainable through isolated optimization efforts, and is considered crucial for advancing the
capabilities of TinyML systems [12]. Furthermore, a comparative summary of TinyML, TinyDL, and
Edge AI paradigms is presented in Table 1, highlighting their design scope, hardware requirements,
model complexity, and implementation focus.

Table 1. Comparison of TinyML, TinyDL, and Edge AI

Feature TinyML TinyDL Edge AI


Scope Subset of Edge AI; ML on Subset of TinyML; specifically Broadest category; AI
extremely deploying deep learning processing near the data
resource-constrained devices, models on these extremely source, on devices ranging
typically MCUs [9, 12] resource-constrained from gateways and edge
MCUs [5, 20] servers to MCUs [8, 26]
Typical MCUs (e.g., ARM Cortex-M Same MCUs, occasionally with Edge servers, GPUs (e.g.,
Hardware series, ESP32), DSPs [9, 16] minimal accelerators [17, 19] NVIDIA Jetson), FPGAs,
powerful SoCs, capable
MCUs/MPUs [8, 9]
Model Classical ML and highly Deep networks heavily Ranges from simple ML to
Complexity compressed DL models (kB to quantized or pruned to fit complex DL,
≲few MB) [9, 14] kB-scale memory [19, 20] hardware-dependent [8]
Primary Goals Enable ML on ultra-low-power, Bring sophisticated DL to the Reduce latency and bandwidth,
low-cost devices; maximize most constrained devices, improve privacy, enable
battery life; ensure data pushing efficiency real-time analytics at the
privacy [8, 9] limits [5, 16] edge [8, 26]
Power Typically in the mW–µW Same as TinyML, with tight Wide range: watts (edge
Consumption range [8, 16] per-inference budgets [5, 17] servers) to milliwatts
(embedded) [8]
Data Strictly on-device at MCU Same scope, on-device MCU Local edge servers, gateways,
Processing level [9] inference [19] or capable end devices [8]
Key ML at the “tiniest” compute Neural networks at kB-scale, Decentralized AI that moves
Characteristic scale, adding intelligence to requiring extreme computation out of the
everyday objects [9] optimization [19] cloud [26]
Relationship Specialized subset of Edge AI, Specialized subset of TinyML, Superset covering non-cloud
focused on the centered on deep learning AI workloads [26]
resource-limited extreme [8] algorithms [20]

To synthesize the transition from TinyML to TinyDL, Figure 2 provides a side-by-side compar-
ison across four key dimensions: model size, hardware platforms, optimization techniques, and
representative application domains. While TinyML typically involves deploying classical machine
learning models under 250 KB on low-power MCUs, TinyDL enables compressed deep learning
models to run on resource-constrained devices through advances in hardware accelerators and
model optimization techniques. TinyDL leverages QAT, NAS, hardware-aware quantization (HAQ),
and knowledge distillation to maintain high accuracy within stringent memory and energy budgets.
From Tiny Machine Learning to Tiny Deep Learning: A Survey 7

As shown, this progression expands the use cases from simple tasks like gesture or electrocardio-
grams (ECGs) monitoring in TinyML to more complex applications such as speech recognition,
vision-based inference, and autonomous systems in TinyDL.

TinyML TinyDL
Model Size ≤ 250 KB Typically, hundreds of KB to under 1
MB

Hardware: MCU (Low Power), Low- Hardware: MCU (Enhanced


Power Sensors Compute), NPUs, Edge Accelerators

QAT, NAS, HAQ, Distillation,


PTQ, Pruning, Manual Feature
Federated Learning
Engineering

Gesture, ECG, HAR, Anomaly Speech, NLP, Object Detection,


Detection Drones

Fig. 2. Comparative summary of TinyML and TinyDL across four key aspects: model size, hardware platforms,
optimization techniques, and representative use cases
Post-Training Quantization (PTQ),

Hardware-Aware Quantization
3 Hardware Platforms for(HAQ)
TinyML and TinyDL
TinyML and TinyDL applications run on highly resource-constrained devices. This section surveys
the hardware platforms enabling TinyML, from general MCUs to new specialized AI accelerators,
and how these platforms are evaluated. Despite their modest specs, these devices can perform
meaningful ML tasks at the edge by balancing performance, power, and accuracy through careful
design and benchmarking.

3.1 MCUs
MCUs form the backbone of many TinyML deployments, bringing intelligence to the extreme
edge. They are single-chip computers designed for low power operation, often running on batteries
in remote sensors or wearables [27]. MCUs operate under strict hardware constraints: low clock
speeds (typically on the order of tens of MHz) and limited on-chip memory (often only tens to a
few hundred kilobytes of RAM). For example, a typical MCU might have approximately 128 KB of
RAM and 1 MB of flash storage, versus the gigabytes of memory and storage on a modern smart-
phone [28]. Because they usually run on small batteries, energy efficiency is paramount – TinyML
devices must consume mere milliwatts or microwatts to run for long periods [27]. Despite these
limitations, MCUs are capable of running surprisingly complex ML workloads when models are
optimized. Continuous improvements in MCU hardware (e.g. more efficient 32-bit ARM Cortex-M
processors) have made them powerful and energy-efficient enough to handle small neural networks
within tight power budgets [17]. In other words, it is now feasible to deploy machine learning
on battery-operated MCU-based sensors in the field. Equally important is the software toolchain
that supports MCUs. Frameworks like TensorFlow Lite for MCUs (TFLite Micro) and platforms

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


8 Somvanshi et al.

like Edge Impulse allow developers to compress and deploy trained models on tiny devices. For
instance, TFLite Micro can convert a neural network into a form that runs in as little as 16 KB of
RAM on an MCU. Techniques such as 8-bit quantization and pruning are used to shrink model size
and computation so that inference can execute in real time on limited CPU and memory.

Key Hardware Constraints for MCU-based TinyML

Clock Speed: Often 20–200 MHz clock rates (much lower than GHz-class processors) [27],
which limits the raw compute throughput of MCUs.
Memory: On the order of kilobytes to a few hundred kilobytes of RAM and usually under
a few MB of flash storage [29]. This means models must be very small and efficient.
Power/Battery: Designed for ultra-low power use; many MCU systems run on coin-cell
batteries or energy harvesters. Power consumption is in the milliwatt or even microwatt
range, so the device can operate for months or years [29].

Even with these constraints, MCUs have demonstrated the ability to run useful ML inference
tasks at the edge. By using optimized models, an MCU can perform tasks like keyword spotting
(KWS), gesture recognition, anomaly detection, or simple image classification entirely on-device
[30]. For example, researchers have successfully deployed a voice wake-word detector and simple
vision models on tiny boards like the Espressif ESP32 and STMicroelectronics STM32 series MCUs.
The ESP32 (a dual-core MCU up to 240 MHz with approximately 520 KB RAM) and various
STM32 Cortex-M variants (e.g. an M7 at 216 MHz with a few hundred KB of RAM) are popular
choices that, with quantized models, can handle basic deep learning tasks under tight memory and
energy constraints. Recent studies and white papers highlight that the combination of improving
MCU hardware and clever model optimizations enables these chips to support machine learning
workloads that were once thought impossible on such limited devices [27]. MCUs provide a flexible,
low-cost platform for TinyML, albeit one that demands extreme efficiency in model design.

3.2 Specialized AI Hardware


While many TinyML applications run on general-purpose MCUs, a new class of specialized AI
hardware has emerged to push the boundaries of performance and efficiency for tiny and edge
deployments. These are application-specific chips and co-processors built specifically to handle
neural network inference with minimal energy. By using custom digital logic (and in some cases
analog techniques), they act as Neural Compute Engines that drastically accelerate ML tasks
compared to a software-only MCU approach. The following subsubsections describe notable
examples of AI-focused hardware designed for low-power, on-device deep learning:

3.2.1 Google Edge Tensor Processing Unit. The Google Edge Tensor Processing Unit (TPU) is a
small application-specific integrated circuit (ASIC) designed by Google to accelerate TensorFlow
Lite models at the edge. Each Edge TPU can perform 4 trillion operations per second (4 TOPS) while
consuming about 2 W of power (roughly 2 TOPS/W) [31]. In practical terms, an Edge TPU can run
vision models like MobileNet V2 at nearly 400 frames per second in a power-efficient manner [32],
far beyond what a typical MCU could achieve. These chips often come as co-processors (e.g., in the
Coral EdgeTPU USB sticks or M.2 modules) that pair with a MCU or microprocessor, offloading the
heavy math of neural networks. By handling matrix multiplications and convolutions in dedicated
hardware, the EdgeTPU enables real-time image and audio inference on the edge device with
minimal latency and modest power use.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 9

3.2.2 Syntiant Neural Decision Processors Series. Syntiant’s Neural Decision Processors (NDP) are
ultra-low-power neural accelerators aimed at always-on workloads like KWS and sensor analytics.
They use a custom deep neural network inference engine that runs models efficiently with parallel
multiply-accumulate (MAC) units and an optimized data path for minimal idle cycles [31]. For
example, the Syntiant NDP120 can continuously listen for voice commands using only a few
microwatts. In MLPerf Tiny benchmark tests, the NDP120 was able to perform a KWS inference in
about 4.3 ms while consuming only 35 µJ of energy per inference (at 30 MHz operation) [33]. This is
orders of magnitude more energy-efficient than running the same task on a generic MCU. The NDP
chips achieve this by being ASICs optimized for neural workloads-they store and process neural
network layers on-chip to avoid costly memory accesses, and they integrate small digital signal
processor (DSP) cores for preprocessing tasks. Syntiant’s platform demonstrates how specialized
silicon can deliver real-time AI within a milliwatt power budget.

3.2.3 Himax WiseEye WE-I Plus. The Himax WE-I Plus (HX6537-A) is an example of an AI-enabled
MCU/ASIC tailored for vision and sensor inferencing at the edge. It combines a 400 MHz DSP with
dedicated hardware accelerators (for tasks like image processing, HOG feature extraction, and JPEG
encoding) in an ultra-low-power design [34]. Uniquely, the WE-I Plus is event-driven: it stays in a
near-standby mode until its camera or sensor accelerator detects a trigger (e.g., motion or a person
in view), then the DSP wakes to run a neural network inference [34]. This architecture is highly
power-efficient. In fact, when running a person-detection CNN (TinyML vision model), the average
power consumption can be under 5 mW-an exceptionally low figure for an image recognition task.
By leveraging an ASIC with built-in neural accelerators, the Himax WE-I Plus achieves real-time
vision inference (e.g., detecting human presence in a frame) using only a fraction of the energy
that a general-purpose MCU would require for the same task [34].
These examples illustrate the importance of custom AI silicon for TinyML. Specialized edge AI
chips like the Edge TPU, Syntiant NDP, Himax WE-I, as well as others (e.g. Intel’s Movidius Myriad
X visual processing unit and various analog neural chips), focus on the common computational
patterns of ML algorithms. By implementing neural network operations (matrix multiplies, convo-
lutions, etc.) in hardware, they achieve far higher throughput per watt than a CPU. Innovations
such as parallel MAC arrays, on-chip memory for weights/activations, and streamlined dataflows
allow these ASICs to perform inference with minimal wasted energy. Many also include features
like built-in DSPs or camera interfaces to handle sensor data directly. The result is low-power,
real-time inference: tasks like wake-word detection or gesture recognition can run continuously
on the edge without exhausting a battery. Table 2 summarizes key TinyML hardware platforms,
including MCUs and neural accelerators, along with their specifications and typical use cases.

3.3 Benchmarking and Evaluation


Standardized benchmarking frameworks play a critical role in enabling fair and consistent evaluation
of TinyML hardware platforms. They provide a unified basis for comparing different systems under
constraints of speed, power consumption, and inference accuracy. Two widely adopted benchmark
suites in this space are MLPerf Tiny and EEMBC MLMark.
MLPerf Tiny, developed by MLCommons in collaboration with EEMBC, is specifically designed
for ultra-low-power AI systems. It defines four core tasks representative of common TinyML
applications: keyword spotting (KWS), Visual Wake Words (VWW), image classification on low-
resolution inputs, and anomaly detection using sensor data [35]. These tasks are performed using
compact neural networks, typically under 250 kB in size. MLPerf Tiny evaluates system performance
in a single-stream inference mode, mimicking real-time sensor workloads. It reports latency per
inference and model accuracy, and also includes an optional energy consumption metric through

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


EEMBC’s EnergyRunner harness [36]. This allows the benchmark to quantify both throughput
and power efficiency (e.g., energy per inference), enabling comprehensive comparisons across
platforms.
EEMBC MLMark serves a broader purpose by benchmarking general embedded ML inference. It
provides a standardized methodology for measuring inference latency and accuracy using fixed
models and datasets, ensuring reproducibility across different hardware platforms [37]. All imple-
mentations must use provided test harnesses or disclose optimizations, enforcing a level playing
field. While MLMark does not include built-in energy metrics—delegating such measurements
to benchmarks like EEMBC’s ULPMark for microcontroller efficiency—it remains valuable for
assessing system throughput and model performance [38].
Together, MLPerf Tiny and MLMark offer complementary strengths. MLPerf Tiny incorporates
energy-aware metrics essential for power-constrained TinyML deployments, while MLMark pro-
vides broader coverage and a reproducible baseline for embedded ML systems. For example, MLPerf
Tiny may reveal that a specialized neural accelerator achieves five times lower latency or ten
times greater energy efficiency than a baseline MCU when running the same KWS model [31].
Similarly, MLMark can track accuracy and inference improvements as new microcontrollers, neural
processing units (NPUs), or software frameworks are introduced [37].
These benchmarking tools have become indispensable for researchers and system designers. They
enable rigorous, quantitative evaluation of performance, energy usage, and inference quality. As
TinyML continues to evolve, these frameworks are expected to adapt by supporting larger models,
more diverse sensor modalities, and increasingly fine-grained energy profiling. Such standardized
evaluation ensures that TinyML hardware solutions remain responsive to real-world application
demands in resource-constrained environments.

Table 2. Examples of TinyML Hardware Platforms and Their Characteristics

Platform Type Processor / Clock Memory Notable Features Typical TinyML Use
(RAM/Flash) Case
Espressif ESP32 MCU (Wi-Fi Dual-core 32-bit MCU 520 KB SRAM, 4 Wi-Fi/Bluetooth IoT sensors, simple
SoC) @ 240 MHz MB flash (external) integrated; low cost KWS
[27]
STM32 (e.g., MCU 216 MHz ARM ∼320 KB RAM, 1 DSP instructions, Industrial sensing,
STM32F7) (Cortex-M7) Cortex-M7 MCU MB flash optional FPU audio classification
[27]
Google EdgeTPU ASIC Neural Custom ASIC @ ∼200 Uses host memory 4 TOPS (∼400 FPS High-speed vision
Accelerator MHz (equiv.) (external DRAM) MobileNet) at 2 W (object detection, etc.)
PCIe/USB interfaces
[32]
Syntiant NDP120 Neural Programmable DNN On-chip memory ∼35 𝜇J per inference Always-listening AI
co-processor core @ 30–100 MHz for models (KWS); always-on (wake word, anomaly
ASIC capability detection)
[33]
Himax WE-I Plus AI MCU/ASIC 400 MHz DSP + 2 MB SRAM, 4 MB Camera interface; <5 Ultra-low-power vision
with DSP accelerators flash (typical dev mW person detection (people counting, etc.)
[34] board) [34]

4 Evolution from TinyML to TinyDL


4.1 Limitations of Classical TinyML
4.1.1 Poor Generalization for Vision/Audio. Generalization error refers to the difference in perfor-
mance between a model’s training and test datasets. Although this metric is commonly used, it
may not fully capture real-world performance, especially when training and test sets come from
similar distributions (e.g., same user or device). Research shows that large deep neural networks,
despite their capacity to memorize data, often exhibit low generalization error [39]. In contrast,
From Tiny Machine Learning to Tiny Deep Learning: A Survey 11

TinyML models, due to their limited capacity and computational constraints, typically exhibit
higher generalization errors. Their reduced ability to learn complex features makes them more
prone to poor performance on unseen or out-of-distribution data. For instance, in the VWW
task, lightweight convolutional models such as MobileNet and MCUNet have achieved 85–90%
accuracy within 200–250 KB memory budgets [40, 41]. In contrast, traditional pipelines using
handcrafted features like HOG with classical classifiers (e.g., SVM, decision trees) tend to perform
significantly worse-typically in the 70–75% accuracy range-due to their limited ability to capture
spatial hierarchies and generalize to real-world images under constrained memory [42, 43].

4.1.2 High Feature Engineering Burden. Feature engineering involves manually selecting or trans-
forming raw sensor data (e.g., audio, motion, temperature) into meaningful inputs for traditional
models like decision trees or SVMs. While essential, this process is time-consuming, requires
domain expertise, and often fails to capture complex patterns as effectively as deep learning. In
TinyML, these limitations are amplified due to the resource constraints of edge devices, making
manual pipelines impractical for real-time, scalable applications.

4.2 What has Enabled TinyDL


4.2.1 Compression Breakthroughs. Model compression techniques aim to achieve a more efficient
representation of one or more layers in a neural network, often with a potential trade-off in quality
[44]. These techniques reduce the model’s size and computational requirements, leading to a 20%
to 30% decrease in memory usage [8]. There are several general model compression strategies,
such as pruning, low-rank factorization, and knowledge distillation. In addition to these, Ray [9]
analyzed specific model compression techniques, including the Tiny Anomaly Compressor [45],
Doped Kronecker Product [46], and Starfish [47], particularly in the context of image compression.
Tiny Anomaly Compressor offers a lightweight, model-agnostic compression method that is
well-suited for on-device anomaly detection in MCU-based IoT systems, though it faces challenges
related to validity and generalizability. Doped Kronecker Product enhances traditional Kronecker
Product compression by mitigating accuracy loss in Natural Language Processing (NLP) tasks
through co-matrix regularization. Meanwhile, Starfish presents a loss-resilient, AutoML-optimized
framework designed for efficient image compression and streaming in resource-constrained IoT
environments.

4.2.2 Quantization and Hardware Advances. Quantization is a cornerstone of TinyDL, enabling


significant model compression and computational efficiency by representing weights and activations
in reduced-precision formats (typically INT8 or lower). This approach reduces memory footprint
and multiply–accumulate (MAC) operations, making deployment feasible on resource-constrained
MCUs. Figure 3 illustrates the TinyDL hardware ecosystem, highlighting core compute units (MCUs,
DSPs, NPUs), memory elements (SRAM, Flash, DRAM), and interface components.
Modern quantization techniques go beyond simple format conversion. Hardware-aware ap-
proaches and novel numeric representations are increasingly used to optimize for latency and
energy. The Cortex-M4, with its Single Instruction Multiple Data/DSP support, remains the most
widely adopted MCU in TinyML due to its balance of efficiency and ecosystem maturity. For
performance-critical tasks, NPUs accelerate matrix operations with far greater energy efficiency
than general-purpose processors.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


12 Somvanshi et al.

Key Quantization Approaches Enabling TinyDL

Post-Training Quantization (PTQ): One-shot float → INT8 conversion; ∼4× size drop
with negligible accuracy loss [9].
Quantization-Aware Training (QAT): Simulates quant noise during training, enabling 4-
or even 2-bit inference while preserving accuracy [48].
Mixed-Precision Schemes (e.g., hardware-aware quantization (HAQ)): Per-layer
bit-width selection (2/4/8 bit) based on latency/energy targets [49].
Custom Numeric Formats (e.g., TENT): Tapered or block-floating formats tuned per
layer; up to 31% energy savings over INT8 baselines [50].

Microcontrollers Accelerator/
AI Optimized Chips
ARM KENDRYTE
Cortex-M C Coral K210

Coral USB Kendryte


Accelerator K210

Development Boards Sensor Integration


Platform

IMU
Raspberry pi Arduino
Portenta H7 Edge Impulse- OPENMV
Ready Devices Cam H7

Fig. 3. Hardware Ecosystem of TinyDL

4.2.3 Lightweight DL Architectures. Deploying deep neural networks on resource-constrained IoT


devices poses significant challenges, particularly in NAS. NAS typically involves three key steps:
(i) defining the search space of possible architectures, (ii) applying a search algorithm to find the
optimal model, and (iii) using an evaluator to balance accuracy and efficiency for deployment [9].
Recent approaches like evolutionary algorithms, differentiable architecture search, progressive
search, and parameter sharing have significantly reduced NAS computation costs-from thousands
to just a few GPU days-while enabling multi-objective optimization that balances accuracy with
efficiency. Techniques such as MNasNet, FBNet, and MONAS further advance this by incorporating
latency, power, and computational constraints into the search process, resulting in highly efficient
models tailored for deployment on resource-constrained devices [44].

4.3 Key Milestones in TinyDL


4.3.1 MobileNet, TinyBERT, MCUNet, SqueezeNet, DistilBERT. MobileNet, SqueezeNet, and MCUNet
are lightweight models for image tasks, designed to run on devices with limited memory. MobileNet
uses depthwise separable convolutions, while SqueezeNet uses fire modules to reduce size. MCUNet

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 13

goes further by running deep learning on tiny MCUs. For language tasks, TinyBERT and DistilBERT
are smaller, faster versions of BERT. TinyBERT uses teacher-student training to keep accuracy high,
and DistilBERT keeps 97% of BERT’s performance with fewer parameters. Figure 4 summarizes key
TinyDL breakthroughs from 2016 to 2024, illustrating the progression from early lightweight CNNs
like SqueezeNet and MobileNet to advanced models such as MCUNet, TinyBERT, and RedMule
that enable deep learning on microcontrollers.

TinyBERT
SqeezeNet DistilBert (Huawei)
(UC Berkeley & (Hugging Face) Layerwise MobileViT
Stanford University) Smaller and Knowledge (Apple)
MobileNet v2 RedMule
AlexNet level accuracy faster than BERT Distillation for Combines CNNs
(Google) First real training
with 50× fewer parameters MCUs.
Inverted Residual Model and Transformers
engine for MCUs
in <0.5 MB. for Mobile Vision
Structure

2016 2017 2018 2019 2020 2021 2023 2024

MobileNet v1 (Google) MCUNet (CSAIL, MIT) EdgeFormer TinyVQA


Depth Wise Separable Jointly Designs (Huawei Noah's Ark The model
Convolutions the TinyNAS and Lab) runs on a
to Build Light Weight TinyEngine Enabling Transformer-like Crazyflie 2.0
Deep Neural ImageNet-Scale Architecture for Edge drone with a
Networks. Inference on Devices GAP8 MCU.
Microcontrollers

Fig. 4. Timeline of Major TinyDL Breakthroughs [51],[52],[53],[54],[55],[41],[56],[53],[57],[58]

5 TinyDL Architectures and Techniques


The evolution from traditional machine learning to deep learning on resource-constrained devices
represents a fundamental paradigm shift in edge computing [59]. This section examines the archi-
tectural innovations and optimization techniques that have enabled the deployment of sophisticated
deep learning models on MCU-class devices with severe memory and computational limitations.

5.1 Lightweight CNNs


The development of lightweight CNNs specifically designed for resource-constrained environments
has been instrumental in enabling deep learning capabilities on edge devices [60]. These architec-
tures employ novel design principles that dramatically reduce computational complexity while
maintaining competitive accuracy levels.
5.1.1 MobileNet Architecture Family. The MobileNet series represents a cornerstone achievement
in efficient CNN design, introducing depthwise separable convolutions that factorize standard
convolutions into depthwise and pointwise operations [60]. MobileNetV2 extends this approach
with inverted residual blocks and linear bottlenecks, achieving 71.8% ImageNet top-1 accuracy
with only 3.4 MB model size when deployed on STM32H7 MCUs [60]. The architecture’s efficiency
stems from its fundamental redesign of the convolution operation, reducing parameters by an order
of magnitude compared to traditional CNNs while maintaining representational capacity [60].
5.1.2 SqueezeNet and Fire Modules. The SqueezeNet architecture employs fire modules consisting
of squeeze layers (1×1 convolutions) followed by expand layers (mixed 1×1 and 3×3 convolutions)
to achieve significant parameter reduction [61]. SqueezeNext further optimizes this design with

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


hardware-aware modifications, achieving AlexNet-level accuracy with 112× fewer parameters
[61]. The architecture’s aggressive parameter reduction makes it particularly suitable for flash-
constrained MCUs, requiring only 1.2 MB storage while maintaining 57.5% ImageNet accuracy
[62].

5.1.3 Hardware-Aware Architecture Design. Recent developments in lightweight CNN design have
emphasized hardware-aware optimization through NAS specifically tailored for MCU constraints
[63]. MCUNet demonstrates this approach by achieving 68.7% ImageNet accuracy with only 0.51
MB model size through joint optimization of network architecture and inference scheduling [41].
The framework employs a two-stage NAS that first optimizes the search space to fit resource
constraints, then specializes the network architecture within the optimized space [41].

5.2 Lightweight Transformers and RNNs


The adaptation of transformer architectures for TinyML represents a significant advancement
in bringing state-of-the-art natural language processing capabilities to edge devices, though it
presents unique challenges due to the attention mechanism’s quadratic memory complexity [55].

5.2.1 TinyBERT Knowledge Distillation. Representing a breakthrough in transformer compression,


TinyBERT employs a novel two-stage knowledge distillation framework specifically designed for
transformer models [55]. The approach performs transformer distillation at both pre-training and
task-specific learning stages, enabling effective knowledge transfer from large teacher models to
compact student networks [55]. TinyBERT-4L achieves 96.7% performance of BERT-Base on the
GLUE benchmark while being 7.5× smaller and 9.4× faster on inference, with only 14.5 million
parameters [55].

5.2.2 DistilBERT Compression Strategy. Utilizing a different distillation approach, DistilBERT re-
duces the original BERT model by 40% while retaining 97% of its language understanding capabilities
[55]. The model achieves 91.3% F1 score on SQuAD v1.1 with 66 million parameters, demonstrating
the effectiveness of student-teacher training with temperature-scaled softmax distributions [55].

5.2.3 MCU Deployment Challenges. Recent research has focused on optimizing transformer deploy-
ment specifically for MCU units, addressing unique challenges posed by the multi-head self-attention
mechanism [64]. The primary bottlenecks include high memory footprint of intermediate attention
results and frequent data marshaling operations [64]. Novel approaches such as Fused-Weight
Self-Attention (FWSA) and Depth-First Tiling have been developed to mitigate these challenges,
achieving up to 6.19× reduction in memory peak usage while maintaining computational accuracy
[64].

A comparative summary of popular TinyDL models, including their architectural characteristics,


deployment efficiency, and hardware targets, is provided in Table 3. This table highlights the
trade-offs between accuracy, model size, and latency across a diverse range of architectures and
deployment contexts, offering valuable insights for selecting appropriate models in resource-
constrained scenarios.

5.3 Model Optimization Techniques


The deployment of deep learning models on TinyML systems necessitates sophisticated optimization
techniques that compress model size and accelerate inference while preserving accuracy [65].
From Tiny Machine Learning to Tiny Deep Learning: A Survey 15

Table 3. Summary of TinyDL Models with Size, Inference Speed, and Task Accuracy

Model Name Architecture Size Latency Accuracy (%) Target Task Hardware
(MB) (ms)
TinyBERT-4L [55] Transformer 14.5 5.0 96.8 (SST-2) Text Classification Mobile SoC
DistilBERT [55] Transformer 66.0 7.0 91.3 (SQuAD) QA / NLP Tasks Mobile GPU
MobileNetV2-0.35 CNN 3.4 ∼32 71.8 Image STM32H7 MCU
[60] (ImageNet) Classification
SqueezeNet v1.1 CNN 4.8 ∼20 58.38 Object Detection Kendryte K210
[61, 62] (ImageNet)
MCUNet-256kB CNN + NAS 0.51 12.0 70.7 Image STM32F746
[41] (ImageNet) Classification
EfficientNet-Lite0 CNN 4.7 45.0 75.1 Image EdgeTPU
[60] (ImageNet) Classification
DS-CNN (MLPerf) 1D-CNN 0.05 20.0 >90.0 Wake Word ARM Cortex-M4
[43] (Commands) Detection
MobileNet (VWW) CNN 0.32 8.0 80.0 (VWW) Visual Wake STM32 MCU
[43] Words
Deep AutoEncoder Autoencoder 0.27 15.0 85.0 (AD Anomaly MCU Platform
(AD) [43] Bench) Detection
ResNet (IC) [43] CNN 0.096 25.0 85.0 Image STM32 MCU
(ImageNet) Classification
Transformer- Transformer 2.1 180.0 78.2 (NLP NLP Tasks STM32F746
FWSA [64] Tasks)
SquishedNet [61] CNN 0.95 156.0 77.0 Image Nvidia Jetson TX1
(CIFAR-10) Classification

5.3.1 Quantization Methodologies. Quantization represents one of the most effective approaches
for model compression in TinyML systems [66][67]. PTQ converts trained models from floating-
point to reduced precision representations, typically INT8, achieving 4× model size reduction with
minimal accuracy degradation [65]. QAT incorporates quantization effects during training, enabling
more aggressive precision reduction while maintaining model performance [66]. Comparative
analysis shows that quantization generally outperforms pruning across various compression ratios,
with benefits becoming more pronounced at moderate compression levels [67].
5.3.2 Co-Design of Architecture and Runtime: MCUNet. MCUNet exemplifies a system-algorithm co-
design approach where both the neural architecture (TinyNAS) and inference engine (TinyEngine)
are jointly optimized to meet the extreme memory and compute constraints of MCUs. Unlike
traditional pipelines that first fix either the library or the model, MCUNet explores a larger design
space by integrating both dimensions. Figure 5 illustrates this co-optimization flow, highlighting
how it surpasses previous approaches limited to one-directional tuning.
5.3.3 Neural Network Pruning. Neural network pruning eliminates redundant or less important
parameters to reduce model complexity and memory footprint [68][65]. Magnitude-based pruning
removes weights with smallest absolute values, providing a straightforward approach for parameter
reduction [68]. Structured pruning targets entire network components such as filters or layers,
enabling more significant architectural simplifications suitable for severely resource-constrained
environments [65]. Research demonstrates that structured pruning can achieve compression ratios
up to 13× without significant accuracy loss through iterative pruning and retraining cycles [68].
5.3.4 Joint Optimization Approaches. The combination of quantization and pruning techniques
has emerged as a powerful strategy for achieving maximum compression efficiency [66][69].
Quantization-aware pruning yields more computationally efficient models than either technique

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


16 Somvanshi et al.

NAS Library NAS Library

(a) Search NN model on an existing (b) Tune deep learning library given a
library. e.g., ProxylessNAS, MnasNet NN model. e.g., TVM

Efficient Neural Architecture

TinyNAS MCUNet TinyEngine

Efficient Compiler / Runtime


(c) MCUNet: system-algorithm co-design

Fig. 5. MCUNet co-designs neural architecture (TinyNAS) and inference scheduling (TinyEngine) for MCU
efficiency. Unlike (a) architecture search and (b) runtime tuning done separately, (c) MCUNet integrates both
for improved accuracy and resource use [41]

alone, particularly for ultra-low latency applications [66]. Joint optimization frameworks demon-
strate superior computational efficiency compared to sequential application of compression tech-
niques, with benefits varying based on target compression ratios and application requirements
[69].
5.3.5 Network Augmentation and Auxiliary Supervision. Network augmentation offers a comple-
mentary training-time optimization strategy, particularly relevant for TinyDL. As shown in Figure 6,
a tiny model is embedded into larger networks that share weights and provide auxiliary supervision.
This enables the tiny model to learn stronger representations without increasing its inference-time
footprint, making it ideal for resource-constrained deployments.

6 Software Toolchains and Deployment Frameworks


The deployment of TinyML and TinyDL models on resource-constrained devices such as MCUs
and edge processors requires robust, efficient, and highly optimized software toolchains. These
toolchains bridge the gap between trained machine learning models and their real-world deploy-
ment on ultra-low-power hardware. This section examines lightweight deployment frameworks,
compilation techniques, and end-to-end platforms that enable practical and scalable TinyML and
TinyDL implementations.

6.1 Model Deployment Tools


This subsection explores a range of lightweight deployment toolchains designed for resource-
constrained edge devices, tracing their evolution from early C++ converters to modern platforms
with support for quantization, AutoML, and hardware-specific integration. Early frameworks such
as uTensor [71] demonstrated feasibility by converting TensorFlow models into C++ code but lacked
support for quantization or advanced operators. TFLite Micro addressed these limitations by adding
support for 8-bit quantization, expanded operator coverage, and community backing, making it
a de facto baseline in TinyML deployments [72]. However, TFLite Micro requires manual tuning
and has no built-in GUI. To lower entry barriers, Edge Impulse [73] introduced a no-code AutoML

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 17

Step1 Step2
Input
g = g base + g aug

g base g aug

Forward flow Base supervision


Output Auxiliary supervision
Auxiliary forward flow

ℒ𝑎𝑎𝑎𝑎𝑎𝑎 = ℒ 𝑊𝑊𝑡𝑡 + 𝛼𝛼1 ℒ 𝑊𝑊𝑡𝑡 , 𝑊𝑊1 + ⋯ + 𝛼𝛼𝑖𝑖 ℒ 𝑊𝑊𝑡𝑡 , 𝑊𝑊𝑖𝑖 +⋯


base supervision auxiliary supervision, working as a sub-model of augmented models

Fig. 6. Network augmentation strategy: A tiny model is trained within a larger model to benefit from auxiliary
supervision, but only the tiny network is used during inference [70].

pipeline with integrated DSP processing and on-device testing. TFLite Model Maker [74] supports
fine-tuning and exporting models tailored for deployment on EdgeTPUs and mobile hardware.
PyTorch Mobile [75], while not suitable for MCUs, supports deployment of larger TinyDL models
(including Transformers) on higher-end mobile SoCs.
More specialized toolchains target performance tuning and low-level integration. CMSIS-NN [76]
provides hand-optimized kernels for ARM Cortex-M architectures and is often paired with TFLite
Micro for improved inference latency. MicroTVM [77], as an extension of the TVM compilation stack,
brings auto-tuning and graph optimization to MCU platforms like STM32 and ESP32. Glow [78],
developed by Meta, offers ahead-of-time graph lowering for hardware accelerators and NPUs. Tools
like DeepC convert Keras models into static C code, ideal for systems without dynamic memory
support [79], while MLPACK [80], although not TinyML-specific, is a lightweight C++ library
adaptable for embedded use. Vendor-specific tools such as X-CUBE-AI [81] for STM32 platforms
and academic solutions like QKeras with HLS4ML [82] for FPGA deployment demonstrate the
growing ecosystem of domain-targeted solutions. Commercial platforms such as OctoML [83] and
Nebullvm [84] further enhance deployment by automating compilation, quantization, and precision
tuning across edge platforms. These diverse tools reflect the increasing demand for streamlined,
hardware-aware TinyDL deployment pipelines.

6.2 Compilation and Runtime Support


To execute models efficiently on tiny hardware, compilation and runtime libraries play a vital role.
They optimize memory usage, operator execution, and compatibility with various MCUs, DSPs,
and NPUs. CMSIS-NN [76] was one of the earliest contributions in this space, providing hand-
optimized fixed-point kernels for ARM Cortex-M chips. Although it does not support automatic

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


graph compilation, it is frequently paired with TFLite Micro to yield significant gains in latency
and power efficiency. To move beyond manually crafted kernels, frameworks such as TVM [85]
introduced automated compilation workflows that include graph-level optimization, operator fusion,
and quantization-aware tuning. Its extension, MicroTVM [77], enables deployment on bare-metal
devices such as STM32 and ESP32, with support for auto-tuning and memory planning specific to
target hardware.
Other frameworks have pursued alternative compilation strategies. Glow [78], developed by
Meta, lowers ML graphs into intermediate representations and compiles them for hardware acceler-
ators, offering static deployment advantages on NPUs. Accelerated Linear Algebra [86], originally
part of TensorFlow, generates backend-specific binaries through operation fusion, and although
primarily used in server environments, its techniques influence edge compiler design. ONNX
Runtime [87], when paired with execution providers like TensorRT and OpenVINO, facilitates
optimized deployment on edge GPUs and NPUs, though its MCU support remains limited. Emerging
tools such as Apache NNC and Tensor Comprehensions [88] explore polyhedral compilation and
domain-specific optimization, primarily in research settings. Commercial efforts like OctoML [83]
offer automated TVM-based tuning services, while Nebullvm [84] supports quantization, prun-
ing, and cross-platform model optimization. Together, these compilation frameworks ensure that
deep learning models, once compressed and optimized, can meet the strict memory and latency
constraints of TinyML deployments.

6.3 End-to-End Platforms


This section highlights the critical role of compilation and runtime libraries in executing ma-
chine learning models efficiently on tiny edge hardware, focusing on tools that enable memory
optimization, operator tuning, and hardware-specific acceleration across MCUs, DSPs, and NPUs.
While deployment and compilation tools handle isolated phases of the TinyML lifecycle, end-
to-end platforms offer unified environments, with some prioritizing usability and others pushing
optimization boundaries. These platforms vary in abstraction, hardware support, and model adapt-
ability, often building on or compensating for the limitations of one another. This section offers an
overview of end-to-end TinyML platforms that streamline the entire machine learning lifecycle for
edge deployment, while Table 4 provides a detailed comparison of toolchains and frameworks based
on platform support, compression, quantization, AutoML capabilities, and community maturity.
Beyond these widely used frameworks, several specialized platforms provide tailored function-
alities that address the needs of different domains. TensorFlow Lite Model Maker [74] offers a
Python-based AutoML pipeline with fine-tuning, pruning, and quantization support for deploy-
ment on mobile and EdgeTPU devices. For time-series and sensor data, platforms such as Qeexo
AutoML [89] and SensiML [92] offer end-to-end workflows that integrate feature extraction, model
compression, and deployment to MCUs and FPGAs. These frameworks are especially suited for
industrial and automotive applications requiring vibration or IMU-based inference.
To support optimization-focused deployment, platforms such as Latent AI’s LEIP stack [91] and
OctoML [83] emphasize model compression workflows across CPUs, NPUs, and MCUs. These
frameworks integrate mixed-precision quantization, pruning, and tuning for performance–energy
trade-offs. While OctoML builds upon the TVM stack for deployment optimization, Latent AI is
designed for hybrid edge use cases requiring cross-platform support.
From Tiny Machine Learning to Tiny Deep Learning: A Survey 19

Table 4. Comparison of TinyML Toolchains and Frameworks

Toolchain / Supported Compression Quantization NAS / AutoML Community


Framework Platforms Support Support Support Maturity /
Ecosystem
TensorFlow Lite MCU (ARM Pruning, weight 8-bit, int4 Manual only Very high; large
Micro [72] Cortex-M), clustering (experimental) community, strong
Arduino, ESP32 docs
uTensor [71] ARM Cortex-M, Minimal 8-bit only No Low; legacy status,
STM32 low activity
CMSIS-NN [76] ARM Cortex-M Manual 8-bit fixed-point No Medium;
family optimizations well-documented
for ARM devs
Edge Impulse Web-based IDE Pruning + DSP Quant-aware Yes (AutoML High; growing
Studio [73] (ESP32, STM32, fusion training built-in) ecosystem, no-code
Arduino) UI
MicroTVM [77] RISC-V, ARM, x86, Compiler-based INT8/INT4 Yes (TVM High among
ESP32 optimization autotuning) compiler
researchers
TFLite Model Android, Raspberry Basic pruning 8-bit, int16 Yes (UI-based) High for
Maker [74] Pi, EdgeTPU mobile/EdgeTPU
apps
Qeexo AutoML ARM Cortex-M, Auto-selected Yes (auto) Yes (end-to-end) Medium; strong for
[89] IMU boards models sensor ML
Neuton TinyML TinyMCU (<1 KB Highly compressed Yes (proprietary) Yes (Auto- Emerging; niche
[90] RAM) models compression) but focused
Latent AI [91] MCU, CPU, NPU Pruning, Yes Yes (LEIP AutoML) Medium; commer-
quantization, (mixed-precision) cial/enterprise use
distillation
SensiML [92] QuickLogic FPGA, Auto feature Yes (auto-tuned) Yes Growing in
MCUs selection sensor-specific
domains
OctoML [83] Cloud, Edge TVM-based Yes (via TVM) TVM-integrated High in TVM
(TVM-compatible) optimization AutoTuner community,
enterprise users
Arduino IDE / Arduino boards None 8-bit (manual) No Large maker
ArduinoML [93] (AVR, Cortex-M) community
NXP eIQ Toolkit NXP MCUs and Quantization, Yes Limited (GUI only) Vendor-supported;
[94] NPUs pruning (NXP SDK) stable in NXP flow
Microsoft EdgeML MCU, DSPs Model compression Yes No Academic; low
[95] (research only) (Bonsai, ProtoNN) deployment focus
Google Colab + Cloud to Manual via TFLite Yes Manual Broad TF support;
TFLite [96] [97] TFLite-compatible converter no GUI
edge
Sony AI Studio Sony Spresense Pre-configured Yes (limited) Model Zoo only Specialized;
[98][99] board Sony-specific
KaaEdge AI [100] Edge devices, IoT Model coordination Depends on device Federated AutoML System-level
networks only (in development) orchestration;
growing

Popular TinyML Deployment Toolchains Compared

TensorFlow Lite Micro: de-facto baseline; 8-bit INT quantization, huge community; no
GUI [72].
Edge Impulse Studio: drag-and-drop AutoML with built-in DSP blocks-ideal for newcom-
ers [73].
MicroTVM: TVM-based compiler autotuning for MCUs (STM32, ESP32) and RISC-V boards
[77].
Neuton TinyML: sub-kilobyte models that fit where even CMSIS-NN is too heavy [90].

Other tools address the needs of beginner users or are integrated within hardware-specific
ecosystems. Arduino IDE and ArduinoML [93] offer intuitive interfaces for model deployment on
AVR and Cortex-M boards but are limited to simpler use cases. The NXP eIQ Toolkit [94] provides
vendor-specific integration for NXP’s MCU and NPU portfolio, offering a graphical interface with
built-in support for quantization and pruning.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


20 Somvanshi et al.

From the research perspective, Microsoft’s EdgeML [95] introduces novel model architectures
such as ProtoNN [101] for ultra-low-memory inference. For advanced users, the Google Colab and
TFLite workflow [96, 97] provides maximum scripting flexibility, allowing users to train models in
the cloud and convert them for deployment using the TFLite converter. Sony AI Studio [98, 99],
while limited to the Spresense board, offers a curated development environment for vision and
audio inference. Lastly, KaaEdge AI [100] addresses large-scale deployment and orchestration needs,
supporting federated learning pipelines and distributed edge intelligence.
Together, these platforms span a wide spectrum-from GUI-based environments to optimization-
first toolchains-offering developers diverse options depending on their expertise, application
complexity, and hardware targets.

7 Applications of TinyML and TinyDL


7.1 Vision
The field of computer vision has witnessed significant progress, particularly with the advent of deep
learning, yet challenges persist in small object detection (SOD) due to factors such as low resolu-
tion, limited pixel information, and difficulties in data labeling [102]. These limitations necessitate
specialized approaches, especially when integrating AI into resource-constrained IoT devices, a
domain increasingly addressed by TinyML [16]. In 2022, research began to address these challenges
by exploring various techniques to enhance SOD performance. Methods included augmenting data
and implementing super-resolution to improve feature visibility for classifiers [102]. Furthermore,
multi-scale prediction strategies, such as image pyramids and Feature Pyramid Networks, were
developed to better adapt to objects of varying sizes and improve detection accuracy [102]. Addition-
ally, Generative Adversarial Networks were also introduced to boost image resolution and enhance
feature representation for small objects [102]. Concurrently, studies focused on improving the
energy efficiency and robustness of TinyML computer vision through the use of log-gradient input
images, which allow for aggressive quantization and reductions in CNN resources [103]. Software
engineering practices for TinyML-based IoT embedded vision were also examined, emphasizing
the need for robust and cost-effective solutions for real-world deployments [104] . Additionally,
challenges and benefits of deploying TinyML on edge devices, including improved privacy and
reduced latency, were broadly discussed [16].
Moving into 2023, the focus continued on refining SOD techniques and evaluating TinyML’s
practical viability. Surveys detailed the significance of SOD in applications like criminal investigation
and autonomous driving, categorizing improvement methods such as boosting input resolution and
integrating contextual information [105]. Concurrently, research also began to evaluate the energy
feasibility of TinyML for computer vision applications, particularly for tasks like people detection,
confirming that TinyML-driven IoT sensors consume less energy compared to traditional machine
learning systems [106]. More recent advancements, extending into 2024 and 2025, have seen the
development of more specialized tools and datasets. The "Wake Vision" dataset was introduced as a
large-scale, high-quality benchmark specifically tailored for TinyML computer vision, particularly
person detection. This dataset aims to overcome the limitations of smaller, less diverse datasets
by using an automated generation pipeline, leading to significant accuracy improvements [107].
In parallel, "YoLite+", a lightweight multi-object detection approach, was proposed for traffic
scenarios. This method leverages MobileNet and depthwise separable convolution to compress
models, achieving faster inference and reduced parameters while maintaining accuracy [108].
Overall, these developments underscore the ongoing efforts to make TinyML an increasingly
practical and powerful tool for diverse computer vision applications in constrained environments.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 21

7.2 Audio and Natural Language Processing


The integration of TinyML with NLP has enabled a new class of real-time, energy-efficient AI
applications on edge devices, overcoming the limitations of traditional computationally intensive
models [55, 109–112]. Efforts in this domain primarily focus on developing lightweight architectures
for on-device speech recognition, allowing complex command processing and KWS on MCU
platforms such as the Arduino Nano 33 BLE Sense [109]. These applications optimize audio signal
processing pipelines by incorporating Fast Fourier Transform-based feature extraction while
minimizing energy consumption for always-on operation [109]. Beyond simple command detection,
TinyML-enabled NLP has expanded to more advanced use cases such as semantic sentiment
classification using privacy-preserving frameworks like split learning, which reduce computational
overhead and enhance data privacy compared to traditional centralized learning approaches [110].
Additionally, TinyML is increasingly applied in human-centric contexts, including behavior analysis
for smart environments and healthcare, where real-time, privacy-aware data processing is critical
despite constrained device resources [111].
These advancements have required adaptation of large transformer-based language models
into edge-suitable formats through quantization, pruning, and architectural simplification to meet
the demands of low memory, limited processing power, and battery efficiency [113]. Beyond
NLP tasks, TinyML audio models are being used in environmental monitoring systems such
as urban noise anomaly detection and wildlife sound recognition. Notably, recent systems like
TinyChirp demonstrate efficient bird song classification on low-power wireless acoustic sensors
[114]. These evolving use cases illustrate the continued maturation of TinyML frameworks and
models, showcasing their potential to address complex real-world challenges while operating within
severe hardware constraints [27, 112].

7.3 Healthcare and Human Behavior Analytics


TinyML is increasingly transforming healthcare by enabling on-device analytics for various medical
applications, prioritizing patient privacy, reducing latency, and enhancing data security [115]. In
particular, TinyML facilitates efficient analysis of ECGs, capturing the heart’s electrical signals.
A system utilizing reservoir computing on a low-power MCU demonstrated high accuracy with
minimal variance [116]. This method lowers complexity and energy consumption while enabling
real-time detection of various pathological conditions. Additionally, TinyML-based ECG solutions
offer continuous monitoring and instant feedback capabilities for both medical professionals and
patients [8]. A practical framework for deploying such real-time health monitoring solutions using
MCUs like ESP32 and STM32 was demonstrated by Dutta et al. [117], emphasizing edge-native
inference and system-level integration. Similarly, TinyML allows real-time, low-power respiratory
monitoring using a CNN model that detects cancer diseases from acoustic signals with high accuracy.
This approach enables remote monitoring, early diagnosis, reduces cloud dependency, and protects
privacy, making it ideal for mobile healthcare [118].
Beyond vital sign monitoring, TinyML is also significantly applied in human behavior analysis
within healthcare, notably in emotion detection and Human Activity Recognition. Emotion detection
systems utilize wearable devices to analyze physiological signals, including respiratory belts,
photoplethysmography, and fingertip temperature [119, 120], as well as bioelectrical methods to
measure skin conductance, electroencephalography, and heart rate [121]. Machine learning models
trained on these diverse datasets consistently achieve high accuracies, indicating a strong potential
for improving emotional state recognition and optimizing ergonomic conditions [119, 121]. For
HAR, TinyML enables deployment on resource-limited devices using various algorithms such
as CNNs and RNNs like LSTMs. A CNN-based HAR system using mmWave radar, for instance,

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


22 Somvanshi et al.

achieved high accuracy with a minimal model size and fast inference time, addressing privacy and
latency concerns associated with camera-based systems [122]. Another notable development is an
IoT wristband designed for on-device, privacy-preserving HAR, providing a low-power, low-cost
solution that performs real-time activity classification that avoids reliance on cloud infrastructure
[123]. Overall, these advancements underscore TinyML’s pivotal role in enhancing consumer
healthcare data protection through AI-driven and privacy-preserving techniques [124], expanding
wearable technologies, and supporting intelligent medical devices with high autonomy and energy
efficiency [125].

7.4 Industrial and Environmental Applications


TinyML is fundamentally transforming industrial and environmental sectors by enabling efficient,
on-device analytics that circumvent the latency, cost, and privacy challenges typically associated
with traditional cloud-based machine learning [8, 9, 126, 127]. This paradigm shift facilitates real-
time industrial anomaly detection by analyzing machine sounds and vibrations directly at the
edge, effectively reducing downtime and enhancing operational efficiency [128]. For instance,
specialized TinyML models deployed on ESP-WROOM-32 MCUs are capable of detecting anomalies
in thermal images of machinery, transmitting data via Message Queuing Telemetry Transport only
when an anomaly is identified and demonstrating high accuracy [129]. This capability extends to
monitoring critical infrastructure, where online learning anomaly detection models like Deep Echo
State Network are developed for water distribution systems, adapting to environmental changes
directly on MCUs [26]. Furthermore, deeply quantized anomaly detectors, such as a Block-based
Binary Shallow Echo State Network, are proposed for specific industrial use cases like identifying
oil leaks in wind turbines, leveraging binarized images and one-bit quantization for efficiency
[130]. These applications underscore how TinyML effectively addresses the latency and security
vulnerabilities often inherent in cloud-based anomaly detection systems [128, 131].
In the broader context of industrial predictive maintenance, TinyML plays a crucial role in
preventing costly failures by enabling continuous, on-device monitoring and analysis of equipment
data, a domain that continues to see growing research [15, 132]. Moreover, TinyML also contributes
to non-repudiable anomaly detection in extreme industrial settings by integrating blockchain tech-
nology, ensuring transparent and immutable records of detected anomalies [133]. This integration
of TinyML with embedded systems and federated learning in Industrial IoT further aims to decrease
latency, increase productivity, and enhance data security in complex manufacturing environments
[134, 135]. Additionally, TinyML-enabled smart objects and their associated challenges are also
widely discussed as a new paradigm for efficient, privacy-preserving, and cost-effective solutions
across various IoT applications [136].
Beyond industrial applications, TinyML significantly contributes to solving environmental
problems, particularly in smart agriculture and wildlife monitoring [137]. With the global population
projected to grow significantly [138], smart agriculture leverages IoT, drones, and machine learning
to enable real-time monitoring of crop and soil health, disease detection, and growth tracking
[139, 140]. TinyML is gaining traction in SA by facilitating machine learning tasks directly on low-
power sensor devices, offering reduced latency, enhanced privacy, and lower energy consumption,
which is especially valuable in underserved areas with limited connectivity [8, 132, 141, 142]. An
example is the Nuru app, developed under the PlantVillage project, which utilizes TensorFlow
Lite to detect plant diseases offline, aiding farmers in remote regions [143, 144]. Additionally, in
wildlife conservation, TinyML supports on-animal and bioacoustic sensors for species tracking,
addressing challenges in handling latency and data volume in vast habitats [145]. This includes
real-time tracking of endangered sea turtles using SmallSats [146] and TinyML-enabled collars to
reduce elephant losses from poaching [147, 148]. These ongoing advancements and broad survey

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 23

efforts emphasize TinyML’s critical role in advancing industrial efficiency and environmental
sustainability by pushing AI capabilities closer to the data source [12, 149].

8 On-Device Learning and Reformability


The evolution of TinyML has opened new frontiers for executing artificial intelligence directly on
resource-constrained endpoint devices [9, 127]. Indeed, a significant paradigm shift within this
domain is the move from static, pre-trained models toward dynamic systems capable of learning
and adapting post-deployment [8]. This progression towards on-device learning and the concept
of reformability is critical for the long-term viability and effectiveness of TinyML applications,
particularly as they become integrated into dynamic, real-world environments [15, 27]. Such
capabilities not only enhance model performance over time but also offer substantial benefits for
user privacy and data security by keeping sensitive information localized on the device [19, 150].

8.1 Continual and Few-Shot Learning on MCUs


Traditional TinyML workflows involve training a machine learning model offline on powerful
servers and then deploying a highly optimized, static version for inference on a MCU [127]. However,
this approach is limited, as the performance of a static model can degrade when the data distribution
it encounters in the real world changes over time, a phenomenon known as concept drift [8]. To
overcome this limitation, research has increasingly focused on enabling learning capabilities directly
on the MCU. A promising solution is the implementation of online or continual learning, where
the model can be updated incrementally as new data becomes available. This allows the device to
adapt to changing environmental conditions without requiring a complete redeployment of the
model [151]. For instance, the TinyOL framework was proposed to facilitate online learning on
MCUs, allowing them to learn from a continuous stream of data [8]. This approach is instrumental
for applications that require long-term autonomy and robustness. Furthermore, advancements in
few-shot learning are enabling MCUs to train effectively on a very small number of examples. This
is particularly relevant for customization and personalization, where a device might need to learn
new keywords or commands specific to a user. Research in few-shot KWS has demonstrated that
models deployed on MCUs can be trained to recognize new words with only a handful of training
samples, making the end-user experience significantly more flexible and interactive [152]. Such
on-device training capabilities are often facilitated through methods like federated learning, which
allows for collaborative model training across multiple decentralized devices while keeping the
raw data localized [153].

8.2 Reformable TinyML


Building upon the concept of on-device learning is the emerging paradigm of Reformable TinyML.
This refers to a holistic framework where TinyML systems are engineered with the inherent
capability to be modified, updated, or reformed after their initial deployment [15]. The primary
goal is to create resilient and sustainable intelligent systems that can self-diagnose performance
degradation and trigger an adaptation process to maintain accuracy and efficiency over their
operational lifetime [27]. The reformable pipeline extends beyond simple on-device updates. It
encompasses a structured workflow that includes monitoring for data drift, evaluating the current
model’s efficacy, and invoking a reformation strategy when necessary. This strategy could involve
on-device fine-tuning, fetching a model patch from an edge server, or participating in a federated
learning task to receive an updated global model [15]. Consequently, reformability ensures that the
intelligence at the extreme edge does not become obsolete, thereby enhancing the overall value
and reliability of the IoT ecosystem. The architecture of such a pipeline is conceptually illustrated
in Figure 7.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


24 Somvanshi et al.

RPC Flower Training


User Framework code Client Client Pipeline Local
Python Python PyTorch Data
FedAvg Strategy
Federated RPC
Qffedavg Strategy RPC Flower Training
Learning Server Client Client Pipeline Local
Loop
Python Python TensorF Data
Fed FS Strategy

Flower Server RPC Flower Training


Client Client Pipeline Local
C++ C++ LibTorch Data

Flower Clients

Fig. 7. Architecture for Privacy-Preserving Federated TinyML [15]

8.3 Privacy and Security Benefits


A foundational advantage of processing data at the edge is the inherent enhancement of privacy
and security [127]. By performing ML inference directly on the MCU, sensitive user data, such as
audio from a microphone or images from a camera, does not need to be transmitted to the cloud or
a remote server [9]. This localization of data significantly reduces the risks associated with data
breaches during transmission and unauthorized access on third-party servers, a crucial consideration
for human-centered applications [19]. The push for on-device learning and reformability further
strengthens these privacy guarantees. In a federated learning context, for example, the raw data
used for training never leaves the user’s device. Instead, only anonymized model updates or
gradients are shared to contribute to a global model, ensuring collaborative learning without
compromising individual privacy [154]. Moreover, recent work has focused on integrating explicit
privacy-preserving mechanisms into the TinyML framework. Techniques such as local differential
privacy can be implemented to add statistical noise to data or model updates before they are
shared, providing mathematical guarantees of privacy [112]. In conjunction with secure hardware
features like memory protection units (MPUs) or secure enclaves to encrypt model parameters,
these methods provide a robust, multi-layered approach to securing intelligent edge devices [112].

9 Evaluation Metrics and Benchmarks


Evaluating TinyML models requires a holistic framework that addresses the unique constraints
of ultra-resource-constrained environments while preserving sufficient task performance for real-
world deployment. Unlike conventional machine learning systems evaluated primarily based on
accuracy or F1-score, TinyML models must be evaluated using a blend of metrics that reflect effi-
ciency, scalability, deployment feasibility, and trade-offs between model performance and resource
utilization. These metrics span a wide spectrum, including computational efficiency (inference
latency, throughput), memory and storage footprint (model size, RAM usage), energy consumption
(power and battery profile), and task-specific accuracy under quantized or pruned conditions. More-
over, consistent comparison across approaches is facilitated by a small set of curated benchmarks
and datasets tailored for the TinyML domain, such as KWS, low-resolution vision tasks, and sensor
signal classification.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 25

9.1 Efficiency and Compression Metrics


In TinyML, efficiency is a first-class citizen. Edge devices such as ARM Cortex-M MCUs or RISC-
V-based embedded systems typically offer only kilobytes of RAM and flash storage, no floating-
point hardware, and must run on energy-constrained or even battery-less systems. Consequently,
performance metrics in TinyML revolve around the ability to deploy models that fit within these
extreme limitations without sacrificing critical functionality. The most frequently used efficiency
metrics include:
9.1.1 Model Size (KB). Model size, measured in kilobytes, represents the total footprint of the
model when stored on the device, including weights, biases, and optionally additional metadata.
Given that popular MCUs (e.g., STM32F746) have only 512 KB to 1 MB of flash memory, model
size is often capped at 250–500 KB to allow room for the operating system and runtime libraries.
Compression techniques such as weight pruning, Huffman encoding, and PTQ are employed to
reduce model size without significant degradation in performance. For instance, Han et al. [155]
demonstrate that deep neural networks can be compressed by 9–13× using a combination of
pruning and quantization, with minimal accuracy drop. MCUNet [41], an architecture designed
specifically for MCUs, achieves ResNet-level accuracy on ImageNet with models as small as 300 KB.
9.1.2 Inference Latency (ms). Inference latency is the wall-clock time taken to perform one forward
pass of the model on the target hardware, typically measured in milliseconds. Latency affects the
responsiveness of real-time systems like gesture recognition or wake-word detection. For example,
Google’s KWS model for TensorFlow Lite Micro executes in less than 20 ms on a Cortex-M4
at 80 MHz [156]. Latency is influenced by model depth, width, and data precision (e.g., 8-bit vs.
floating-point), and can be reduced using techniques like operator fusion, loop unrolling, and
hardware-specific optimizations such as CMSIS-NN [76].
9.1.3 Memory Usage (KB). Memory usage includes both the static memory footprint (the size of
the model parameters) and dynamic memory (temporary activations, intermediate buffers). MCUs
typically have only tens to hundreds of kilobytes of SRAM; for example, the STM32L475 has just
128 KB of SRAM, which limits the size of intermediate tensors and imposes constraints on model
depth and width [156]. Memory profiling tools in frameworks such as TensorFlow Lite Micro [72]
and TVM [157] provide developers with visibility into memory allocation and enable optimization
of memory usage across inference stages. Techniques such as activation recomputation (also known
as gradient checkpointing) [158] and in-place memory reuse [159] are particularly effective in
minimizing peak RAM usage. Additionally, CMSIS-NN [76] and other hand-optimized libraries
help reduce memory overhead by using fixed-point arithmetic and optimizing buffer allocation for
low-latency execution on Arm Cortex-M processors.
9.1.4 Operations Count (MACs / FLOPs). Operations count refers to the total multiply-accumulate
operations (MACs) or floating-point operations (FLOPs) required for a single inference. It is often
used as a proxy for computational workload and correlates strongly with both latency and energy
consumption on MCU-class hardware. MLPerf Tiny includes MACs as a primary metric to stan-
dardize comparisons across architectures, enabling fair trade-off analysis across accuracy, latency,
and energy [43].
9.1.5 Energy Consumption (mJ). For battery-powered or energy-harvesting devices, energy effi-
ciency is paramount. Energy per inference, measured in millijoules (mJ), is a product of inference
latency and average power consumption. Banbury et al. [43] show that their benchmark models
consume between 0.1 mJ and 10 mJ per inference. Specialized accelerators like the GAP8 or Ambiq
Apollo chips allow sub-mJ inference for tasks like image classification or speech recognition.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


9.2 Accuracy Trade-offs and Constraints
Accuracy remains an important performance metric, particularly when TinyML is used in critical
applications such as medical diagnostics or industrial anomaly detection. However, due to severe
memory and compute limitations, models are often subject to trade-offs between efficiency and
accuracy. One of the most widely studied compromises is the balance between quantization and
model performance.

9.2.1 Quantization vs. Accuracy. Quantization reduces the precision of weights and activations,
typically from 32-bit floating point (FP32) to 8-bit integers (INT8), and sometimes lower. This
can lead to massive reductions in model size, latency, and energy usage, but may affect model
accuracy. Magnitude-based structured pruning of MobileNetV2 (50% weights removed) incurs
< 1% top-1 drop on ImageNet while shrinking model size by 1.9 times [65]. Jacob et al. [160]
found that PTQ can reduce MobileNetV1 accuracy on ImageNet by up to 3–4%, though QAT can
mitigate this. QAT introduces fake quantization nodes during training, allowing the network to
compensate for precision loss. Special training procedures and loss-aware quantization [161] are
employed for binary networks. Beyond QAT and PTQ, several quantization schemes have emerged
to address different levels of granularity and trade-off. For instance, per-channel quantization adjusts
the scale and zero-point for each output channel, leading to better numerical stability and often
improved accuracy compared to per-tensor quantization [162]. Mixed-precision quantization allows
different layers or operations to use different bit-widths (e.g., 8-bit for early layers and 4-bit for
later layers), striking a balance between efficiency and performance [163]. Techniques like DoReFa-
Net [164] and Learned Step Size Quantization [165] further improve accuracy by learning optimal
quantization parameters during training. In the TinyML context, these methods have enabled
the deployment of accurate models like ResNet and MobileNet variants on MCUs with under
256 KB of SRAM. Furthermore, hardware-aware quantization strategies are increasingly integrated
into deployment pipelines using tools such as TensorFlow Model Optimization Toolkit [166] and
PyTorch’s quantization API, enabling automated conversion and validation across platforms.

9.2.2 Constraints Beyond Quantization. TinyML models often face deployment-specific constraints
such as memory budgets (e.g., 64 KB of RAM), latency caps (e.g., 10 ms), and energy limits (e.g.,
1 mJ per inference). Frameworks like 𝜇NAS [167] and Once-for-All (OFA) [168] support constraint-
aware NAS to generate tailored models. In practice, trade-offs are carefully evaluated to balance
latency, accuracy, and reliability. Constraint-aware modeling goes beyond architectural choices and
encompasses compiler-level and deployment-time optimizations. For example, models must comply
with quantization compatibility constraints of hardware accelerators like the GAP8 SoC, which
supports only INT8 convolutions [169], or the Ambiq Apollo3 Blue, which requires careful SRAM
and DMA management to maintain sub-mW operation [170]. Real-time constraints also vary across
application domains, such as, voice-triggered devices may tolerate 10–20 ms of latency, whereas
anomaly detection in industrial sensors may allow hundreds of milliseconds, but must operate
within a strict energy envelope. To handle such variance, modern compilers such as TVM [157],
Glow [78], and Apache Relay perform cross-layer optimization and memory layout transformations
that respect such deployment constraints. Additionally, tools like MCUNetV2 [171] integrate NAS
with firmware-level profiling to co-optimize models for specific MCUs, achieving a better trade-off
across the energy-latency-accuracy spectrum.
From Tiny Machine Learning to Tiny Deep Learning: A Survey 27

9.3 Standard Datasets and Benchmarks


To enable reproducibility and fair comparison across TinyML approaches, the community relies on
a curated set of datasets and benchmarks reflecting the domain’s constraints and typical use cases.
These benchmarks target tasks such as KWS, low-resolution vision, and anomaly detection.

Table 5. Summary of benchmark datasets and their characteristics

Dataset Task Input Size Classes Use Case


Google Speech Keyword spotting 1 s audio (16 kHz) 12–35 Voice command
Commands [172] detection
Visual Wake Words [40] Person detection 96×96 RGB 2 Vision-triggered
wake-up
Tiny ImageNet [173] Image classification 64×64 RGB 200 Low-resolution
object recognition
CIFAR-10/100 [174] Image classification 32×32 RGB 10/100 Lightweight vision
tasks
𝜇MLPerf [38] Benchmark suite Various Various Standardized
TinyML evaluation

9.3.1 Google Speech Commands. Developed by Warden et al., this dataset contains short spoken
words sampled at 16 kHz. It is the standard benchmark for KWS. Tasks involve recognizing a fixed
vocabulary such as “yes”, “no”, and “go”.
9.3.2 Visual Wake Words. This dataset is a binary classification task for detecting the presence
of a person in low-resolution images. It is used in wake-word-style visual triggers for cameras or
embedded vision systems.
9.3.3 Tiny ImageNet and CIFAR. These datasets serve as benchmarks for image classification under
low-resolution and low-memory conditions. Tiny ImageNet is more challenging due to its 200-class
design, while CIFAR remains widely used for comparison.
9.3.4 𝜇MLPerf Benchmark Suite. MLCommons introduced 𝜇MLPerf to provide standardized evalu-
ation across KWS, image classification, and anomaly detection. It includes metrics like accuracy,
model size, memory footprint, and energy per inference, making it one of the most comprehensive
benchmarks for TinyML systems.

10 Challenges and Open Research Problems


The field of TinyDL, while demonstrating remarkable progress, faces numerous fundamental
challenges that require innovative solutions to realize its full potential across diverse edge computing
applications [59][175].

10.1 Trade-off Between Accuracy and Footprint


The fundamental tension between model accuracy and resource consumption represents the most
persistent challenge in TinyDL deployment [176]. Current approaches often sacrifice significant
accuracy to meet stringent memory and computational constraints, limiting the applicability of
TinyDL systems in accuracy-critical applications [176].
10.1.1 Memory Hierarchy Complexity. The complex memory hierarchy of MCUs, with limited
SRAM (typically 320KB) and flash storage (1MB), creates intricate optimization challenges [176].
Traditional ML benchmarks assume gigabytes of memory, making direct adaptation impossible
[176]. The memory bottleneck affects not only model storage but also intermediate activations

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


28 Somvanshi et al.

during inference, requiring sophisticated memory scheduling strategies that consider the entire
network topology rather than layer-wise optimization [176].

10.2 Computational Efficiency Paradox


While larger models generally achieve better accuracy, the computational resources available on
MCUs create hard limits on deployable model complexity [176]. The emergence of transformer
architectures exacerbates this challenge, as the quadratic complexity of self-attention mechanisms
conflicts with the linear resource scaling capabilities of edge devices [64]. Current solutions like
attention approximation and sparse attention patterns show promise but remain insufficient for
complex real-world applications [64].

10.3 Secure Model Updates in the Field


The deployment of TinyDL models in remote and potentially hostile environments creates unprece-
dented security challenges that traditional cloud-based ML systems do not encounter [177][178].
10.3.1 Adversarial Attack Transferability. Research demonstrates that adversarial attacks crafted
on powerful host machines can successfully transfer to resource-constrained devices like ESP32
and Raspberry Pi, highlighting the vulnerability of TinyML systems to security threats [177]. The
limited defensive capabilities of edge devices make them particularly susceptible to model extraction
and evasion attacks, where adversaries can potentially reconstruct sensitive model parameters or
manipulate model behavior [177][178].
10.3.2 Model Update Vulnerabilities. TinyML devices often operate in physically accessible envi-
ronments where attackers can potentially intercept or manipulate model updates [178]. The limited
computational resources of these devices make it challenging to implement robust cryptographic
protocols for secure model transmission and verification [178]. The integration of hardware secu-
rity modules and trusted execution environments into TinyML platforms represents a promising
direction, but current solutions significantly increase cost and power consumption [150].

10.4 Generalization Under Few-Shot Learning


The limited computational and storage resources of TinyML devices severely constrain their ability
to adapt to new tasks or domains through traditional machine learning approaches [179][180]. For
example, wearable health-monitoring devices deployed in elderly care settings often need to adapt
to new users with limited labeled data. Supporting such personalized learning on-device without
cloud retraining demands efficient few-shot learning techniques that operate within sub-256 KB
memory and minimal latency budgets.
10.4.1 Meta-Learning Constraints. While few-shot learning techniques show promise for enabling
rapid adaptation in resource-constrained environments, the meta-learning algorithms themselves
often require significant computational resources that exceed TinyML capabilities [179]. The
memory requirements for maintaining meta-parameters and adaptation mechanisms conflict with
the storage limitations of MCUs, necessitating novel approaches to meta-learning specifically
designed for edge deployment [180].
10.4.2 Continual Learning Limitations. The ability to learn continuously from new data while
retaining previously acquired knowledge represents a critical capability for long-term TinyML
deployment [181]. However, the memory limitations of edge devices make it challenging to imple-
ment effective continual learning strategies that prevent catastrophic forgetting while enabling
knowledge acquisition from limited data samples [181]. Recent research on memory-constrained
online continual learning demonstrates that effective continual learning is possible under severe

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 29

memory limitations, but requires algorithmic innovations that fundamentally differ from traditional
approaches [181].

10.5 Lack of Standardized Benchmarks for TinyDL


The absence of comprehensive, standardized benchmarks specifically designed for TinyDL systems
impedes systematic progress and fair comparison of different approaches [43].
10.5.1 Hardware Heterogeneity Challenges. The diverse landscape of TinyML hardware platforms,
ranging from ARM Cortex-M MCUs to specialized AI accelerators, complicates the development
of universally applicable benchmarks [43][182]. Each platform exhibits unique characteristics in
terms of memory hierarchy, instruction sets, and optimization opportunities, making it difficult to
establish fair comparison metrics across different systems [43].
10.5.2 Multi-Objective Evaluation Complexity. Traditional ML benchmarks focus primarily on
accuracy metrics, but TinyDL systems require evaluation across multiple dimensions including
latency, energy consumption, memory usage, and model size [43]. The MLPerf Tiny benchmark
provides valuable baseline comparisons but focuses primarily on computer vision and simple
audio processing tasks, neglecting other important application domains such as natural language
processing and sensor fusion [43].

10.6 Limited Tool Support for Advanced DL on MCUs


The deployment of sophisticated deep learning models on MCUs faces significant toolchain limita-
tions that restrict the practical implementation of state-of-the-art architectures [176].
10.6.1 Transformer Deployment Challenges. While transformer architectures represent the state-of-
the-art in many AI applications, their deployment on MCUs remains severely limited by inadequate
compiler and runtime support [64]. Current toolchains lack optimized implementations of attention
mechanisms and layer normalization operations, forcing researchers to develop custom, platform-
specific solutions that limit portability and reproducibility [64]. This limitation is particularly
evident in tasks such as on-device natural language understanding or voice assistant applications,
where lightweight models like TinyBERT or DistilBERT cannot be fully deployed on common
MCUs (e.g., ARM Cortex-M4) due to the absence of efficient attention-layer primitives in existing
toolchains, such as TensorFlow Lite Micro or CMSIS-NN.
10.6.2 Cross-Platform Portability. The heterogeneous nature of TinyML hardware creates signifi-
cant challenges for developing portable software solutions [176]. Current toolchains often require
platform-specific optimizations that limit code reusability and increase development overhead,
hindering the widespread adoption of TinyDL techniques across diverse hardware platforms [176].

These challenges represent fundamental research opportunities that will shape the future devel-
opment of TinyDL[59][183]. Addressing these issues requires interdisciplinary collaboration across
machine learning, computer systems, and hardware design communities to develop innovative
solutions that unlock the full potential of edge AI applications [183][184].

11 Future Directions
As TinyDL systems mature and expand across diverse application domains from healthcare and
smart homes to industrial automation and autonomous sensing, new challenges and technological
frontiers are emerging. Addressing these will require interdisciplinary advances in hardware design,
algorithmic efficiency, secure training, and adaptable software ecosystems. This section outlines
five promising directions for future exploration. Neuromorphic architectures employing Spiking

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


30 Somvanshi et al.

Neural Networks (SNNs) offer an alternative computational paradigm designed for ultra-low-power,
event driven processing typical of brain-inspired systems. These architectures promise efficient
always-on inference on MCU-scale devices ideal for continuous monitoring applications-with
hardware platforms like BrainChip’s Akida and Intel’s Loihi leading the way. To fully leverage
SNNs in TinyDL, research must advance surrogate-gradient training, event encoding techniques,
and software toolchains that map spiking models onto neuromorphic hardware seamlessly [65].
Implementing federated learning (FL) in TinyDL contexts addresses privacy and adaptability
by enabling decentralized learning across devices without sharing raw data crucial for distributed
sensor networks. Lightweight frameworks such as TinyFedTL and TinyMetaFed demonstrate
on-device aggregation of quantized updates, yet challenges remain in managing communication
overhead, heterogeneous device capabilities, and adversarial resilience [69, 178]. Future work must
focus on sparsified updates, asynchronous or hierarchical FL protocols, and secure aggregation
mechanisms amenable to TinyML constraints.
Tiny Foundation Models refer to miniaturized versions of large pretrained models intended for
deployment on edge hardware. Promising techniques such as knowledge distillation, structured
pruning, and quantizationapplied to models like TinyViT have shown the potential to reduce
model size to MCU-suitable scales while preserving task performance [41, 63]. The next step is to
enable modular foundational architectures, where a general “backbone” pre-trained model supports
multiple lightweight task-specific heads, with workflows powered by on-device or Edge AutoML-
enabled fine-tuning. Edge AutoML seeks to automate the process of designing, compressing, and
deploying TinyDL models on resource-constrained devices. Techniques like hardware-aware NAS
frameworks, such as TinyNAS and Once-for-All Networks, have demonstrated effective ways to
balance accuracy with memory and latency constraints [61, 64]. However, integrating AutoML into
full deployment pipelines remains an open challenge. Future research should focus on combining
AutoML with model compression strategies like quantization and pruning and incorporating
hardware feedback to generate models that are not only accurate but also energy-efficient and
deployable in real-world TinyDL scenarios.
Domain-specific accelerators, including NPUs, ASICs, FPGAs, and specialized RISC-V engines,
offer substantial gains in inference speed, energy efficiency, and model scalability for TinyDL.
Devices like the EdgeTPU and transformer-focused RISC-V extensions efficiently deliver quantized
convolution and attention workloads, outperforming general-purpose MCUs [64, 65]. The challenge
now is to develop advanced compilation toolchains that partition and schedule TinyDL models
across heterogeneous hardware, integrate with platforms like TFLite Micro and CMSIS-NN, and
maximize runtime configuration flexibility without sacrificing portability or ease of development
[77, 78].

12 Conclusions
This survey presents a comprehensive examination of the evolution from TinyML to TinyDL,
highlighting how the convergence of efficient model architectures, software toolchains, and hard-
ware platforms has enabled sophisticated on-device intelligence in severely resource-constrained
environments. We begin by delineating the scope and distinction between TinyML and TinyDL,
emphasizing the growing need to embed deep learning capabilities, once reserved for data centers,
into low-power MCUs and edge devices. We have outlined the hardware advancements, including
the emergence of neural accelerators and specialized ASICs, that now support the deployment of
deep networks with kilobyte-scale memory footprints and milliwatt power budgets. Simultane-
ously, we explored the critical role of model optimization techniques such as quantization, pruning,
and joint compression strategies, as well as the contributions of NAS in tailoring architectures

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 31

to edge constraints. On the software side, we cataloged an extensive range of deployment frame-
works, compiler toolchains, and AutoML platforms that streamline the end-to-end TinyDL lifecycle.
Through domain-specific applications in vision, audio, healthcare, and industrial monitoring, we
demonstrated the transformative potential of TinyDL across sectors demanding low latency, energy
efficiency, and data privacy.
Looking ahead, TinyDL is poised to catalyze a new generation of edge-native intelligence. This
includes the development of neuromorphic architectures using spiking neural networks, federated
learning for decentralized personalization, and ultra-lightweight foundation models capable of
generalization across tasks and modalities. The co-design of hardware and software will become
increasingly central, as will the creation of standardized, energy-aware benchmarks to evaluate
system performance holistically. By bridging the conceptual, architectural, and practical aspects of
TinyDL, this survey aims to serve as a foundational resource for both researchers and practitioners.
It underscores the critical shift from cloud dependence to autonomous, efficient edge intelligence,
laying the groundwork for continued innovation in AI at the very edge of computing.

References
[1] Syed Ali Raza Zaidi, Ali M Hayajneh, Maryam Hafeez, and Qasim Zeeshan Ahmed. Unlocking edge intelligence
through tiny machine learning (tinyml). IEEE Access, 10:100867–100877, 2022.
[2] Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, and Song Han. Tiny machine learning: Progress and futures
[feature]. IEEE Circuits and Systems Magazine, 23(3):8–34, 2023.
[3] Hui Han and Julien Siebert. Tinyml: A systematic review and synthesis of existing research. In 2022 International
Conference on Artificial Intelligence in Information and Communication (ICAIIC), pages 269–274. IEEE, 2022.
[4] Bita Darvish Rouhani, Azalia Mirhoseini, and Farinaz Koushanfar. Tinydl: Just-in-time deep learning solution for
constrained embedded systems. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–4.
IEEE, 2017.
[5] Minh Tri Lê, Pierre Wolinski, and Julyan Arbel. Efficient neural networks for tiny machine learning: A comprehensive
review. arXiv preprint arXiv:2311.11883, 2023.
[6] Jihong Park, Sumudu Samarakoon, Mehdi Bennis, and Mérouane Debbah. Wireless network intelligence at the edge.
Proceedings of the IEEE, 107(11):2204–2239, 2019.
[7] Stanislava Soro. Tinyml for ubiquitous edge ai. arXiv preprint arXiv:2102.01255, 2021.
[8] Youssef Abadade, Anas Temouden, Hatim Bamoumen, Nabil Benamar, Yousra Chtouki, and Abdelhakim Senhaji
Hafid. A comprehensive survey on tinyml. IEEE Access, 11:96892–96922, 2023.
[9] Partha Pratim Ray. A review on tinyml: State-of-the-art and prospects. Journal of King Saud University-Computer and
Information Sciences, 34(4):1595–1623, 2022.
[10] Danilo Pau and Prem Kumar Ambrose. Automated neural and on-device learning for micro controllers. In 2022 IEEE
21st Mediterranean Electrotechnical Conference (MELECON), pages 758–763. IEEE, 2022.
[11] Giovanni Delnevo, Silvia Mirri, Catia Prandi, and Pietro Manzoni. An evaluation methodology to determine the
actual limitations of a tinyml-based solution. Internet of Things, 22:100729, 2023.
[12] Luigi Capogrosso, Federico Cunico, Dong Seon Cheng, Franco Fummi, and Marco Cristani. A machine learning-
oriented survey on tiny machine learning. IEEE Access, 12:23406–23426, 2024.
[13] Lina Bariah, Qiyang Zhao, Hang Zou, Yu Tian, Faouzi Bader, and Merouane Debbah. Large generative ai models for
telecom: The next big thing? IEEE Communications Magazine, 62(11):84–90, 2024.
[14] Imopishak Thingom and N Basanta Singh. A review on machine learning in iot devices. International Journal of
Digital Technologies, 2(1), 2023.
[15] Visal Rajapakse, Ishan Karunanayake, and Nadeem Ahmed. Intelligence at the extreme edge: A survey on reformable
tinyml. ACM Computing Surveys, 55(13s):1–30, 2023.
[16] Nasser Alajlan and Dalia M. Ibrahim. Tinyml: Enabling of inference deep learning models on ultra-low-power iot
edge devices for ai applications. Micromachines, 13(6):851, 2022.
[17] Abdussalam Elhanashi, Pierpaolo Dini, Sergio Saponara, and Qinghe Zheng. Advancements in tinyml: Applications,
limitations, and impact on iot devices. Electronics, 13(17):3562, 2024.
[18] Georgios Kornaros. Hardware-assisted machine learning in resource-constrained iot environments for security:
review and future prospective. IEEE Access, 10:58603–58622, 2022.
[19] Ismail Lamaakal, Ibrahim Ouahbi, Khalid El Makkaoui, Yassine Maleh, Paweł Pławiak, and Fahad Alblehai. A tinydl
model for gesture-based air handwriting arabic numbers and simple arabic letters recognition. IEEE Access, 2024.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


32 Somvanshi et al.

[20] Zeinab E Ahmed, Aisha A Hashim, Rashid A Saeed, and Mamoon M Saeed. Tinyml network applications for smart
cities. In TinyML for Edge Intelligence in IoT and LPWAN Networks, pages 423–451. Elsevier, 2024.
[21] Norah N Alajlan and Dina M Ibrahim. Original research article tinyml: Adopting tiny machine learning in smart
cities. Journal of Autonomous Intelligence, 7(4), 2024.
[22] Ivan Khokhlov, Egor Davydenko, Ilya Osokin, Ilya Ryakin, Azer Babaev, Vladimir Litvinenko, and Roman Gorbachev.
Tiny-yolo object detection supplemented with geometrical data. In 2020 IEEE 91st Vehicular Technology Conference
(VTC2020-Spring), pages 1–5. IEEE, 2020.
[23] Nithesh Singh Sanjay and Ali Ahmadinia. Mobilenet-tiny: A deep neural network-based real-time object detection for
rasberry pi. In 2019 18th IEEE international conference on machine learning and applications (ICMLA), pages 647–652.
IEEE, 2019.
[24] Brett Koonce. Squeezenet. In Convolutional neural networks with swift for tensorflow: image recognition and dataset
categorization, pages 73–85. Springer, 2021.
[25] Riku Immonen and Timo Hämäläinen. Tiny machine learning for resource-constrained microcontrollers. Journal of
Sensors, 2022(1):7437023, 2022.
[26] Danilo Pau, Abderrahim Khiari, and Davide Denaro. Online learning on tiny micro-controllers for anomaly detection
in water distribution systems. In 2021 IEEE 11th International Conference on Consumer Electronics (ICCE-Berlin), pages
1–6. IEEE, 2021.
[27] Rakhee Kallimani, Krishna Pai, Prasoon Raghuwanshi, Sridhar Iyer, and Onel LA López. Tinyml: Tools, applications,
challenges, and future research directions. Multimedia Tools and Applications, 83(10):29015–29045, 2024.
[28] Swapnil Sayan Saha, Sandeep Singh Sandha, and Mani Srivastava. Machine learning for microcontroller-class
hardware: A review. IEEE Sensors Journal, 22(22):21362–21390, 2022.
[29] Michael Fauscette. TinyML: Portable, Low Cost, Low Power Machine Learning, 2025.
[30] Sucheta Mandal. TinyML: Running Deep Learning Models on Microcontrollers, April 2025.
[31] Syntiant. Syntiant NDP120 Achieves Outstanding Results in Latest MLPerf Tiny v0.7 Benchmark Suite, 2025.
[32] Coral. M.2 Accelerator with Dual Edge TPU, 2025.
[33] Syntiant. Syntiant Core 2 Achieves Lowest Power Results in MLPerf Tiny v1.2 Benchmark Suite, 2025.
[34] Himax Technologies Inc. Himax Launches WiseEye WE-I Plus HX6537-A to Support AI Deep Learning with Google’s
TensorFlow Lite for Microcontrollers, June 2020.
[35] Salvatore Salamone. Real-time Analytics News for Week Ending June 19, June 2021.
[36] MLCommons. Benchmark MLPerf Inference: Tiny | MLCommons V1.1 Results, 2025.
[37] MLMark. Introducing the EEMBC MLMark Benchmark, 2025.
[38] Colby R Banbury, Vijay Janapa Reddi, Max Lam, William Fu, Amin Fazel, Jeremy Holleman, Xinyuan Huang, Robert
Hurtado, David Kanter, Anton Lokhmotov, et al. Benchmarking tinyml systems: Challenges and direction. arXiv
preprint arXiv:2003.04821, 2020.
[39] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning
requires rethinking generalization, 2017.
[40] Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, and Rocky Rhodes. Visual wake words
dataset. arXiv preprint arXiv:1906.05721, 2019.
[41] Ji Lin, Wei-Ming Chen, Yujun Lin, Chuang Gan, Song Han, et al. Mcunet: Tiny deep learning on iot devices. Advances
in neural information processing systems, 33:11711–11722, 2020.
[42] Sangwon Lee, Jonghoon Choi, Sehoon Park, and Sungroh Yoon. Designing extremely memory-efficient cnns for
on-device vision tasks. IEEE Access, 8:49401–49413, 2020.
[43] Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David
Kanter, Sebastian Ahmed, Danilo Pau, et al. Mlperf tiny benchmark. arXiv preprint arXiv:2106.07597, 2021.
[44] Gaurav Menghani. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM
Computing Surveys, 55(12):1–37, 2023.
[45] Gabriel Signoretti, Marianne Silva, Pedro Andrade, Ivanovitch Silva, Emiliano Sisinni, and Paolo Ferrari. An evolving
tinyml compression algorithm for iot environments based on data eccentricity. Sensors, 21(12):4153, 2021.
[46] Urmish Thakker, Paul N Whatmough, Zhi-Gang Liu, Matthew Mattina, and Jesse Beu. Compressing language models
using doped kronecker products. arXiv preprint arXiv:2001.08896, 2020.
[47] Pan Hu, Junha Im, Zain Asgar, and Sachin Katti. Starfish: Resilient image compression for aiot cameras. In Proceedings
of the 18th Conference on Embedded Networked Sensor Systems, pages 395–408, 2020.
[48] Sek M Chai. Quantization-guided training for compact tinyml models. In Research Symposium on Tiny Machine
Learning, 2021.
[49] Manuele Rusci, Marco Fariselli, Alessandro Capotondi, and Luca Benini. Leveraging automated mixed-low-precision
quantization for tiny edge microcontrollers. In IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge,
and Mobile for Embedded Machine Learning: Second International Workshop, IoT Streams 2020, and First International

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 33

Workshop, ITEM 2020, Co-located with ECML/PKDD 2020, Ghent, Belgium, September 14-18, 2020, Revised Selected Papers
2, pages 296–308. Springer, 2020.
[50] Hamed Fatemi, Vedant Karia, Tej Pandit, and Dhireesha Kudithipudi. TENT: Efficient quantization of neural networks
on the tiny edge with tapered fixed point. In Proceedings of the Research Symposium on Tiny Machine Learning, 2020.
[51] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size, 2016.
[52] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto,
and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
[53] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks, 2019.
[54] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter, 2020.
[55] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling
bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
[56] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets
and transformer, 2022.
[57] Yvan Tortorella, Luca Bertaccini, Luca Benini, Davide Rossi, and Francesco Conti. Redmule: A mixed-precision matrix-
matrix operation engine for flexible and energy-efficient on-chip linear algebra and tinyml training acceleration,
2023.
[58] Hasib-Al Rashid, Argho Sarkar, Aryya Gangopadhyay, Maryam Rahnemoonfar, and Tinoosh Mohsenin. Tinyvqa:
Compact multimodal deep neural network for visual question answering on resource-constrained devices, 2024.
[59] Soroush Heydari and Qusay H Mahmoud. Tiny machine learning and on-device inference: A survey of applications,
challenges, and future directions. Sensors, 25(10):3191, 2025.
[60] Praneel Chand and Mansour Assaf. An empirical study on lightweight cnn models for efficient classification of used
electronic parts. Sustainability, 16(17):7607, 2024.
[61] Mohammad Javad Shafiee, Francis Li, Brendan Chwyl, and Alexander Wong. Squishednets: Squishing squeezenet
further for edge device scenarios via deep evolutionary synthesis. arXiv preprint arXiv:1711.07459, 2017.
[62] Yuanyuan Xu, Genke Yang, Jiliang Luo, and Jianan He. An electronic component recognition algorithm based on
deep learning with a faster squeezenet. Mathematical Problems in Engineering, 2020(1):2940286, 2020.
[63] Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi,
Matthew Mattina, and Paul Whatmough. Micronets: Neural network architectures for deploying tinyml applications
on commodity microcontrollers. Proceedings of machine learning and systems, 3:517–532, 2021.
[64] Victor JB Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, and Luca Benini. Optimizing the deployment of
tiny transformers on low-power mcus. IEEE Transactions on Computers, 2024.
[65] Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. Pruning and quantization for deep neural
network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
[66] Benjamin Hawks, Javier Duarte, Nicholas J Fraser, Alessandro Pappalardo, Nhan Tran, and Yaman Umuroglu. Ps and
qs: Quantization-aware pruning for efficient low latency neural network inference. Frontiers in Artificial Intelligence,
4:676564, 2021.
[67] Andrey Kuzmin, Markus Nagel, Mart Van Baalen, Arash Behboodi, and Tijmen Blankevoort. Pruning vs quantization:
Which is better? Advances in neural information processing systems, 36:62414–62427, 2023.
[68] KA Kumari, S Ahamad, T Patil, K Sardana, E Muniyandy, and D Pilli. Neural network pruning techniques for efficient
model compression. International Journal of Intelligent Systems and Applications in Engineering, 12(15s):565–575, 2024.
[69] Xinyu Zhang, Ian Colbert, Ken Kreutz-Delgado, and Srinjoy Das. Training deep neural networks with joint quantization
and pruning of features and weights.
[70] Han Cai, Chuang Gan, Ji Lin, and Song Han. Network augmentation for tiny deep learning. arXiv preprint
arXiv:2110.08890, 2021.
[71] N. Tan. AI on Microcontrollers: uTensor brings Deep-Learning to MCUs. FOSDEM 2018, 2018. Presentation.
[72] Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna
Natraj, Tiezhen Wang, et al. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proceedings of
Machine Learning and Systems, 3:800–811, 2021.
[73] Shawn Hymel, Jan Trivedi, Louis Heller, Sandeep Sharma, Paul Fiedler, Ajay Patel, Anil Chandak, Abhishek Sinha,
and Thomas Schmid. Edge Impulse: An MLOps Platform for Tiny Machine Learning. arXiv preprint, arXiv:2212.03332,
2022. Available at [Link]
[74] Google AI Edge. TensorFlow Lite Model Maker. [Link] 2025. Accessed:
June 5, 2025.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


34 Somvanshi et al.

[75] Towards Data Science. Deep Learning on your phone: PyTorch Lite Interpreter for mobile plat-
forms. [Link]
ae73d0b17eaa/3, 2025. Published on January 18, 2025. Accessed: June 5, 2025.
[76] Liangzhen Lai, Naveen Suda, and Vikas Chandra. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M
CPUs. arXiv preprint, arXiv:1801.06601, 2018. Available at [Link]
[77] C. Liu, M. Jobst, L. Guo, X. Shi, J. Partzsch, and C. Mayr. Deploying machine learning models to ahead-of-time runtime
on edge using microtvm. arXiv preprint, arXiv:2304.04842, 2023. Available at [Link]
[78] N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R. Dzhabarov, N. Gibson, J. Hegeman, M. Lele, R. Levenstein,
J. Montgomery, B. Maher, S. Nadathur, J. Olesen, J. Park, A. Rakhov, M. Smelyanskiy, and M. Wang. Glow: Graph
lowering compiler techniques for neural networks. arXiv preprint, arXiv:1805.00907, 2018. Available at https:
//[Link]/abs/1805.00907.
[79] Keras2c: A library for converting Keras neural networks to real-time compatible C. Engineering Applications of
Artificial Intelligence, 100:104188, 2021.
[80] R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, and A. G. Gray. Mlpack: A scalable c++
machine learning library. Journal of Machine Learning Research, 14(1):801–805, 2013.
[81] STMicroelectronics. X-CUBE-AI: STM32Cube Expansion Package. [Link]
[Link], 2025. Accessed: June 5, 2025.
[82] J. Duarte, E. Kreinar, J. Ngadiuba, et al. hls4ml: An open-source codesign workflow to empower scientific low-
power machine learning devices. In TinyML Research Symposium 2021, San Jose, CA, 2021. arXiv:2103.05579,
[Link]
[83] ARM Ltd. OctoML: Accelerating ML model deployment. [Link] 2025. ARM
Partner Catalog, Accessed: June 5, 2025.
[84] Nebuly Team. nebullvm: AI runtime optimization library. [Link] 2024. Accessed: June 5,
2025.
[85] Mauro Conti, Roberto Di Pietro, Luigi V. Mancini, and Alessandro Mei. (old) distributed data source verification in
wireless sensor networks. Inf. Fusion, 10(4):342–353, 2009.
[86] X. He. Accelerated linear algebra compiler for computationally efficient numerical models. PLOS ONE, 18(2):e0282265,
2023.
[87] J. Bai, F. Lu, K. Zhang, et al. Onnx: Open neural network exchange. GitHub repository, 2019.
[88] N. Vasilache et al. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions.
arXiv preprint, arXiv:1802.04730, 2018. Available at [Link]
[89] Qeexo. Qeexo automl user guides. [Link] 2025. Accessed: 2025-06-11.
[90] Neuton AI. Neuton ai. [Link] 2025. Accessed: 2025-06-11.
[91] Latent AI. Latent ai. [Link] 2025. Accessed: 2025-06-11.
[92] SensiML Corp. Sensiml toolkit technical overview. Technical report, SensiML Corp., April 2021. Rev. 1.1.
[93] Arduino. Get started with machine learning on arduino nano 33 ble sense. [Link]
ble-sense/get-started-with-machine-learning/, 2025. Accessed: 2025-06-11.
[94] NXP Semiconductors. eIQ Toolkit User Guide, 2024. Version 1.8.0.
[95] Microsoft. Edgeml: Machine learning for resource-constrained edge devices. [Link]
2025. Accessed: 2025-06-11.
[96] EdjeElectronics. Train a tensorflow lite 2 object detection model. [Link]
EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi/blob/master/Train_TFLite2_
Object_Detction_Model.ipynb, 2025. Accessed: 2025-06-11.
[97] EdjeElectronics. Tensorflow lite object detection on android and raspberry pi. [Link]
TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi, 2025. Accessed: 2025-06-11.
[98] Sony AI. Sony ai. [Link] 2025. Accessed: 2025-06-11.
[99] Sony. Model optimization toolkit. [Link] 2025. Accessed: 2025-06-11.
[100] KaaIoT Technologies, LLC. Kaaiot and supermicro collaborate to provide ai-powered iot solutions for the edge. https:
//[Link]/blog/kaaiot-and-supermicro-collaborate-to-provide-ai-powered-iot-solutions-for-the-edge, 2025.
Accessed: 2025-06-11.
[101] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape, Ashish Kumar, Saurabh
Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Protonn: compressed and accurate knn for resource-
scarce devices. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page
1331–1340. [Link], 2017.
[102] Yuxuan Liang, Yulin Han, and Fangming Jiang. Deep learning-based small object detection: A survey. In Proceedings
of the 2022 8th International Conference on Computing and Artificial Intelligence (ICCAI ’22), pages 432–438. ACM,
2022.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 35

[103] Qianyun Lu and Boris Murmann. Improving the energy efficiency and robustness of tinyml computer vision using
log-gradient input images. In Proceedings of the tinyML Research Symposium (tinyML Research Symposium ’22). ACM,
March 2022. Also available as arXiv:2203.02571.
[104] S. B. Lakshman and N. U. Eisty. Software engineering approaches for tinyml based iot embedded vision: A systematic
literature review. arXiv preprint arXiv:2204.08702, 2022.
[105] Qian Feng, Xinxin Xu, and Zhenxing Wang. Deep learning-based small object detection: A survey. Mathematical
Biosciences and Engineering, 20(4):6551–6590, 2023.
[106] Adriel Monti De Nardi and Maxwell Eduardo Monteiro. Evaluation of the energy viability of smart iot sensors using
tinyml for computer vision applications: A case study. International Robotics & Automation Journal, 9(2):78–85, 2023.
[107] Colby Banbury, Emil Njor, Andrea Mattia Garavagno, Mark Mazumder, Matthew Stewart, Pete Warden, Manjunath
Kudlur, Nat Jeffries, and Vijay Janapa Reddi. Wake vision: A tailored dataset and benchmark suite for tinyml computer
vision applications. arXiv preprint arXiv:2405.00892v5, Jun 2025. ver. 5.
[108] S. You, Zhiyu Chen, Shangdong Li, Mengxue Wang, Tengfeng Feng, and Yimu Jiang. Yolite+: a lightweight multi-object
detection approach in traffic scenarios. Procedia Computer Science, 199:346–353, 2022.
[109] Andrew Barovic and Armin Moin. Tinyml for speech recognition. arXiv preprint arXiv:2504.16213, 2025.
[110] Ahmed Y. Radwan, Mohammad Shehab, and Mohamed-Slim Alouini. Tinyml nlp scheme for semantic wireless
sentiment classification with privacy preservation. arXiv preprint arXiv:2411.06291v3, April 2025. Accepted at EuCNC
& 6G Summit 2025.
[111] Ismail Lamaakal, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi, Mohamed Essahraui, Mohamed F. Bouami,
Ahmed A. Abd El-Latif, May Almousa, Jun Peng, and Dusit Niyato. A comprehensive survey on tiny machine learning
for human behavior analysis (hba). IEEE Internet of Things Journal, 2025. In press.
[112] M. Pujari, A. Goel, and A. K. Pakina. Efficient tinyml architectures for on-device small language models: Privacy-
preserving inference at the edge. International Journal Science and Technology, 3(3):67–75, 2024.
[113] Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening, Salim Hariri, Sicong Shao, Pratik
Satam, and Soheil Salehi. Quantized transformer language model implementations on edge devices. In Proceedings of
the 2023 IEEE 22nd International Conference on Machine Learning and Applications (ICMLA), pages 104–111, 2023.
[114] Zhaolan Huang, Adrien Tousnakhoff, Polina Kozyr, Roman Rehausen, Felix Bießmann, Robert Lachlan, Cedric Adjih,
and Emmanuel Baccelli. Tinychirp: Bird song recognition using tinyml models on low-power wireless acoustic
sensors. arXiv preprint arXiv:2407.21453v2, September 2024. Accepted at IEEE IS2 2024.
[115] Vasileios Tsoukas, Eleni Boumpa, Georgios Giannakas, and Athanasios Kakarountas. A review of machine learning
and tinyml in healthcare. In Proceedings of the 25th Pan-Hellenic Conference on Informatics, pages 69–73, 2021.
[116] Norhen Abdennadher, Danilo Pau, and Arcangelo Bruna. Fixed complexity tiny reservoir heterogeneous network
for on-line ecg learning of anomalies. In 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE), pages
233–237. IEEE, 2021.
[117] Anandi Dutta. A smart design framework for a novel reconfigurable multi-processor systems-on-chip (ASREM) architecture.
Ph.d. dissertation, University of Louisiana at Lafayette, Lafayette, LA, USA, 2016.
[118] Musa Dima Genemo. Federated learning for bronchus cancer detection using tiny machine learning edge devices.
Indonesian Journal of Data and Science, 5(1):64–69, 2024.
[119] Martin Ragot, Nicolas Martin, Sonia Em, Nico Pallamin, and Jean-Marc Diverrez. Emotion recognition using
physiological signals: laboratory vs. wearable sensors. In Advances in Human Factors in Wearable Technologies and
Game Design: Proceedings of the AHFE 2017 International Conference on Advances in Human Factors and Wearable
Technologies, July 17-21, 2017, The Westin Bonaventure Hotel, Los Angeles, California, USA 8, pages 15–22. Springer,
2018.
[120] Juan Antonio Domínguez-Jiménez, Kiara Coralia Campo-Landines, Juan C Martínez-Santos, Enrique J Delahoz, and
Sonia H Contreras-Ortiz. A machine learning model for emotion recognition from physiological signals. Biomedical
signal processing and control, 55:101646, 2020.
[121] Rita Laureanti, Marco Bilucaglia, Margherita Zito, Riccardo Circi, Alessandro Fici, Fiamma Rivetti, Riccardo Valesi,
Carlo Oldrini, Luca T Mainardi, and Vincenzo Russo. Emotion assessment using machine learning and low-cost
wearable devices. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society
(EMBC), pages 576–579. IEEE, 2020.
[122] Satyapreet Singh Yadav, Radha Agarwal, Kola Bharath, Sandeep Rao, and Chetan Singh Thakur. Tinyradar: Mmwave
radar based human activity classification for edge computing. In 2022 IEEE International Symposium on Circuits and
Systems (ISCAS), pages 2414–2417. IEEE, 2022.
[123] Bidyut Saha, Riya Samanta, Soumya Kanti Ghosh, and Ram Babu Roy. From wrist to world: Harnessing wearable imu
sensors and tinyml to enable smart environment interactions. In Proceedings of the Third International Conference on
AI-ML Systems, pages 1–3, 2023.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


36 Somvanshi et al.

[124] Anita Christaline Johnvictor, M Poonkodi, N Prem Sankar, and Thinesh Vs. Tinyml-based lightweight ai healthcare
mobile chatbot deployment. Journal of Multidisciplinary Healthcare, pages 5091–5104, 2024.
[125] Mamta Bhamare, Pradnya V Kulkarni, Rashmi Rane, Sarika Bobde, and Ruhi Patankar. Tinyml applications and use
cases for healthcare. In TinyML for Edge Intelligence in IoT and LPWAN Networks, pages 331–353. Elsevier, 2024.
[126] Samson O Ooko and Simon M Karume. Application of tiny machine learning in predicative maintenance in industries.
Journal of Computing Theories and Applications, 2(1):131–150, 2024.
[127] Lachit Dutta and Swapna Bharali. Tinyml meets iot: A comprehensive survey. Internet of Things, 16:100461, 2021.
[128] J Manokaran and G Vairavel. Smart anomaly detection using data-driven techniques in iot edge: a survey. In
Proceedings of Third International Conference on Communication, Computing and Electronics Systems: ICCCES 2021,
pages 685–702. Springer, 2022.
[129] Vítor M Oliveira and António HJ Moreira. Edge ai system using a thermal camera for industrial anomaly detection.
In International Summit Smart City 360°, pages 172–187. Springer, 2021.
[130] Matteo Cardoni, Danilo Pietro Pau, Laura Falaschetti, Claudio Turchetti, and Marco Lattuada. Online learning of oil
leak anomalies in wind turbines with block-based binary reservoir. Electronics, 10(22):2836, 2021.
[131] Apostolos Xenakis, Anthony Karageorgos, Efthimios Lallas, Adriana E Chis, and Horacio González-Vélez. Towards
distributed iot/cloud based fault detection and maintenance in industrial automation. Procedia Computer Science,
151:683–690, 2019.
[132] Yap Yan Siang, Mohd Ridzuan Ahamd, and Mastura Shafinaz Zainal Abidin. Anomaly detection based on tiny machine
learning: A review. Open International Journal of Informatics, 9(Special Issue 2):67–78, 2021.
[133] Mattia Antonini, Miguel Pincheira, Massimo Vecchio, and Fabio Antonelli. A tinyml approach to non-repudiable
anomaly detection in extreme industrial environments. In 2022 IEEE International Workshop on Metrology for Industry
4.0 & IoT (MetroInd4. 0&IoT), pages 397–402. IEEE, 2022.
[134] Muhammad Abubakar, Adbul Sattar, Hamid Manzoor, Khola Farooq, and Muhammad Yousif. Iiot: An infusion of
embedded systems, tinyml, and federated learning in industrial iot. Journal of Computing & Biomedical Informatics,
8(02), 2025.
[135] Martina Casiroli and Danilo Pietro Pau. Tiny machine learning business intelligence in the semiconductor industry:
A case study. In 2023 IEEE Global Conference on Artificial Intelligence and Internet of Things (GCAIoT), pages 9–16.
IEEE, 2023.
[136] Ramon Sanchez-Iborra and Antonio F Skarmeta. Tinyml-enabled frugal smart objects: Challenges and opportunities.
IEEE Circuits and Systems Magazine, 20(3):4–18, 2020.
[137] Hatim Bamoumen, Anas Temouden, Nabil Benamar, and Yousra Chtouki. How tinyml can be leveraged to solve
environmental problems: A survey. In 2022 International Conference on Innovation and Intelligence for Informatics,
Computing, and Technologies (3ICT), pages 338–343. IEEE, 2022.
[138] United Nations. World Population Projected to Reach 9.8 Billion in 2050, and 11.2 Billion in 2100. Online, 2017.
Accessed: YYYY-MM-DD.
[139] Alakananda Mitra, Sukrutha LT Vangipuram, Anand K Bapatla, Venkata KVV Bathalapalli, Saraju P Mohanty,
Elias Kougianos, and Chittaranjan Ray. Everything you wanted to know about smart agriculture. arXiv preprint
arXiv:2201.04754, 2022.
[140] Sarah Condran, Michael Bewong, Md Zahidul Islam, Lancelot Maphosa, and Lihong Zheng. Machine learning in
precision agriculture: a survey on trends, applications and evaluations over two decades. IEEE Access, 10:73786–73803,
2022.
[141] Yogeswaranathan Kalyani and Rem Collier. A systematic survey on the role of cloud, fog, and edge computing
combination in smart agriculture. Sensors, 21(17):5922, 2021.
[142] Vu Khanh Quy, Nguyen Van Hau, Dang Van Anh, Nguyen Minh Quy, Nguyen Tien Ban, Stefania Lanza, Giovanni
Randazzo, and Anselme Muzirafuti. Iot-enabled smart agriculture: architecture, applications, and challenges. Applied
Sciences, 12(7):3396, 2022.
[143] Nikesh Gondchawar, RS Kawitkar, et al. Iot based smart agriculture. International Journal of advanced research in
Computer and Communication Engineering, 5(6):838–842, 2016.
[144] G Sushanth and S Sujatha. Iot based smart agriculture system. In 2018 international conference on wireless communi-
cations, signal processing and networking (WiSPNET), pages 1–4. IEEE, 2018.
[145] Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia Zuffi, Benjamin Risse, Alexander Mathis,
Mackenzie W Mathis, Frank Van Langevelde, Tilo Burghardt, et al. Perspectives in machine learning for wildlife
conservation. Nature communications, 13(1):792, 2022.
[146] David J Curnick, Alasdair J Davies, Clare Duncan, Robin Freeman, David MP Jacoby, Hugo TE Shelley, Cristian Rossi,
Oliver R Wearn, Michael J Williamson, and Nathalie Pettorelli. Smallsats: a new technological frontier in ecology and
conservation? Remote Sensing in Ecology and Conservation, 8(2):139–150, 2022.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


From Tiny Machine Learning to Tiny Deep Learning: A Survey 37

[147] Kong Ka Hing, Mehran Behjati, Vala Saleh, Yap Kian Meng, Anwar PP Abdul Majeed, and Yufan Zheng. Edge
intelligence for wildlife conservation: Real-time hornbill call classification using tinyml. In International Conference
on Intelligent Manufacturing and Robotics, pages 476–488. Springer, 2024.
[148] Konkala Venkateswarlu Reddy, BS Karthikeya Reddy, Veerapu Goutham, Miriyala Mahesh, JS Nisha, Gopinath
Palanisamy, Mallikarjuna Golla, Swetha Purushothaman, Katangure Rithisha Reddy, and Varsha Ramkumar. Edge ai
in sustainable farming: Deep learning-driven iot framework to safeguard crops from wildlife threats. IEEE Access,
12:77707–77723, 2024.
[149] Ariel M Lorenzo, Rodrigo Barien, Neil Darwin Favila, Dennis Basa, Jay M Ventura, and Sherwin Catolos. Trees have
ears: An acoustic surveillance and tinyml-based for detecting illegal logging. In 2024 International Conference of
Adisutjipto on Aerospace Electrical Engineering and Informatics (ICAAEEI), pages 1–6. IEEE, 2024.
[150] Mangesh Pujari, Anil Kumar Pakina, and Ashwin Sharma. Enhancing cybersecurity in edge ai systems: A game-
theoretic approach to threat detection and mitigation. IOSR Journal of Computer Engineering, 25(3):65–73, 2023.
[151] Haoyu Ren, Darko Anicic, and Thomas A Runkler. Tinyol: Tinyml with online-learning on microcontrollers. In 2021
international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2021.
[152] Mark Mazumder, Colby Banbury, Josh Meyer, Pete Warden, and Vijay Janapa Reddi. Few-shot keyword spotting in
any language. arXiv preprint arXiv:2104.01454, 2021.
[153] Kavya Kopparapu and Eric Lin. Tinyfedtl: Federated transfer learning on tiny devices. arXiv preprint arXiv:2110.01107,
2021.
[154] Marc Monfort Grau, Roger Pueyo Centelles, and Felix Freitag. On-device training of machine learning models on
microcontrollers with a look at federated learning. In Proceedings of the Conference on Information Technology for
Social Good, pages 198–203, 2021.
[155] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[156] Pete Warden and Daniel Situnayake. Tinyml: Machine learning with tensorflow lite on arduino and ultra-low-power
microcontrollers. O’Reilly Media, 2019.
[157] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan
Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In
13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
[158] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation
through time. Advances in neural information processing systems, 29, 2016.
[159] Minsik Cho and Daniel Brand. Mec: Memory-efficient convolution for deep neural network. In International Conference
on Machine Learning, pages 815–824. PMLR, 2017.
[160] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and
Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
[161] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and
Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint
arXiv:1805.06085, 2018.
[162] Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for
rapid-deployment. Advances in Neural Information Processing Systems, 32, 2019.
[163] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed
precision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8612–8620, 2019.
[164] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth
convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
[165] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned
step size quantization. arXiv preprint arXiv:1902.08153, 2019.
[166] Tensorflow model optimization toolkit. [Link] Accessed: 2024-06-10.
[167] Boyu Chen, Peixia Li, Baopu Li, Chen Lin, Chuming Li, Ming Sun, Junjie Yan, and Wanli Ouyang. Bn-nas: Neural
architecture search with batch normalization. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 307–316, 2021.
[168] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize
it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
[169] Angelo Garofalo, Manuele Rusci, Francesco Conti, Davide Rossi, and Luca Benini. Pulp-nn: A computing library for
quantized neural network inference at the edge on risc-v based parallel ultra low power clusters. In 2019 26th IEEE
International Conference on Electronics, Circuits and Systems (ICECS), pages 33–36. IEEE, 2019.
[170] Ambiq micro apollo3 blue soc technical reference manual. [Link] Accessed: 2024-06-10.

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.


38 Somvanshi et al.

[171] J Lin, WM Chen, H Cai, C Gan, and S Han. Mcunetv2: Memory-efficient patch-based inference for tiny deep learning.
arxiv. arXiv preprint arXiv:2110.15352, 2021.
[172] Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209,
2018.
[173] Yann Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
[174] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[175] Riya Adlakha and Eltahir Kabbar. The challenges of tinyml implementation: A literature review. 2024.
[176] Filip Svoboda, Javier Fernandez-Marques, Edgar Liberis, and Nicholas D Lane. Deep learning on microcontrollers: A
study on deployment costs and challenges. In Proceedings of the 2nd European Workshop on Machine Learning and
Systems, pages 54–63, 2022.
[177] Parin Shah, Yuvaraj Govindarajulu, Pavan Kulkarni, and Manojkumar Parmar. Enhancing tinyml security: Study of
adversarial attack transferability. arXiv preprint arXiv:2407.11599, 2024.
[178] Jacob Huckelberry, Yuke Zhang, Allison Sansone, James Mickens, Peter A Beerel, and Vijay Janapa Reddi. Tinyml
security: Exploring vulnerabilities in resource-constrained machine learning systems. arXiv preprint arXiv:2411.07114,
2024.
[179] Archit Parnami and Minwoo Lee. Learning from few examples: A summary of approaches to few-shot learning.
arXiv preprint arXiv:2203.04291, 2022.
[180] Yeonju Kim, Jeonghyeon Yoon, and Seungku Kim. A few-shot learning-based material recognition scheme using
smartphones. Applied Sciences, 15(1):430, 2025.
[181] Enrico Fini, Stéphane Lathuiliere, Enver Sangineto, Moin Nabi, and Elisa Ricci. Online continual learning under
extreme memory constraints. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XXVIII 16, pages 720–735. Springer, 2020.
[182] Peter Chang. Benchmarking ai compiler for the tinyml market. [Link]
Peter-Chang_tinyML-[Link], 2023. Presented at tinyML Asia 2023, November 16, 2023.
[183] Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia,
Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, et al. Edge ai: A taxonomy, systematic review and
future directions. Cluster Computing, 28(1):1–53, 2025.
[184] Xubin Wang, Zhiqing Tang, Jianxiong Guo, Tianhui Meng, Chenhao Wang, Tian Wang, and Weijia Jia. Empowering
edge intelligence: A comprehensive survey on on-device ai models. ACM Computing Surveys, 57(9):1–39, 2025.

Received 20 June 2025

J. ACM, Vol. 37, No. 4, Article . Publication date: May 2025.

Common questions

Powered by AI

Hardware advancements such as the evolution from simple microcontroller units (MCUs) to enhanced microcontrollers with accelerators like NPUs and Edge TPUs have significantly bolstered TinyDL's capabilities. These advancements allow TinyDL to tackle more complex applications like speech recognition, NLP, and object detection, which require more compute power and memory than typical TinyML use cases . Additionally, these sophisticated hardware platforms enable the deployment of deeper neural networks and more accurate models optimized through techniques like QAT and NAS . Medium-scale deployment of TinyDL is feasible due to the integration of these hardware advancements, which balance power, performance, and accuracy .

TinyML employs data-efficient training techniques such as knowledge distillation, pruning, and quantization to ensure performance in low-power IoT devices. Knowledge distillation involves training a smaller ‘student’ model to mimic the output of a larger ‘teacher’ model, effectively compressing knowledge into a format that runs on constrained hardware such as MCUs . Structured pruning removes negligible parts of the model to reduce size, and quantization compresses weights and activations to lower bit-widths, decreasing computational load . These techniques enable powerful inference without the need for extensive on-device computation, maximizing efficiency and battery life on IoT devices .

TinyDL addresses the constraints of ultra-resource-constrained environments by prioritizing metrics of efficiency and scalability beyond traditional accuracy benchmarks. Techniques such as model compression (e.g., pruning, quantization), small neural networks, and the use of low-power hardware platforms like enhanced MCUs and specialized accelerators ensure models fit within limited memory, storage, and energy budgets without sacrificing essential functionalities . Additionally, frameworks like Edge AutoML automate the design and fine-tuning for these environments, incorporating hardware feedback to optimize deployment feasibility . This comprehensive approach ensures TinyDL models are more agile and suited for edge conditions where resources are severely limited.

Domain-specific accelerators like NPUs, ASICs, and FPGAs are designed to efficiently handle specific types of computations prevalent in deep learning, such as quantized convolution and attention workloads, outperforming general-purpose MCUs . They offer significant improvements in inference speed, energy efficiency, and model scalability, crucial for deployment in TinyDL scenarios . These accelerators provide customized data paths and optimized execution for operations like matrix multiplication which are computationally intensive in TinyDL models, enabling more complex applications like vision-based inference or speech recognition in constrained devices . This specialization results in faster data processing and lower power consumption, crucial for prolonged operation in embedded systems.

Efficiency and compression metrics are crucial for TinyML as they measure how well a model can perform under severe resource constraints, unlike standard performance metrics like accuracy or F1-score. These metrics include model size in KB, inference latency in ms, and memory usage in terms of both static and dynamic requirements . They determine if a model can run efficiently on hardware with limited RAM or processing power, such as ARM Cortex-M MCUs . Techniques like weight pruning and quantization are utilized to minimize model footprint, ensuring inference can occur in real-time even with constrained device specifications . This emphasis on operational feasibility marks a significant departure from conventional performance metrics.

TinyDL utilizes advanced optimization techniques such as Quantization Aware Training (QAT), Neural Architecture Search (NAS), Hardware-Aware Quantization (HAQ), and knowledge distillation to ensure that models retain high accuracy while operating under strict memory and energy constraints . In contrast, TinyML often employs foundational methods like Post-Training Quantization (PTQ) and pruning, alongside manual feature engineering . Therefore, while TinyML strategies focus more on operability within basic resource constraints, TinyDL explores sophisticated techniques to squeeze more capabilities out of limited resources without significant accuracy losses.

A modular foundational architecture in TinyDL offers the benefit of reusing a general pre-trained backbone for multiple task-specific heads, reducing redundancy and improving adaptability. Practically, it could be implemented using lightweight modular components fine-tuned with on-device or edge AutoML frameworks, allowing devices to swap task-specific components without replacing the entire model . This approach enhances flexibility, enabling seamless updates in response to changing user requirements without extensive retraining. Additionally, it could optimize resource utilization by deploying only the necessary modules, thus maintaining efficiency on constrained hardware . Such modularity could pave the way for innovative deployment strategies in edge AI, ensuring models remain adaptable and contextually relevant.

Current TinyDL frameworks face limitations in integrating fully automated deployment pipelines that accommodate real-time environmental feedback and varied hardware capabilities. Improvements could include enhancing compiler toolchains to better partition TinyDL models across heterogeneous hardware, maximizing runtime flexibility, and ensuring compatibility across platforms like TFLite Micro and CMSIS-NN . Additionally, expanding support for dynamic model adaptation and fine-tuning via on-device AutoML could improve model performance over time without significant manual intervention. These enhancements would support robust, adaptable, and efficient deployment of TinyDL models in real-world edge environments, overcoming existing framework limitations that hinder seamless edge integration .

Federated learning in TinyDL faces challenges of communication overhead, heterogeneous device capabilities, and adversarial resilience. Addressing these requires focusing on sparsified updates, asynchronous or hierarchical federated learning protocols, and secure aggregation mechanisms that suit TinyML's constraints . Solutions like on-device aggregation of quantized updates with frameworks like TinyFedTL and TinyMetaFed can help minimize communication load . Developing lightweight protocols that dynamically adapt to varying device capabilities and enhancing security measures for data exchange will be crucial for overcoming these challenges in constrained environments.

AutoML in TinyDL assists in automating the design, compression, and deployment of models, making them more feasible for operation in resource-constrained environments . It leverages frameworks like TinyNAS to perform hardware-aware NAS and Once-for-All Networks to balance accuracy, memory, and latency constraints . However, challenges remain in fully integrating AutoML into deployment pipelines, particularly in terms of combining it with model compression strategies, like quantization and pruning, while incorporating real-time hardware feedback . Addressing these challenges would streamline end-to-end model deployment, enhancing efficiency and performance predictability in TinyDL applications.

You might also like