Awesome research works for on-device AI systems

A curated list of research works on efficient on-device AI systems, methods, and applications for mobile and edge devices.

By Topic

Hardware-aware Attention Acceleration Methods

[MLSys 2025] MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [paper]
[MLSys 2025] Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking [paper]
[NeurIPS 2022] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [paper]

On-device LLM Inference Systems

[arXiv 2025] HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators [paper]
[ASPLOS 2025] Fast On-device LLM Inference with NPUs [paper] [code]
[arXiv 2024] PowerInfer-2: Fast Large Language Model Inference on a Smartphone [paper]
[MobiCom 2024] MELTing point: Mobile Evaluation of Language Transformers [paper] [code]
[MobiCom 2024] Mobile Foundation Model as Firmware [paper] [code]

Inference Acceleration Systems with Heterogeneous Computing Processors (e.g., CPU, GPU, NPU, etc.)

[MobiSys 2024] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs [paper]
[MobiCom 2024] Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices [paper]
[Sensys 2023] Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU [paper]
[MobiSys 2023] NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors [paper]
[ATC 2023] Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices [paper]
[IPSN 2023] PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators [paper]
[SenSys 2022] BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference [paper]
[MobiSys 2022] Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors [paper]
[MobiSys 2022] CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices [paper]

Adaptive Inference Systems for Optimized Resource Utilization

[RTSS 2024] FLEX: Adaptive Task Batch Scheduling with Elastic Fusion in Multi-Modal Multi-View Machine Perception [paper]
[MobiCom 2024] Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices [paper]
[MobiSys 2023] OmniLive: Super-Resolution Enhanced 360° Video Live Streaming for Mobile Devices [paper]
[MobiSys 2023] HarvNet: Resource-Optimized Operation of Multi-Exit Deep Neural Networks on Energy Harvesting Devices [paper]
[MobiCom 2022] NeuLens: Spatial-based Dynamic Acceleration of Convolutional Neural Networks on Edge [paper]
[MobiCom 2021] Flexible high-resolution object detection on edge devices with tunable latency [paper]

On-device Training, Model Adaptation Systems

[ASPLOS 2025] Nazar: Monitoring and Adapting ML Models on Mobile Devices
[SenSys 2024] AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments [paper]
[SenSys 2023] EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge [paper]
[MobiCom 2023] Cost-effective On-device Continual Learning over Memory Hierarchy with Miro [paper]
[MobiCom 2023] AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments [paper]
[MobiSys 2023] ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection [paper]
[SenSys 2023] On-NAS: On-Device Neural Architecture Search on Memory-Constrained Intelligent Embedded Systems [paper]
[MobiCom 2022] Mandheling: mixed-precision on-device DNN training with DSP offloading [paper]
[MobiSys 2022] Memory-efficient DNN training on mobile devices [paper]

Profilers

[SenSys 2023] nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms [paper]
[MobiSys 2021] nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices [paper]

By Conference (2025~)

MLSys 2025

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [paper]
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking [paper]
TurboAttention: Efficient attention approximation for High Throughputs LLMs [paper]
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [paper]
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [paper]

ASPLOS 2025

Fast On-device LLM Inference with NPUs [paper] [code]
Energy-aware Scheduling and Input Buffer Overflow Prevention for Energy-harvesting Systems
Generalizing Reuse Patterns for Efficient DNN on Microcontrollers
Nazar: Monitoring and Adapting ML Models on Mobile Devices

EuroSys 2025

Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution [paper]
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge [paper]

SOSP 2025

MobiSys 2025

MobiCom 2025

Preprint 2025

HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators [paper]

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome research works for on-device AI systems

By Topic

Hardware-aware Attention Acceleration Methods

On-device LLM Inference Systems

Inference Acceleration Systems with Heterogeneous Computing Processors (e.g., CPU, GPU, NPU, etc.)

Adaptive Inference Systems for Optimized Resource Utilization

On-device Training, Model Adaptation Systems

Profilers

By Conference (2025~)

About

Releases

Packages

Contributors 2

jeho-lee/Awesome-On-Device-AI-Systems

Folders and files

Latest commit

History

Repository files navigation

Awesome research works for on-device AI systems

By Topic

Hardware-aware Attention Acceleration Methods

On-device LLM Inference Systems

Inference Acceleration Systems with Heterogeneous Computing Processors (e.g., CPU, GPU, NPU, etc.)

Adaptive Inference Systems for Optimized Resource Utilization

On-device Training, Model Adaptation Systems

Profilers

By Conference (2025~)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages