Ye Li (李也)

↓ Jump to: 📰 News 📄 Publications 🏅 Service ✍️ Blog 📋 CV 📬 Contact

I am a Ph.D. student at the Information Theory and Machine Learning Group (ITML Group) & Multimedia Laboratory (MMLab@SIGS), Tsinghua Shenzhen International Graduate School, Tsinghua University, where I lead the Efficient Deep Learning and Embodiment Group. I am advised by Prof. Shu-Tao Xia and Prof. Zhi Wang, and co-advised by Prof. Yuan Meng. Previously, I received my Master's degree from Nankai University, where I worked with Prof. Zhongxin Liu on swarm intelligence and learning-based control in the Intelligent Predictive Adaptive Control Lab. I obtained my Bachelor's degree from Nanjing Agricultural University, where I worked on robot perception and control with Prof. Wei Lu in the NJAUROBOT_LAB.

My research interests focus on:

Embodied Intelligence: building accurate, efficient, and generalizable embodied large models that perceive, reason, and act in the physical world, such as Vision-Language-Action (VLA) models, Diffusion Policies (DP), World Action Models (WAM), etc.
Efficient Artificial Intelligence: making powerful large models available to everyone under limited compute, such as efficient inference and compression for large language models (LLMs), including token compression, network pruning, quantization, sparsity, caching, etc.

📰 Latest News

Jun 2026 AcceptRoboStream is accepted by ECCV 2026 — see you in Malmö 🇸🇪
May 2026 arXivElegantVLA is released — welcome to discuss! 💬
May 2026 AcceptSAG is accepted by ICML 2026 — see you in Seoul 🇰🇷
Feb 2026 AcceptTTS & VVS are accepted by CVPR 2026 — see you in Denver 🇺🇸
Jan 2026 AcceptSP-VLA & BAC are accepted by ICLR 2026 — see you in Rio 🇧🇷
Dec 2025 arXivTS-DP is released — welcome to discuss! 💬
Oct 2025 AcceptPRANCE is accepted by IEEE TPAMI

📄 Selected Publications

* = first/co-first author · Full list on Google Scholar / Publications page

Preprint2026

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models Project Page

Vision-Language-Action (VLA) models are powerful for generalist robotic control but slow, since large vision-language backbones and iterative action heads run at every control step, and prior acceleration uses fixed rules that ignore the non-uniform reasoning demands of sequential control. Inspired by human motor control — where resources concentrate on goal-sensitive stages — ElegantVLA learns when to think. It is a training-free, plug-in phase-adaptive framework whose lightweight scheduler watches temporal representation similarity, robot-motion cues, and episode progress to jointly allocate compute across the vision encoder, LLM, and action head: a five-level Vision–LLM mode (full recomputation to multi-step temporal reuse) and a three-level denoising mode (reuse during stable motion, full refinement at goal-sensitive stages). On GR00T it reaches up to 2.55× speedup, on CogACT 3.77×, and in real-world GR00T tasks it cuts computation 2.18× while raising control frequency from 13.8 to 26.3 Hz — preserving or improving task success.

Y. Li*, H. Liu, K. Ji, Y. Meng, J. Fan, Y. Wang, S. Qin, C. Wu, S.-T. Xia, Z. Wang

ElegantVLA accelerates the full VLA pipeline end to end: by analyzing redundancy in both high-level semantics and action generation, it adaptively schedules computation across every module — the vision encoder, LLM, and action head — for extreme speedups.

⚡ up to 2.55× (GR00T) · 3.77× (CogACT) · real-world (Franka) 13.8→26.3 Hz

📎 arXiv 🔗 Project

ICLR 20262026

SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

Vision-Language-Action (VLA) models deliver strong control but are too slow for real-time robotics. SP-VLA accelerates them by removing two kinds of redundancy that prior compression overlooks: temporal redundancy across sequential actions and spatial redundancy in visual input. An action-aware scheduler routes intuitive steps to a lightweight generator and deliberative ones to the full VLA model, while a spatio-semantic dual-aware token pruner keeps only the most informative tokens. Together they achieve 1.5× lossless speedup on LIBERO and 2.4× on SimplerEnv, with up to 6% average performance gain.

Y. Li*, Y. Meng, Z. Sun, K. Ji, C. Tang, J. Fan, X. Ma, S.-T. Xia, Z. Wang, W. Zhu

SP-VLA is the first method to exploit temporal redundancy in VLA inference: it reads the end-effector's speed to separate intuitive from deliberative steps, delegating cheap steps to a lightweight generator and reserving the full VLA model for critical ones — achieving lossless acceleration.

⚡ 2.5x lossless (Franka) 1.5× lossless (LIBERO) · 2.4× (SimplerEnv)

📎 arXiv 💻 Code

Preprint2025

TS-DP: Reinforcement Speculative Decoding for Temporal Adaptive Diffusion Policy Acceleration

Diffusion Policy (DP) excels at embodied control but is slow due to iterative denoising, while embodied tasks demand dynamic, adaptive computation. Static lossy methods (e.g., quantization) cannot handle this, whereas speculative decoding offers a lossless, adaptive—yet underexplored—alternative. TS-DP is the first framework to enable temporally-adaptive speculative decoding for DP: a distilled Transformer drafter imitates the base model to replace costly denoising calls, and an RL-based scheduler adjusts speculative parameters to match time-varying task difficulty. Across diverse embodied environments, TS-DP achieves up to 4.17× faster inference with over 94% accepted drafts, reaching 25 Hz real-time control without performance degradation.

Y. Li*, J. Feng, Y. Meng, K. Ji, C. Tang, X. Wen, S.-T. Xia, Z. Wang, W. Zhu

TS-DP is the first to bring lossless speculative decoding to Diffusion Policy, with an RL scheduler that adapts the speculative computation to time-varying task difficulty in embodied control.

⚡ up to 4.17× faster · >94% drafts accepted · 25 Hz real-time (Franka) · lossless

📎 arXiv

TPAMI 20262025

PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference

Vision Transformers are costly due to model size and quadratic token complexity, yet existing compression is static and single-domain — fixing a sparsity ratio or treating architecture and token selection separately — causing sharp accuracy drops under aggressive rates. PRANCE jointly optimizes activated channels and token count per input: a weight-sharing meta-network supports arbitrary channel widths for attention/MLP layers, and a lightweight PPO selector navigates the huge (~10^14) decision space, trained via a 'Result-to-Go' mechanism that casts ViT inference as a Markov decision process. PRANCE cuts FLOPs by ~50% while retaining only ~10% of tokens at lossless Top-1 accuracy, and is compatible with pruning, merging, and sequential pruning-merging.

Y. Li*, C. Tang*, Y. Meng, J. Fan, Z. Chai, X. Ma, Z. Wang, W. Zhu

PRANCE performs per-input joint optimization of computational redundancy from both the data and the model perspectives — co-reducing tokens and pruning channels — to push ViT acceleration to the extreme.

⚡ ~50% FLOPs reduction · keeps ~10% of tokens · lossless Top-1 accuracy

🌐 Official 📎 arXiv 💻 Code

SCTS 2023Control2023

A DDPG-based Solution for Optimal Consensus of Continuous-time Linear Multi-agent Systems

Modeling engineering systems is costly because parameters drift with temperature, component aging, etc. This work proposes a data-driven, model-free optimal controller based on Deep Deterministic Policy Gradient (DDPG) for continuous-time leader-following multi-agent consensus. Two neural networks fit the state and action spaces to avoid the dimensional explosion of time-consuming state iteration; the controller reaches consensus from only the consensus error, requires no initial admissible policy, and self-learns in real time as system parameters change, with minimal energy consumption. Convergence and stability are proven and verified in simulation.

Y. Li*, Z. Liu, G. Lan, M. Sader, Z. Chen

A data-driven, model-free optimal controller based on DDPG for continuous-time leader-following multi-agent consensus — it learns optimal control online from observations alone, removing the need for an accurate system model or any initial admissible policy.

🤖 DDPG-based optimal consensus for multi-agent systems

🌐 Official

ECCV 20262026

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics Project Page

Vision-Language Models (VLMs) are increasingly used as high-level planners for robotic manipulation, but they struggle on long-horizon tasks that demand persistent spatial understanding and temporal memory across many steps. RoboStream is a training-free framework that augments VLM planners with explicit spatio-temporal reasoning and memory: Spatio-Temporal Fusion Tokens compress and carry historical visual-spatial context across the rollout, while a Causal Spatio-Temporal Graph models how scene entities and actions evolve over time, enabling consistent, memory-aware planning. Without any additional training, RoboStream improves robustness and success on long-horizon robotic manipulation.

Y. Huang, J. Wu, W. Bu, Z. Xiong, G. Jiang, Y. Li, K. Ji, S. Xie, Y. Huang, C. Wu, J. Jiang, Z. Wang

RoboStream is a training-free framework that equips VLM planners with persistent spatio-temporal reasoning and memory — via Spatio-Temporal Fusion Tokens and a Causal Spatio-Temporal Graph — for robust long-horizon robotic manipulation.

🧠 spatio-temporal memory · training-free · long-horizon manipulation

📎 arXiv 🔗 Project

ICML 20262026

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning Project Page

Diffusion Policy models multi-modal action distributions well but its multi-step denoising is too slow for real-time visuomotor control, and existing caching accelerators use static schedules that don't adapt to robot-environment dynamics. SAG (Sparse ActionGen) enables extremely sparse action generation via a rollout-adaptive prune-then-reuse mechanism: it first identifies prunable computations globally, then reuses cached activations to substitute them. An observation-conditioned diffusion pruner — designed to be highly parameter- and inference-efficient — captures rollout dynamics for environment-aware adaptation, and a one-for-all reuse strategy reuses activations across both timesteps and blocks in a zig-zag manner to minimize global redundancy. Across multiple robotic benchmarks, SAG achieves up to 4× generation speedup without sacrificing performance.

K. Ji, J. Zhou, Y. Meng, Y. Li, H. Cui, Z. Wang

SAG accelerates Diffusion Policy to real time via a rollout-adaptive prune-then-reuse scheme — an observation-conditioned pruner identifies prunable computations on the fly, and a one-for-all strategy reuses activations across timesteps and blocks.

⚡ up to 4× generation speedup · no performance loss

📎 arXiv 💻 Code 🔗 Project

CVPR 20262026

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

Visual autoregressive (AR) generation is strong for image synthesis but slow due to next-token prediction, and speculative decoding's 'draft one step, then verify one step' paradigm prevents a direct reduction in forward passes. Leveraging the interchangeability of visual tokens, VVS explores verification skipping for the first time to explicitly cut target-model forward passes. It integrates three modules: (1) a verification-free token selector with dynamic truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped-step scheduling. VVS reduces target-model forward passes by 2.8× relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional speculative decoding.

H. Dong, Y. Li, R. Lu, C. Tang, S.-T. Xia, Z. Wang

VVS is the first to bring step-skipping to speculative decoding: by skipping verification steps for visual autoregressive generation, it directly cuts target-model forward passes via a verification-free token selector, feature caching, and skipped-step scheduling.

⚡ 2.8× fewer target-model forward passes · competitive generation quality

🌐 Official 📎 arXiv 💻 Code

CVPR 20262026

Test-time Sparsity for Extreme Fast Action Diffusion

Action diffusion produces high-fidelity actions but is costly due to iterative denoising, and feature-caching accelerators struggle to adapt to the policy dynamics of open-environment rollouts. TTS proposes test-time sparsity: it dynamically predicts prunable residual computations for each model forward at test time. To keep the savings real, a highly parallelized pipeline shares a lightweight pruner with the diffusion transformer and overlaps pruning with decoding (cutting non-decoder delay to milliseconds); an omnidirectional reuse strategy then reaches 95% sparsity by reusing features cached within the current forward, across denoising timesteps, and across rollout iterations. TTS reduces FLOPs by 92% and accelerates action generation by 5×, achieving lossless performance at 47.5 Hz.

K. Ji, Y. Meng, J. Zhou, Y. Li, C. Tang, Z. Wang

TTS accelerates action diffusion to real time via test-time sparsity — dynamically predicting prunable computations at each forward, with a parallelized pipeline and an omnidirectional feature-reuse strategy reaching 95% sparsity.

⚡ 92% fewer FLOPs · 5× faster · 47.5 Hz real-time · lossless

🌐 Official 📎 arXiv 💻 Code

ICLR 20262026

Block-wise Adaptive Caching for Accelerating Diffusion Policy Project Page

Diffusion Policy has strong visuomotor modeling ability but its high denoising cost is impractical for real-time robotic control, and existing diffusion accelerators fail to generalize to it due to architectural and data divergences. BAC (Block-wise Adaptive Caching) accelerates Diffusion Policy by caching intermediate action features, achieving lossless acceleration by adaptively updating and reusing cached features at the block level — based on the observation that feature similarities are non-uniform over time and block-specific. An Adaptive Caching Scheduler picks optimal update timesteps by maximizing global feature similarity; since per-block scheduling causes error surges from inter-block error propagation (especially in FFN blocks), a Bubbling Union Algorithm truncates these errors by updating high-error upstream blocks before downstream FFNs. As a training-free plugin compatible with transformer-based Diffusion Policy and VLA models, BAC delivers up to 3× inference speedup for free.

K. Ji, Y. Meng, H. Cui, Y. Li, J. Zhou, S. Hua, L. Chen, Z. Wang

BAC is a training-free plugin that accelerates Diffusion Policy by caching intermediate action features per transformer block — each block gets its own optimal update schedule, with a Bubbling Union Algorithm to stop cross-block cache-error propagation.

⚡ up to 3× inference speedup · training-free plugin · lossless

📎 arXiv 💻 Code 🔗 Project

🏅 Academic Service

🔍 Conference Reviewer

CVPR 2026 ECCV 2026 NeurIPS 2026 ACM MM 2026

📬 Contact

Happy to discuss research, internships, or collaborations. Best reached by email.
📧 liye23@mails.tsinghua.edu.cn · 🏛 Tsinghua Shenzhen International Graduate School (SIGS)
CV · 💼 LinkedIn · 🔬 ORCID