ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

Published in arXiv preprint, 2026, 2026

Recommended citation: Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang. "ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models." arXiv:2605.29438, 2026. https://arxiv.org/abs/2605.29438

Y. Li*, H. Liu, K. Ji, Y. Meng, J. Fan, Y. Wang, S. Qin, C. Wu, S.-T. Xia, Z. Wang

📄 Paper · 🔗 Project

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control, but their high computational cost and limited control frequency hinder real-time manipulation — large vision-language backbones and iterative action heads are executed at every control step. Existing acceleration optimizes individual components or relies on fixed rules, treating every step with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control.

Inspired by human motor control — where cognitive and feedback resources concentrate on goal-sensitive stages — ElegantVLA learns when to invest full computation and when to reuse prior computation. It is a training-free, plug-in, phase-adaptive inference framework that performs intra-model dynamic compute scheduling:

Lightweight scheduler. Observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head.
Five-level Vision–LLM mode. From full recomputation to multi-step temporal reuse, chosen by visual-language representation stability.
Three-level denoising mode. Reuses intermediate denoising states during stable motion while preserving full refinement at goal-sensitive stages.

Because it neither modifies nor retrains the base model, ElegantVLA is a general accelerator for modern VLA pipelines with explicit action-generation modules. On GR00T it achieves up to 2.55× average speedup, on CogACT 3.77×, and in real-world GR00T experiments across six tasks it reduces computation by 2.18× and raises control frequency from 13.8 Hz to 26.3 Hz, preserving or improving task success.

Share on

Twitter Facebook LinkedIn