History-Aware Visuomotor Policy Learning via Point Tracking

History-Aware Visuomotor Policy Learning
via Point Tracking


Jingjing Chen*   Hongjie Fang*   Chenxi Wang   Shiquan Wang†   Cewu Lu†

Shanghai Jiao Tong University      Noematrix      Flexiv Robotics

*Equal Contribution      †Equal Advising

    [Paper]    [Code Coming Soon]

Abstract. Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements — such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory — and consistently outperforms both Markovian baselines and prior history-based approaches.


History-Aware Visuomotor Policies

Directly relying on raw observation histories is both inefficient and redundant. Our approach introduces an object-centric point tracking representation that captures the motion and state of task-relevant objects, transforming long sequences of images into a structured form. To make this representation more efficient, we apply a compression module that condenses the history into a compact summary. This compressed history is then integrated into standard visuomotor policies (ACT, Diffusion Policy and RISE), creating history-aware policies that can leverage long-horizon context.

policy

Experiments

task figure

We carefully designed 7 real-world manipulation tasks that feature multiple repeated or difficult-to-distinguish states across different horizons, aiming to evaluate both the history-awareness and overall performance of visuomotor policies. The tasks collectively evaluate several aspects of history-awareness, including counting, spatial memorization, task stage identification, pre-loaded memory and continuous memory.

results

Our object-centric point track history representation can be seamlessly integrated into various visuomotor policies. It enables effective history-aware decision-making, and demonstrates strong effectiveness across all five evaluation aspects.

results

2D trackers suffer from depth ambiguities and tracking discontinuities, producing low-quality tracks. By contrast, 3D trackers maintain accurate spatial relationships and handle occlusions effectively, yielding substantially better performance than 2D trackers, making them better suited for robotic manipulations.

results

Asynchronous tracking with train-time augmentation improves efficiency while preserving policy performance. In all experiments and the following videos, HistRISE are implemented with asynchronous tracking, while HistACT and HistDP are implemented with synchronous tracking.


Videos: Add-Salt

RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
DP v.s. HistDP
LongDP v.s. HistDP
TraceDP v.s. HistDP
ACT v.s. HistACT
ACT v.s. HistACT
ACT v.s. HistACT


Videos: One-Move

RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
DP v.s. HistDP
LongDP v.s. HistDP
TraceDP v.s. HistDP
ACT v.s. HistACT
ACT v.s. HistACT
ACT v.s. HistACT


Videos: Three-Scoop

RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
DP v.s. HistDP
LongDP v.s. HistDP
TraceDP v.s. HistDP
ACT v.s. HistACT
ACT v.s. HistACT
ACT v.s. HistACT
ACT v.s. HistACT


Videos: Swap-Easy

RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
DP v.s. HistDP
LongDP v.s. HistDP
TraceDP v.s. HistDP
ACT v.s. HistACT
ACT v.s. HistACT
ACT v.s. HistACT


Videos: Swap-Hard

RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
RISE v.s. HistRISE
DP v.s. HistDP
LongDP v.s. HistDP
TraceDP v.s. HistDP


Videos: Guess-Easy



Videos: Guess-Hard



BibTeX

@article{chen2025history,
    title   = {History-Aware Visuomotor Policy Learning via Point Tracking},
    author  = {Chen, Jingjing and Fang, Hongjie and Wang, Chenxi and Wang, Shiquan and Lu, Cewu},
    journal = {arXiv preprint arXiv:2509.17141},
    year    = {2025}
}

Website template: Allan Zhou
Modified from ALOHA @ Tony Z. Zhao