MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

Junyi Ma¹, Xieyuanli Chen², Wentao Bao³, Jingyi Xu¹, Hesheng Wang¹

¹Shanghai Jiao Tong University, ²National University of Defense Technology, ³Michigan State University

MADiff accurately predicts future hand trajectories on egocentric videos (the first observed frame as the prediction canvas).

What is MADiff?

MADiff is a pioneering Mamba diffusion model for hand trajectory prediction (HTP). It exploits the devised motion-aware Mamba for diffusion denoising. A novel motion-driven selective scan (MDSS) is tailored to facilitate state transition, capturing entangled hand motion and camera egomotion patterns with temporal causality.

Please refer to our paper for more details about the contributions of this work.

MADiff Architecture

Semantic Extraction

We use the foundation model for extracting semantic features sensitive to hand-scenario relationships. Visual grounding allows us to use text prompt to achieve task-specific.

Denoising to Future Trajs

Motion-aware Mamba is proposed to gradually denoise latent features in the latent diffusion to predict accurate hand waypoints.

Motion-Driven Selective Scan

MDSS is proposed in our motion-aware Mamba to achieve diffusion denoising under egomotion guidance.

Training Strategy

We use angle and length constraints to optimize the directionality and stabability of predicted hand trajectories.

MADiff vs. AR and iter-NAR paradigms

MADiff inference with Motion-Aware Mamba and CDC operation

Additional HTP Results

The starting point of the predicted trajectory does not align with the hand in the image because here we use the first frame of the video clip as the prediction canvas, while the predicted trajectory’s starting point is at the timestamp close to the last frame of the video. Please refer to more results of comparison with baselines in our our paper.

Multiple Samples by MADiff

MADiff follows a generative scheme and thus can generate trajectory clusters by multiple sampling from Gaussian noise.

Multiple Samples by Models without MDSS

The models agnostic to camera egomotion tend to accumulate prediction errors due to motion gaps increasing along the time axis. The final position and overall shape similarity are also not as good as those of the model with MDSS.

Text Prompt Tuning

“hand”

“hand, which is scooping”

“hand”

“hand, which is throwing”

Text prompt tuning helps to extract action-specific semantics, thus improving corresponding HTP performance.

BibTeX

@misc{ma2024madiff,
      title={MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos}, 
      author={Junyi Ma and Xieyuanli Chen and Wentao Bao and Jingyi Xu and Hesheng Wang},
      year={2024},
      eprint={2409.02638},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.02638}, 
}