HUMOF: Human Motion Forecasting in Interactive Social Scenes

Sun, Caiyi; Sun, Yujing; Han, Xiao; Yang, Zemin; Liu, Jiawei; Zhu, Xinge; Yiu, Siu-Ming; Ma, Yuexin

HUMOF: Human Motion Forecasting in Interactive Social Scenes

Caiyi Sun^1,2*†, Yujing Sun^2,3*, Xiao Han¹, Zemin Yang¹, Jiawei Liu⁴, Xinge Zhu⁵, Siu-Ming Yiu^2‡, Yuexin Ma^1‡

¹ShanghaiTech University · ²The University of Hong Kong · ³Digital Trust Centre, Nanyang Technological University · ⁴Sun Yat-sen University · ⁵The Chinese University of Hong Kong
^*Equal contribution. ^†Work done during internship at ShanghaiTech University. ^‡Corresponding authors.

ICLR 2026 Poster

Keywords: scene-aware human motion forecasting

TL;DR: Human motion prediction considering human-scene and human-human interactions

Code Paper

Abstract

Complex dynamic scenes present significant challenges for predicting human behavior due to the abundance of interaction information, such as human-human and human-environment interactions. These factors complicate the analysis and understanding of human behavior, thereby increasing the uncertainty in forecasting human motions. Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in dynamic scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. The source code will be available at https://github.com/scy639/HUMOF.

Method

HUMOF encodes human-human and human-scene interactions with hierarchical features, then reasons over them using a coarse-to-fine interaction module that leverages both spatial and frequency-domain cues to forecast future motion.

HUMOF Overview.

Detailed architecture of HUMOF. Our method takes inputs from three aspects: the past motions of the target person, a 3D point cloud for the scene, and motion sequences of interactive persons. The interactions are comprehensively encoded by (a) Hierarchical Human-Human Interaction Representation and (b) Hierarchical Human-Scene Interaction Representation, respectively. Thereafter, the hierarchical representations are leveraged by (c), a Coarse-to-Fine Interaction Reasoning Module, to predict future motions for the target person. Details of the Interaction-Perceptive Transformer layer in (c) are shown on the top right.

Quantitative Results

Visual Results

Visualization of motion prediction results on dynamic scenes

More visual comparisons on HOI-M3 dataset

Visual comparisons on the HOI-M³ dataset.

Visual comparisons on the GTA-IM dataset.

BibTeX

@article{sun2025humof,
  title={HUMOF: Human Motion Forecasting in Interactive Social Scenes},
  author={Sun, Caiyi and Sun, Yujing and Han, Xiao and Yang, Zemin and Liu, Jiawei and Zhu, Xinge and Yiu, Siu Ming and Ma, Yuexin},
  journal={arXiv preprint arXiv:2506.03753},
  year={2025}
}