HUMOF: Human Motion Forecasting in Interactive Social Scenes

Caiyi Sun1,2*†, Yujing Sun2,3*, Xiao Han1, Zemin Yang1, Jiawei Liu4, Xinge Zhu5, Siu-Ming Yiu2‡, Yuexin Ma1‡
1ShanghaiTech University · 2The University of Hong Kong · 3Digital Trust Centre, Nanyang Technological University · 4Sun Yat-sen University · 5The Chinese University of Hong Kong
*Equal contribution. Work done during internship at ShanghaiTech University. Corresponding authors.
ICLR 2026 Poster
Keywords: scene-aware human motion forecasting
TL;DR: Human motion prediction considering human-scene and human-human interactions

Abstract

Complex dynamic scenes present significant challenges for predicting human behavior due to the abundance of interaction information, such as human-human and human-environment interactions. These factors complicate the analysis and understanding of human behavior, thereby increasing the uncertainty in forecasting human motions. Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in dynamic scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. The source code will be available at https://github.com/scy639/HUMOF.

Method

HUMOF encodes human-human and human-scene interactions with hierarchical features, then reasons over them using a coarse-to-fine interaction module that leverages both spatial and frequency-domain cues to forecast future motion.

HUMOF overview figure

HUMOF Overview.

Detailed architecture of HUMOF

Detailed architecture of HUMOF. Our method takes inputs from three aspects: the past motions of the target person, a 3D point cloud for the scene, and motion sequences of interactive persons. The interactions are comprehensively encoded by (a) Hierarchical Human-Human Interaction Representation and (b) Hierarchical Human-Scene Interaction Representation, respectively. Thereafter, the hierarchical representations are leveraged by (c), a Coarse-to-Fine Interaction Reasoning Module, to predict future motions for the target person. Details of the Interaction-Perceptive Transformer layer in (c) are shown on the top right.

Quantitative Results

Quantitative results (placeholder)

Visual Results

Visualization of motion prediction results on dynamic scenes
More visual comparisons on HOI-M3 dataset

Visual comparisons on the HOI-M3 dataset.

Visual comparisons on the GTA-IM dataset

Visual comparisons on the GTA-IM dataset.

BibTeX

@article{sun2025humof,
  title={HUMOF: Human Motion Forecasting in Interactive Social Scenes},
  author={Sun, Caiyi and Sun, Yujing and Han, Xiao and Yang, Zemin and Liu, Jiawei and Zhu, Xinge and Yiu, Siu Ming and Ma, Yuexin},
  journal={arXiv preprint arXiv:2506.03753},
  year={2025}
}