Abstract
Complex dynamic scenes present significant challenges for predicting human behavior due to the abundance of interaction information, such as human-human and human-environment interactions. These factors complicate the analysis and understanding of human behavior, thereby increasing the uncertainty in forecasting human motions. Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in dynamic scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. The source code will be available at https://github.com/scy639/HUMOF.
Method
HUMOF encodes human-human and human-scene interactions with hierarchical features, then reasons over them using a coarse-to-fine interaction module that leverages both spatial and frequency-domain cues to forecast future motion.
HUMOF Overview.
Detailed architecture of HUMOF. Our method takes inputs from three aspects: the past motions of the target person, a 3D point cloud for the scene, and motion sequences of interactive persons. The interactions are comprehensively encoded by (a) Hierarchical Human-Human Interaction Representation and (b) Hierarchical Human-Scene Interaction Representation, respectively. Thereafter, the hierarchical representations are leveraged by (c), a Coarse-to-Fine Interaction Reasoning Module, to predict future motions for the target person. Details of the Interaction-Perceptive Transformer layer in (c) are shown on the top right.
Quantitative Results
Visual Results
Visual comparisons on the HOI-M3 dataset.
Visual comparisons on the GTA-IM dataset.
BibTeX
@article{sun2025humof,
title={HUMOF: Human Motion Forecasting in Interactive Social Scenes},
author={Sun, Caiyi and Sun, Yujing and Han, Xiao and Yang, Zemin and Liu, Jiawei and Zhu, Xinge and Yiu, Siu Ming and Ma, Yuexin},
journal={arXiv preprint arXiv:2506.03753},
year={2025}
}