Disentangling Foreground and Background Motion
for Enhanced Realism in Human Video Generation


Our approach allows for the generation of realistic human actions against backgrounds that are in motion, departing from the conventional static backdrop.

Abstract

Recent progress in human video synthesis uses stable diffusion models for high-quality video creation. However, most methods animate only the foreground based on pose data, leaving backgrounds static, unlike real-life videos where backgrounds move with the action. Our solution learns both foreground and background motion separately, using pose-based animation for people and sparse tracking points for backgrounds, reflecting their natural interaction. Trained on real videos with this advanced motion capture, our model outputs coherent foreground and background movement. For longer videos, we generate clips sequentially, adding global features at each stage and connecting clips via the final frame of the previous one to preserve continuity, also infusing the reference image's features to prevent color inconsistencies. Tests prove our method excels at creating videos that blend foreground actions with reactive backgrounds more effectively than previous approaches.

Architecture

Our method separates foreground and background motion, using pose estimation for foreground and sparse tracking points for background. This enables realistic human actions with dynamic backgrounds. We also introduce an advanced pipeline for creating long videos free from accumulated errors, achieved through clever conditioning and global feature use, ensuring coherent and consistent extended clips.


overview

Video

Acknowledgements

The website template was borrowed from Michaƫl Gharbi and Mip-NeRF.