"The girl turns and picks up the [TGT] teddy bear resting on the bed."
We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt.
Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human-object interactions by leveraging the priors of large-scale video generative models.
We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks.
Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.
Our target-aware model generates videos where the actor interacts with the specified target. Baseline methods often hallucinate the target described in the input text prompt, leading to unintended outputs.
"The girl turns and picks up the [TGT] teddy bear resting on the bed."
"The man lifts the [TGT] bottle of coke and takes a slow sip."
In scenes with multiple objects of the same type, our method enables actors to accurately interact with the intended target by leveraging its mask.
"The woman picks up the [TGT] mug cup and takes a sip of coffee."
While trained on human-scene interaction datasets, our target-aware model generalizes to non-human interactions.
Left: “The rabbit turns its head towards the [TGT] carrot and takes a bite of it.”
Right: “The dog bites onto the [TGT] frisbee and lifts it off the ground.”
Our model, extended to take in two masks, enables specifying both the source actor and the target object. The actor is indicated with a red mask and the target is indicated with a green mask.
“The [SRC] robotic arm picks up the [TGT] blue can with its robot hand.”
Given images of a person and a scene, we perform depth-based 3D insertion of the person into the scene and render them together to produce frames for video diffusion input. We interpolate generated initial and final frames to synthesize navigation contents, and utilize our target-aware video diffusion model to synthesize action and manipulation contents.
We demonstrate our target-aware model's connection with robotic applications by performing imitation learning on 3D poses of a person interacting with a target in the scene. We use GVHMR to estimate 3D human motions from our generated videos and PhysHOI to perform physics-based imitation learning of the human motion.
@inproceedings{kim2025target,
title={Target-Aware Video Diffusion Models},
author={Kim, Taeksoo and Joo, Hanbyul},
booktitle={ICLR},
year={2026}
}