"The girl turns and picks up the [TGT] teddy bear resting on the bed."
We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt.
Unlike existing image-to-video diffusion models that often require dense structural or motion cues to guide the actor's movements, our target-aware model relies solely on a simple mask to specify the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level motion planning in applications such as robotics.
We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target’s spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions.
Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.
Our target-aware model generates videos where the actor interacts with the specified target. Baseline methods often hallucinate the target described in the input text prompt, leading to unintended outputs.
"The girl turns and picks up the [TGT] teddy bear resting on the bed."
"The man lifts the [TGT] bottle of coke and takes a slow sip."
In scenes with multiple objects of the same type, our method enables actors to accurately interact with the intended target by leveraging its mask.
"The woman picks up the [TGT] mug cup and takes a sip of coffee."
While trained on human-scene interaction datasets, our target-aware model generalizes to non-human interactions.
Left: “The rabbit turns its head towards the [TGT] carrot and takes a bite of it.”
Right: “The dog bites onto the [TGT] frisbee and lifts it off the ground.”
Our model, extended to take in two masks, enables specifying both the source actor and the target object. The actor is indicated with a red mask and the target is indicated with a green mask.
“The [SRC] robotic arm picks up the [TGT] blue can with its robot hand.”
Given images of a person and a scene, we perform depth-based 3D insertion of the person into the scene and render them together to produce frames for video diffusion input. We interpolate generated initial and final frames to synthesize navigation contents, and utilize our target-aware video diffusion model to synthesize action and manipulation contents.
We demonstrate our target-aware model's connection with robotic applications by performing imitation learning on 3D poses of a person interacting with a target in the scene. We use GVHMR to estimate 3D human motions from our generated videos and PhysHOI to perform physics-based imitation learning of the human motion.
@article{kim2025target,
title={Target-Aware Video Diffusion Models},
author={Kim, Taeksoo and Joo, Hanbyul},
journal={arXiv preprint arXiv:2503.18950},
year={2025}
}