Target-Aware Video Diffusion Models

1Seoul National University, 2RLWRLD
arXiv 2025

TL; DR. Given an input image, our target-aware video diffusion model generates a video in which
an actor accurately interacts with the target, specified with its segmentation mask.

Abstract

We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt.

Unlike existing image-to-video diffusion models that often require dense structural or motion cues to guide the actor's movements, our target-aware model relies solely on a simple mask to specify the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level motion planning in applications such as robotics.

We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target’s spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions.

Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

Results

Comparison on target alignment

Our target-aware model generates videos where the actor interacts with the specified target. Baseline methods often hallucinate the target described in the input text prompt, leading to unintended outputs.

"The girl turns and picks up the [TGT] teddy bear resting on the bed."

"The man lifts the [TGT] bottle of coke and takes a slow sip."

Multiple objects of the same type

In scenes with multiple objects of the same type, our method enables actors to accurately interact with the intended target by leveraging its mask.

"The woman picks up the [TGT] mug cup and takes a sip of coffee."

Non-human interactions

While trained on human-scene interaction datasets, our target-aware model generalizes to non-human interactions.

Left: “The rabbit turns its head towards the [TGT] carrot and takes a bite of it.”
Right: “The dog bites onto the [TGT] frisbee and lifts it off the ground.”

Control over both the actor and the target

Our model, extended to take in two masks, enables specifying both the source actor and the target object. The actor is indicated with a red mask and the target is indicated with a green mask.

“The [SRC] robotic arm picks up the [TGT] blue can with its robot hand.”


Applications

Video content creation

Given images of a person and a scene, we perform depth-based 3D insertion of the person into the scene and render them together to produce frames for video diffusion input. We interpolate generated initial and final frames to synthesize navigation contents, and utilize our target-aware video diffusion model to synthesize action and manipulation contents.

Zero-shot 3D human-object interaction synthesis

We demonstrate our target-aware model's connection with robotic applications by performing imitation learning on 3D poses of a person interacting with a target in the scene. We use GVHMR to estimate 3D human motions from our generated videos and PhysHOI to perform physics-based imitation learning of the human motion.


Method

1. We first extend a baseline video diffusion model to incorporate the mask as an additional input.
2. We introduce a [TGT] token that will be used to encode target’s spatial information in the text prompt.
3. We fine-tune the model using our cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask.

Random Image

During inference, we prepend the [TGT] token to words referring to the target and enable the model to leverage the spatial cue provided by the mask. Our cross-attention loss effectively guides the [TGT] token to focus on the target region, enabling precise interactions between the actor and the target.

Random Image

BibTeX

@article{kim2025target,
        title={Target-Aware Video Diffusion Models},
        author={Kim, Taeksoo and Joo, Hanbyul},
        journal={arXiv preprint arXiv:2503.18950},
        year={2025}
      }