Target-Aware Video Diffusion Models

1Seoul National University, 2RLWRLD
ICLR 2026

TL; DR. Our target-aware model generates a video in which
an actor accurately interacts with the target, specified with its segmentation mask.

Abstract

We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt.

Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human-object interactions by leveraging the priors of large-scale video generative models.

We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks.

Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.

Results

Comparison on target alignment

Our target-aware model generates videos where the actor interacts with the specified target. Baseline methods often hallucinate the target described in the input text prompt, leading to unintended outputs.

"The girl turns and picks up the [TGT] teddy bear resting on the bed."

"The man lifts the [TGT] bottle of coke and takes a slow sip."

Multiple objects of the same type

In scenes with multiple objects of the same type, our method enables actors to accurately interact with the intended target by leveraging its mask.

"The woman picks up the [TGT] mug cup and takes a sip of coffee."

Non-human interactions

While trained on human-scene interaction datasets, our target-aware model generalizes to non-human interactions.

Left: “The rabbit turns its head towards the [TGT] carrot and takes a bite of it.”
Right: “The dog bites onto the [TGT] frisbee and lifts it off the ground.”

Control over both the actor and the target

Our model, extended to take in two masks, enables specifying both the source actor and the target object. The actor is indicated with a red mask and the target is indicated with a green mask.

“The [SRC] robotic arm picks up the [TGT] blue can with its robot hand.”


Applications

Video content creation

Given images of a person and a scene, we perform depth-based 3D insertion of the person into the scene and render them together to produce frames for video diffusion input. We interpolate generated initial and final frames to synthesize navigation contents, and utilize our target-aware video diffusion model to synthesize action and manipulation contents.

Zero-shot 3D human-object interaction synthesis

We demonstrate our target-aware model's connection with robotic applications by performing imitation learning on 3D poses of a person interacting with a target in the scene. We use GVHMR to estimate 3D human motions from our generated videos and PhysHOI to perform physics-based imitation learning of the human motion.


Method

1. We first extend a baseline video diffusion model to incorporate the mask as an additional input.
2. We introduce a [TGT] token that will be used to encode target's spatial information in the text prompt.
3. We fine-tune the model using our cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask.

Method overview

During inference, we prepend the [TGT] token to words referring to the target and enable the model to leverage the spatial cue provided by the mask. Our cross-attention loss effectively guides the [TGT] token to focus on the target region, enabling precise interactions between the actor and the target.

Cross-attention visualization

BibTeX

@inproceedings{kim2025target,
        title={Target-Aware Video Diffusion Models},
        author={Kim, Taeksoo and Joo, Hanbyul},
        booktitle={ICLR},
        year={2026}
      }