0% found this document useful (0 votes)

9 views15 pages

Training-Free Video Generation Guidance

The document presents V IDEO-MSG, a training-free guidance method for text-to-video (T2V) generation that enhances video quality by utilizing multimodal planning and structured noise initialization. This approach creates a fine-grained spatio-temporal plan called VIDEO SKETCH, which guides T2V models without the need for fine-tuning or additional memory during inference. V IDEO-MSG demonstrates significant improvements in text alignment and motion control across multiple T2V backbones, making it easier to adopt large models in video generation.

Uploaded by

xiangyilyu6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views15 pages

Training-Free Video Generation Guidance

Uploaded by

xiangyilyu6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Training-free Guidance in Text-to-Video Generation

via Multimodal Planning and Structured Noise Initialization

Jialu Li* Shoubin Yu* Han Lin* Jaemin Cho Jaehong Yoon Mohit Bansal
UNC Chapel Hill
{jialuli, shoubin, hanlincs, jmincho, jhyoon, mbansal}@[Link]
[Link]
arXiv:2504.08641v1 [[Link]] 11 Apr 2025

Abstract ically improved the quality of generated videos in diverse

domains. However, even recent T2V generation models still
Recent advancements in text-to-video (T2V) diffusion mod- often struggle to follow text descriptions accurately, espe-
els have significantly enhanced the visual quality of the cially when the prompt requires accurate control of spatial
generated videos. However, even recent T2V models find layouts or object trajectories. As illustrated in Fig. 1 (b),
it challenging to follow text descriptions accurately, espe- recent work has studied improving text alignment by pro-
cially when the prompt requires accurate control of spa- viding detailed layout guidance as an additional input to
tial layouts or object trajectories. A recent line of re- T2V models, such as bounding boxes [21, 24, 26], opti-
search uses layout guidance for T2V models that require cal flow [23], and object trajectories [53, 62], which are
fine-tuning or iterative manipulation of the attention map often created from a large language model (LLM). How-
during inference time. This significantly increases the ever, since the original T2V models do not understand the
memory requirement, making it difficult to adopt a large layout guidance, these approaches fine-tune the T2V mod-
T2V model as a backbone. To address this, we introduce els with layout annotations [24, 26] or iteratively manipu-
V IDEO -MSG, a training-free Guidance method for T2V lating the attention map of T2V models during inference
generation based on Multimodal planning and Structured time [21]. While effective, these techniques substantially
noise initialization. V IDEO -MSG consists of three steps, increase memory consumption at inference time or require
where in the first two steps, V IDEO -MSG creates VIDEO retraining for different T2V backbones, limiting their scal-
SKETCH , a fine-grained spatio-temporal plan for the final ability to large T2V models.
video, specifying background, foreground, and object tra- To address this, we introduce V IDEO -MSG,
jectories, in the form of draft video frames. In the last step, Multimodal Sketch Guidance for video generation, a
V IDEO -MSG guides a downstream T2V diffusion model training-free guidance method for T2V generation based
with VIDEO SKETCH through noise inversion and denois- on multimodal planning and structured noise initializa-
ing. Notably, V IDEO -MSG does not need fine-tuning or tion, as illustrated in Fig. 1 (c). As illustrated in Fig. 2,
attention manipulation with additional memory during in- V IDEO -MSG consists of three steps: (1) background
ference time, making it easier to adopt large T2V models. planning (Sec. 3.1), (2) foreground object layout and tra-
V IDEO -MSG demonstrates its effectiveness in enhancing jectory planning (Sec. 3.2), and (3) video generation with
text alignment with multiple T2V backbones (VideoCrafter2 structured noise inversion (Sec. 3.3). From the first two
and CogVideoX-5B) on popular T2V generation bench- steps, V IDEO -MSG creates VIDEO SKETCH, a fine-grained
marks (T2VCompBench and VBench). We provide com- spatial and temporal plan with a set of multimodal models,
prehensive ablation studies about noise inversion ratio, dif- including multimodal LLM (MLLM), object detection,
ferent background generators, background object detection, and instance segmentation models. Then in the last step,
and foreground object segmentation. V IDEO -MSG guides a downstream T2V diffusion model
with VIDEO SKETCH through structured noise inversion
and denoising. Notably, V IDEO -MSG does not need
1. Introduction fine-tuning or additional memory during inference time,
making it easier to adopt large T2V models, compared to
Recent advances in text-to-video (T2V) diffusion mod-
existing methods based on fine-tuning or iterative attention
els [1, 2, 12, 17, 32, 34, 37, 38, 45, 46, 57] have dramat-
manipulation.
* equal contribution V IDEO -MSG demonstrates their effectiveness in en-

1
Text
Text Prompt LLM T2V
Prompt
T2V
Object Layout & Trajectory

(a) Single Model for Video Generation (b) Video Generation with Attention-based Layout Guidence

Text Noise
Prompt MLLM T2V
Inversion

Object Layout Video

& Trajectory Sketch

Background &
Foreground Plans
T2I / I2V

Background & Foreground

Figure 1. Comparison of different text-to-video generation methods: (a) single model for video generation, (b) video generation with
(attention-based) layout guidance, and our (c) V IDEO -MSG, a training-free guidance method for T2V generation based on multimodal
planning and structured noise initialization. Since V IDEO -MSG does not need fine-tuning or additional memory during inference time, it
is easier to adopt large T2V models than previous video layout guidance methods based on fine-tuning or iterative attention manipulation.

hancing text alignment with multiple T2V backbones planning or require extensive training and do not fully lever-
(VideoCrafter2 [4] and CogVideoX-5B [57]) on popular age the power of existing visual tools for fine-grained video
T2V generation benchmarks (T2VCompBench [43] and generation. In contrast, our work leverages the power of
VBench [11]). For example, V IDEO -MSG improves mo- both multimodal LLMs and image/video diffusion models
tion binding with a relative gain of 52.46%, numeracy with to generate a VIDEO SKETCH for final fine-grained motion
a relative gain of 40.11%, and spatial relationship with a rel- control, and is fully training-free.
ative gain of 11.15% with CogVideoX-5B as a T2V genera-
tion backbone. We provide comprehensive quantitative and 2.2. Motion Direction Control in Video Generation
qualitative ablation studies about noise inversion ratio, dif- Controllability in video generation is gaining increasing at-
ferent background generators, background object detection, tention in the field of generative AI, as it enables models
and foreground object segmentation. We hope our method to generate videos aligned with user intent. One line of re-
can inspire future work on effectively and efficiently inte- search focuses on training models with the capability of tra-
grating LLMs’ planning ability into video generation. jectory control, camera control, or motion control by gen-
erating intermediate representations. For trajectory control,
2. Related Work recent works such as DragNUMA [58], IVA-0 [59], Dra-
gAnything [53], and TrackGo [63] encode object movement
2.1. MLLM Planning for Video Generation trajectories into dense features, which are then fused into
There are recent research works [8, 22, 24, 51, 64] that the diffusion model to enable object movement control. On
leverage the reasoning capabilities and world knowledge the other hand, CameraCtrl [7], MotionCtrl [52], and Im-
of LLMs or multimodal LLMs for the task of video gen- age Conductor [20] encode camera extrinsics as features to
eration. For example, one line of work [22, 24, 51] ap- control camera motion in the generated videos. A common
plies GPT-4 / GPT-4o to expand a single text prompt into drawback of both of these categories is their reliance on ac-
a ‘video plan’ in the format of bounding boxes or detailed curate object trajectory or camera movement information,
prompt description [55], which is then given as input to which are difficult for users to manipulate directly. Addi-
downstream video diffusion model for layout-guided video tionally, video datasets with accurate trajectory annotations
generation. The other line of work [16, 28, 29, 48, 50, 54] are limited, which constrains the performance of these mod-
performs token-level planning utilizing multimodal LLMs. els. The third category, including VideoJAM [3] and Mo-
For example, [16, 50] tokenize videos and text into the same tionI2V [40], produces motion and video representations
space and generate video tokens using the same strategy as jointly, or sequentially by first generating intermediate rep-
text (e.g., next-token prediction). However, both directions resentations, which then serve as guidance for generating
either rely on high-quality prompts and the bounding box video outputs. However, such methods require extensive

2
Figure 2. Three stages of V IDEO -MSG. In the first stage, the MLLM plans specific global and local contexts that fit the provided text-to-
video prompt. The text-to-image (T2I) model uses the MLLM planned context to render the necessary components of the video. In the
third stage, we generate video with VIDEO SKETCH via noise inversion.

training due to extra generation objectives. In contrast, our ation based on bounding boxes, where the T2I model may
method uses an image-to-video model, allowing us to trans- fail to generate the foreground object at the specified box lo-
form existing, real-world images into controllable videos cation in the image. In addition, we explore two approaches
under LLM planning. for background generation:
(1) Using a T2I model to generate an initial background,
3. Method followed by an I2V model to animate it. In this way,
we can adopt a strong T2V model to potentially achieve
We introduce V IDEO -MSG, Multimodal Sketch Guidance
improved video aesthetic quality.
for video generation, a training-free guidance method for
(2) Directly using a T2V model to generate the background
T2V generation based on multimodal planning and struc-
with animation, which avoids the potential distribution
tured noise initialization. V IDEO -MSG consists of three
gap between the two models in (1).
stages (illustrated in Fig. 2):
In both cases, we adopt a video generation model. We
• Background planning (Sec. 3.1), where we adopt T2I
aim to introduce natural background animation rather than
and I2V models to generate background image priors with
keeping it static while only animating foreground objects.
natural animation.
This ensures that elements such as flowing water, moving
• Foreground Object Layout and Trajectory Planning
clouds, or swaying trees are naturally incorporated, making
(Sec. 3.2), where we apply MLLM and object detectors
the generated videos more realistic and visually coherent.
to plan and place foreground objects into the background
Moreover, by comparing approaches (1) with (2), we notice
harmoniously.
that the advantage of adopting a strong T2I model in (1) out-
• Video Generation with Structured Noise Initialization
weighs the domain gap between the T2I and I2V models in
(Sec. 3.3), where the synthesized images derived from the
(2) as discussed in Sec. 4.3. Therefore, we apply approach
above stages are used as VIDEO SKETCH for final video
(1) as our default experiment setting.
generation via inversion techniques.

3.1. Background Planning 3.2. Foreground Object Layout and Trajectory

Planning
Given a prompt for video generation, we first ask an MLLM
(GPT-4o [33]) to generate a detailed background descrip- This stage aims to place the property of the foreground ob-
tion (see Stage 1 in Fig. 2). Here, we explicitly instruct the ject in the background in a spatially coherent manner. We
MLLM to generate only the background and avoid includ- first implement this stage by providing the background im-
ing any moving or key objects mentioned in the original ages generated in stage 1, along with a prompt describing
prompt, thereby enforcing proper decoupling. We find that movement dynamics to GPT-4o [33], then ask it to generate
this strategy helps address issues in conditional T2I gener- a sequence of bounding boxes to represent the foreground

3
object’s movement. For instance, given the text prompt: “A can be effectively utilized here to create structured noise
cat sinking to the left in the living room”, GPT-4o can cor- to fuse the information from VIDEO SKETCH. While the
rectly infer the cat’s movement direction (i.e., moving left). normal denoising process starts from the terminal timestep
However, when provided with a background image of a liv- t “ T (a random noise) to the initial timestep t “ 0 (a
ing room, GPT-4o often fails to position the cat’s bounding clean video), we create per-frame initial noises from VIDEO
box appropriately on the floor (e.g., with the bounding box SKETCH via noise inversion [31] and start denoising from
floating in mid-air or overlapping with unrelated objects), a timestep tinv . Specifically, we first encode the sequence
as illustrated in Figure 4. This suggests that while GPT-4o of VIDEO SKETCH frames into the latent space z using a
demonstrates strong motion reasoning capabilities, it lacks 3D VAE [13, 57]. Next, we obtain the initial noise ztinv
?
direct grounding capability for visual elements and strug- via the forward diffusion process [9]: ztinv “ αt z0 `
? śt
gles to align foreground objects with the background scene 1 ´ αtinv ϵ, ϵ „ N p0, Iq, where αt “ s“1 p1 ´ βs q
in a spatially consistent manner. is the cumulative noise schedule, and ϵ represents Gaussian
To overcome this limitation, we first detect all ob- noise. We parameterize tinv “ α ˆ T , where α P p0.0, 1.0q.
jects in the background image with Recognize-Anything Inspired by VideoDirectorGPT [24], which uses an LLM to
(RAM) [61] then extract their bounding boxes with estimate a confidence score along with bounding box lay-
Grounding-DINO [25]. These bounding boxes are fed into outs as layout guidance strength, we employ an LLM to in-
GPT-4o to provide explicit spatial context, which helps fer an appropriate noise inversion ratio α value given a text
it accurately position and animate foreground objects, en- description. (see Sec. 4.3 for detailed experiments). We
hancing spatial coherence in generated videos and reduc- explain more details about the noise inversion in Appendix.
ing placement errors. Qualitative examples of the effec-
tiveness of object detection with Grounding-DINO and 4. Experiments
RAM are presented in Figure 4. With the above inputs
(i.e., video text prompt, background image, and the bound- 4.1. Experiment Setups
ing boxes of objects in the background), GPT-4o gener- Datasets. We evaluate V IDEO -MSG on popular text-to-
ates a sequence of bounding boxes for the foreground ob- video generation benchmarks, T2V-CompBench [43] and
jects in the format [object name, bounding box VBench [11]. T2V-CompBench and VBench measure di-
coordinates] (see stage 2 in Fig. 2). Additionally, it verse aspects of text-to-video generation tasks with seven
provides a textual description for each frame and a reason- (e.g., consistent attribute binding, motion binding, spatial
ing process explaining the planned object motions after the relationships) and sixteen categories (e.g., overall consis-
sequence of frames. This reasoning step enhances the co- tency, color, temporal flickering, motion smoothness), re-
herence and accuracy of motion planning. spectively. In this work, we primarily use T2V-CompBench
Once the sequence of bounding boxes is obtained, we to evaluate video diffusion models’ capability in composi-
utilize a T2I model to generate the appearance of the tional text-to-video generation, and use VBench to measure
foreground object using the prompt: “An image of the motion smoothness of the generated video.
{object name}.” However, directly merging the gen-
erated object image with the background presents a chal-
Implementation details. We implement V IDEO -MSG
lenge—the background in the generated object image can
on two recent text-to-video generation diffusion models:
significantly affect the overall visual coherence, as illus-
VideoCrafter2 [4] and CogVideoX-5B [57]. To gener-
trated in Fig. 5. To address this, we apply SAM [14] to
ate the VIDEO SKETCH, we employ FLUX.1-dev [19] and
extract the object from the generated image, removing any
SDXL [36] as the background generator, and CogVideoX-
unintended background. Based on the planned bounding
5B as the image-to-video generator. We utilize Recognize-
boxes, the extracted object is then resized and placed onto
Anything [61] and Gounded-Segment-Anything [14] for
the background image at the corresponding location. This
foreground object segmentation. We utilize GPT4o as the
process ensures a more seamless integration of the fore-
multi-modal LLM for background description generation,
ground object into the background, improving the visual
foreground object layout and trajectory planning, and de-
consistency of the generated video.
termining the noise inversion ratio α dynamically based on
the prompt. For noise inversion ratio α (Sec. 3.3), we find
3.3. Video Generation with Structured Noise Ini-
the range [0.7, 0.9] works well for CogVideoX-5B, and the
tialization
range [0.5, 0.8] works well for VideoCrafter2 (see Sec. 4.3
In this stage, we generate a final video by guiding the T2V for ablation study). All experiments are conducted on A100
diffusion model with the VIDEO SKETCH created from the and A6000 GPUs, with batch size 1 and an approximate
previous stage (Sec. 3.2). Inversion methods [41], which memory usage of 16 GB. We provide additional details,
are often used in image and video editing tasks [31, 39], such as prompts used for GPT-4o, in the Appendix.

4
Model Consist-attr Dynamic-attr Spatial Motion Action Interaction Numeracy
(Closed-source models)
Pika [44] 0.6513 0.1744 0.5043 0.2221 0.5380 0.6625 0.2613
Gen-3 [38] 0.7045 0.2078 0.5533 0.3111 0.6280 0.7900 0.2169
Dreamina [5] 0.8220 0.2114 0.6083 0.2391 0.6660 0.8175 0.4006
PixVerse [35] 0.7370 0.1738 0.5874 0.2178 0.6960 0.8275 0.3281
Kling [15] 0.8045 0.2256 0.6150 0.2448 0.6460 0.8475 0.3044
(Open-source models)
ModelScope [49] 0.5483 0.1654 0.4220 0.2552 0.4880 0.7075 0.2066
ZeroScope [42] 0.4495 0.1086 0.4073 0.2319 0.4620 0.5550 0.2378
AnimateDiff [6] 0.4883 0.1764 0.3883 0.2236 0.4140 0.6550 0.0884
Latte [30] 0.5325 0.1598 0.4476 0.2187 0.5200 0.6625 0.2187
Show-1 [60] 0.6388 0.1828 0.4649 0.2316 0.4940 0.7700 0.1644
Open-Sora 1.2 [10] 0.6600 0.1714 0.5406 0.2388 0.5717 0.7400 0.2556
Open-Sora-Plan v1.1.0 [18] 0.7413 0.1770 0.5587 0.2187 0.6780 0.7275 0.2928
VideoTetris [47] 0.7125 0.2066 0.5148 0.2204 0.5280 0.7600 0.2609
Vico [56] 0.7025 0.2376 0.4952 0.2225 0.5480 0.7775 0.2116
VideoCrafter2 [4] 0.6750 0.1850 0.4891 0.2233 0.5800 0.7600 0.2041
VideoCrafter2 + LVD [21] 0.6663 0.2308 0.5106 0.2178 0.5640 0.8125 0.2869
(-0.0087) (+0.0458) (+0.0215) (-0.0055) (-0.0160) (+0.0525) (+0.0828)
VideoCrafter2 + V IDEO -MSG (Ours) 0.7536 0.2110 0.5866 0.3732 0.5737 0.8220 0.3138
(+0.0786) (+0.0260) (+0.0975) (+0.1499) (-0.0063) (+0.0620) (+0.1097)
CogVideoX-5B [57] 0.7220 0.2334 0.5461 0.2943 0.5960 0.7950 0.2603
CogVideoX-5B + V IDEO -MSG (Ours) 0.7109 0.2102 0.6070 0.4487 0.5960 0.7800 0.3647
(-0.0111) (-0.0232) (+0.0609) (+0.1544) (+0.0000) (-0.0150) (+0.1044)
Table 1. T2V-CompBench evaluation results. We highlight the best/second-best scores for open-sourced models with bold/underline.

4.2. Quantitative Evaluation within a set of object bounding boxes generated by an LLM.
On the VideoCrafter2 backbone, we find that V IDEO -MSG
Improved control on spatial layout and object trajec-
outperforms LVD in all categories except for dynamic at-
tory. Table 1 shows that V IDEO -MSG significantly im-
tribute binding, with the largest improvement observed in
proves both T2V backbone models (VideoCrafter2 and
motion binding (‘Motion’), where V IDEO -MSG surpasses
CogVideoX-5B) in many skills, especially in motion bind-
LVD by 0.1554. This demonstrates the effectiveness of our
ing (‘Motion’), with an increase of 0.1499 on VideoCrafter2
approach. Note that V IDEO -MSG is also more memory-
and 0.1544 on CogVideoX-5B. V IDEO -MSG also provides
efficient than LVD, as the layout guidance in LVD requires
large improvements in spatial relationships (‘Spatial’), and
backpropagation through the T2V diffusion backbone, mak-
numeracy (‘Numeracy’) in both backbone models. These
ing it hard to adapt to large diffusion models; we could
results show that the planning and structured noise initial-
implement V IDEO -MSG with CogVideoX-5B backbone to
ization of V IDEO -MSG effectively improve the control of
run on an A6000 GPU (48GB), but we could not fit LVD
spatial layouts and object trajectories in video generation.
even on an A100 (80GB).
It is also noteworthy that V IDEO -MSG, implemented with
open-source T2V backbone models, archives higher mo-
tion binding scores than closed-source models such as Gen- 4.3. Ablation Studies
3 [38]. The V IDEO -MSG did not improve the scores in
dynamic attribute binding (‘Dynamic-attr’) and object ac- Noise inversion ratio α. As described in Sec. 3.3, we
tion and interaction (‘Action’ and ‘Interaction’) categories. guide the T2V generation backbone by denoising from an
This is likely because dynamic changes in object or envi- intermediate timestep tinv “ α ˆ T to the initial timestep
ronment states and interactions and actions between objects t “ 0. Here, we experiment with different noise inver-
are difficult to guide solely with bounding boxes. sion ratios α (i.e., varying the noise injected into the VIDEO
SKETCH ). Table 2 shows that lower α achieves better per-
formance in motion binding (e.g., moving left/right), nu-
Comparison to planning-based baseline. We also com- meracy, and spatial relationships but hurts the smoothness
pare V IDEO -MSG with LVD [21], a recent T2V layout of motions. This aligns with the intuition that increasing the
guidance method, where it adds a gradient-based energy number of refinement steps based on VIDEO SKETCH en-
function optimization step before each denoising step of hances the final motion quality. We observe that automat-
the T2V diffusion backbone. The energy function adjusts ically inferring proper α given text description with LLM
the cross-attention map of diffusion models to concentrate achieves a good trade-off and use this approach by default.

5
Figure 3. Videos generated with CogVideoX-5B and V IDEO -MSG with CogVideoX-5B backbone. The videos generated with V IDEO -
MSG are more accurate regarding object motions, numeracy, and spatial relationships.

6
T2V-CompBench VBench
No. Noise inversion ratio α
Motion Binding Numeracy Spatial Motion Smoothness
1. Direct T2V (no inversion) 0.2233 0.2041 0.4891 97.73
2. 0.8 0.2793 0.2081 0.5502 98.69
3. 0.7 0.3197 0.2653 0.5678 98.62
4. 0.6 0.3352 0.3059 0.6057 98.63
5. 0.5 0.3980 0.3138 0.6447 98.58
6. LLM-controlled 0.3732 0.3138 0.5866 99.01

Table 2. Comparison of different noise inversion ratio α, where we compare static values and LLM-based dynamic values. Backbone T2V:
VideoCrafter2. Background generator: Flux + CogVideoX-5B.

No. Background Generator Motion Numeracy backbone). We observe that CogVideoX-5B struggles with
1. Direct T2V (no background) 0.2897 0.2750 motion direction (e.g., an egg moves to the right instead
2. SDXL (T2I) + CogVideoX-5B (I2V) 0.4487 0.3559 of to the left, a helicopter ascends instead of descending
3. FLUX (T2I) + CogVideoX-5B (I2V) 0.4549 0.3647 to the land), numeracy (e.g., generated four bears instead
4. CogVideoX-5B (T2V) 0.4565 0.3028 of three bears, four penguins instead of six penguins), and
spatial relationships (e.g., a vending machine should be lo-
Table 3. Ablation studies on different background generators. cated to the right of a gorilla, but it is missing; the um-
brella should be located on the left of the children). In con-
trast, V IDEO -MSG successfully guides the T2V backbone
Different background generator. In Table 3, we com- to generate videos with correct semantics in all cases. Note
pare different background generation methods (Sec. 3.1): that the T2V model can understand the coarse guidance in
(1) generating a background with a text-to-image (T2I) VIDEO SKETCH and place objects that harmonize well with
model, followed by an image-to-video (I2V) model for the background through noise inversion. For example, in
animation, and (2) directly using a text-to-video (T2V) the middle example (‘three bears in a river surrounded by
model to generate an animated background. While both ap- mountains’), even when the VIDEO SKETCH includes three
proaches improve motion binding and numeracy compared bears only with other heads facing forward, the T2V model
to using a single T2V model for video generation, the T2I could place the three bears in the river naturally.
+ I2V pipeline scores higher in numeracy, and the T2V ap-
proach scores higher in motion binding. We attribute this
Effect of different noise inversion ratios α. Fig. 3 shows
to the video generation model’s ability to better refine ob-
the video generation results from VIDEO SKETCH (with
ject motion when the background follows a static camera,
CogVideoX-5B backbone), with different noise inversion
making foreground changes more salient for the video dif-
ratios α. Interestingly, the model can automatically refine
fusion model. The I2V pipeline better adheres to the “Static
objects to better align with the prompt and surrounding en-
Camera” prompt, producing natural background animations
vironment based on different α. We find that lower α (i.e.,
(e.g., wind, light changes). In contrast, T2V models of-
less noise) generally provides stronger layout control. For
ten disregard the “Static Camera” requirement, introducing
example, in the left top example, the egg in the videos with
excessive camera motion and scene changes in the video.
α “ 0.7 and α “ 0.8 closely follow the trajectory in VIDEO
These inconsistencies make it harder for the video diffu-
SKETCH , while in the video α “ 0.9, the egg movement
sion model to refine foreground objects, leading to perfor-
is small and does not follow the trajectory. However, a
mance degradation (e.g., 0.3028 with CogVideoX-5B vs.
lower α can lead to less natural generations; e.g., in the
0.3647 with FLUX on numeracy). Additionally, we find
bottom-middle example, the boy motion appears less nat-
that a stronger T2I model (e.g., FLUX [19]) yields better re-
ural at α “ 0.7 compared to α “ 0.9. This highlights the
sults than a weaker one (e.g., SDXL [36]), highlighting the
importance of selecting an appropriate α to balance motion
potential of leveraging high-quality T2I models for layout-
smoothness with faithful adherence to VIDEO SKETCH.
controlled text-to-video generation.

4.4. Qualitative Analysis Background object detection helps foreground object

VIDEO SKETCH improves control of spatial layout and placement. We find that a deep understanding of the
object trajectory. Fig. 3 compares videos generated from background images through object detection is crucial in
CogVideoX-5B, and V IDEO -MSG (with CogVideoX-5B foreground planning (Sec. 3.2). As shown in Fig. 4, with-

7
Figure 4. Example video showing the importance of background Figure 5. Example video showing the importance of foreground
object detection in foreground object placement. object segmentation.

out access to background bounding box information, the ment the balloon from the generated foreground object im-
MLLM fails to place the golden retriever on the grass when age and then place it onto the background to create VIDEO
relying solely on the background image input. In con- SKETCH , the balloon in the generated video harmonizes
trast, when provided with bounding box information from well with the background.
the background (e.g., {"label": "path", "box":
[0.44, 0.57, 0.99, 0.99]}), the MLLM suc-
cessfully positions the golden retriever at the correct lo-
5. Conclusion
cation on the grass. Moreover, we find that this planning In this work, we introduce V IDEO -MSG, a training-free
step directly impacts the final video quality. Condition- guidance method designed to enhance text-to-video (T2V)
ing the generation on inaccurate bounding box plans can generation through multimodal planning and structured
conflict with the video diffusion model’s prior knowledge. noise initialization. V IDEO -MSG consists of three steps,
For instance, in the first frame, the model may generate wherein the first two steps, V IDEO -MSG creates VIDEO
two golden retrievers—one on the ground based on its prior SKETCH , a detailed spatial and temporal plan utilizing a
knowledge and another floating in the air according to the set of multimodal models, including multimodal LLM, ob-
VIDEO SKETCH —resulting in unrealistic outputs, such as ject detection, and instance segmentation models. In the
a golden retriever running mid-air across the garden. In final step, V IDEO -MSG guides a downstream T2V diffu-
contrast, our approach, which conditions planning on back- sion model with VIDEO SKETCH through noise inversion
ground bounding boxes, enables the generation of more nat- and denoising. Notably, V IDEO -MSG does not require
ural and commonsense-aligned videos. fine-tuning or additional memory during inference, making
it easier to adopt large T2V models than existing methods
Segmentation of foreground objects improves harmo- that rely on fine-tuning or iterative attention manipulation.
nization. As demonstrated in Fig. 5, without object seg- V IDEO -MSG demonstrates its effectiveness in enhancing
mentation, the foreground object (a balloon) does not align text alignment with multiple T2V backbones on popular
well with the background, and the quantity is not well con- T2V generation benchmarks. We also provide comprehen-
trolled (multiple balloons). This occurs because the video sive ablation studies and qualitative examples that support
diffusion model does not inherently distinguish between the the design choices of V IDEO -MSG. We hope our method
appearance of the background and that of the balloon, caus- can inspire future work on effectively and efficiently inte-
ing them to blend together. In contrast, when we first seg- grating LLMs’ planning capabilities into video generation.

8
Acknowledgments Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin
Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com-
This work was supported by ARO W911NF2110220, prehensive benchmark suite for video generative models. In
DARPA MCS N66001-19-2-4031, DARPA KAIROS Proceedings of the IEEE/CVF Conference on Computer Vi-
Grant FA8750-19-2-1004, DARPA ECOLE Program No. sion and Pattern Recognition, 2024. 2, 4
HR00112390060, NSF-AI Engage Institute DRL211263, [12] Hedra Inc. Hedra. [Link] 2025.
ONR N00014-23-1-2356, Microsoft Accelerate Foundation 1
Models Research (AFMR) grant program, and a Bloomberg [13] Diederik P Kingma and Max Welling. Auto-encoding varia-
Data Science PhD Fellowship. The views, opinions, and/or tional bayes. In ICLR, 2014. 4
findings contained in this article are those of the authors and [14] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
not of the funding agency. Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
thing. In Proceedings of the IEEE/CVF international confer-
References ence on computer vision, pages 4015–4026, 2023. 4
[1] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, [15] Kling AI. Kling. [Link] 2024. 5
Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin [16] Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama,
Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- Jonathan Huang, Grant Schindler, Rachel Hornung, Vigh-
dation model platform for physical ai. arXiv preprint nesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna
arXiv:2501.03575, 2025. 1 Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng,
[2] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth,
Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, David Hendon, Alonso Martinez, David Minnen, Mikhail
Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-
nie: Generative interactive environments. In Forty-first Inter- Hsuan Yang, Irfan Essa, Huisheng Wang, David A Ross,
national Conference on Machine Learning, 2024. 1 Bryan Seybold, and Lu Jiang. Videopoet: A large language
[3] Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam model for zero-shot video generation. In Forty-first Interna-
Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. tional Conference on Machine Learning, 2024. 2
Videojam: Joint appearance-motion representations for en- [17] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai,
hanced motion generation in video models. arXiv preprint Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang,
arXiv:2502.02492, 2025. 2 et al. Hunyuanvideo: A systematic framework for large video
generative models. arXiv preprint arXiv:2412.03603, 2024.
[4] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia,
1
Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2:
[18] PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, 2024. 5
Overcoming data limitations for high-quality video diffu-
[19] Black Forest Labs. Flux. [Link]
sion models. In Proceedings of the IEEE/CVF Conference
black-forest-labs/flux, 2024. 4, 7
on Computer Vision and Pattern Recognition, pages 7310–
7320, 2024. 2, 4, 5 [20] Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang,
Ziyang Yuan, Liangbin Xie, Yuexian Zou, and Ying Shan.
[5] Dreamina AI. Dreamina. https : / / dreamina .
Image conductor: Precision control for interactive video syn-
[Link]/ai-tool/platform, 2024. 5
thesis. CoRR, 2024. 2
[6] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang,
[21] Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and
Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin,
Boyi Li. Llm-grounded video diffusion models. arXiv
and Bo Dai. Animatediff: Animate your personalized text-
preprint arXiv:2309.17444, 2023. 1, 5
to-image diffusion models without specific tuning. arXiv
[22] Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and
preprint arXiv:2307.04725, 2023. 5
Boyi Li. Llm-grounded video diffusion models. In The
[7] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Twelfth International Conference on Learning Representa-
Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling tions, 2024. 2
camera control for text-to-video generation. CoRR, 2024. 2 [23] Xinyao Liao, Xianfang Zeng, Liao Wang, Gang Yu, Gu-
[8] Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, osheng Lin, and Chi Zhang. Motionagent: Fine-grained
Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao controllable video generation via motion field agent. arXiv
Weng, Ying Shan, et al. Animate-a-story: Storytelling preprint arXiv:2502.03207, 2025. 1
with retrieval-augmented video generation. arXiv preprint [24] Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal.
arXiv:2307.06940, 2023. 2 Videodirectorgpt: Consistent multi-scene video generation
[9] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- via llm-guided planning. In First Conference on Language
fusion Probabilistic Models. In NeurIPS, pages 1–25, 2020. Modeling, 2024. 1, 2, 4
4 [25] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
[10] hpcaitech. Open-sora: Democratizing efficient video pro- Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
duction for all, 2024. 5 Zhu, et al. Grounding dino: Marrying dino with grounded
[11] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, pre-training for open-set object detection. arXiv preprint
Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, arXiv:2303.05499, 2023. 4

9
[26] Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Video- Duncan Findlay, Edede Oiwoh, Edmund Wong, Ehsan As-
drafter: Content-consistent multi-scene video generation dar, Elizabeth Proehl, Elizabeth Yang, Eric Antonow, Eric
with llm. CoRR, 2024. 1 Kramer, Eric Peterson, Eric Sigler, Eric Wallace, Eugene
[27] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- Brevdo, Evan Mays, Farzad Khorasani, Felipe Petroski
uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided Such, Filippo Raso, Francis Zhang, Fred von Lohmann,
sampling of diffusion probabilistic models. arXiv preprint Freddie Sulit, Gabriel Goh, Gene Oden, Geoff Salmon,
arXiv:2211.01095, 2022. 13 Giulio Starace, Greg Brockman, Hadi Salman, Haiming
[28] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- Bao, Haitang Hu, Hannah Wong, Haoyu Wang, Heather
taghi, and Aniruddha Kembhavi. Unified-io: A unified Schmidt, Heather Whitney, Heewoo Jun, Hendrik Kirchner,
model for vision, language, and multi-modal tasks. In The Henrique Ponde de Oliveira Pinto, Hongyu Ren, Huiwen
Eleventh International Conference on Learning Representa- Chang, Hyung Won Chung, Ian Kivlichan, Ian O’Connell,
tions, 2023. 2 Ian O’Connell, Ian Osband, Ian Silber, Ian Sohl, Ibrahim
[29] Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Okuyucu, Ikai Lan, Ilya Kostrikov, Ilya Sutskever, Ing-
Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha mar Kanitscheider, Ishaan Gulrajani, Jacob Coxon, Jacob
Kembhavi. Unified-io 2: Scaling autoregressive multimodal Menick, Jakub Pachocki, James Aung, James Betker, James
models with vision, language, audio, and action. In CVPR, Crooks, James Lennon, Jamie Kiros, Jan Leike, Jane Park,
2024. 2 Jason Kwon, Jason Phang, Jason Teplitz, Jason Wei, Jason
[30] Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- Wolfe, Jay Chen, Jeff Harris, Jenia Varavva, Jessica Gan Lee,
wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Jessica Shieh, Ji Lin, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi
Latent diffusion transformer for video generation. arXiv Yu, Joanne Jang, Joaquin Quinonero Candela, Joe Beutler,
preprint arXiv:2401.03048, 2024. 5 Joe Landers, Joel Parish, Johannes Heidecke, John Schul-
[31] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- man, Jonathan Lachman, Jonathan McKay, Jonathan Uesato,
jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided Jonathan Ward, Jong Wook Kim, Joost Huizinga, Jordan
image synthesis and editing with stochastic differential equa- Sitkin, Jos Kraaijeveld, Josh Gross, Josh Kaplan, Josh Sny-
tions. In International Conference on Learning Representa- der, Joshua Achiam, Joy Jiao, Joyce Lee, Juntang Zhuang,
tions, 2022. 4 Justyn Harriman, Kai Fricke, Kai Hayashi, Karan Singhal,
[32] OpenAI. Sora. [Link] 2024. 1 Katy Shi, Kavin Karthik, Kayla Wood, Kendra Rimbach,
[33] OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Kenny Hsu, Kenny Nguyen, Keren Gu-Lemberg, Kevin But-
Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, ton, Kevin Liu, Kiel Howe, Krithika Muthukumar, Kyle
Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Luther, Lama Ahmad, Larry Kai, Lauren Itow, Lauren Work-
Madry,
˛ Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, man, Leher Pathak, Leo Chen, Li Jing, Lia Guy, Liam Fe-
Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex dus, Liang Zhou, Lien Mamitsuka, Lilian Weng, Lindsay
Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- McCallum, Lindsey Held, Long Ouyang, Louis Feuvrier,
illov, Alexi Christakis, Alexis Conneau, Ali Kamali, Al- Lu Zhang, Lukas Kondraciuk, Lukasz Kaiser, Luke He-
lan Jabri, Allison Moyer, Allison Tam, Amadou Crookes, witt, Luke Metz, Lyric Doshi, Mada Aflak, Maddie Simens,
Amin Tootoochian, Amin Tootoonchian, Ananya Kumar, Madelaine Boyd, Madeleine Thompson, Marat Dukhan,
Andrea Vallone, Andrej Karpathy, Andrew Braunstein, An- Mark Chen, Mark Gray, Mark Hudnall, Marvin Zhang, Mar-
drew Cann, Andrew Codispoti, Andrew Galu, Andrew Kon- wan Aljubeh, Mateusz Litwin, Matthew Zeng, Max John-
drich, Andrew Tulloch, Andrey Mishchenko, Angela Baek, son, Maya Shetty, Mayank Gupta, Meghan Shah, Mehmet
Angela Jiang, Antoine Pelisse, Antonia Woodford, Anuj Yatbaz, Meng Jia Yang, Mengchao Zhong, Mia Glaese, Mi-
Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak, Avi- anna Chen, Michael Janner, Michael Lampe, Michael Petrov,
tal Oliver, Barret Zoph, Behrooz Ghorbani, Ben Leim- Michael Wu, Michele Wang, Michelle Fradin, Michelle
berger, Ben Rossen, Ben Sokolowsky, Ben Wang, Benjamin Pokrass, Miguel Castro, Miguel Oom Temudo de Cas-
Zweig, Beth Hoover, Blake Samic, Bob McGrew, Bobby tro, Mikhail Pavlov, Miles Brundage, Miles Wang, Mi-
Spero, Bogo Giertler, Bowen Cheng, Brad Lightcap, Bran- nal Khan, Mira Murati, Mo Bavarian, Molly Lin, Murat
don Walkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Yesildal, Nacho Soto, Natalia Gimelshein, Natalie Cone,
Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Car- Natalie Staudacher, Natalie Summers, Natan LaFontaine,
roll Wainwright, Cary Bassin, Cary Hudson, Casey Chu, Neil Chowdhury, Nick Ryder, Nick Stathas, Nick Turley,
Chad Nelson, Chak Li, Chan Jun Shern, Channing Con- Nik Tezak, Niko Felix, Nithanth Kudige, Nitish Keskar,
ger, Charlotte Barette, Chelsea Voss, Chen Ding, Cheng Noah Deutsch, Noel Bundick, Nora Puckett, Ofir Nachum,
Lu, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Ola Okelola, Oleg Boiko, Oleg Murk, Oliver Jaffe, Olivia
Koch, Christian Gibson, Christina Kim, Christine Choi, Watkins, Olivier Godement, Owen Campbell-Moore, Patrick
Christine McLeavey, Christopher Hesse, Claudia Fischer, Chao, Paul McMillan, Pavel Belov, Peng Su, Peter Bak, Pe-
Clemens Winter, Coley Czarnecki, Colin Jarvis, Colin Wei, ter Bakkum, Peter Deng, Peter Dolan, Peter Hoeschele, Peter
Constantin Koumouzelis, Dane Sherburn, Daniel Kappler, Welinder, Phil Tillet, Philip Pronin, Philippe Tillet, Prafulla
Daniel Levin, Daniel Levy, David Carr, David Farhi, David Dhariwal, Qiming Yuan, Rachel Dias, Rachel Lim, Rahul
Mely, David Robinson, David Sasaki, Denny Jin, Dev Val- Arora, Rajan Troll, Randall Lin, Rapha Gontijo Lopes, Raul
ladares, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza

10
Zamani, Ricky Wang, Rob Donnelly, Rob Honsby, Rocky feld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff
Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik
Rory Carmichael, Rowan Zellers, Roy Chen, Ruby Chen, Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos
Ruslan Nigmatullin, Ryan Cheu, Saachi Jain, Sam Altman, Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li,
Sam Schoenholz, Sam Toizer, Samuel Miserendino, Sand- Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro-
hini Agarwal, Sara Culver, Scott Ethersmith, Scott Gray, vic, and Yuming Du. Movie gen: A cast of media founda-
Sean Grove, Sean Metzger, Shamez Hermani, Shantanu tion models. [Link]
Jain, Shengjia Zhao, Sherwin Wu, Shino Jomoto, Shirong gen- media- foundation- models- generative-
Wu, Shuaiqi, Xia, Sonia Phene, Spencer Papay, Srinivas ai-video/, 2025. 1
Narayanan, Steve Coffey, Steve Lee, Stewart Hall, Suchir [38] Runway. Gen-3. [Link]
Balaji, Tal Broda, Tal Stramer, Tao Xu, Tarun Gogineni, introducing-gen-3-alpha/, 2024. 1, 5
Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas [39] Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain,
Cunninghman, Thomas Degry, Thomas Dimson, Thomas Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman.
Raoux, Thomas Shadwell, Tianhao Zheng, Todd Under- Emu edit: Precise image editing via recognition and genera-
wood, Todor Markov, Toki Sherbakov, Tom Rubin, Tom tion tasks. In CVPR, 2023. 4
Stasi, Tomer Kaftan, Tristan Heywood, Troy Peterson, Tyce
[40] Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian,
Walters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie
Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung,
Monaco, Vishal Kuo, Vlad Fomenko, Wayne Chang, Weiyi
Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and
Zheng, Wenda Zhou, Wesam Manassra, Will Sheu, Woj-
controllable image-to-video generation with explicit motion
ciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, You-
modeling. In SIGGRAPH (Conference Paper Track), 2024.
long Cheng, Yu Zhang, Yuchen He, Yuchen Zhang, Yujia
2
Jin, Yunxing Dai, and Yury Malkov. Gpt-4o system card,
2024. 3 [41] Jiaming Song, Chenlin Meng, and Stefano Ermon.
Denoising diffusion implicit models. arXiv preprint
[34] Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari
arXiv:2010.02502, 2020. 4
Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre
Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen [42] Spencer Sterling. Zeroscope v2 576w. https : / /
Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, huggingface . co / cerspense / zeroscope _ v2 _
Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, 576w, 2024. 5
Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Fred- [43] Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu,
eric Besse, Tim Harley, Anna Mitenkova, Jane Wang, Jeff Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen-
Clune, Demis Hassabis, Raia Hadsell, Adrian Bolton, Satin- sive benchmark for compositional text-to-video generation.
der Singh, and Tim Rocktäschel. Genie 2: A large-scale CoRR, abs/2407.14505, 2024. 2, 4
foundation world model, 2024. 1 [44] Pika Team. Pika art. [Link] 2024. 5
[35] PixVerse. Pixverse. [Link] [45] WorldLab Team. Worldlab. https : / / www .
2024. 5 [Link]/blog, 2024. 1
[36] Dustin Podell, Zion English, Kyle Lacey, Andreas [46] Wan Team. Wan: Open and advanced large-scale video gen-
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and erative models, 2025. 1
Robin Rombach. Sdxl: Improving latent diffusion models
[47] Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng,
for high-resolution image synthesis. In The Twelfth Inter-
Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei
national Conference on Learning Representations, 2024. 4,
Wan, Di Zhang, and Bin Cui. Videotetris: Towards compo-
7
sitional text-to-video generation, 2024. 5
[37] Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra,
[48] Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong,
Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-
Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun,
Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary,
Saining Xie, and Zhuang Liu. Metamorph: Multimodal
Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan
understanding and generation via instruction tuning. arXiv
Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng
preprint arXiv:2412.14164, 2024. 2
Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt
Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Pe- [49] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang,
ter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video
Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak technical report. arXiv preprint arXiv:2308.06571, 2023. 5
Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, [50] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan
Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang,
Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xi- Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is
aoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, all you need. CoRR, 2024. 2
Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian [51] Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit
He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Bansal. Dreamrunner: Fine-grained storytelling video gen-
Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh eration with retrieval-augmented motion adaptation. arXiv
Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schon- preprint arXiv:2411.16657, 2024. 2

11
[52] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, attention for long-range image and video generation. In
Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. The Thirty-eighth Annual Conference on Neural Information
Motionctrl: A unified and flexible motion controller for Processing Systems, 2024. 2
video generation. In SIGGRAPH (Conference Paper Track),
2024. 2
[53] Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He,
David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting
Gao, and Di Zhang. Draganything: Motion control for any-
thing using entity representation. In European Conference
on Computer Vision, pages 331–348. Springer, 2024. 1, 2
[54] Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang,
Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu
Yin, Li Yi, et al. Vila-u: a unified foundation model inte-
grating visual understanding and generation. arXiv preprint
arXiv:2409.04429, 2024. 2
[55] Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Ste-
fano Ermon, and CUI Bin. Mastering text-to-image dif-
fusion: Recaptioning, planning, and generating with multi-
modal llms. In Forty-first International Conference on Ma-
chine Learning, 2024. 2
[56] Xingyi Yang and Xinchao Wang. Compositional video gen-
eration as flow equalization, 2024. 5
[57] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu
Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao-
han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video
diffusion models with an expert transformer. arXiv preprint
arXiv:2408.06072, 2024. 1, 2, 4, 5
[58] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang
Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained
control in video generation by integrating text, image, and
trajectory. arXiv preprint arXiv:2308.08089, 2023. 2
[59] Shoubin Yu, Jacob Zhiyuan Fang, Jian Zheng, Gunnar Sig-
urdsson, Vicente Ordonez, Robinson Piramuthu, and Mohit
Bansal. Zero-shot controllable image-to-video animation via
motion decomposition. In Proceedings of the 32nd ACM
International Conference on Multimedia, pages 3332–3341,
2024. 2
[60] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu,
Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and
Mike Zheng Shou. Show-1: Marrying pixel and latent dif-
fusion models for text-to-video generation. arXiv preprint
arXiv:2309.15818, 2023. 5
[61] Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li,
Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo,
Yaqian Li, Shilong Liu, et al. Recognize anything: A strong
image tagging model. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
1724–1732, 2024. 4
[62] Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai,
Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora:
Trajectory-oriented diffusion transformer for video genera-
tion. arXiv preprint arXiv:2407.21705, 2024. 1
[63] Haitao Zhou, Chuang Wang, Rui Nie, Jinxiao Lin, Dong-
dong Yu, Qian Yu, and Changhu Wang. Trackgo: A flex-
ible and efficient method for controllable video generation.
CoRR, 2024. 2
[64] Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi
Feng, and Qibin Hou. Storydiffusion: Consistent self-

12
Appendix separate the generation of foreground and background, en-
suring that key foreground objects are not mistakenly in-
In this appendix, we present the following: cluded in the background. In Fig. 7, we first prompt the
• Noise inversion details using DPM-Solver++ [27] in MLLM to generate bounding boxes for foreground objects,
Sec. A. followed by reasoning for their placement. Additionally, we
• MLLM prompts we use to collect background descrip- emphasize that the placement of foreground objects should
tion, foreground object layout and trajectory, and α used be informed by background bounding box annotations to
to determine how much noise to inject during inversion in improve spatial coherence. Finally, Fig. 8 illustrates the
Sec. B. prompt template used to determine the appropriate level of
noise injection during inversion. We explicitly incorporate
A. Noise Inversion Details prior knowledge into the template, instructing the MLLM
to apply less noise for tasks requiring precise trajectory or
Here, we describe in detail how we adopt the inversion tech- layout control and more noise for tasks involving dynamic
nique for final video generation with the motion priors pre- changes or object actions that cannot be effectively modeled
pared in Stage 3. Specifically, we first encode the sequence with bounding box plans.
of images with the planned layout, collected in the pre-
vious stage (Sec. 3.2), into the latent space z using a 3D
Variational Autoencoder (3D VAE). Then, we perform the
forward diffusion process where Gaussian noise is gradu-
ally added to the latent. Following the DPM-Solver++ [27]
scheduler in CogVideoX, the noised latent at diffusion step
t is:
? ?
zt “ αt z0 ` 1 ´ αt ϵ, ϵ „ N p0, Iq. (1)
śt
Here, αt “ s“1 p1´βs q is the cumulative noise sched-
ule, and ϵrepresents Gaussian noise.
Then, given a noisy latent zt , we attempt to recover the
clean latent as a general reverse denoising starting from
step t. The model then denoises to zt´1 using the DPM-
Solver++ method, which provides a high-order approxima-
tion of the reverse diffusion process. Specifically, the up-
date equation for zt´1 in a second-order solver is:

zt´1 “ zt ` λ1 F̂ pzt , tq ` λ2 F̂ pzt ` λ3 F̂ pzt , tq, tm q, (2)

where F̂ pz, tq “ ´ 21 βt z ´ g 2 ptqϵθ pz, tq is the estimated

drift term, λ1 , λ2 , λ3 are step-size coefficients computed
adaptively, and tm is an intermediate timestep between t
and t ´ 1.
We observe that selecting t within a specific range en-
ables video diffusion models to inject smooth object mo-
tions naturally. This process effectively transforms a se-
quence of static images into a coherent video with realistic
motion dynamics.

B. Prompt for MLLM Planning

In this section, we present the prompt templates used to
collect background descriptions, foreground object layouts
and trajectories, and the parameter α, which determines
the amount of noise injected during inversion. As shown
in Fig. 6, we explicitly instruct the multi-modal LLM to

13
Figure 6. Prompt template used to query background description.

Figure 7. Prompt template used to query foreground object layout and trajectory plan.

14
Figure 8. Prompt template used to determine how much noise to inject during inversion.

LLM Grounded Video Diffusion
No ratings yet
LLM Grounded Video Diffusion
21 pages
Advances in AI Video Generation Techniques
No ratings yet
Advances in AI Video Generation Techniques
32 pages
Text-Guided Video Completion Framework
No ratings yet
Text-Guided Video Completion Framework
12 pages
Controllable Image-to-Video Generation
No ratings yet
Controllable Image-to-Video Generation
10 pages
Text-Guided Video Completion Framework
No ratings yet
Text-Guided Video Completion Framework
12 pages
Video Crafter 2
No ratings yet
Video Crafter 2
11 pages
Text-Guided Video Completion Framework
No ratings yet
Text-Guided Video Completion Framework
12 pages
Zero-Shot Text-to-Video Generation
No ratings yet
Zero-Shot Text-to-Video Generation
11 pages
Training-Free Motion-Guided Video Generation
No ratings yet
Training-Free Motion-Guided Video Generation
17 pages
Text2Video-Zero: Zero-Shot Video Generation
No ratings yet
Text2Video-Zero: Zero-Shot Video Generation
3 pages
GPT4Motion: Text-to-Video Motion Scripting
No ratings yet
GPT4Motion: Text-to-Video Motion Scripting
11 pages
Make-A-Video: Text-to-Video Innovation
No ratings yet
Make-A-Video: Text-to-Video Innovation
13 pages
MotionClone: Training-Free Video Generation
No ratings yet
MotionClone: Training-Free Video Generation
17 pages
Video Synthesis with Diffusion Models
No ratings yet
Video Synthesis with Diffusion Models
11 pages
ModelScope Text-to-Video Overview
No ratings yet
ModelScope Text-to-Video Overview
14 pages
Motion-Zero: Zero-Shot Video Control
No ratings yet
Motion-Zero: Zero-Shot Video Control
9 pages
Text-to-Video Generation: A Comprehensive Survey
No ratings yet
Text-to-Video Generation: A Comprehensive Survey
34 pages
Preprints202512 1476 v1
No ratings yet
Preprints202512 1476 v1
50 pages
Training-Free Text-to-Video Generation
No ratings yet
Training-Free Text-to-Video Generation
11 pages
Dynamic Video Generation from Images
No ratings yet
Dynamic Video Generation from Images
21 pages
Video Synthesis with Diffusion Models
No ratings yet
Video Synthesis with Diffusion Models
26 pages
AI Text-to-Video Synthesis Techniques
No ratings yet
AI Text-to-Video Synthesis Techniques
7 pages
2603 09104v1
No ratings yet
2603 09104v1
11 pages
NeuroVidX: Text-to-Video Generation Model
No ratings yet
NeuroVidX: Text-to-Video Generation Model
29 pages
EMU VIDEO: Advanced Text-to-Video Generation
No ratings yet
EMU VIDEO: Advanced Text-to-Video Generation
29 pages
Motion-Decoupled Co-Speech Gesture Generation
No ratings yet
Motion-Decoupled Co-Speech Gesture Generation
22 pages
VideoPoet: Zero-Shot Video Generation
No ratings yet
VideoPoet: Zero-Shot Video Generation
20 pages
Integrated AI for Multi-Modal Media Creation
No ratings yet
Integrated AI for Multi-Modal Media Creation
11 pages
Code2Video: Framework for Educational Video Generation
No ratings yet
Code2Video: Framework for Educational Video Generation
26 pages
Text-Controlled Video Generation Framework
No ratings yet
Text-Controlled Video Generation Framework
12 pages
Advances in Text-to-Video Generation
No ratings yet
Advances in Text-to-Video Generation
5 pages
Text-to-Video Generation Review
No ratings yet
Text-to-Video Generation Review
5 pages
Text To Video - Model
No ratings yet
Text To Video - Model
6 pages
Si Et Al. (2025) - RepViDEo Rethinking Cross-Layer Representation For Video Generation
No ratings yet
Si Et Al. (2025) - RepViDEo Rethinking Cross-Layer Representation For Video Generation
15 pages
Multimodal Instructional Plan Model
No ratings yet
Multimodal Instructional Plan Model
19 pages
Human-Centric Video Generation Framework
No ratings yet
Human-Centric Video Generation Framework
15 pages
CoCoCo: Enhancing Video Inpainting
No ratings yet
CoCoCo: Enhancing Video Inpainting
23 pages
AI-Driven Multimedia Storytelling
No ratings yet
AI-Driven Multimedia Storytelling
29 pages
Visual ChatGPT: Multi-Modal Interaction
No ratings yet
Visual ChatGPT: Multi-Modal Interaction
17 pages
Chat-Centric Video Understanding System
No ratings yet
Chat-Centric Video Understanding System
16 pages
UniVideo: Unified Video Generation & Editing
No ratings yet
UniVideo: Unified Video Generation & Editing
24 pages
Video-ChatGPT: Advancing Video Dialogue
No ratings yet
Video-ChatGPT: Advancing Video Dialogue
17 pages
AI Video Creation from Text Insights
No ratings yet
AI Video Creation from Text Insights
32 pages
PixelDance: Advanced Video Generation
No ratings yet
PixelDance: Advanced Video Generation
11 pages
P 2V: A V G S P: Aper Ideo Utomatic Ideo Eneration From Cientific Apers
No ratings yet
P 2V: A V G S P: Aper Ideo Utomatic Ideo Eneration From Cientific Apers
19 pages
Video Generation from Semantic Labels
No ratings yet
Video Generation from Semantic Labels
10 pages
High-Quality Video Generation Models
No ratings yet
High-Quality Video Generation Models
12 pages
Controllable Video Generation: A Survey
No ratings yet
Controllable Video Generation: A Survey
41 pages
Video-to-Audio Generation Insights
No ratings yet
Video-to-Audio Generation Insights
17 pages
Video Is Wortha Thousand Images Exploringthe Latest Trendsin Long Video Generation
No ratings yet
Video Is Wortha Thousand Images Exploringthe Latest Trendsin Long Video Generation
36 pages
Zero-Shot Text-to-Video Generation
No ratings yet
Zero-Shot Text-to-Video Generation
11 pages
Automated Academic Video Generation
No ratings yet
Automated Academic Video Generation
20 pages
Semantic Video Generation from Captions
No ratings yet
Semantic Video Generation from Captions
9 pages
Step-by-Step Visual Instruction Generation
No ratings yet
Step-by-Step Visual Instruction Generation
22 pages
Multimodal Video Captioning Framework
No ratings yet
Multimodal Video Captioning Framework
12 pages
Text-to-Video Generation Survey
No ratings yet
Text-to-Video Generation Survey
21 pages
Text-to-Video Generation with GANs
No ratings yet
Text-to-Video Generation with GANs
7 pages
ThinkSound: CoT for Audio Generation
No ratings yet
ThinkSound: CoT for Audio Generation
19 pages
2023 - 09 - 08-Reuse and Diffuse
No ratings yet
2023 - 09 - 08-Reuse and Diffuse
18 pages
Mechanical Actuation Systems Overview
No ratings yet
Mechanical Actuation Systems Overview
50 pages
Build a Simple Radio Receiver Project
No ratings yet
Build a Simple Radio Receiver Project
5 pages
Understanding Phenomenological Anthropology
No ratings yet
Understanding Phenomenological Anthropology
9 pages
Windows On Your Inner Self: Dreamwork With Transactional Analysis
100% (1)
Windows On Your Inner Self: Dreamwork With Transactional Analysis
8 pages
Bsccbcsrs 36
0% (1)
Bsccbcsrs 36
582 pages
GC & RVP Analyzer Specifications
No ratings yet
GC & RVP Analyzer Specifications
19 pages
Comprehensive BMAT Exam Guide
100% (1)
Comprehensive BMAT Exam Guide
12 pages
The Comfort and Utility of Coffee Mugs
No ratings yet
The Comfort and Utility of Coffee Mugs
1 page
3D Flux Machines: Multiphysics Insights
No ratings yet
3D Flux Machines: Multiphysics Insights
3 pages
Welding, Cutting & Brazing Safety Guidelines
No ratings yet
Welding, Cutting & Brazing Safety Guidelines
29 pages
Concrete Testing Services and Pricing
No ratings yet
Concrete Testing Services and Pricing
10 pages
TV Operating Instructions & Installation Guide
No ratings yet
TV Operating Instructions & Installation Guide
6 pages
CV-201 Drive Station Technical Drawing
No ratings yet
CV-201 Drive Station Technical Drawing
1 page
New Products and Services PoV Accenture
No ratings yet
New Products and Services PoV Accenture
84 pages
Dark Investigation: Cannibalism Scenario
100% (1)
Dark Investigation: Cannibalism Scenario
15 pages
50cc 2-Stroke Mini Bike Manual
No ratings yet
50cc 2-Stroke Mini Bike Manual
4 pages
Optimize Factory Performance with IoT
No ratings yet
Optimize Factory Performance with IoT
12 pages
Textile Engineering Internship Report
No ratings yet
Textile Engineering Internship Report
26 pages
Multifunctional Smart Buoy Development
No ratings yet
Multifunctional Smart Buoy Development
6 pages
Minimizing Condensation in Tasmanian Homes
No ratings yet
Minimizing Condensation in Tasmanian Homes
24 pages
Vikram Solar IPO Details and Dates
No ratings yet
Vikram Solar IPO Details and Dates
9 pages
Data Structures Lab Practical Guide
No ratings yet
Data Structures Lab Practical Guide
28 pages
Building the MFH Ferrari Testa Rossa
No ratings yet
Building the MFH Ferrari Testa Rossa
38 pages
Protein Guidelines for Muscle Recovery
No ratings yet
Protein Guidelines for Muscle Recovery
7 pages
Action Research Chapter1 3
No ratings yet
Action Research Chapter1 3
18 pages
Gas Temperature and RMS Speed Problem
No ratings yet
Gas Temperature and RMS Speed Problem
13 pages
Bahria Town: Transforming Pakistan's Real Estate
No ratings yet
Bahria Town: Transforming Pakistan's Real Estate
1 page
Grundfos CR 10-4 A-FJ-A-E-HQQE Specs
No ratings yet
Grundfos CR 10-4 A-FJ-A-E-HQQE Specs
7 pages
Precast Drain Installation Work Instruction
No ratings yet
Precast Drain Installation Work Instruction
2 pages
Textile Internship Insights 2024
No ratings yet
Textile Internship Insights 2024
31 pages

Training-Free Video Generation Guidance

Uploaded by

Training-Free Video Generation Guidance

Uploaded by

Training-free Guidance in Text-to-Video Generation

via Multimodal Planning and Structured Noise Initialization

Abstract ically improved the quality of generated videos in diverse

Object Layout Video

Background & Foreground

3.1. Background Planning 3.2. Foreground Object Layout and Trajectory

4.4. Qualitative Analysis Background object detection helps foreground object

zt´1 “ zt ` λ1 F̂ pzt , tq ` λ2 F̂ pzt ` λ3 F̂ pzt , tq, tm q, (2)

where F̂ pz, tq “ ´ 21 βt z ´ g 2 ptqϵθ pz, tq is the estimated

B. Prompt for MLLM Planning

You might also like