Training-Free Video Generation Guidance
Training-Free Video Generation Guidance
Jialu Li* Shoubin Yu* Han Lin* Jaemin Cho Jaehong Yoon Mohit Bansal
UNC Chapel Hill
{jialuli, shoubin, hanlincs, jmincho, jhyoon, mbansal}@[Link]
[Link]
arXiv:2504.08641v1 [[Link]] 11 Apr 2025
1
Text
Text Prompt LLM T2V
Prompt
T2V
Object Layout & Trajectory
(a) Single Model for Video Generation (b) Video Generation with Attention-based Layout Guidence
Text Noise
Prompt MLLM T2V
Inversion
Background &
Foreground Plans
T2I / I2V
Figure 1. Comparison of different text-to-video generation methods: (a) single model for video generation, (b) video generation with
(attention-based) layout guidance, and our (c) V IDEO -MSG, a training-free guidance method for T2V generation based on multimodal
planning and structured noise initialization. Since V IDEO -MSG does not need fine-tuning or additional memory during inference time, it
is easier to adopt large T2V models than previous video layout guidance methods based on fine-tuning or iterative attention manipulation.
hancing text alignment with multiple T2V backbones planning or require extensive training and do not fully lever-
(VideoCrafter2 [4] and CogVideoX-5B [57]) on popular age the power of existing visual tools for fine-grained video
T2V generation benchmarks (T2VCompBench [43] and generation. In contrast, our work leverages the power of
VBench [11]). For example, V IDEO -MSG improves mo- both multimodal LLMs and image/video diffusion models
tion binding with a relative gain of 52.46%, numeracy with to generate a VIDEO SKETCH for final fine-grained motion
a relative gain of 40.11%, and spatial relationship with a rel- control, and is fully training-free.
ative gain of 11.15% with CogVideoX-5B as a T2V genera-
tion backbone. We provide comprehensive quantitative and 2.2. Motion Direction Control in Video Generation
qualitative ablation studies about noise inversion ratio, dif- Controllability in video generation is gaining increasing at-
ferent background generators, background object detection, tention in the field of generative AI, as it enables models
and foreground object segmentation. We hope our method to generate videos aligned with user intent. One line of re-
can inspire future work on effectively and efficiently inte- search focuses on training models with the capability of tra-
grating LLMs’ planning ability into video generation. jectory control, camera control, or motion control by gen-
erating intermediate representations. For trajectory control,
2. Related Work recent works such as DragNUMA [58], IVA-0 [59], Dra-
gAnything [53], and TrackGo [63] encode object movement
2.1. MLLM Planning for Video Generation trajectories into dense features, which are then fused into
There are recent research works [8, 22, 24, 51, 64] that the diffusion model to enable object movement control. On
leverage the reasoning capabilities and world knowledge the other hand, CameraCtrl [7], MotionCtrl [52], and Im-
of LLMs or multimodal LLMs for the task of video gen- age Conductor [20] encode camera extrinsics as features to
eration. For example, one line of work [22, 24, 51] ap- control camera motion in the generated videos. A common
plies GPT-4 / GPT-4o to expand a single text prompt into drawback of both of these categories is their reliance on ac-
a ‘video plan’ in the format of bounding boxes or detailed curate object trajectory or camera movement information,
prompt description [55], which is then given as input to which are difficult for users to manipulate directly. Addi-
downstream video diffusion model for layout-guided video tionally, video datasets with accurate trajectory annotations
generation. The other line of work [16, 28, 29, 48, 50, 54] are limited, which constrains the performance of these mod-
performs token-level planning utilizing multimodal LLMs. els. The third category, including VideoJAM [3] and Mo-
For example, [16, 50] tokenize videos and text into the same tionI2V [40], produces motion and video representations
space and generate video tokens using the same strategy as jointly, or sequentially by first generating intermediate rep-
text (e.g., next-token prediction). However, both directions resentations, which then serve as guidance for generating
either rely on high-quality prompts and the bounding box video outputs. However, such methods require extensive
2
Figure 2. Three stages of V IDEO -MSG. In the first stage, the MLLM plans specific global and local contexts that fit the provided text-to-
video prompt. The text-to-image (T2I) model uses the MLLM planned context to render the necessary components of the video. In the
third stage, we generate video with VIDEO SKETCH via noise inversion.
training due to extra generation objectives. In contrast, our ation based on bounding boxes, where the T2I model may
method uses an image-to-video model, allowing us to trans- fail to generate the foreground object at the specified box lo-
form existing, real-world images into controllable videos cation in the image. In addition, we explore two approaches
under LLM planning. for background generation:
(1) Using a T2I model to generate an initial background,
3. Method followed by an I2V model to animate it. In this way,
we can adopt a strong T2V model to potentially achieve
We introduce V IDEO -MSG, Multimodal Sketch Guidance
improved video aesthetic quality.
for video generation, a training-free guidance method for
(2) Directly using a T2V model to generate the background
T2V generation based on multimodal planning and struc-
with animation, which avoids the potential distribution
tured noise initialization. V IDEO -MSG consists of three
gap between the two models in (1).
stages (illustrated in Fig. 2):
In both cases, we adopt a video generation model. We
• Background planning (Sec. 3.1), where we adopt T2I
aim to introduce natural background animation rather than
and I2V models to generate background image priors with
keeping it static while only animating foreground objects.
natural animation.
This ensures that elements such as flowing water, moving
• Foreground Object Layout and Trajectory Planning
clouds, or swaying trees are naturally incorporated, making
(Sec. 3.2), where we apply MLLM and object detectors
the generated videos more realistic and visually coherent.
to plan and place foreground objects into the background
Moreover, by comparing approaches (1) with (2), we notice
harmoniously.
that the advantage of adopting a strong T2I model in (1) out-
• Video Generation with Structured Noise Initialization
weighs the domain gap between the T2I and I2V models in
(Sec. 3.3), where the synthesized images derived from the
(2) as discussed in Sec. 4.3. Therefore, we apply approach
above stages are used as VIDEO SKETCH for final video
(1) as our default experiment setting.
generation via inversion techniques.
3
object’s movement. For instance, given the text prompt: “A can be effectively utilized here to create structured noise
cat sinking to the left in the living room”, GPT-4o can cor- to fuse the information from VIDEO SKETCH. While the
rectly infer the cat’s movement direction (i.e., moving left). normal denoising process starts from the terminal timestep
However, when provided with a background image of a liv- t “ T (a random noise) to the initial timestep t “ 0 (a
ing room, GPT-4o often fails to position the cat’s bounding clean video), we create per-frame initial noises from VIDEO
box appropriately on the floor (e.g., with the bounding box SKETCH via noise inversion [31] and start denoising from
floating in mid-air or overlapping with unrelated objects), a timestep tinv . Specifically, we first encode the sequence
as illustrated in Figure 4. This suggests that while GPT-4o of VIDEO SKETCH frames into the latent space z using a
demonstrates strong motion reasoning capabilities, it lacks 3D VAE [13, 57]. Next, we obtain the initial noise ztinv
?
direct grounding capability for visual elements and strug- via the forward diffusion process [9]: ztinv “ αt z0 `
? śt
gles to align foreground objects with the background scene 1 ´ αtinv ϵ, ϵ „ N p0, Iq, where αt “ s“1 p1 ´ βs q
in a spatially consistent manner. is the cumulative noise schedule, and ϵ represents Gaussian
To overcome this limitation, we first detect all ob- noise. We parameterize tinv “ α ˆ T , where α P p0.0, 1.0q.
jects in the background image with Recognize-Anything Inspired by VideoDirectorGPT [24], which uses an LLM to
(RAM) [61] then extract their bounding boxes with estimate a confidence score along with bounding box lay-
Grounding-DINO [25]. These bounding boxes are fed into outs as layout guidance strength, we employ an LLM to in-
GPT-4o to provide explicit spatial context, which helps fer an appropriate noise inversion ratio α value given a text
it accurately position and animate foreground objects, en- description. (see Sec. 4.3 for detailed experiments). We
hancing spatial coherence in generated videos and reduc- explain more details about the noise inversion in Appendix.
ing placement errors. Qualitative examples of the effec-
tiveness of object detection with Grounding-DINO and 4. Experiments
RAM are presented in Figure 4. With the above inputs
(i.e., video text prompt, background image, and the bound- 4.1. Experiment Setups
ing boxes of objects in the background), GPT-4o gener- Datasets. We evaluate V IDEO -MSG on popular text-to-
ates a sequence of bounding boxes for the foreground ob- video generation benchmarks, T2V-CompBench [43] and
jects in the format [object name, bounding box VBench [11]. T2V-CompBench and VBench measure di-
coordinates] (see stage 2 in Fig. 2). Additionally, it verse aspects of text-to-video generation tasks with seven
provides a textual description for each frame and a reason- (e.g., consistent attribute binding, motion binding, spatial
ing process explaining the planned object motions after the relationships) and sixteen categories (e.g., overall consis-
sequence of frames. This reasoning step enhances the co- tency, color, temporal flickering, motion smoothness), re-
herence and accuracy of motion planning. spectively. In this work, we primarily use T2V-CompBench
Once the sequence of bounding boxes is obtained, we to evaluate video diffusion models’ capability in composi-
utilize a T2I model to generate the appearance of the tional text-to-video generation, and use VBench to measure
foreground object using the prompt: “An image of the motion smoothness of the generated video.
{object name}.” However, directly merging the gen-
erated object image with the background presents a chal-
Implementation details. We implement V IDEO -MSG
lenge—the background in the generated object image can
on two recent text-to-video generation diffusion models:
significantly affect the overall visual coherence, as illus-
VideoCrafter2 [4] and CogVideoX-5B [57]. To gener-
trated in Fig. 5. To address this, we apply SAM [14] to
ate the VIDEO SKETCH, we employ FLUX.1-dev [19] and
extract the object from the generated image, removing any
SDXL [36] as the background generator, and CogVideoX-
unintended background. Based on the planned bounding
5B as the image-to-video generator. We utilize Recognize-
boxes, the extracted object is then resized and placed onto
Anything [61] and Gounded-Segment-Anything [14] for
the background image at the corresponding location. This
foreground object segmentation. We utilize GPT4o as the
process ensures a more seamless integration of the fore-
multi-modal LLM for background description generation,
ground object into the background, improving the visual
foreground object layout and trajectory planning, and de-
consistency of the generated video.
termining the noise inversion ratio α dynamically based on
the prompt. For noise inversion ratio α (Sec. 3.3), we find
3.3. Video Generation with Structured Noise Ini-
the range [0.7, 0.9] works well for CogVideoX-5B, and the
tialization
range [0.5, 0.8] works well for VideoCrafter2 (see Sec. 4.3
In this stage, we generate a final video by guiding the T2V for ablation study). All experiments are conducted on A100
diffusion model with the VIDEO SKETCH created from the and A6000 GPUs, with batch size 1 and an approximate
previous stage (Sec. 3.2). Inversion methods [41], which memory usage of 16 GB. We provide additional details,
are often used in image and video editing tasks [31, 39], such as prompts used for GPT-4o, in the Appendix.
4
Model Consist-attr Dynamic-attr Spatial Motion Action Interaction Numeracy
(Closed-source models)
Pika [44] 0.6513 0.1744 0.5043 0.2221 0.5380 0.6625 0.2613
Gen-3 [38] 0.7045 0.2078 0.5533 0.3111 0.6280 0.7900 0.2169
Dreamina [5] 0.8220 0.2114 0.6083 0.2391 0.6660 0.8175 0.4006
PixVerse [35] 0.7370 0.1738 0.5874 0.2178 0.6960 0.8275 0.3281
Kling [15] 0.8045 0.2256 0.6150 0.2448 0.6460 0.8475 0.3044
(Open-source models)
ModelScope [49] 0.5483 0.1654 0.4220 0.2552 0.4880 0.7075 0.2066
ZeroScope [42] 0.4495 0.1086 0.4073 0.2319 0.4620 0.5550 0.2378
AnimateDiff [6] 0.4883 0.1764 0.3883 0.2236 0.4140 0.6550 0.0884
Latte [30] 0.5325 0.1598 0.4476 0.2187 0.5200 0.6625 0.2187
Show-1 [60] 0.6388 0.1828 0.4649 0.2316 0.4940 0.7700 0.1644
Open-Sora 1.2 [10] 0.6600 0.1714 0.5406 0.2388 0.5717 0.7400 0.2556
Open-Sora-Plan v1.1.0 [18] 0.7413 0.1770 0.5587 0.2187 0.6780 0.7275 0.2928
VideoTetris [47] 0.7125 0.2066 0.5148 0.2204 0.5280 0.7600 0.2609
Vico [56] 0.7025 0.2376 0.4952 0.2225 0.5480 0.7775 0.2116
VideoCrafter2 [4] 0.6750 0.1850 0.4891 0.2233 0.5800 0.7600 0.2041
VideoCrafter2 + LVD [21] 0.6663 0.2308 0.5106 0.2178 0.5640 0.8125 0.2869
(-0.0087) (+0.0458) (+0.0215) (-0.0055) (-0.0160) (+0.0525) (+0.0828)
VideoCrafter2 + V IDEO -MSG (Ours) 0.7536 0.2110 0.5866 0.3732 0.5737 0.8220 0.3138
(+0.0786) (+0.0260) (+0.0975) (+0.1499) (-0.0063) (+0.0620) (+0.1097)
CogVideoX-5B [57] 0.7220 0.2334 0.5461 0.2943 0.5960 0.7950 0.2603
CogVideoX-5B + V IDEO -MSG (Ours) 0.7109 0.2102 0.6070 0.4487 0.5960 0.7800 0.3647
(-0.0111) (-0.0232) (+0.0609) (+0.1544) (+0.0000) (-0.0150) (+0.1044)
Table 1. T2V-CompBench evaluation results. We highlight the best/second-best scores for open-sourced models with bold/underline.
4.2. Quantitative Evaluation within a set of object bounding boxes generated by an LLM.
On the VideoCrafter2 backbone, we find that V IDEO -MSG
Improved control on spatial layout and object trajec-
outperforms LVD in all categories except for dynamic at-
tory. Table 1 shows that V IDEO -MSG significantly im-
tribute binding, with the largest improvement observed in
proves both T2V backbone models (VideoCrafter2 and
motion binding (‘Motion’), where V IDEO -MSG surpasses
CogVideoX-5B) in many skills, especially in motion bind-
LVD by 0.1554. This demonstrates the effectiveness of our
ing (‘Motion’), with an increase of 0.1499 on VideoCrafter2
approach. Note that V IDEO -MSG is also more memory-
and 0.1544 on CogVideoX-5B. V IDEO -MSG also provides
efficient than LVD, as the layout guidance in LVD requires
large improvements in spatial relationships (‘Spatial’), and
backpropagation through the T2V diffusion backbone, mak-
numeracy (‘Numeracy’) in both backbone models. These
ing it hard to adapt to large diffusion models; we could
results show that the planning and structured noise initial-
implement V IDEO -MSG with CogVideoX-5B backbone to
ization of V IDEO -MSG effectively improve the control of
run on an A6000 GPU (48GB), but we could not fit LVD
spatial layouts and object trajectories in video generation.
even on an A100 (80GB).
It is also noteworthy that V IDEO -MSG, implemented with
open-source T2V backbone models, archives higher mo-
tion binding scores than closed-source models such as Gen- 4.3. Ablation Studies
3 [38]. The V IDEO -MSG did not improve the scores in
dynamic attribute binding (‘Dynamic-attr’) and object ac- Noise inversion ratio α. As described in Sec. 3.3, we
tion and interaction (‘Action’ and ‘Interaction’) categories. guide the T2V generation backbone by denoising from an
This is likely because dynamic changes in object or envi- intermediate timestep tinv “ α ˆ T to the initial timestep
ronment states and interactions and actions between objects t “ 0. Here, we experiment with different noise inver-
are difficult to guide solely with bounding boxes. sion ratios α (i.e., varying the noise injected into the VIDEO
SKETCH ). Table 2 shows that lower α achieves better per-
formance in motion binding (e.g., moving left/right), nu-
Comparison to planning-based baseline. We also com- meracy, and spatial relationships but hurts the smoothness
pare V IDEO -MSG with LVD [21], a recent T2V layout of motions. This aligns with the intuition that increasing the
guidance method, where it adds a gradient-based energy number of refinement steps based on VIDEO SKETCH en-
function optimization step before each denoising step of hances the final motion quality. We observe that automat-
the T2V diffusion backbone. The energy function adjusts ically inferring proper α given text description with LLM
the cross-attention map of diffusion models to concentrate achieves a good trade-off and use this approach by default.
5
Figure 3. Videos generated with CogVideoX-5B and V IDEO -MSG with CogVideoX-5B backbone. The videos generated with V IDEO -
MSG are more accurate regarding object motions, numeracy, and spatial relationships.
6
T2V-CompBench VBench
No. Noise inversion ratio α
Motion Binding Numeracy Spatial Motion Smoothness
1. Direct T2V (no inversion) 0.2233 0.2041 0.4891 97.73
2. 0.8 0.2793 0.2081 0.5502 98.69
3. 0.7 0.3197 0.2653 0.5678 98.62
4. 0.6 0.3352 0.3059 0.6057 98.63
5. 0.5 0.3980 0.3138 0.6447 98.58
6. LLM-controlled 0.3732 0.3138 0.5866 99.01
Table 2. Comparison of different noise inversion ratio α, where we compare static values and LLM-based dynamic values. Backbone T2V:
VideoCrafter2. Background generator: Flux + CogVideoX-5B.
No. Background Generator Motion Numeracy backbone). We observe that CogVideoX-5B struggles with
1. Direct T2V (no background) 0.2897 0.2750 motion direction (e.g., an egg moves to the right instead
2. SDXL (T2I) + CogVideoX-5B (I2V) 0.4487 0.3559 of to the left, a helicopter ascends instead of descending
3. FLUX (T2I) + CogVideoX-5B (I2V) 0.4549 0.3647 to the land), numeracy (e.g., generated four bears instead
4. CogVideoX-5B (T2V) 0.4565 0.3028 of three bears, four penguins instead of six penguins), and
spatial relationships (e.g., a vending machine should be lo-
Table 3. Ablation studies on different background generators. cated to the right of a gorilla, but it is missing; the um-
brella should be located on the left of the children). In con-
trast, V IDEO -MSG successfully guides the T2V backbone
Different background generator. In Table 3, we com- to generate videos with correct semantics in all cases. Note
pare different background generation methods (Sec. 3.1): that the T2V model can understand the coarse guidance in
(1) generating a background with a text-to-image (T2I) VIDEO SKETCH and place objects that harmonize well with
model, followed by an image-to-video (I2V) model for the background through noise inversion. For example, in
animation, and (2) directly using a text-to-video (T2V) the middle example (‘three bears in a river surrounded by
model to generate an animated background. While both ap- mountains’), even when the VIDEO SKETCH includes three
proaches improve motion binding and numeracy compared bears only with other heads facing forward, the T2V model
to using a single T2V model for video generation, the T2I could place the three bears in the river naturally.
+ I2V pipeline scores higher in numeracy, and the T2V ap-
proach scores higher in motion binding. We attribute this
Effect of different noise inversion ratios α. Fig. 3 shows
to the video generation model’s ability to better refine ob-
the video generation results from VIDEO SKETCH (with
ject motion when the background follows a static camera,
CogVideoX-5B backbone), with different noise inversion
making foreground changes more salient for the video dif-
ratios α. Interestingly, the model can automatically refine
fusion model. The I2V pipeline better adheres to the “Static
objects to better align with the prompt and surrounding en-
Camera” prompt, producing natural background animations
vironment based on different α. We find that lower α (i.e.,
(e.g., wind, light changes). In contrast, T2V models of-
less noise) generally provides stronger layout control. For
ten disregard the “Static Camera” requirement, introducing
example, in the left top example, the egg in the videos with
excessive camera motion and scene changes in the video.
α “ 0.7 and α “ 0.8 closely follow the trajectory in VIDEO
These inconsistencies make it harder for the video diffu-
SKETCH , while in the video α “ 0.9, the egg movement
sion model to refine foreground objects, leading to perfor-
is small and does not follow the trajectory. However, a
mance degradation (e.g., 0.3028 with CogVideoX-5B vs.
lower α can lead to less natural generations; e.g., in the
0.3647 with FLUX on numeracy). Additionally, we find
bottom-middle example, the boy motion appears less nat-
that a stronger T2I model (e.g., FLUX [19]) yields better re-
ural at α “ 0.7 compared to α “ 0.9. This highlights the
sults than a weaker one (e.g., SDXL [36]), highlighting the
importance of selecting an appropriate α to balance motion
potential of leveraging high-quality T2I models for layout-
smoothness with faithful adherence to VIDEO SKETCH.
controlled text-to-video generation.
7
Figure 4. Example video showing the importance of background Figure 5. Example video showing the importance of foreground
object detection in foreground object placement. object segmentation.
out access to background bounding box information, the ment the balloon from the generated foreground object im-
MLLM fails to place the golden retriever on the grass when age and then place it onto the background to create VIDEO
relying solely on the background image input. In con- SKETCH , the balloon in the generated video harmonizes
trast, when provided with bounding box information from well with the background.
the background (e.g., {"label": "path", "box":
[0.44, 0.57, 0.99, 0.99]}), the MLLM suc-
cessfully positions the golden retriever at the correct lo-
5. Conclusion
cation on the grass. Moreover, we find that this planning In this work, we introduce V IDEO -MSG, a training-free
step directly impacts the final video quality. Condition- guidance method designed to enhance text-to-video (T2V)
ing the generation on inaccurate bounding box plans can generation through multimodal planning and structured
conflict with the video diffusion model’s prior knowledge. noise initialization. V IDEO -MSG consists of three steps,
For instance, in the first frame, the model may generate wherein the first two steps, V IDEO -MSG creates VIDEO
two golden retrievers—one on the ground based on its prior SKETCH , a detailed spatial and temporal plan utilizing a
knowledge and another floating in the air according to the set of multimodal models, including multimodal LLM, ob-
VIDEO SKETCH —resulting in unrealistic outputs, such as ject detection, and instance segmentation models. In the
a golden retriever running mid-air across the garden. In final step, V IDEO -MSG guides a downstream T2V diffu-
contrast, our approach, which conditions planning on back- sion model with VIDEO SKETCH through noise inversion
ground bounding boxes, enables the generation of more nat- and denoising. Notably, V IDEO -MSG does not require
ural and commonsense-aligned videos. fine-tuning or additional memory during inference, making
it easier to adopt large T2V models than existing methods
Segmentation of foreground objects improves harmo- that rely on fine-tuning or iterative attention manipulation.
nization. As demonstrated in Fig. 5, without object seg- V IDEO -MSG demonstrates its effectiveness in enhancing
mentation, the foreground object (a balloon) does not align text alignment with multiple T2V backbones on popular
well with the background, and the quantity is not well con- T2V generation benchmarks. We also provide comprehen-
trolled (multiple balloons). This occurs because the video sive ablation studies and qualitative examples that support
diffusion model does not inherently distinguish between the the design choices of V IDEO -MSG. We hope our method
appearance of the background and that of the balloon, caus- can inspire future work on effectively and efficiently inte-
ing them to blend together. In contrast, when we first seg- grating LLMs’ planning capabilities into video generation.
8
Acknowledgments Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin
Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com-
This work was supported by ARO W911NF2110220, prehensive benchmark suite for video generative models. In
DARPA MCS N66001-19-2-4031, DARPA KAIROS Proceedings of the IEEE/CVF Conference on Computer Vi-
Grant FA8750-19-2-1004, DARPA ECOLE Program No. sion and Pattern Recognition, 2024. 2, 4
HR00112390060, NSF-AI Engage Institute DRL211263, [12] Hedra Inc. Hedra. [Link] 2025.
ONR N00014-23-1-2356, Microsoft Accelerate Foundation 1
Models Research (AFMR) grant program, and a Bloomberg [13] Diederik P Kingma and Max Welling. Auto-encoding varia-
Data Science PhD Fellowship. The views, opinions, and/or tional bayes. In ICLR, 2014. 4
findings contained in this article are those of the authors and [14] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
not of the funding agency. Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
thing. In Proceedings of the IEEE/CVF international confer-
References ence on computer vision, pages 4015–4026, 2023. 4
[1] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, [15] Kling AI. Kling. [Link] 2024. 5
Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin [16] Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama,
Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- Jonathan Huang, Grant Schindler, Rachel Hornung, Vigh-
dation model platform for physical ai. arXiv preprint nesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna
arXiv:2501.03575, 2025. 1 Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng,
[2] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth,
Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, David Hendon, Alonso Martinez, David Minnen, Mikhail
Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-
nie: Generative interactive environments. In Forty-first Inter- Hsuan Yang, Irfan Essa, Huisheng Wang, David A Ross,
national Conference on Machine Learning, 2024. 1 Bryan Seybold, and Lu Jiang. Videopoet: A large language
[3] Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam model for zero-shot video generation. In Forty-first Interna-
Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. tional Conference on Machine Learning, 2024. 2
Videojam: Joint appearance-motion representations for en- [17] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai,
hanced motion generation in video models. arXiv preprint Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang,
arXiv:2502.02492, 2025. 2 et al. Hunyuanvideo: A systematic framework for large video
generative models. arXiv preprint arXiv:2412.03603, 2024.
[4] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia,
1
Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2:
[18] PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, 2024. 5
Overcoming data limitations for high-quality video diffu-
[19] Black Forest Labs. Flux. [Link]
sion models. In Proceedings of the IEEE/CVF Conference
black-forest-labs/flux, 2024. 4, 7
on Computer Vision and Pattern Recognition, pages 7310–
7320, 2024. 2, 4, 5 [20] Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang,
Ziyang Yuan, Liangbin Xie, Yuexian Zou, and Ying Shan.
[5] Dreamina AI. Dreamina. https : / / dreamina .
Image conductor: Precision control for interactive video syn-
[Link]/ai-tool/platform, 2024. 5
thesis. CoRR, 2024. 2
[6] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang,
[21] Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and
Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin,
Boyi Li. Llm-grounded video diffusion models. arXiv
and Bo Dai. Animatediff: Animate your personalized text-
preprint arXiv:2309.17444, 2023. 1, 5
to-image diffusion models without specific tuning. arXiv
[22] Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and
preprint arXiv:2307.04725, 2023. 5
Boyi Li. Llm-grounded video diffusion models. In The
[7] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Twelfth International Conference on Learning Representa-
Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling tions, 2024. 2
camera control for text-to-video generation. CoRR, 2024. 2 [23] Xinyao Liao, Xianfang Zeng, Liao Wang, Gang Yu, Gu-
[8] Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, osheng Lin, and Chi Zhang. Motionagent: Fine-grained
Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao controllable video generation via motion field agent. arXiv
Weng, Ying Shan, et al. Animate-a-story: Storytelling preprint arXiv:2502.03207, 2025. 1
with retrieval-augmented video generation. arXiv preprint [24] Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal.
arXiv:2307.06940, 2023. 2 Videodirectorgpt: Consistent multi-scene video generation
[9] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- via llm-guided planning. In First Conference on Language
fusion Probabilistic Models. In NeurIPS, pages 1–25, 2020. Modeling, 2024. 1, 2, 4
4 [25] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
[10] hpcaitech. Open-sora: Democratizing efficient video pro- Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
duction for all, 2024. 5 Zhu, et al. Grounding dino: Marrying dino with grounded
[11] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, pre-training for open-set object detection. arXiv preprint
Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, arXiv:2303.05499, 2023. 4
9
[26] Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Video- Duncan Findlay, Edede Oiwoh, Edmund Wong, Ehsan As-
drafter: Content-consistent multi-scene video generation dar, Elizabeth Proehl, Elizabeth Yang, Eric Antonow, Eric
with llm. CoRR, 2024. 1 Kramer, Eric Peterson, Eric Sigler, Eric Wallace, Eugene
[27] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- Brevdo, Evan Mays, Farzad Khorasani, Felipe Petroski
uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided Such, Filippo Raso, Francis Zhang, Fred von Lohmann,
sampling of diffusion probabilistic models. arXiv preprint Freddie Sulit, Gabriel Goh, Gene Oden, Geoff Salmon,
arXiv:2211.01095, 2022. 13 Giulio Starace, Greg Brockman, Hadi Salman, Haiming
[28] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- Bao, Haitang Hu, Hannah Wong, Haoyu Wang, Heather
taghi, and Aniruddha Kembhavi. Unified-io: A unified Schmidt, Heather Whitney, Heewoo Jun, Hendrik Kirchner,
model for vision, language, and multi-modal tasks. In The Henrique Ponde de Oliveira Pinto, Hongyu Ren, Huiwen
Eleventh International Conference on Learning Representa- Chang, Hyung Won Chung, Ian Kivlichan, Ian O’Connell,
tions, 2023. 2 Ian O’Connell, Ian Osband, Ian Silber, Ian Sohl, Ibrahim
[29] Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Okuyucu, Ikai Lan, Ilya Kostrikov, Ilya Sutskever, Ing-
Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha mar Kanitscheider, Ishaan Gulrajani, Jacob Coxon, Jacob
Kembhavi. Unified-io 2: Scaling autoregressive multimodal Menick, Jakub Pachocki, James Aung, James Betker, James
models with vision, language, audio, and action. In CVPR, Crooks, James Lennon, Jamie Kiros, Jan Leike, Jane Park,
2024. 2 Jason Kwon, Jason Phang, Jason Teplitz, Jason Wei, Jason
[30] Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- Wolfe, Jay Chen, Jeff Harris, Jenia Varavva, Jessica Gan Lee,
wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Jessica Shieh, Ji Lin, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi
Latent diffusion transformer for video generation. arXiv Yu, Joanne Jang, Joaquin Quinonero Candela, Joe Beutler,
preprint arXiv:2401.03048, 2024. 5 Joe Landers, Joel Parish, Johannes Heidecke, John Schul-
[31] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- man, Jonathan Lachman, Jonathan McKay, Jonathan Uesato,
jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided Jonathan Ward, Jong Wook Kim, Joost Huizinga, Jordan
image synthesis and editing with stochastic differential equa- Sitkin, Jos Kraaijeveld, Josh Gross, Josh Kaplan, Josh Sny-
tions. In International Conference on Learning Representa- der, Joshua Achiam, Joy Jiao, Joyce Lee, Juntang Zhuang,
tions, 2022. 4 Justyn Harriman, Kai Fricke, Kai Hayashi, Karan Singhal,
[32] OpenAI. Sora. [Link] 2024. 1 Katy Shi, Kavin Karthik, Kayla Wood, Kendra Rimbach,
[33] OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Kenny Hsu, Kenny Nguyen, Keren Gu-Lemberg, Kevin But-
Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, ton, Kevin Liu, Kiel Howe, Krithika Muthukumar, Kyle
Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Luther, Lama Ahmad, Larry Kai, Lauren Itow, Lauren Work-
Madry,
˛ Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, man, Leher Pathak, Leo Chen, Li Jing, Lia Guy, Liam Fe-
Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex dus, Liang Zhou, Lien Mamitsuka, Lilian Weng, Lindsay
Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- McCallum, Lindsey Held, Long Ouyang, Louis Feuvrier,
illov, Alexi Christakis, Alexis Conneau, Ali Kamali, Al- Lu Zhang, Lukas Kondraciuk, Lukasz Kaiser, Luke He-
lan Jabri, Allison Moyer, Allison Tam, Amadou Crookes, witt, Luke Metz, Lyric Doshi, Mada Aflak, Maddie Simens,
Amin Tootoochian, Amin Tootoonchian, Ananya Kumar, Madelaine Boyd, Madeleine Thompson, Marat Dukhan,
Andrea Vallone, Andrej Karpathy, Andrew Braunstein, An- Mark Chen, Mark Gray, Mark Hudnall, Marvin Zhang, Mar-
drew Cann, Andrew Codispoti, Andrew Galu, Andrew Kon- wan Aljubeh, Mateusz Litwin, Matthew Zeng, Max John-
drich, Andrew Tulloch, Andrey Mishchenko, Angela Baek, son, Maya Shetty, Mayank Gupta, Meghan Shah, Mehmet
Angela Jiang, Antoine Pelisse, Antonia Woodford, Anuj Yatbaz, Meng Jia Yang, Mengchao Zhong, Mia Glaese, Mi-
Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak, Avi- anna Chen, Michael Janner, Michael Lampe, Michael Petrov,
tal Oliver, Barret Zoph, Behrooz Ghorbani, Ben Leim- Michael Wu, Michele Wang, Michelle Fradin, Michelle
berger, Ben Rossen, Ben Sokolowsky, Ben Wang, Benjamin Pokrass, Miguel Castro, Miguel Oom Temudo de Cas-
Zweig, Beth Hoover, Blake Samic, Bob McGrew, Bobby tro, Mikhail Pavlov, Miles Brundage, Miles Wang, Mi-
Spero, Bogo Giertler, Bowen Cheng, Brad Lightcap, Bran- nal Khan, Mira Murati, Mo Bavarian, Molly Lin, Murat
don Walkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Yesildal, Nacho Soto, Natalia Gimelshein, Natalie Cone,
Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Car- Natalie Staudacher, Natalie Summers, Natan LaFontaine,
roll Wainwright, Cary Bassin, Cary Hudson, Casey Chu, Neil Chowdhury, Nick Ryder, Nick Stathas, Nick Turley,
Chad Nelson, Chak Li, Chan Jun Shern, Channing Con- Nik Tezak, Niko Felix, Nithanth Kudige, Nitish Keskar,
ger, Charlotte Barette, Chelsea Voss, Chen Ding, Cheng Noah Deutsch, Noel Bundick, Nora Puckett, Ofir Nachum,
Lu, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Ola Okelola, Oleg Boiko, Oleg Murk, Oliver Jaffe, Olivia
Koch, Christian Gibson, Christina Kim, Christine Choi, Watkins, Olivier Godement, Owen Campbell-Moore, Patrick
Christine McLeavey, Christopher Hesse, Claudia Fischer, Chao, Paul McMillan, Pavel Belov, Peng Su, Peter Bak, Pe-
Clemens Winter, Coley Czarnecki, Colin Jarvis, Colin Wei, ter Bakkum, Peter Deng, Peter Dolan, Peter Hoeschele, Peter
Constantin Koumouzelis, Dane Sherburn, Daniel Kappler, Welinder, Phil Tillet, Philip Pronin, Philippe Tillet, Prafulla
Daniel Levin, Daniel Levy, David Carr, David Farhi, David Dhariwal, Qiming Yuan, Rachel Dias, Rachel Lim, Rahul
Mely, David Robinson, David Sasaki, Denny Jin, Dev Val- Arora, Rajan Troll, Randall Lin, Rapha Gontijo Lopes, Raul
ladares, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza
10
Zamani, Ricky Wang, Rob Donnelly, Rob Honsby, Rocky feld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff
Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik
Rory Carmichael, Rowan Zellers, Roy Chen, Ruby Chen, Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos
Ruslan Nigmatullin, Ryan Cheu, Saachi Jain, Sam Altman, Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li,
Sam Schoenholz, Sam Toizer, Samuel Miserendino, Sand- Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro-
hini Agarwal, Sara Culver, Scott Ethersmith, Scott Gray, vic, and Yuming Du. Movie gen: A cast of media founda-
Sean Grove, Sean Metzger, Shamez Hermani, Shantanu tion models. [Link]
Jain, Shengjia Zhao, Sherwin Wu, Shino Jomoto, Shirong gen- media- foundation- models- generative-
Wu, Shuaiqi, Xia, Sonia Phene, Spencer Papay, Srinivas ai-video/, 2025. 1
Narayanan, Steve Coffey, Steve Lee, Stewart Hall, Suchir [38] Runway. Gen-3. [Link]
Balaji, Tal Broda, Tal Stramer, Tao Xu, Tarun Gogineni, introducing-gen-3-alpha/, 2024. 1, 5
Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas [39] Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain,
Cunninghman, Thomas Degry, Thomas Dimson, Thomas Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman.
Raoux, Thomas Shadwell, Tianhao Zheng, Todd Under- Emu edit: Precise image editing via recognition and genera-
wood, Todor Markov, Toki Sherbakov, Tom Rubin, Tom tion tasks. In CVPR, 2023. 4
Stasi, Tomer Kaftan, Tristan Heywood, Troy Peterson, Tyce
[40] Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian,
Walters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie
Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung,
Monaco, Vishal Kuo, Vlad Fomenko, Wayne Chang, Weiyi
Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and
Zheng, Wenda Zhou, Wesam Manassra, Will Sheu, Woj-
controllable image-to-video generation with explicit motion
ciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, You-
modeling. In SIGGRAPH (Conference Paper Track), 2024.
long Cheng, Yu Zhang, Yuchen He, Yuchen Zhang, Yujia
2
Jin, Yunxing Dai, and Yury Malkov. Gpt-4o system card,
2024. 3 [41] Jiaming Song, Chenlin Meng, and Stefano Ermon.
Denoising diffusion implicit models. arXiv preprint
[34] Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari
arXiv:2010.02502, 2020. 4
Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre
Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen [42] Spencer Sterling. Zeroscope v2 576w. https : / /
Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, huggingface . co / cerspense / zeroscope _ v2 _
Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, 576w, 2024. 5
Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Fred- [43] Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu,
eric Besse, Tim Harley, Anna Mitenkova, Jane Wang, Jeff Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen-
Clune, Demis Hassabis, Raia Hadsell, Adrian Bolton, Satin- sive benchmark for compositional text-to-video generation.
der Singh, and Tim Rocktäschel. Genie 2: A large-scale CoRR, abs/2407.14505, 2024. 2, 4
foundation world model, 2024. 1 [44] Pika Team. Pika art. [Link] 2024. 5
[35] PixVerse. Pixverse. [Link] [45] WorldLab Team. Worldlab. https : / / www .
2024. 5 [Link]/blog, 2024. 1
[36] Dustin Podell, Zion English, Kyle Lacey, Andreas [46] Wan Team. Wan: Open and advanced large-scale video gen-
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and erative models, 2025. 1
Robin Rombach. Sdxl: Improving latent diffusion models
[47] Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng,
for high-resolution image synthesis. In The Twelfth Inter-
Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei
national Conference on Learning Representations, 2024. 4,
Wan, Di Zhang, and Bin Cui. Videotetris: Towards compo-
7
sitional text-to-video generation, 2024. 5
[37] Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra,
[48] Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong,
Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-
Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun,
Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary,
Saining Xie, and Zhuang Liu. Metamorph: Multimodal
Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan
understanding and generation via instruction tuning. arXiv
Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng
preprint arXiv:2412.14164, 2024. 2
Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt
Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Pe- [49] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang,
ter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video
Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak technical report. arXiv preprint arXiv:2308.06571, 2023. 5
Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, [50] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan
Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang,
Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xi- Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is
aoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, all you need. CoRR, 2024. 2
Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian [51] Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit
He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Bansal. Dreamrunner: Fine-grained storytelling video gen-
Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh eration with retrieval-augmented motion adaptation. arXiv
Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schon- preprint arXiv:2411.16657, 2024. 2
11
[52] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, attention for long-range image and video generation. In
Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. The Thirty-eighth Annual Conference on Neural Information
Motionctrl: A unified and flexible motion controller for Processing Systems, 2024. 2
video generation. In SIGGRAPH (Conference Paper Track),
2024. 2
[53] Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He,
David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting
Gao, and Di Zhang. Draganything: Motion control for any-
thing using entity representation. In European Conference
on Computer Vision, pages 331–348. Springer, 2024. 1, 2
[54] Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang,
Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu
Yin, Li Yi, et al. Vila-u: a unified foundation model inte-
grating visual understanding and generation. arXiv preprint
arXiv:2409.04429, 2024. 2
[55] Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Ste-
fano Ermon, and CUI Bin. Mastering text-to-image dif-
fusion: Recaptioning, planning, and generating with multi-
modal llms. In Forty-first International Conference on Ma-
chine Learning, 2024. 2
[56] Xingyi Yang and Xinchao Wang. Compositional video gen-
eration as flow equalization, 2024. 5
[57] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu
Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao-
han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video
diffusion models with an expert transformer. arXiv preprint
arXiv:2408.06072, 2024. 1, 2, 4, 5
[58] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang
Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained
control in video generation by integrating text, image, and
trajectory. arXiv preprint arXiv:2308.08089, 2023. 2
[59] Shoubin Yu, Jacob Zhiyuan Fang, Jian Zheng, Gunnar Sig-
urdsson, Vicente Ordonez, Robinson Piramuthu, and Mohit
Bansal. Zero-shot controllable image-to-video animation via
motion decomposition. In Proceedings of the 32nd ACM
International Conference on Multimedia, pages 3332–3341,
2024. 2
[60] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu,
Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and
Mike Zheng Shou. Show-1: Marrying pixel and latent dif-
fusion models for text-to-video generation. arXiv preprint
arXiv:2309.15818, 2023. 5
[61] Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li,
Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo,
Yaqian Li, Shilong Liu, et al. Recognize anything: A strong
image tagging model. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
1724–1732, 2024. 4
[62] Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai,
Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora:
Trajectory-oriented diffusion transformer for video genera-
tion. arXiv preprint arXiv:2407.21705, 2024. 1
[63] Haitao Zhou, Chuang Wang, Rui Nie, Jinxiao Lin, Dong-
dong Yu, Qian Yu, and Changhu Wang. Trackgo: A flex-
ible and efficient method for controllable video generation.
CoRR, 2024. 2
[64] Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi
Feng, and Qibin Hou. Storydiffusion: Consistent self-
12
Appendix separate the generation of foreground and background, en-
suring that key foreground objects are not mistakenly in-
In this appendix, we present the following: cluded in the background. In Fig. 7, we first prompt the
• Noise inversion details using DPM-Solver++ [27] in MLLM to generate bounding boxes for foreground objects,
Sec. A. followed by reasoning for their placement. Additionally, we
• MLLM prompts we use to collect background descrip- emphasize that the placement of foreground objects should
tion, foreground object layout and trajectory, and α used be informed by background bounding box annotations to
to determine how much noise to inject during inversion in improve spatial coherence. Finally, Fig. 8 illustrates the
Sec. B. prompt template used to determine the appropriate level of
noise injection during inversion. We explicitly incorporate
A. Noise Inversion Details prior knowledge into the template, instructing the MLLM
to apply less noise for tasks requiring precise trajectory or
Here, we describe in detail how we adopt the inversion tech- layout control and more noise for tasks involving dynamic
nique for final video generation with the motion priors pre- changes or object actions that cannot be effectively modeled
pared in Stage 3. Specifically, we first encode the sequence with bounding box plans.
of images with the planned layout, collected in the pre-
vious stage (Sec. 3.2), into the latent space z using a 3D
Variational Autoencoder (3D VAE). Then, we perform the
forward diffusion process where Gaussian noise is gradu-
ally added to the latent. Following the DPM-Solver++ [27]
scheduler in CogVideoX, the noised latent at diffusion step
t is:
? ?
zt “ αt z0 ` 1 ´ αt ϵ, ϵ „ N p0, Iq. (1)
śt
Here, αt “ s“1 p1´βs q is the cumulative noise sched-
ule, and ϵrepresents Gaussian noise.
Then, given a noisy latent zt , we attempt to recover the
clean latent as a general reverse denoising starting from
step t. The model then denoises to zt´1 using the DPM-
Solver++ method, which provides a high-order approxima-
tion of the reverse diffusion process. Specifically, the up-
date equation for zt´1 in a second-order solver is:
13
Figure 6. Prompt template used to query background description.
Figure 7. Prompt template used to query foreground object layout and trajectory plan.
14
Figure 8. Prompt template used to determine how much noise to inject during inversion.
15