0% found this document useful (0 votes)
4 views12 pages

Text-Guided Video Completion Framework

The document presents a project on Text-Guided Video Completion (TVC) using the Multimodal Masked Video Generation (MMVG) framework to enhance video generation from incomplete frames and text prompts. The proposed system aims for high-quality visual output and temporal coherence, making it applicable in various fields such as entertainment and surveillance. It introduces a novel masking strategy and combines visual and textual data to improve video synthesis, with experimental results demonstrating its effectiveness through Fréchet Video Distance (FVD) scores.

Uploaded by

Thoufeeq A 45
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

Text-Guided Video Completion Framework

The document presents a project on Text-Guided Video Completion (TVC) using the Multimodal Masked Video Generation (MMVG) framework to enhance video generation from incomplete frames and text prompts. The proposed system aims for high-quality visual output and temporal coherence, making it applicable in various fields such as entertainment and surveillance. It introduces a novel masking strategy and combines visual and textual data to improve video synthesis, with experimental results demonstrating its effectiveness through Fréchet Video Distance (FVD) scores.

Uploaded by

Thoufeeq A 45
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Integrating Text Cues For Video Completion With

Multimodal Masked Video Creation


Dr. J. C. MIRACLIN JOYCE PAMILA
Department of CSE, Government College of Technology, Coimbatore, India.
miraclin@[Link]
Anumitha R D, Kayalvizhi T, Thoufeeq A, Ruthra L

Abstract:

The "Integrating Text Cues for Video Completion with Multimodal Masked Video Creation"
project seeks to improve the potential of video generation from incomplete frames and text
prompts to create seamless video completions. The new solution, known as Text-Guided Video
Completion (TVC), applies the Multimodal Masked Video Generation (MMVG) framework,
which employs temporal-aware visual tokens and masking to predict, rewind, and complete
missing sections of a video. The system seeks to ensure temporal coherence and high-quality
visual output andretain textual input alignment. MMVG is trained on varied datasets using
methods such as Temporal-Aware Vector Quantization (T-VQ) and multimodal fusion to manage
complex video scenarios. The expected outcomes are enhanced controllability and flexibility in
video editing, making the system applicable for use in entertainment, gaming, training
simulation, and surveillance. The project opens the door for further improvement in real-time
video generation, sensor integration, and responsiveness to changing video domains, seeking to
widen the limits of automated video creation and editing technology.

1. Introduction: The core innovation is in the Multimodal


Masked Video Generation (MMVG)
The video editing and creation processes are framework, where a masking mechanism
of paramount importance in many fields along with temporally aware visual tokens is
such as entertainment, surveillance, and employed to perform operations like
autonomous systems, where logical stories prediction, rewind, and completion of video
and smooth continuity are crucial. However, parts. MMVG leverages both visual and text
the very dynamic nature of these signals to make the synthesized videos
applications has a tendency to create sparse temporally consistent, possess smooth
video sets, which require sophisticated over-transitions, and adhere to the intended
mechanisms for efficient reconstruction or narrative.
completion of lost segments. This paper
attempts to address this problem with By bringing these operations under a unified
Text-Guided Video Completion (TVC), a framework, the project enhances the
novel paradigm that unifies natural language flexibility and accuracy of video synthesis,
instructions and incomplete video frames to thus making available a range of
generate rational and high-quality video applications. The approach resolves existing
content. video editing problems, from reconstructing
broken surveillance footage to generating
dynamic video streams in video games and B. Video action recognition:
animation. Additionally, the natural
language descriptions make the system Video action recognition is a central
applicable to real-world human interaction challenge in computer vision where
contexts. temporal relationships among frames need to
be understood to effectively comprehend
2. Related work: video content. Several methods employ
image-based paradigms and add
A. Using Prior for Text-to-Video architectural modifications to enable
Diffusion Model: temporal modeling. These approaches
typically either fine-tune the full model or
Designing realistic motion remains a primary train new models from scratch on large
challenge in text-to-video synthesis. A video datasets, both of which are
variety of approaches utilize prior knowledge computationally expensive. Recently, a new
to address this challenge. ControlVideo research trend has shifted towards action
utilizes pre-built motion data, such as depth recognition by looking at state transitions.
maps or edge maps, as conditions of video This viewpoint represents actions as
diffusion models and emphasizesmotion's transformations that alter objects and the
importance to generate realistic videos. surrounding environment. Nevertheless,
GD-VDM follows a two-stage approach: most of these methods look at state
depth-based videos are first generated and transitions in the entire scene where various
then the task is solved by a designated objects and their semantic interpretations are
Vid2Vid diffusion model generating deeply intertwined. This intertwining
real-world coherent videos, which are mainly renders it challenging to effectively detect
for autonomous driving cases. Its state transitions at the object level.
applicability to general cases is limited due to
the lack of depth training data. Other C. Generative Adversarial Networks for
methods include Make-Your-Video, relying Video Generation
on a special depth estimator to obtain motion
information directly from driving videos Recent developments in video generation
without explicit depth generation. Leo technology have seen the development of
deviates from other approaches asit trains a models like Generative Adversarial
motion diffusion model in order to output Networks (GANs), autoregressive models,
motion representations decoded as optical and more recently, transformer models.
flows and used to animate static images. GANs have shown their ability to generate
Other techniques such as latent space high-quality video sequences with the help
displacement, noise correlation, and motion of adversarial training processes to gain an
textual descriptions too have been attempted insight into the complex distribution of
as the condition of the video generation video data. For instance, MoCoGAN
models. Recent developments such as PYoCo (Motion and Content GAN) separates video
introduces the video noise prior to diffusion generation into two independent streams:
models which allows low-cost fine-tuning of one handling content and the other motion,
text-to-image models to videos. enabling the model to generate longer and
coherent video sequences. Temporal
consistency added has further enhanced the 2.​ Developing MMVG, a single model with
capability of GANs to generate smooth and a new masking scheme capable of
realistic frame-to-frame transitions, which is addressing all TVC tasks in a single
a prerequisite for video completion and training step.
motion synthesis tasks. 3.​ Showing through extensive experiments
that MMVG successfully handlesdifferent
3. Proposed System: video completion situations and facilitates
video generation and prediction. This
Experimental results demonstrate the makes TVC a promising vision in
significance of instructions for controllable vision-and-language research.
video completion. The proposed MMVG
framework successfully addresses all three 4. Methodology:
TVC tasks and is made possible through a
new masking strategy improving temporal The MMVG (Multimodal Masked Video
modeling, which further enhances overall Generation) Text-Guided Video Generation
video generation and prediction. The main model combines textual instruction with
contributions of this work are as follows: partially incomplete video inputs to produce
or complete video fragments. This is
1.​ The TVC model is introduced as a model achieved by utilizing an efficiently planned
that synthesizes videos from incomplete multi-stage operation of visual tokenization,
frames with natural language commands multimodal encoding, temporal reasoning,
to guide the temporal evolution. and video decoding. The following discusses
each module in detail along with the missing frames in a temporally and
underlying methodology: contextually consistent way with the rest of
the video content.
4.1 Temporal-Aware Visual
Tokenization (T-VQ) The adaptive aspect of this masking
approach enables the model to concentrate
In the first step, the model uses a on more difficult-to-recover frames and
VQGAN-based approach to convert video scale the probability of masking according to
frames into discrete visual tokens. The key the difficulty of recovering a frame. This
innovation in this step is the addition of enables the model to learn more from hard
temporal awareness to the tokenization examples and, consequently, enhance its
process, referred to as T-VQ. A VQ encoder generative capability.
maps each video frame to a latent
representation. This representation is then 4.3 Textual Tokenization and
quantized to the nearest vector in the Cross-Modal Fusion
codebook to generate a discrete token for the
video frame in the latent space. Apart from the video information, text
input is tokenized into discrete word tokens
Temporal coherence is maintained by by a pre-trained tokenizer. These tokens
training the model to understand the capture the semantic meaning of the text and
temporal order of video frames. By using a are used to inform the process of generating
contrastive temporal reasoning model, the the video. By both the video frames and text
model is trained in the relative temporal being represented as discrete tokens, the
order between different frames. This is model can map both modalities to a common
necessary for ensuring that the output video latent space. This allows visual and
has continuity and smooth transitions and linguistic information to be fused, which is
thus mimics the natural flow of time seen in necessary for text-to-video generation.
real video content.
To combine the visual and textual data, a
4.2 Multimodal Masking Strategy multimodal encoder is employed. The
encoder handles the word tokens and the
One distinctive feature of this architecture visual tokens in parallel, enabling the model
is the masking strategy used to train the to capture inter-modal relationships. The
model on incomplete video inputs. During fusion is made possible by a self-attention
training, the majority of the frames in the mechanism, often employed in transformer
video are randomly occluded and few frames models, that enables the system to attend to
are kept intact. The occluded content is important regions of both the video and the
replaced by a special [SPAN] token, and it is text simultaneously. The resulting combined
used as a placeholder. The objective is to features hold both the visual context of the
make the model recover or synthesize the video and the semantic meaning of the text,
occluded frames based on the visible frames allowing for correct video generation based
and the provided textual description as input. on the given textual guidance.
This strategy helps learning of the model in
video completion—actually filling in the
4.4 Video Decoding and Generation and cross-entropy loss for video decoding.
These losses not only make the generated
Once both the visual and text inputs are videos accurate in terms of content per
encoded, the model proceeds to the decoding frame but also smooth and continuous in
stage, in which the aim is to synthesize the terms of temporal flow.
missing frames and recover the full video.
The decoder is autoregressive in nature, 4.6 Unified Learning Framework
producing each token depending on the
tokens that have already been produced and The design of the model is such that it can
the combined features extracted from the perform effortlessly on multiple video
encoder. The autoregressive approach generation tasks. The model is trained on
guarantees that the model considers the variable masking conditions and learns to fill
temporal frame dependencies, producing the in the videos from partial input at any
video in a coherent fashion. arbitrary time step. This makes it possible
for the model to work on various video
Cross-entropy loss is utilized in the generation tasks like video completion,
decoding process to guarantee that the text-to-video synthesis, and video
estimated video tokens correspond to the interpolation in a single framework.
ground-truth tokens in the target video.
Cross-entropy loss ensures that the model In addition, the model does not strictly
enhances the capability of producing depend on chronological guidance, which
accurate frames when trained. Furthermore, provides flexibility to the system. It can also
the decoder can also be utilized to work with generate videos from unordered sets of
both spatial and temporal attention through frames, with the guidance of the textual
the utilization of methods such as 3D-shifted input, rendering the system highly versatile
windows, which target blocks of video for all video generation applications.
patches over neighboring frames.
5. Evaluation:
4.5 Reconstruction and Video
Fréchet Video Distance (FVD) is a
Completion
measure used to quantify the similarity
Last but not least, with the discrete visual between two collections of video features,
tokens for every frame produced, the VQ usually a generated video and a ground truth
decoder is employed to synthesize the video (actual) video. It calculates how similar the
frames. It includes the conversion of the distribution of features extracted from
latent visual tokens into pixel space in order generated videos is to that of actual videos,
to generate a full video. The aim is to and it gives a quantitative measure of video
generate a visually and temporally coherent generation quality. A smaller FVD value
video that obeys both frame order and means that the generated video more
text-guidance. accurately represents the underlying nature
of actual videos.
During training, the model is taught to
minimize a number of loss functions, such as FVD score is derived by initially deriving
reconstruction loss, temporal coherence loss, features from video frames through a
pretrained model (e.g., InceptionV3,
frequently utilized for tasks involving the model to manage different video
images). The feature vectors of every frame scenarios effectively.
are summed up into a distribution (e.g., by
their mean and covariance). Fréchet Analysis of FVD Scores:
Distance is calculated between the two
1.​ Turn on light (FVD = 4.08):
distributions (ground truth and synthetic
○​ A fairly low FVD value of 4.08
video) to measure the similarity between
indicates that the produced video
them. This method allows FVD to estimate
for turning on the light is very
the quality of temporal and spatial
close to the distribution of the
coherence, motion dynamics, and overall
actual features in the real video.
realism of the produced video.
The first action is fairly basic and
The below table gives the FVD scores of constitutes little movement, which
different actions recorded in videos, could lead to a more distinct
representing the performance of the separation between generated and
generated video models for every action: actual video distributions and
therefore end up as a lower FVD
value. This shows that the model
Action FVD Score has well learned the most
important features of the action in
high fidelity.
Turn on light 4.08 2.​ Open drawer (FVD = 6.69):
○​ Opening a drawer adds more
intricate motion and interaction
with items. The FVD score of
Open drawer 6.69
6.69 indicates that the generated
video is fairly close to the ground
truth but still falls short of truly
Take cup 7.10 capturing the extent of detail such
as the accurate hand motion or the
interaction with the drawer itself.
Open cupboard 9.35 The difficulty in capturing these
aspects comes across in the
greater FVD score than that of the
Put cup into cupboard 9.86 "turn on light" action.
3.​ Take cup (FVD = 7.10):
○​ The FVD score of 7.10 for the
Table 1. The table gives the Fréchet "take cup" action also suggests an
Video Distance (FVD) scores of different increasing complexity in video
action categories, measuring the quality and generation. The action presumably
temporal consistency of the generated entails both hand movement and
videos. Lower scores represent better scene change as the cup is being
performance, demonstrating the ability of taken up. The model can probably
struggle to catch the finer aspects,
including the hand and cup producing realistic and coherent
interaction, or the cup motion sequences for more intricate tasks.
itself. The larger FVD score
indicates a larger mismatch
between the synthesized and the
actual video for this action.
4.​ Open cupboard (FVD = 9.35):
○​ Opening a cupboard is a more
dynamic activity that involves
many elements such as hand
movement, door interaction, and
possible changes in the scene..
The 9.35 FVD score indicates the
added challenge in accurately
rendering the scene, realistic
motion capture, and temporal
consistency across frames. This
score reflects that the video
produced for this action differs Figure 2. This plot represents the Fréchet Video
more prominently from the Distance (FVD) scores for different action
ground truth than do easier actions categories, offering a quantification of the
such as "turn on light" or "open generated video quality and temporal coherence.
drawer." Lower FVD scores represent better performance,
5.​ Put cup into cupboard (FVD = pointing to the model's capacity to deal
effectively with different video generation
9.86):
situations while maintaining temporal
○​ The process of placing a cup into
coherence.
a cupboard is multi-staged, such
as lifting the cup, transporting it Interpreting FVD Scores:
towards the cupboard, and
precisely placing it within. The ●​ Lower FVD Scores: A low FVD score
process is extremely dynamic, means that the generated video
demanding fine motor and precise distribution is very close to the
interaction with the environment. ground truth video feature
The FVD score of 9.86, the distribution. This indicates that the
maximum in this set, indicates model can effectively mimic simple
that the model performs worst on actions.
this action. The object ●​ Higher FVD Scores: A higher FVD
manipulation complexity and the score means that the generated video
accuracy of motion involved make is more different from the actual
it more difficult for the model to video distribution. This is usually
produce videos that are close to true for more complex actions
the ground truth. The high FVD involving complex interactions,
score suggests that the model has nuanced motion, or environmental
a lot of room for improvement in changes.
6. Experiments: similar but also contextually consistent with
given instructions.
6.1 Experimental Setup:
Implementation Detail
Dataset:
T-VQ employs ResBlocks as the visual
auto-encoder, made up of EncQ and DecQ.
Dataset Train / Val FPS
Discriminator D has the same architecture as
Kitchen 16695/5804 6 EncQ. For the vector quantization, patch size
Table 2. The above table gives an overview of a is 16, reducing a 128x128 video frame to an
dataset, including the distribution of samples for 8x8 grid of discrete visual tokens. Codebook
training and validation, along with the frame rate C has 1024 vocabularies and the hidden
at which the data was recorded in a specific embedding size is fixed at 256. The
setting. optimization of T-VQ is achieved using the
Adam optimizer with batch size 32 and
For the new task, we use varied video scenes learning rate 4.5e-6.
with natural instructions for TVC. The
Kitchen dataset has 22K egocentric videos MMVG is developed in encoder-decoder
on kitchen activity. The videos are of structure. EncM is a 24-layer, 16-head,
different lengths (4-16 frames) and have hidden embedding of 1024 transformer.
narrations. All the videos in this dataset are DecM follows a similar setup and applies a
resized to 128x128. temporal window size of 3 for VideoSwin.
The masking approach M initiates with a
Evaluation Metrics preliminary sampling rate p of 0.9 and an
adjustment rate α of 0.1. MMVG is
FVD calculates the distance between the
optimized with mixed precision training
ground truth and the video features [7],
with a batch size of 4 and learning rate of
giving a quantitative measure of how closely
4.5e-6 and also using the Adam optimizer.
the generated video approximates the actual
video in terms of feature distribution. This All experiments are implemented using
measure captures both the structure and PyTorch and conducted on 8 NVIDIA A100
content of the generated videos relative to GPUs.
the ground truth, enabling an assessment of
overall quality and fidelity. 6.2 Main Results
To enhance the correspondence of video VideoMAE is a MAE [34]-based model and
features with respective textual descriptions, fills in the missing video cubes, achieving
we fine-tune the CLIP model on every TVC by masking all video frames but the
dataset. This tuning enables the model to first or last (or both). TATS, SOTA video
comprehend and represent the distinguishing generation, also generates videos as discrete
features of every video scene better. In this visual tokens. As TATS can only look at the
way, by more accurately corresponding past through the autoregressive transformer,
visual and textual information, we guarantee it needs special training for each task. We
that generated videos are not only visually employ MMVGU as the single model that
can support all TVC tasks with a single
training and MMVGS to further train for MMVGS does even better and surpasses
every prediction, rewind, and infilling. TATS TATS entirely.
is used as the primary baseline, and we
examine the significance of guided text. TV Infilling We use the extra FILM for
infilling, which does video interpolation
TV Prediction VideoMAE tries to generate with in-between motion. Although
all the frames at once, so it is hard to keep synthesizing the intermediate frames
temporal consistency, leading to a large between the beginning and the end, the
328.9 FVD on Kitchen. TATS is naturally fast-changing visual dynamics render this
suited for prediction since it generates the process difficult, leading to a greater FVD.
frames sequentially. Yet, our combined Guided by the head and tail, we see a
MMVGU outperforms TATS on all datasets significant improvement even without text
(e.g., smaller 105.6 and 124.8 FVD on instructions (e.g., smaller FVDs on Kitchen),
Kitchen and Flintstones). These findings assisting in temporal video modeling. Our
show that learning from multiple time points combined MMVGU comes to similar
does not degrade prediction from the past. In performance to that of TATS, which is
contrast, our masking approach enhances especially trained for the infilling task. This
temporal coherence. MMVGS also improves illustrates that partial frame completion at
performance by training prediction as various points in time still improves
completion from the head. Employing performance, and MMVGS further excels on
instructions as guidance causes predictions TV Infilling.
to better match the anticipated ground-truth
outcomes. Likewise, MMVGU with text 6.3 UI Integration
outperforms TATS, despite not being
specifically trained for prediction. The For the project, a simple interface is applied
task-specific MMVGS also enhances the with Gradio to enable users to browse and
unified model. preview generated videos. The interface is
made to show a list [Link] files in a
TV Rewind Rewinding from the previous predefined directory, making it easy to
frame enables the model to visualize what interact with. The videos are dynamically
occurred previously while creating an retrieved using a function that searches the
appropriate start. Objects can be absent in directory for available files, and users can
the previous frame (e.g., spoons and forks choose a video from a dropdown menu to
for "close drawer"), which makes this task see its preview. The interface also uses a
more difficult. Like prediction, VideoMAE simple design with tabs and labels, which is
does not generate plausible rewind outputs. easy to use. This incorporation allows easy
Language is still important for creating an access to the generated outputs, thus
appropriate start. Our joint MMVGU improving the overall user experience.
performs as well as TATS and, on Kitchen
and Flintstones, better than it. With learning
completion from incomplete frames, the
autoregressive model is capable of video
rewind without explicit training. With TATS'
design for training MMVGU to rewind,
7. Applications and Future related to automated video creation and
Directions: editing.

The application of text prompts in 8. Conclusion:


television production by TVC has
In short, the integration of textual cues in
vastpotential in many industries:
video generation with the MMVG model is a
●​ Entertainment: Facilitate the significant achievement in transcending the
creation of immersive and interactive constraints of incomplete video information.
storytelling experiences, allowing The synergy of natural language processing
creators to create content that lives and visual completion facilitates the
upto audience demand. possibility of opening up a spectrum of
●​ Gaming: Enhance character impactful applications in automatic video
animation and scene generation from generation. With the enhancement of the
game narratives or user input, creating technology, it has the potential to
real-world environments and revolutionize how we consume and engage
experiences. with video content in an increasingly
●​ Training Simulations: In expanding digital world.
environments like healthcare or
We present the novel task of text-assisted
military training, the ability to
video completion (TVC) in which videos are
construct realistic simulations from
completed with respect to start, end, or both
theoretical situations or tactics would
frames under text description guidance. Our
improve training performance.
proposed framework, Multimodal Masked
●​ Surveillance: Facilitate the recovery
Video Generation (MMVG), features a new
of damaged surveillance recordings
masking mechanism based on visual
by using advanced video completion
supervision at various temporal positions,
methods, thereby ensuring the
which renders it context-invariant. MMVG,
recovery of important visual data and
with the regulation of masking situations,
offering complete insights during
addresses prediction, rewind, and infilling
important evaluations. This method
tasks within a unified model. Experimental
improves the capacity to analyze and
findings on a range of video scenes illustrate
interpret surveillance system
that MMVG performs extremely well on
recordings, particularly in cases where
TVC and generative video modeling. We
data is lost or corrupted.
believe that this work will benefit
Future research efforts may be focused vision-and-language research, providing an
on developing real-time capabilities to aid in anchor point for future development in the
the incorporation of sensors for enabling area.
instant video inspection, and developing
adaptive frameworks suitable for specific
video industries. Continuing research efforts
will be needed to overcome present
limitations and develop the technologies
References: 7. Levon Khachatryan, Andranik
Movsisyan, Vahram Tadevosyan, Roberto
1. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Henschel, Zhangyang Wang, Shant
Zhang, Z., Lin, S., Guo, B.: Swin Navasardyan, and Humphrey Shi.
transformer: Hierarchical vision transformer Text2video-zero: Text-toimage diffusion
using shifted windows. In: Proceedings of models are zero-shot video generators. arXiv
the IEEE/CVF International Conference on preprint arXiv:2303.13439, 2023.
Computer Vision. pp. 10012–10022 (2021)
8. Jiaxi Gu, Shicong Wang, Haoyu Zhao,
2. He, K., Zhang, X., Ren, S., Sun, J.: Deep Tianyi Lu, Xing Zhang, Zuxuan Wu,
residual learning for image recognition. In: Songcen Xu, Wei Zhang, Yu-Gang Jiang,
Proceedings of the IEEE conference on and Hang Xu. Reuse and diffuse: Iterative
computer vision and pattern recognition. pp. denoising for text-to-video generation. arXiv
770–778 (2016) preprint arXiv:2309.03549, 2023

3. Dosovitskiy, A., Beyer, L., Kolesnikov, 9. Songwei Ge, Seungjun Nah, Guilin Liu,
A., Weissenborn, D., Zhai, X., Unterthiner, Tyler Poon, Andrew Tao, Bryan Catanzaro,
T., Dehghani, M., Minderer, M., Heigold, David Jacobs, Jia-Bin Huang, MingYu Liu,
G., Gelly, S., et al.: An image is worth and Yogesh Balaji. Preserve your own
16x16 words: Transformers for image correlation: A noise prior for video diffusion
recognition at scale. arXiv preprint models. In Proceedings of the IEEE/CVF
arXiv:2010.11929 (2020) International Conference on Computer
Vision, pages 22930–22941, 2023
[Link] Zhang, Yuxiang Wei, Dongsheng
Jiang, Xiaopeng Zhang, Wangmeng Zuo, 10. Qiyang Hu, Adrian Waelchli, Tiziano
and Qi Tian. Controlvideo: Training-free Portenier, Matthias Zwicker, and Paolo
controllable text-to-video generation. arXiv Favaro. Video Synthesis from a Single
preprint arXiv:2305.13077, 2023 Image and Motion Stroke. In
arXiv:1812.01874, 2018.
5. Jinbo Xing, Menghan Xia, Yuxin Liu,
Yuechen Zhang, Yong Zhang, Yingqing He, [Link] Zhang, Chao Xu, Liang Liu,
Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Mengmeng Wang, Xia Wu, Yong liu, and
Xintao Wang, et al. Make-your-video: Yunliang Jiang. DTVNet: Dynamic
Customized video generation using textual Time-lapse Video Generation via Single Still
and structural guidance. arXiv preprint Image. In European Conference on
arXiv:2306.00943, 2023 Computer Vision (ECCV), 2020.

6. Susung Hong, Junyoung Seo, Sunghwan 12. Chenfei Wu, Jian Liang, Lei Ji, Fan
Hong, Heeseong Shin, and Seungryong Kim. Yang, Yuejian Fang, Daxin Jiang, and Nan
Large language models are frame-level Duan. NU¨WA: Visual Synthesis Pretraining
directors for zero-shot text-to-video for Neural visUal World creAtion. In
generation. arXiv preprint European Conference on Computer Vision
arXiv:2305.14330, 2023. (ECCV), 2022.

You might also like