Text-2-Video Generation: Step-by-Step Guide

Bringing Words to Life: Easy Techniques to Generate Stunning Videos from Text Using Python.

Text-2-Video Generation: Step-by-Step Guide
Gif by Author




Diffusion-based image generation models represent a revolutionary breakthrough in the field of Computer Vision. Pioneered by models including Imagen, DallE, and MidJourney, these advancements demonstrate remarkable capabilities in text-conditioned image generation. For an introduction to the inner workings of these models, you can read this article.

However, the development of Text-2-Video models poses a more formidable challenge. The goal is to achieve coherence and consistency across each generated frame and maintain generation context from the video's inception to its conclusion.

Yet, recent advancements in Diffusion-based models offer promising prospects for Text-2-Video tasks as well. Most Text-2-Video models now employ fine-tuning techniques on pre-trained Text-2-Image models, integrating dynamic image motion modules, and leveraging diverse Text-2-Video datasets like WebVid or HowTo100M.

In this article, our approach involves utilizing a fine-tuned model provided by HuggingFace, which proves instrumental in generating the videos.






We use the Diffusers library provided by HuggingFace, and a utility library called Accelerate, that allows PyTorch code to run in parallel threads. This speeds up our generation process.

First, we must install our dependencies and import relevant modules for our code.

pip install diffusers transformers accelerate torch


Then, import the relevant modules from each library.

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video


Creating Pipelines


We load the Text-2-Video model provided by ModelScope on HuggingFace, in the Diffusion Pipeline. The model has 1.7 billion parameters and is based on UNet3D architecture that generates a video from pure noise through an iterative de-noising process. It works in a 3-part process. The model firsts perform text-feature extraction from the simple English prompt. The text features are then encoded to the video latent space and de-noised. Lastly, the video latent space is decoded back to the visual space and a short video is generated.

pipe = DiffusionPipeline.from_pretrained(
"damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(



Moreover, we use 16-bit floating-point precision to reduce GPU utilization. In addition, CPU offloading is enabled that removes unnecessary parts from GPU during runtime.


Generating Video


prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)


We then pass a prompt to the Video Generation pipeline that provides a sequence of generated frames. We use 25 inference steps so that the model will perform 25 de-noising iterations. A higher number of inference steps can improve video quality but requires higher computational resources and time.

The separate image frames are then combined using a diffuser's utility function, and a video is saved on the disk.

We then pass a prompt to the Video Generation pipeline that provides a sequence of generated frames. The separate image frames are then combined using a diffuser's utility function, and a video is saved on the disk.

FinalVideo from Muhammad Arham on Vimeo.




Simple enough! We get a video of Spiderman surfing. Although it is a short not-so-high-quality video, it still symbolizes the promising prospect of this process, which can attain similar results as Image-2-Text models soon. Nonetheless, testing your creativity and playing with the model is still good enough. You can use this Colab Notebook to try it out.
Muhammad Arham is a Deep Learning Engineer working in Computer Vision and Natural Language Processing. He has worked on the deployment and optimizations of several generative AI applications that reached the global top charts at Vyro.AI. He is interested in building and optimizing machine learning models for intelligent systems and believes in continual improvement.