Text-to-Video AI Models: A New Hype in Multimodal Machine Learning

October 7, 2022

Kasra Darvish

2-Minute Read

2-minute video generated by text from phenaki

There has been great hype in multimodal research these past few days because of the release of multiple text-to-video models while text-to-image models are still pretty hype and not perfect yet. Diffusion models such as GLIDE in general and stable diffusion in particular are the state-of-the-art models for text-to-image tasks as of now. However, multiple papers came out recently that tackle the task of generating videos from text prompts or still images. Here are some of these models released all about the same time:

Make-a-video by Meta
Imagen video by Google
Phenaki Apparantly by Google Brain (It’s under submission at ICLR anonymously, but the paper is on Arxiv)
LUMIERE: A Space-Time Diffusion Model for Video Generation by Google Research

The 2-minute video example generated by Phenaki shown below is mind-blowing.

2-minute video

Below is the prompts they used to generate this video.

Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion’s face, inside the office. Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city.
reference: https://phenaki.video

The speed of progress in the field of AI and machine learning is crazy fast. Multimodal research in particular has received a lot of attention. It was only in May 2022 that Google released Imagen which is a text-to-image model. Few months later, in October 2022, they released the similar model for generating videos.

Kasra Darvish

Text-to-Video AI Models: A New Hype in Multimodal Machine Learning

Recent Posts

Large Language Models

Paper Localizing Active Objects From Egocentric Vision

Semi Live Journal

Research Workflow

Machine Learning Experiment Tracking With Wandb

Categories

About