Text-to-Video AI Models: A New Hype in Multimodal Machine Learning - Blog of Kasra Darvish
This is me

Kasra Darvish

I write to exist beyond time!

Kasra Darvish

2-Minute Read

2-minute video generated by text from phenaki

There has been great hype in multimodal research these past few days because of the release of multiple text-to-video models while text-to-image models are still pretty hype and not perfect yet. Diffusion models such as GLIDE in general and stable diffusion in particular are the state-of-the-art models for text-to-image tasks as of now. However, multiple papers came out recently that tackle the task of generating videos from text prompts or still images. Here are some of these models released all about the same time:

  1. Make-a-video by Meta
  2. Imagen video by Google
  3. Phenaki Apparantly by Google Brain (It’s under submission¬†at ICLR anonymously, but the paper is on Arxiv)
  4. LUMIERE: A Space-Time Diffusion Model for Video Generation by Google Research

The 2-minute video example generated by Phenaki shown below is mind-blowing.

2-minute video

Below is the prompts they used to generate this video.

Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion’s face, inside the office. Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city.
reference: https://phenaki.video

The speed of progress in the field of AI and machine learning is crazy fast. Multimodal research in particular has received a lot of attention. It was only in May 2022 that Google released Imagen which is a text-to-image model. Few months later, in October 2022, they released the similar model for generating videos.

comments powered by Disqus

Recent Posts



I'm a Ph.D. student interested in Artificial Intelligence, Machine Learning and intelligence in its abstract form