OpenAI Launches Sora: A Groundbreaking Text-to-Video AI Model

still image from video generated using OpenAI Sora AI model
Image Courtesy: OpenAI
In Short
  • OpenAI has announced an impressive text-to-video AI model called Sora. It can generate realistic videos up to one minute.
  • Sora can generate videos up to a resolution of 1080p and handles reflections, and shadows pretty well. Many experts believe Sora is trained on Unreal Engine simulations.
  • It's currently not available to regular users. OpenAI is red-teaming with experts to assess the model for bias, risks, and harms.

Just when Google announced its next-gen Gemini 1.5 Pro model, OpenAI rained on Google’s parade with the surprise announcement of Sora, a breakthrough text-to-video AI model. The new video generation model, Sora, is different from anything we have seen so far in the AI industry. From the examples we’ve seen, video generation models like Runway’s Gen-2 and Pika pale in comparison to the Sora model. Here is everything you need to know about OpenAI’s new Sora model.

Sora Can Generate Videos Up to 1 Minute

OpenAI’s text-to-video AI model, Sora, can generate highly detailed videos (up to 1080p) from textual prompts. It follows user prompts extremely well and simulates the physical world in motion. The most impressive part is that Sora can generate AI videos up to one minute, which is far longer than existing text-to-video models which generate videos up to three or four seconds.

OpenAI has showcased many visual examples to demonstrate Sora’s powerful capability. The ChatGPT maker says Sora has a deep understanding of language and can generate “compelling characters that express vibrant emotions“. It can also create several different shots in a single video with characters and scenes persisting throughout the video.

That said, Sora has some deficiencies too. Currently, it doesn’t understand the physics of the real world very well. OpenAI explains, “A person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark“.

As for the model architecture, OpenAI says Sora is a diffusion model built on the transformer architecture. It uses the recaptioning technique introduced with Dall -E 3 that generates a highly descriptive prompt from a sample user prompt. Apart from text-to-video generation, Sora can also create videos from still images, animate them, and extend the frame in a video format.

Looking at the breathtaking videos generated using the Sora model, many experts believe that Sora might be trained on synthetically generated data from Unreal Engine 5 given the similarities with UE5 simulations. Sora-generated videos don’t have the usual distortion of hands and characters that we generally see on other diffusion models. It may also be using Neural Radiance Field (NeRF) to generate 3D scenes from 2D images.

Whatever the case, it seems OpenAI has made another breakthrough with Sora, and it’s palpable from OpenAI’s ending remarks on its blog, stressing on achieving AGI.

Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.

Sora is not available for regular users to try at the moment. Currently, OpenAI is red-teaming with experts to evaluate the model for harms and risks. The company is also giving access to Sora to several filmmakers, designers, and artists to get feedback and improve the model before a public release.

comment Comments 0
Leave a Reply