From Text to Video: Exploring the Capabilities and Limitations of OpenAI’s Sora

7 min readMar 6, 2024

In recent weeks, a 59-second video has been widely shared on social media. The video features a woman walking through the streets of Tokyo, shown in close-up shots that reveal details like her freckles and skin texture.

What’s remarkable about this video is that it was generated entirely by an artificial intelligence (AI) system called Sora, which was launched by OpenAI on February 15th.

Sora is a text-to-video model that can create one-minute videos based on simple text prompts, with both realistic and imaginative scenes.

Technically, Sora is based on Transformer and incorporates Diffusion, where the video data unit Patches is the key.

In fact, Sora “understands” the mathematical and algorithmic representations of the relationships between the changes, movements, and positions of patches/pixels in a spatial and temporal sense, which in turn fit certain plausibilities of the physical world [1].

In essence, Sora is a large-scale model for understanding the world in images, based entirely on vision, oriented to vision.

While text-to-speech and text-to-graphics AI tools have been around for a while, AI video is a relatively new field. Even so, Sora has already set a new standard for AI-generated videos.

Other AI video tools like Runway Gen 2 and Pika struggle to produce coherent videos in seconds, but Sora is already breaking records.

1. What is Sora and why does it have such magic powers?

Sora is OpenAI’s text-to-video generative AI model. By inputting a text prompt, for example, “a giant duck walks through the streets in Boston”, it creates a 60-second video that matches the description of the prompt. [2] Here’s an example from the OpenAI site:

https://www.youtube.com/watch?v=gHp5dT7Yntg

The videos generated by Sora are highly detailed and contain complex scenes, vivid character expressions, and intricate camera movements that other video models cannot achieve.

So how does Sora work?

In essence, Sora works on the principles of diffusion models, similar to text-to-image generative AI models such as DALL-E 3, StableDiffusion, and Midjourney.

According to its official website, Sora is a diffusion model that generates a video by starting with what looks like static noise and gradually transforming it by removing the noise over many steps. Sora is capable of generating entire videos at once, or extending generated videos to make them longer[3].

The Sora team has successfully overcome the challenge of ensuring that a subject stays the same even when it goes out of view temporarily, by enabling the model to anticipate multiple frames of content.

Here are some more examples.

In addition, with a deep understanding of language, Sora can accurately comprehend the needs expressed in the user’s instructions, grasping how these elements are manifested in the real world.

Sora is capable of creating complex scenes that can include not only multiple roles but also specific types of actions, as well as highly detailed representations of objects and backgrounds.

Moreover, Sora can sometimes simulate behaviors that affect the state of the world in simple ways.

For instance, a painter can leave new brush strokes on a canvas, or a person eating a burger can leave bite marks on the burger over time.

Sora is also able to design multiple shots in the same video while maintaining consistency in character and visual style.

This level of multi-camera consistency is completely out of reach for both Gen 2 and Pika.

https://twitter.com/billpeeb/status/1758960998315135360

The information of images and videos is actually much larger than that of text, especially for the real world represented in vision.

Massive video samples have allowed Sora to establish a basic dynamic relational “understanding” of the macro/micro spatial and temporal changes in the visual world.

As Tim Brooks, a research scientist on the project says, “It learns about 3D geometry and consistency, We didn’t bake that in — it just entirely emerged from seeing a lot of data. ”[4]

For these reasons, many believe that Sora lays the groundwork for understanding and modeling the real world and is an important step toward achieving general artificial intelligence (AGI).

2. Sora’s Limitations and Perspectives

According to OpenAI, Sora’s current model has weaknesses. Sora may not always adhere to “real-world” physical rules as it does not have an implicit understanding of physics.

OpenAI has displayed some of the issues with its AI model on the homepage of its website.

It appears that Sora, the AI model, may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect.

For instance, certain characters, animals, or objects may vanish, deform, or replicate over time. Furthermore, some images may contradict common physics, like a broken glass of liquid, a basketball passing through a hoop, or a chair floating and moving.

According to OpenAI, the model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory [3].

The following examples demonstrate that some of Sora’s animations may not be entirely realistic. In certain instances, characters may appear to move in physically impossible ways, and animals or people can suddenly appear, particularly in scenes with many elements.

In one video, a man can be seen running backward on a treadmill while maintaining a realistic gait.

A lady tried to blow out her birthday candles, but the flames didn’t change at all.

Take a look at the scene of a cat demanding breakfast, when the cat moves its left front paw, another appendage sprouts to replace it.

Same as the ways the gray wolf pups merge and reemerge with mesmerizing weirdness.

https://www.youtube.com/watch?v=jspYKxFY7Sc

The examples below show that Sora is confounded by how to light a cigarette and an ant has only 4 legs.

https://www.washingtonpost.com/technology/interactive/2024/ai-video-sora-openai-flaws/

3: Sora: Data-driven rather than explicit physical modeling

Natalie Summers, a representative for the Sora project, told the Washington Post that Sora’s development involved training an AI algorithm on countless hours of video footage. These videos were sourced from both licensed providers and publicly available data on the Internet. By processing this extensive video content, the AI is able to learn and recognize specific objects and concepts.

Furthermore, the report also acknowledges that while Sora can simulate certain aspects of the physical world, it does not accurately model the physics of many basic interactions, such as glass shattering.

Other interactions, like eating food, do not always yield correct changes in the object state. This further supports the notion that Sora’s capabilities are based on learned data patterns rather than a deep, principled understanding of physics.

Therefore, maadaa.ai has reasons to believe that while Sora exhibits emergent capabilities that simulate aspects of the physical world, its understanding is based on data-driven patterns rather than explicit physical modeling.

For now, OpenAI takes a cautious approach to releasing its tools. Currently, a dedicated group of “red teamers” is rigorously testing the tool to identify potential areas of risk or harm.

As technology continues to advance and innovate, the boundaries between virtual and real are becoming harder to distinguish. Sora, a video model, has shown great potential for simulating real-world phenomena and behaviors.

However, it’s important to recognize that Sora’s journey to full-fledged utility is far from over. At this stage, Sora offers a limited set of capabilities and isn’t flawless. Despite these limitations, maadaa.ai remains optimistic about the future of video creation technologies. We believe that as these technologies continue to evolve, these text-to-video AI tools will provide more realistic and natural visual experiences in the future.

Reference:

[1] https://openai.com/research/video-generation-models-as-world-simulators

[2] https://www.sciencefocus.com/future-technology/openai-sora

[3] https://openai.com/sora

[4] https://www.wired.com/story/openai-sora-generative-ai-video/

[5] https://openai.com/research/video-generation-models-as-world-simulators

[6] https://www.washingtonpost.com/technology/interactive/2024/ai-video-sora-openai-flaws/

From Text to Video: Exploring the Capabilities and Limitations of OpenAI’s Sora

1. What is Sora and why does it have such magic powers?

2. Sora’s Limitations and Perspectives

3: Sora: Data-driven rather than explicit physical modeling

Reference:

Written by maadaa.ai

No responses yet