Descriptive Text

VideoPhy

Evaluating Physical Commonsense In Video Generation

(*, ^, ~, Equal Contribution)
1University of California, Los Angeles
2Google Research

Abstract

Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models.

To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of 688 captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense.

Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.

Image 1 Image 2

VIDEOPHY evaluates physical commonsense in video generation. (a) Model performance on the VIDEOPHY dataset using human evaluation. We assess the physical commonsense and semantic adherence to the conditioning caption in the generated videos; (b) Illustration of poor physical commonsense by various T2V generative models. Here, we show that the generated videos can violate a diverse range of physics laws such as conversation of mass, Newton's first law, and solid constitutive laws.

Human Leaderboard on Video Generation Models

Human evaluation results on VideoPhy. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1.

Open Models

# Model Source PC=1 SA=1 SA=1, PC=1
1 CogVideoX-5B 🥇 Open 53 63.3 39.6
2 VideoCrafter2 🥉 Open 34.6 48.5 19.0
3 CogVideoX-2B Open 34.1 47.2 18.6
4 LaVIE Open 28.0 48.7 15.7
5 SVD-T2I2V Open 30.8 42.4 11.9
6 ZeroScope Open 32.6 30.2 11.9
7 OpenSora Open 23.5 18.0 4.9

Closed Models

# Model Source PC=1 SA=1 SA=1, PC=1
1 Pika 🥈 Closed 36.5 41.1 19.7
2 Luma Dream Machine Closed 21.8 61.9 13.6
3 Lumiere-T2I2V Closed 25.0 48.5 12.5
4 Lumiere-T2V Closed 27.9 38.4 9.0
5 Gen-2 (Runway) Closed 27.2 26.6 7.6

🚨 To submit your results to the leaderboard, please send to this email with your csv with video url and captions from the model builders for human / automatic evaluation.

Automatic Leaderboard on Video Generation Models

Automatic evaluation results on VideoPhy using our auto-evaluator. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1.

Open Models

# Model Source PC SA Avg.
1 CogVideoX-5B 🥇 Open 41 57 49
2 VideoCrafter2 🥉 Open 36 47 41
3 LaVIE Open 36 45 41
4 CogVideoX-2B Open 39 40 39
5 SVD-T2I2V Open 34 37 35
6 ZeroScope Open 42 27 34
7 OpenSora Open 35 21 28

Closed Models

# Model Source PC SA Avg.
1 Luma Dream Machine 🥈 Closed 30 53 41.5
2 Lumiere-T2I2V Closed 25 46 35
3 Lumiere-T2V Closed 31 35 33
4 Pika Closed 33 25 29
5 Gen-2 (Runway) Closed 31 26 29

🚨 To submit your results to the leaderboard, please send to this email with your csv with video url and captions from the model builders for human / automatic evaluation.

VideoPhy: Benchmark

Detailed Leaderboard

leaderboard

Human evaluation results on the VideoPhy dataset. We report the percentage of testing prompts for which the T2V models generate videos that adhere to the conditioning caption and exhibit physical commonsense. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1. Ideally, we want the generative models to maximize the performance on this metric. In the first column, we highlight the overall performance, and the later columns are dedicated to fine-grained performance for the interaction between different states of matter in the prompts.

Data Statistics

VideoCon-Physics: Auto Evaluation of Physical Commonsense Alignment

We use VIDEOCON, an open-source generative video-text language model with 7B parameters, that is trained on real videos for robust semantic adherence evaluation. Specifically, we prompt VIDEOCON to generate a response (Yes/No) to the text adherence and physical commonsense of the generated videos.

formular

Effectiveness of Our Auto-Evaluator

roc-auc

Comparison of ROC-AUC for automatic evaluation methods. We find that VIDEOCONPHYSICS outperforms diverse baselines, including GPT-4Vision and Gemini-1.5-Pro-Vision, for semantic adherence (SA) and physical commonsense (PC) judgments by a large margin on the testing prompts.