Descriptive Text

VideoPhy

Evaluating Physical Commonsense In Video Generation

(*, ^, ~, Equal Contribution)
1University of California, Los Angeles
2Google Research

Abstract

Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models.

To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of $688$ captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense.

Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.

Image 1 Image 2

VIDEOPHY evaluates physical commonsense in video generation. (a) Model performance on the VIDEOPHY dataset using human evaluation. We assess the physical commonsense and semantic adherence to the conditioning caption in the generated videos; (b) Illustration of poor physical commonsense by various T2V generative models. Here, we show that the generated videos can violate a diverse range of physics laws such as conversation of mass, Newton's first law, and solid constitutive laws.

VideoPhy: Benchmark

Detailed Leaderboard

leaderboard

Human evaluation results on the VideoPhy dataset. We report the percentage of testing prompts for which the T2V models generate videos that adhere to the conditioning caption and exhibit physical commonsense. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1. Ideally, we want the generative models to maximize the performance on this metric. In the first column, we highlight the overall performance, and the later columns are dedicated to fine-grained performance for the interaction between different states of matter in the prompts.

Data Statistics

VideoCon-Physics: Auto Evaluation of Physical Commonsense Alignment

We use VIDEOCON, an open-source generative video-text language model with 7B parameters, that is trained on real videos for robust semantic adherence evaluation. Specifically, we prompt VIDEOCON to generate a response (Yes/No) to the text adherence and physical commonsense of the generated videos.

formular

Effectiveness of Our Auto-Evaluator

roc-auc

Comparison of ROC-AUC for automatic evaluation methods. We find that VIDEOCONPHYSICS outperforms diverse baselines, including GPT-4Vision and Gemini-1.5-Pro-Vision, for semantic adherence (SA) and physical commonsense (PC) judgments by a large margin on the testing prompts.