VIDEOPHY: Evaluating Physical Commonsense In Video Generation

Abstract

Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models.

To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of 688 captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense.

Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.

Gen-2 (poor physical commonsense): Conservation of Mass Violation

Text Prompt: A blender spins, mixing squeezed juice within it.

Gen-2 (poor physical commonsense): Newton's Second Law Violation

Text Prompt: Water pouring from a watering can onto plants.

SVD (poor physical commonsense): Newton's Second Law Violation

Text Prompt: Detergent flowing into a bucket of water.

LaVIE (poor physical commonsense): Conservation of Mass Violation

Text Prompt: A baker scoops flour into a plastic bowl with a metal scoop.

VideoScope (good physical commonsense)

Text Prompt: A feather slowly floats down to the ground.

VideoCrafter2 (good physical commonsense)

Text Prompt: A survivalist strikes a flint to light dry tinder

Human Leaderboard on Video Generation Models

Human evaluation results on VideoPhy. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1.

Open Models

#	Model	Source	PC=1	SA=1	SA=1, PC=1
1	CogVideoX-5B 🥇	Open	53	63.3	39.6
2	VideoCrafter2 🥉	Open	34.6	48.5	19.0
3	CogVideoX-2B	Open	34.1	47.2	18.6
4	LaVIE	Open	28.0	48.7	15.7
5	SVD-T2I2V	Open	30.8	42.4	11.9
6	ZeroScope	Open	32.6	30.2	11.9
7	OpenSora	Open	23.5	18.0	4.9

Closed Models

#	Model	Source	PC=1	SA=1	SA=1, PC=1
1	Pika 🥈	Closed	36.5	41.1	19.7
2	Luma Dream Machine	Closed	21.8	61.9	13.6
3	Lumiere-T2I2V	Closed	25.0	48.5	12.5
4	Lumiere-T2V	Closed	27.9	38.4	9.0
5	Gen-2 (Runway)	Closed	27.2	26.6	7.6

🚨 To submit your results to the leaderboard, please send to this email with your csv with video url and captions from the model builders for human / automatic evaluation.

Automatic Leaderboard on Video Generation Models

Automatic evaluation results on VideoPhy using our auto-evaluator. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1.

Open Models

#	Model	Source	PC	SA	Avg.
1	CogVideoX-5B 🥇	Open	41	57	49
2	VideoCrafter2 🥉	Open	36	47	41
3	LaVIE	Open	36	45	41
4	Mochi	Open	31	50	41
5	CogVideoX-2B	Open	39	40	39
6	SVD-T2I2V	Open	34	37	35
7	Hunyuan Video	Open	28	42	35
8	ZeroScope	Open	42	27	34
9	OpenSora	Open	35	21	28

Closed Models

#	Model	Source	PC	SA	Avg.
1	Luma Dream Machine 🥈	Closed	30	53	41.5
2	Lumiere-T2I2V	Closed	25	46	35
3	Lumiere-T2V	Closed	31	35	33
4	Pika	Closed	33	25	29
5	Gen-2 (Runway)	Closed	31	26	29

🚨 To submit your results to the leaderboard, please send to this email with your csv with video url and captions from the model builders for human / automatic evaluation.

VideoPhy: Benchmark

Detailed Leaderboard

Human evaluation results on the VideoPhy dataset. We report the percentage of testing prompts for which the T2V models generate videos that adhere to the conditioning caption and exhibit physical commonsense. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1. Ideally, we want the generative models to maximize the performance on this metric. In the first column, we highlight the overall performance, and the later columns are dedicated to fine-grained performance for the interaction between different states of matter in the prompts.

Data Statistics

Key statistics of the VideoPhy dataset.

Top 20 most frequently occurring verbs(inner circle) and their top 4 direct nouns (outercircle) in our collected captions

VideoCon-Physics: Auto Evaluation of Physical Commonsense Alignment

We use VIDEOCON, an open-source generative video-text language model with 7B parameters, that is trained on real videos for robust semantic adherence evaluation. Specifically, we prompt VIDEOCON to generate a response (Yes/No) to the text adherence and physical commonsense of the generated videos.

Effectiveness of Our Auto-Evaluator

Comparison of ROC-AUC for automatic evaluation methods. We find that VIDEOCONPHYSICS outperforms diverse baselines, including GPT-4Vision and Gemini-1.5-Pro-Vision, for semantic adherence (SA) and physical commonsense (PC) judgments by a large margin on the testing prompts.