Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models.
To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of $688$ captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense.
Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.
Gen-2 (poor physical commonsense): Conservation of Mass Violation
Text Prompt: A blender spins, mixing squeezed juice within it.
Gen-2 (poor physical commonsense): Newton's Second Law Violation
Text Prompt: Water pouring from a watering can onto plants.
SVD (poor physical commonsense): Newton's Second Law Violation
Text Prompt: Detergent flowing into a bucket of water.
LaVIE (poor physical commonsense): Conservation of Mass Violation
Text Prompt: A baker scoops flour into a plastic bowl with a metal scoop.
VideoScope (good physical commonsense)
Text Prompt: A feather slowly floats down to the ground.
VideoCrafter2 (good physical commonsense)
Text Prompt: A survivalist strikes a flint to light dry tinder
Human evaluation results on the VideoPhy dataset. We report the percentage of testing prompts for which the T2V models generate videos that adhere to the conditioning caption and exhibit physical commonsense. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1. Ideally, we want the generative models to maximize the performance on this metric. In the first column, we highlight the overall performance, and the later columns are dedicated to fine-grained performance for the interaction between different states of matter in the prompts.
Key statistics of the VideoPhy dataset.
Top 20 most frequently occurring verbs(inner circle) and their top 4 direct nouns (outercircle) in our collected captions
We use VIDEOCON, an open-source generative video-text language model with 7B parameters, that is trained on real videos for robust semantic adherence evaluation. Specifically, we prompt VIDEOCON to generate a response (Yes/No) to the text adherence and physical commonsense of the generated videos.
Comparison of ROC-AUC for automatic evaluation methods. We find that VIDEOCONPHYSICS outperforms diverse baselines, including GPT-4Vision and Gemini-1.5-Pro-Vision, for semantic adherence (SA) and physical commonsense (PC) judgments by a large margin on the testing prompts.