Logo Swing-Bench

Can the LLMs submit valid code that passes CI and peer review?

1The University of Hong Kong, 2University of California, Los Angeles, 3Tsinghua University,
4University of Michigan, 5The Ohio State University, 6University of Edinburgh, 7The Hong Kong University of Science and Technology (Guangzhou), 8The Hong Kong Polytechnic University, 9The Chinese University of Hong Kong 10LMSYS Org
* These authors contributed equally.
Dataset overview.

Data examples of the PhyX dataset. The dataset contains 3,000 human-annotated physics problems with visual context.

πŸ””News

πŸ”₯ [2025-05-27]: PhyX is officially supported by VLMEvalKit for easy evalution! πŸŽ‰

πŸš€ [2025-05-23]: The arXiv paper is online! πŸš€

πŸš€ [2025-05-21]: We release the testmini set of Swing-Arena at Huggingface and the evaluation code! πŸš€

Introduction

We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29\% compared to human experts.

Logo PhyX Benchmark

Comparisons with Existing Benchmarks

To further distinguish the difference between PhyX and other existing ones, we elaborate the benchmark details in Figure. From the realistic perspective, the prior benchmarks are heavily focused on abstract lining. In contrast, our benchmark are realistic and delete text redundancy.

comparison with existing benchmarks comparison with existing benchmarks

Comparison with existing physics benchmarks.

Statistics

Logo Experiment Results

Data

You can directly download our data from Huggingface datasets. For guidance on how to access and utilize the data, please consult our instructions on Github.

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark.

In our leaderboard, the default setting is the Text-DeRedundancy setting, which removes text redundancy in the question (`PhyX_mini_SIMPLY.tsv` & `PhyX_mini_MC_SIMPLY.tsv`). We suggest using this setting for a fair comparison.

Human Expert Open-Source Proprietary

Click on PhyX(OE), PhyX(MC) or PhyX(domain) to expand detailed results.

Reset PhyX(OE) PhyX(MC) PhyX(domain)
Name Size Date Text-DeRedundancy Text-DeRedundancy Overall

Overall results of different models on the PhyX leaderboard. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.

Error Analysis

To dive into the reasoning capabilities and limitations of models, we meticulously inspected 96 randomly sampled incorrect predictions and performed an in-depth analysis based on GPT-4o. The objectives of this analysis were twofold: to identify current model weaknesses and to guide future enhancements in model design and training. A comprehensive case study of 30 notable cases is included in our paper, Appendix E.

error distribution

Error distribution over 96 annotated GPT-4o errors.

Error Examples

Correct Examples

BibTeX

@misc{shen2025phyxdoesmodelwits,
        title={PhyX: Does Your Model Have the "Wits" for Physical Reasoning?}, 
        author={Hui Shen and Taiqiang Wu and Qi Han and Yunta Hsieh and Jizhou Wang and Yuyue Zhang and Yuxin Cheng and Zijian Hao and Yuansheng Ni and Xin Wang and Zhongwei Wan and Kai Zhang and Wendong Xu and Jing Xiong and Ping Luo and Wenhu Chen and Chaofan Tao and Zhuoqing Mao and Ngai Wong},
        year={2025},
        eprint={2505.15929},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2505.15929}, 
  }