✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
-
2024.09.03
🌟 MME-RealWorld is now supported in the VLMEvalKit repository, enabling one-click evaluation—give it a try!" -
2024.09.01
🌟 Qwen2-VL currently ranks first on our leaderboard, but its overall accuracy remains below 55%, see our leaderboard for the detail. -
2024.08.20
🌟 We are very proud to launch MME-RealWorld, which contains 13K high-quality images, annotated by 32 volunteers, resulting in 29K question-answer pairs that cover 43 subtasks across 5 real-world scenarios. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications.
- MME-RealWorld Overview
- Dataset Examples
- Dataset License
- Evaluation Pipeline
- Experimental Results
- Citation
- Related Works
Existing Multimodal Large Language Model benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including:
- small data scale leads to a large performance variance;
- reliance on model-based annotations results in restricted data quality;
- insufficient task difficulty, especially caused by the limited image resolution.
We present MME-RealWorld, a benchmark meticulously designed to address real-world applications with practical relevance. Featuring 13,366 high-resolution images averaging 2,000 × 1,500 pixels, MME-RealWorld poses substantial recognition challenges. Our dataset encompasses 29,429 annotations across 43 tasks, all expertly curated by a team of 25 crowdsource workers and 7 MLLM experts. The main advantages of MME-RealWorld compared to existing MLLM benchmarks as follows:
-
Data Scale: with the efforts of a total of 32 volunteers, we have manually annotated 29,429 QA pairs focused on real-world scenarios, making this the largest fully human-annotated benchmark known to date.
-
Data Quality: 1) Resolution: Many image details, such as a scoreboard in a sports event, carry critical information. These details can only be properly interpreted with high- resolution images, which are essential for providing meaningful assistance to humans. To the best of our knowledge, MME-RealWorld features the highest average image resolution among existing competitors. 2) Annotation: All annotations are manually completed, with a professional team cross-checking the results to ensure data quality.
-
Task Difficulty and Real-World Utility: We can see that even the most advanced models have not surpassed 60% accuracy. Additionally, many real-world tasks are significantly more difficult than those in traditional benchmarks. For example, in video monitoring, a model needs to count the presence of 133 vehicles, or in remote sensing, it must identify and count small objects on a map with an average resolution exceeding 5000×5000.
-
MME-RealWord-CN: Existing Chinese benchmark is usually translated from its English version. This has two limitations: 1) Question-image mismatch. The image may relate to an English scenario, which is not intuitively connected to a Chinese question. 2) Translation mismatch. The machine translation is not always precise and perfect enough. We collect additional images that focus on Chinese scenarios, asking Chinese volunteers for annotation. This results in 5,917 QA pairs.
License:
MME-RealWorld is only used for academic research. Commercial use in any form is prohibited.
The copyright of all images belongs to the image owners.
If there is any infringement in MME-RealWorld, please email [email protected] and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify MME-RealWorld in whole or in part.
You must strictly comply with the above restrictions.
Please send an email to [email protected]. 🌟
📍 Prompt:
The common prompt used in our evaluation follows this format:
[Image] [Question] The choices are listed below:
(A) [Choice A]
(B) [Choice B]
(C) [Choice C]
(D) [Choice D]
(E) [Choice E]
Select the best answer to the above multiple-choice question based on the image. Respond with only the letter (A, B, C, D, or E) of the correct option.
The best answer is:
📍 Evaluation:
We offer two methods for downloading our images and QA pairs:
-
Base64 Encoded Images: We have encoded all images in Base64 format and uploaded them to our Hugging Face repository, which includes two folders:
MME-RealWorld
andMME-RealWorld-CN
. The JSON files within these folders can be read directly, with the images in Base64 format. By using theevaluation/download_and_prepare_prompt.py
script and creating a classMMERealWorld
, you can automatically download and convert the data into a CSV file that can be used directly. You can use thedecode_base64_to_image_file
function to convert the Base64 formatted images back into PIL format. -
Direct Image Download: You can download the images and data directly from our Baidu Netdisk or Hugging Face repository. For Hugging Face, follow the instructions to decompress the split compressed images. The file
MME_RealWorld.json
contains the English version of the questions, whileMME_RealWorld_CN.json
contains the Chinese version. Make sure to place all the decompressed images in the same folder to ensure the paths are read correctly.
To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template output_test_template.json. Once you have prepared the model responses in this format, please refer to the evaluation script eval_your_results.py, and you will get the accuracy scores across categories, subtasks, and task types. The evaluation does not introduce any third-party models, such as ChatGPT.
python eval_your_results.py \
--results_file $YOUR_RESULTS_FILE \
Please ensure that the results_file
follows the specified JSON format stated above.
📍 Leaderboard:
If you want to add your model to our leaderboard, please send model responses to [email protected], as the format of output_test_template.json.
Models are ranked according to their average performance. Rows corresponding to proprietary models are highlighted in gray for distinction. “OCR”, “RS”, “DT”, “MO”, and “AD” each indicate a specific task domain: Optical Character Recognition in the Wild, Remote Sensing, Diagram and Table, Monitoring, and Autonomous Driving, respectively. “Avg” and “Avg-C” indicate the weighted average accuracy and the unweighted average accuracy across subtasks in each domain.
- Evaluation results of different MLLMs on the perception tasks.
- Evaluation results of different MLLMs on the reasoning tasks.
- Evaluation results of different MLLMs on the perception tasks of MME-RealWorld-CN.
- Evaluation results of different MLLMs on the reasoning tasks of MME-RealWorld-CN.
If you find our work helpful for your research, please consider citing our work.
@article{zhang2024mme,
title={MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?},
author={Zhang, Yi-Fan and Zhang, Huanyu and Tian, Haochen and Fu, Chaoyou and Zhang, Shuangqing and Wu, Junfei and Li, Feng and Wang, Kun and Wen, Qingsong and Zhang, Zhang and others},
journal={arXiv preprint arXiv:2408.13257},
year={2024}
}
Explore our related researches: