Yuan Zhang1,2, Fei Xiao1, Tao Huang3, Chun-Kai Fan2, Hongyuan Dong1,
Jiawen Li1, Jiacong Wang1,4, Kuan Cheng2, Shanghang Zhang2✉️, Haoyuan Guo1✉️
1ByteDance Inc, 2School of Computer Science, Peking University,
3The University of Sydney, 4School of Artificial Intelligence, University of Chinese Academy of Sciences
- 🔥 [2024/09/28] Our ConBench is accepted by NeurIPS 2024 main track!
- 🔥 [2024/06/06] We merged ConBench into LLaVA official Evaluation Suite!
- 🔥 [2024/05/24] We relased ConBench in arXiv! The code and dataset are now open source!
When faced with prompts in different sizes of solution spaces, Large vision-language models (LVLMs) fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point.
Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency.
pip install anls
Note
If you want to compute Score[C], you should install openAI API.
pip install openai
Download on Huggingface
The model outputs answers based on the image and propmpt and stores them in format of txt.
Example for evaluating GPT-4V result in GPT-4V.py
Example for evaluating LLaVA-series result in lmms-eval
The results should be listed like this:
GPT-4V
├─artwork.txt
├─attribute_reason.txt
├─attribute_recognition.txt
├─biology.txt
├─calculation.txt
├─celebrity.txt
├─chemistry.txt
├─code.txt
├─color.txt
├─count.txt
├─cross_instance_reason.txt
├─landmark.txt
├─math.txt
├─ocr.txt
├─physics.txt
├─position.txt
├─poster.txt
├─scene.txt
└translation.txt
python Score.py --results_dir ${Model_results} --Score_D
Example for evaluating GPT-4V result:
python3 Score.py --results_dir ./Res/GPT-4V --Score_D
The results will be save in Con_res/GPT-4V_D.json
.
Note
Besides, we provide the evaluation in lmms-eval.
python3 -m accelerate.commands.launch --main_process_port 10096 --num_processes=1 lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks ConBench --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_conbench --output_path ./logs/
python Score.py --results_dir ${Model_results}
Example for evaluating GPT-4V result:
python3 Score.py --results_dir ./Res/GPT-4V
The results will be save in Con_res/GPT-4V_C.json
.
Rank | Teacher | ConScore[D] |
---|---|---|
1 | Qwen-VL-Max | 37.00 |
2 | GPT-4-Omni | 35.70 |
3 | InternVL-v1.2P-40B | 34.70 |
4 | Gemini-Ultra-Vision | 33.10 |
5 | InternVL-v1.5-26B | 31.40 |
Rank | Teacher | ConScore[C] |
---|---|---|
1 | GPT-4-Omni | 62.2 |
2 | Qwen-VL-Max | 58.4 |
3 | GPT-4V | 55.6 |
4 | Gemini-Ultra-Vision | 54.6 |
5 | InternVL-v1.2P-40B | 53.7 |
- The review pipeline
This project is released under the Apache 2.0 license.
If you use ConBench in your research, please cite our work by using the following BibTeX entry:
@article{zhang2024unveiling,
title={Unveiling the Tapestry of Consistency in Large Vision-Language Models},
author={Zhang, Yuan and Xiao, Fei and Huang, Tao and Fan, Chun-Kai and Dong, Hongyuan and Li, Jiawen and Wang, Jiacong and Cheng, Kuan and Zhang, Shanghang and Guo, Haoyuan},
journal={arXiv preprint arXiv:2405.14156},
year={2024}
}
We extend our gratitude to the open-source efforts of MME, MMBench, MMMU and SEEDBench.