Unveiling the Tapestry of Consistency in Large Vision-Language Models (NeurIPS)

Yuan Zhang^1,2, Fei Xiao¹, Tao Huang³, Chun-Kai Fan², Hongyuan Dong¹,

Jiawen Li¹, Jiacong Wang^1,4, Kuan Cheng², Shanghang Zhang^2✉️, Haoyuan Guo^1✉️

¹ByteDance Inc, ²School of Computer Science, Peking University,

³The University of Sydney, ⁴School of Artificial Intelligence, University of Chinese Academy of Sciences

📜 News

🔥 [2024/09/28] Our ConBench is accepted by NeurIPS 2024 main track!
🔥 [2024/06/06] We merged ConBench into LLaVA official Evaluation Suite!
🔥 [2024/05/24] We relased ConBench in arXiv! The code and dataset are now open source!

✒️ Contents

News
Contents
Overview
Preparation
Leaderboard
Citation
Acknowledgment

👀 Overview

When faced with prompts in different sizes of solution spaces, Large vision-language models (LVLMs) fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point.

Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency.

👨‍💻 Preparation

Install ANLS

pip install anls

Install OpenAI API

Note

If you want to compute Score[C], you should install openAI API.

pip install openai

ConBench Dataset

Download on Huggingface

🎯 Usage

Get Model Responses

The model outputs answers based on the image and propmpt and stores them in format of txt.

Example for evaluating GPT-4V result in GPT-4V.py

Example for evaluating LLaVA-series result in lmms-eval

The results should be listed like this:

GPT-4V
├─artwork.txt
├─attribute_reason.txt
├─attribute_recognition.txt
├─biology.txt
├─calculation.txt
├─celebrity.txt
├─chemistry.txt
├─code.txt
├─color.txt
├─count.txt
├─cross_instance_reason.txt
├─landmark.txt
├─math.txt
├─ocr.txt
├─physics.txt
├─position.txt
├─poster.txt
├─scene.txt
└translation.txt

Fast Evaluation on ConScore[D]

python Score.py --results_dir ${Model_results} --Score_D

Example for evaluating GPT-4V result:

python3 Score.py --results_dir ./Res/GPT-4V --Score_D

The results will be save in Con_res/GPT-4V_D.json.

Note

Besides, we provide the evaluation in lmms-eval.

python3 -m accelerate.commands.launch --main_process_port 10096 --num_processes=1 lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks ConBench --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_conbench --output_path ./logs/

Evaluation on ConScore[C]

python Score.py --results_dir ${Model_results}

Example for evaluating GPT-4V result:

python3 Score.py --results_dir ./Res/GPT-4V

The results will be save in Con_res/GPT-4V_C.json.

🏆 Leaderboard

ConScore[D]

Rank	Teacher	ConScore[D]
1	Qwen-VL-Max	37.00
2	GPT-4-Omni	35.70
3	InternVL-v1.2P-40B	34.70
4	Gemini-Ultra-Vision	33.10
5	InternVL-v1.5-26B	31.40

ConScore[C]

Rank	Teacher	ConScore[C]
1	GPT-4-Omni	62.2
2	Qwen-VL-Max	58.4
3	GPT-4V	55.6
4	Gemini-Ultra-Vision	54.6
5	InternVL-v1.2P-40B	53.7

The review pipeline

License

This project is released under the Apache 2.0 license.

Citation

If you use ConBench in your research, please cite our work by using the following BibTeX entry:

@article{zhang2024unveiling,
  title={Unveiling the Tapestry of Consistency in Large Vision-Language Models},
  author={Zhang, Yuan and Xiao, Fei and Huang, Tao and Fan, Chun-Kai and Dong, Hongyuan and Li, Jiawen and Wang, Jiacong and Cheng, Kuan and Zhang, Shanghang and Guo, Haoyuan},
  journal={arXiv preprint arXiv:2405.14156},
  year={2024}
}

Acknowledgment

We extend our gratitude to the open-source efforts of MME, MMBench, MMMU and SEEDBench.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Unveiling the Tapestry of Consistency in Large Vision-Language Models (NeurIPS)

📜 News

✒️ Contents

👀 Overview

👨‍💻 Preparation

Install ANLS

Install OpenAI API

ConBench Dataset

🎯 Usage

Get Model Responses

Fast Evaluation on ConScore[D]

Evaluation on ConScore[C]

🏆 Leaderboard

ConScore[D]

ConScore[C]

License

Citation

Acknowledgment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Unveiling the Tapestry of Consistency in Large Vision-Language Models (NeurIPS)

📜 News

✒️ Contents

👀 Overview

👨‍💻 Preparation

Install ANLS

Install OpenAI API

ConBench Dataset

🎯 Usage

Get Model Responses

Fast Evaluation on ConScore[D]

Evaluation on ConScore[C]

🏆 Leaderboard

ConScore[D]

ConScore[C]

License

Citation

Acknowledgment