Skip to content

Latest commit

 

History

History
206 lines (142 loc) · 6.6 KB

README.md

File metadata and controls

206 lines (142 loc) · 6.6 KB

Unveiling the Tapestry of Consistency in Large Vision-Language Models (NeurIPS)

Yuan Zhang1,2, Fei Xiao1, Tao Huang3, Chun-Kai Fan2, Hongyuan Dong1,

Jiawen Li1, Jiacong Wang1,4, Kuan Cheng2, Shanghang Zhang2✉️, Haoyuan Guo1✉️

1ByteDance Inc, 2School of Computer Science, Peking University,

3The University of Sydney, 4School of Artificial Intelligence, University of Chinese Academy of Sciences

📜 News

  • 🔥 [2024/09/28] Our ConBench is accepted by NeurIPS 2024 main track!
  • 🔥 [2024/06/06] We merged ConBench into LLaVA official Evaluation Suite!
  • 🔥 [2024/05/24] We relased ConBench in arXiv! The code and dataset are now open source!

mask

✒️ Contents

👀 Overview

When faced with prompts in different sizes of solution spaces, Large vision-language models (LVLMs) fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point.

image

Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency.

image

👨‍💻 Preparation

Install ANLS

pip install anls

Install OpenAI API

Note

If you want to compute Score[C], you should install openAI API.

pip install openai

ConBench Dataset

Download on Huggingface

🎯 Usage

Get Model Responses

The model outputs answers based on the image and propmpt and stores them in format of txt.

Example for evaluating GPT-4V result in GPT-4V.py

Example for evaluating LLaVA-series result in lmms-eval

The results should be listed like this:

GPT-4V
├─artwork.txt
├─attribute_reason.txt
├─attribute_recognition.txt
├─biology.txt
├─calculation.txt
├─celebrity.txt
├─chemistry.txt
├─code.txt
├─color.txt
├─count.txt
├─cross_instance_reason.txt
├─landmark.txt
├─math.txt
├─ocr.txt
├─physics.txt
├─position.txt
├─poster.txt
├─scene.txt
└translation.txt

Fast Evaluation on ConScore[D]

python Score.py --results_dir ${Model_results} --Score_D

Example for evaluating GPT-4V result:

python3 Score.py --results_dir ./Res/GPT-4V --Score_D

The results will be save in Con_res/GPT-4V_D.json.

Note

Besides, we provide the evaluation in lmms-eval.

python3 -m accelerate.commands.launch --main_process_port 10096 --num_processes=1 lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks ConBench --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_conbench --output_path ./logs/

Evaluation on ConScore[C]

python Score.py --results_dir ${Model_results} 

Example for evaluating GPT-4V result:

python3 Score.py --results_dir ./Res/GPT-4V

The results will be save in Con_res/GPT-4V_C.json.

🏆 Leaderboard

ConScore[D]

Rank Teacher ConScore[D]
1 Qwen-VL-Max 37.00
2 GPT-4-Omni 35.70
3 InternVL-v1.2P-40B 34.70
4 Gemini-Ultra-Vision 33.10
5 InternVL-v1.5-26B 31.40

ConScore[C]

Rank Teacher ConScore[C]
1 GPT-4-Omni 62.2
2 Qwen-VL-Max 58.4
3 GPT-4V 55.6
4 Gemini-Ultra-Vision 54.6
5 InternVL-v1.2P-40B 53.7
  • The review pipeline

mask

License

This project is released under the Apache 2.0 license.

Citation

If you use ConBench in your research, please cite our work by using the following BibTeX entry:

@article{zhang2024unveiling,
  title={Unveiling the Tapestry of Consistency in Large Vision-Language Models},
  author={Zhang, Yuan and Xiao, Fei and Huang, Tao and Fan, Chun-Kai and Dong, Hongyuan and Li, Jiawen and Wang, Jiacong and Cheng, Kuan and Zhang, Shanghang and Guo, Haoyuan},
  journal={arXiv preprint arXiv:2405.14156},
  year={2024}
}

Acknowledgment

We extend our gratitude to the open-source efforts of MME, MMBench, MMMU and SEEDBench.