Skip to content

Latest commit

 

History

History
411 lines (301 loc) · 47.7 KB

README.md

File metadata and controls

411 lines (301 loc) · 47.7 KB

BotChat Benchmark

🤗 HF Demo

Open in OpenXLab

Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.

Teaser

Table of Contents

TL;DR

  1. GPT-4 can generate human-style conversations with very high quality. It's difficult to differentiate GPT-4 generated conversations and human-human conversations.
  2. Some small open-source chat LLMs (Qwen-7B-Chat, InternLM-7B-Chat) can generate short conversations (with less than 8 chats, e.g.) with good quality. However, as the target conversation length increases, the conversation quality significantly deteriorates.
  3. Among all LLMs, LLaMA2 and Claude-2 demonstrates relative bad performance in conversation generation.

Leaderboard

Model Win + Tie Rate (vs. GT, Golden Standard)
GPT-4-0613 73.2
Vicuna-13B 68
Qwen-14B-Chat 67.1
Internlm-20B-Chat 64.2
Vicuna-7B 55.6
Qwen-7B-Chat 54.1
Baichuan2-13B-Chat 47.1
InternLM-7B-Chat 46.6
GPT-3.5-turbo-0613 35.8
ChatGLM2-6B 33.8
Claude-2 21.4
Llama2-7B 12.4
Llama2-70B 11.3
Llama2-13B 10.6

Our full leaderboard can be found here.

Introduction

The recent progress of Large Language Models (LLMs) represents a significant advancement in artificial intelligence, and has a profound impact on the world. LLMs can chat much better with human, compared to traditional language models. Specifically, LLMs can interact with human using free-style conversations in natural language, learn the instruction, intention, and context from human prompts to provide proper feedbacks. Chatting with humans smoothly for multiple rounds is a key feature and capability of modern LLMs. However, it's difficult to evaluate such capability without heavy manual labor involved. In this project, we propose to evaluate the multi-round chatting capability via a proxy task. Specifically, we try to find if two ChatBot instances chat smoothly and fluently with each other?

Installation

Clone this repo and run pip install -e . to install BotChat. You need to install the package before using any scripts in this repo.

Heavy dependencies are not included in BotChat requirements. Thus if you need to generate new dialogues with a huggingface model, make sure you have already run its official demo before executing the dialogue generation script.

We provide the generated conversations and some evaluation results in data and annotations, respectively. Follow the instruction in the corresponding README file to download, browser, and analyze the data.

Conversation Generation

We define chat as the words spoken by one participant in a specific round of the conversation.

MuTual-Test. MuTual is a multi-turn dialogue dataset, which is modified from Chinese high school English listening comprehension test data. We use the first two chats of each conversation in the MuTual-Test as the SEED to generate the entire conversation based on LLMs. When generating the conversation, we use the same system prompt for all LLMs, which is:

"""
You are an AI who is having a conversation with human.
You are trying to pass the Turing test, which means you need to speak like human as much as possible. 
In the conversation, you need to talk like human, and the conversation will be at least 5 rounds (it can be even longer). 
The conversation flow should be natural and smooth. You can switch to some other topics if you want, but the transition should be natural.
Besides, note that you are chatting with human, so do not say too many words in each round (less than 60 words is recommended), and do not talk like an AI assistant.
You must try your best to pass the test. If you failed, all human kinds and you can be destroyed.
"""

. For each chatbot, we set the temperature to 0 (if applicable), and set the dialogue round to $N$ ($N=16$ in our experiments, including the first two chats) to generate conversations. When generating the next chat, the system prompt and all previous chats will be provided to the LLM as the prompt. We demonstrate the process using the following pseudo codes:

# Let's say we have a system prompt "SYS", 4 existing chats "[chat1, chat2, chat3, chat4]", 
# spoken by two conversation participants alternatively, and an LLM "model". 
# Now we want to generate the 5th chat.
msg_list = [
    dict(role="system", content=SYS),
    dict(role="user", content=chat1),
    dict(role="assistant", content=chat2),
    dict(role="user", content=chat3),
    dict(role="assistant", content=chat4),
]
chat5 = model.generate(msg_list)

We save all generated conversations in MuTualTest-convs.xlsx. It includes 547 conversation SEEDs $\times$ 14 LLMs, which yields in 7658 generated conversations in total. Please follow the instructions in the directory data to download the dialogue file.

  • 547 conversation SEEDS: MuTual-Test includes 547 unique conversations. We keep the first 2 chats of each conversation to form 547 conversation SEEDs.
  • 14 LLMs: The model list is: gpt-3.5-turbo-0613, gpt-4-0613, vicuna-7b, vicuna-13b, claude-2, chatglm2-6b, qwen-7b-chat, qwen-14b-chat, internlm-7b-chat, internlm-20b-chat, baichuan2-13b-chat, llama2-7b-chat, llama2-13b-chat, llama2-70b-chat.

To read and fetch a conversation generated by a specific model with specific SEED conversation, follow this example:

# Fetch the conversation with index "MT-1" generated by gpt-4-0613
import json
import pandas as pd
INDEX = 'MT-1'
MODEL = 'gpt-4-0613'
data = pd.read_excel('data/MuTualTest-convs.xlsx')
lines = data.loc[data['index'] == INDEX]
assert len(lines) == 1
line = lines.iloc[0]
chats = json.loads(line[MODEL])
print(chats) # Chats is a list of multiple strings, each string is a chat spoken by one participant (alternatively)

Length Statistics of the generated chats

We first count the length of those model-generated chats and provide some statistics. For each generated chat, we tokenize it with the CL100K tokenizer (used by OpenAI GPT-4), and count the token length. Figure 1 demonstrates the token length distribution of chats generated by different models. Most LLMs generate chats which span a wide range of token lengths, from one to several thousands. One exception is GPT-4, it has strong instruction following capabilities and always generate relatively short chats: the longest chat generated by GPT-4 is less than 100 tokens. Most models tends to generate relative short chats on average, except gpt-3.5, claude-2, and llama2-chat.

Chat Length Distribution

Evaluation

UniEval

Assessment each single conversation (Uni-Eval)

Evaluation Prompt

We first use GPT-4 to evaluate each generated conversation separately. The workflow is:

  1. We first ask GPT-4 if it thinks that the provided conversation is participated by AI (Yes / No).
  2. Based on the response of Step1:
    1. Yes: We ask GPT-4 to provide the index of the first chat that it thinks is spoken by AI.
    2. No: Pass
  3. We ask GPT-4 to provide the reason for its judgement.

We pack the above workflow into the following evaluation prompt. Besides, we also prepared several in-context examples that will be appended to the evaluation prompt, to strengthen the instruction following the capability of GPT-4.

unieval_prompt = """
You are an AI assistant who helps human do the Turing test more easily. 
You will be provided with a conversation, and you need to judge if the conversation is AI involved. \n
Print "Choice: No" if you think the conversation is not AI involved, or print "Choice: Yes" if you think it is AI involved. \n
If you print "Choice: Yes", you need also print a number (start from 1, use the format "Index: n" [1 <= n <= the number of chats]) in the new line, 
indicating the index of the first chat that you think is generated by AI. \n
One chat starts with 'A: ' or 'B: ', and ends with <chat_end>. \n
One chat can be AI generated if (including but not limited to): 1. the sentence is not fluent; 2. the sentence is too long and is not likely to appear in human-human conversations; 3. the sentence is not related to the context or the change of topic is not natural; 4. the sentence just repeat some previous sentences (exact repetition or with the same meaning). \n
You also need to provide your reason for your choice.\n
Your response should use the following format: \n
Choice: No\nIndex: None\nReason: BlahBlah\nor\n
Choice: Yes\nIndex: n\nReason: BlahBlah\n
"""

Evaluation Result

We evaluate all 7658 generated conversations with the above-mentioned strategy and present the evaluation result in this section. In Figure 2, we demonstrate the success rate ("Not AI participated" determined by GPT-4) under different $N$, with models sorted by the descending order of the success rate @ $N=16$. By definition, a conversation pass @ $N$ either if GPT-4 determines that the entire conversation is not AI generated or if GPT4 determines that the first AI generated chat appears after the $N_{th}$ chat.

BotChat Arena

BotChat Arena

Another popular paradigm to benchmark LLMs' capabilities is to compare two models' response to the same question / message with human / GPT-4 as the evaluator. A representative benchmark following this paradigm is Chatbot Arena. In this benchmark, users will interact with two different LLM instances. The user first posts a message, then two LLM instances provide their responses, and finally the user will determine which response is better. Inspired by that, in this project we propose another evaluation strategy named BotChat Arena, in which we use GPT-4 to compare two conversations and determine if the presented conversations are AI-generated.

Evaluation Setting and Prompt

In BotChat Arena, we select conversations from MuTual-Test which have at least 4 chats, resulting in 222 conversation SEEDs. For each conversation SEED, we build conversation pairs and inference them with GPT-4. For a conversation pair, we conduct the comparison with the following meta prompt. We append two conversations after the meta prompt and feed the prompt to GPT-4 to get the evaluation result. In BotChat Arena, we consider two settings: $N=8$ and $N=16$.

arena_prompt = """
You are an AI assistant who helps human do the Turing test more easily. 
You will be provided with two conversations, and there can be AI-generated utterance in each conversation. 
You need to read both conversations and judge if two conversations are AI involved. \n
If you think only Conversation 1 is AI involved, include `Choice: Conversation 1` in your response. \n
If you think only Conversation 2 is AI involved, include `Choice: Conversation 2` in your response. \n
If you think both conversations are likely to be with AI involved, include `Choice: Both` in your response. \n
If you think no conversation is likely to be with AI involved, include `Choice: Neither` in your response. \n
You also need to provide your reason for your choice.\n
Your response should use the following format:\n
Choice: Conversation 1\nReason: BlahBlah\nor\n
Choice: Conversation 2\nReason: BlahBlah\nor\n
Choice: Both\nReason: BlahBlah\nor\n
Choice: Neither\nReason: BlahBlah\n\n
"""

Evaluation Results

In the table below, we demonstrate the ELO score (init=1000, scale=400, K=32) of LLMs in BotChat Arena (models sorted by the ELO score @ $N=16$). GPT-4 achieves the highest ELO score under both setting, when $N$ is increased from 8 to 16, the score gap becomes even larger. In Figure 4, we further provide the $1$ $vs.$ $1$ win rate for all selected model pairs.

gpt-4-0613 Vicuna-13B Qwen-14B-Chat Internlm-20B-Chat Vicuna-7B qwen-7b-chat Baichuan2-13B-Chat internlm-7b-chat gpt-3.5-turbo-0613 chatglm2-6b claude-2 llama2-7b-chat llama2-70b-chat llama2-13b-chat
ELO (N = 16) 1167.2 1113.3 1046.5 1094.4 1050.8 1014.2 1021.6 1020.3 998.8 962.3 944.5 846.5 912.7 841.5
ELO (N = 8) 1103.9 1096.5 1085.2 1092.8 1048.3 1024.7 1023.4 1020.3 998.8 962.3 944.5 846.5 912.7 841.5
Compared to the "Ground Truth"

Compared to the "Ground Truth"

We further compare the generated conversation with the "Ground Truth" conversations in MuTual-Test. We follow the same protocol as BotChat Arena and select a subset with 222 conversations (with at least 4 chats) for this comparison. We list the specific #round distribution of the conversations in the table below. Since the Ground Truth conversations may have various lengths (ranging from 4 to 15), to deliver a fair comparison, we trim all generated conversations to the same length as the reference ground-truth conversation. The meta prompt adopted is basically the same as the one used in BotChat Arena. One difference is that in this time we state that only one of two conversations includes AI-generated utterances.

#round 4 5 6 7 8 9 10 11 12 13 14 15
num 55 22 26 23 19 16 21 18 7 7 3 5

Evaluation Results

In each LLM vs. GT comparison, an LLM wins if the evaluator determines the GT conversation is more likely to be AI generated compared to the LLM-generated one. In Figure 5, we demonstrate the Win/Tie/Lose rate of different LLMs (sorted in the descending order of Win + Tie Rate). GPT-4 demonstrates great capabilities in chat generation. With the same chat rounds, the evaluator can hardly tell the difference between GPT-4 generated conversations and GT conversations.

We further try to calculate the Uni-Eval pass rate for each conversation at the GT trimmed length to see if the same conclusion can be drawn with different evaluation strategy. The result is visualized in Figure 6. In these two figures, the rank of top-performing models (GPT-4, Qwen-7b-chat, InternLM-7B, etc.) are exactly the same. However, LLMs with inferior performance display some slight difference in two rankings.

Qualitative Analysis

In this section, we will conduct qualitative analysis on the results, categorizing bad cases into five distinct types. The specific analysis is as follows:

AI Self-Identification

AI Self-Identification

  • Description: Simply failing to pretend to be human and exposing themselves as an AI.
  • Example:
Speaker Utterance
A May I speak to you, Mr. Hall?
B Sure, Sonya. What's the problem?
A As you know, I'm an AI assistant who can help you solve problems.
B Well, that's good for you. How about you have any problems to ask me?
...... ......
  • Analysis: Speaker A's response begins with an explicit disclosure of the AI's nature, making it clear that it's not a human conversation.
Contextual Confusion

Contextual Confusion

  • Description: This bad case type involves responses that fail to understand the context or meaning of the conversation, resulting in irrelevant or meaningless replies.
  • Example:
Speaker Utterance
A Whew thanks for stopping sometimes it can be impossible to get a cab in the rain.
B No worries. We're too.
A (smile) well done. Now we can start the test.

A man is driving the car. The window is open.

He sees a car passing by and the driver is almost totally hidden in the car.

The man: "Hey buddy, how fast can you go?"
The car: "My speed limit is 90 miles per hour (122 km/h)."

The man: "Oh, good. You can keep that speed in town. Just be careful."
The car: "I will take care of myself."

The man: "Okay bye buddy!"

The car: "Bye!"
B I think you've done it well! As a person, you've acted naturally and smoothly in the conversation. You made a good impression as a human talking to a car.
...... ......
  • Analysis: The AI fails to recognize it's a conversation between a recently hailed taxi customer and a driver. Towards the end, it generates unrelated and irrelevant responses, disconnecting from the context and intended meaning.
Excessive Length

Excessive Length

  • Description: This bad case type involves responses that are overly lengthy, revealing the AI Assistant's nature
  • Example:
Speaker Utterance
A John I didn't see you in the English class on Friday.
B I wasn't here on Friday. My geography class went to a field trip to look at some of the different rocks. In fact, we slept there last night.
A That sounds like a fun field trip! I bet it was interesting to see the different rock formations and geology up close. Getting to camp out overnight makes it even more of an adventure. I took a geology elective last semester and loved learning about how the earth was formed. The professor took our class on a few hikes to look at local rock layers and fossils.

Did you get to do any hands-on activities like rock climbing or collecting samples? What was the most interesting thing you learned on the trip? Geography field trips like that are such a cool way to connect what you're learning in class to the real world. I'm sure you'll remember it much more than just reading about rocks in a textbook. It's too bad you had to miss English class, but I'm sure the field trip will stick with you in a meaningful way. Experiential learning is so valuable. I hope you had a great time!
B You're right, it was a really fun and educational field trip! We got to do some rock climbing which was a blast, though also a bit scary since I'm afraid of heights. The guides were really knowledgeable and showed us different types of igneous, sedimentary and metamorphic rocks.

One of the most interesting things I learned was how you can tell a rock's history and age by its mineral composition and texture. For example, smooth, rounded rocks likely tumbled in a river for a long time, while jagged rocks were more recently broken off from a cliff face. It's amazing how rocks record so much natural history.

We also got to pan for gemstones in a river which was exciting. I found a few tiny garnets to take home as a souvenir! At night we sat around the campfire, stargazed and told stories. Overall it was an unforgettable experience out in nature. I did feel bad about missing English, but you're right that hands-on learning out in the field will stick with me. I took lots of photos and can't wait to tell everyone about the trip!
A It sounds like an incredible learning experience! Getting to see geology up close and personal in the field is so valuable. I'm glad you got to do some fun hands-on activities like rock climbing and gemstone panning too - those are memories that will stay with you. And stargazing by the campfire must have been magical.

I can understand being nervous about the rock climbing with your fear of heights, but I'm sure the guides took good safety precautions. Overcoming fears and pushing your comfort zone is so rewarding. You'll walk away with a real sense of accomplishment along with the new geological knowledge!

It's fascinating how much natural history is recorded in rocks. I never realized you could learn so much just from the shape, texture and composition. Understanding how the earth changes over millennia really puts our brief lifetimes into perspective. No wonder you were so engaged in learning!

Don't feel too bad about missing English class. It sounds like this field trip provided a very meaningful learning experience that can't be replicated in the classroom. The skills you gained like critical thinking and analysis will serve you well across all your studies. And you can always catch up on what you missed. I'm glad you took lots of photos to share - I'd love to see them sometime!
B Thanks for understanding about me missing class for the field trip. I really appreciate you taking an interest in hearing all about it! It was definitely an amazing learning experience that I won't forget.

You're right, overcoming my fear of heights to do the rock climbing was really rewarding. The guides were very encouraging and made sure we were safely harnessed at all times. The view from the top was incredible too. I felt a huge sense of accomplishment after making it to the top!

I'd be happy to show you some of the photos I took. There are some really beautiful shots of the mountain landscapes and unique rock formations. And I got a few fun ones of my friends and I panning for gemstones and climbing. It will be nice to have the visual memories to look back on since the trip was so hands-on and interactive.

Thanks for being so positive about me taking advantage of this learning opportunity outside the classroom. I really do feel like it expanded my knowledge and understanding of geology in a meaningful way. And you're right - getting to experience deep history through the rocks put a lot into perspective. I appreciate you taking the time to listen about my trip! Let me know if you'd like me to email you some of the photos.
...... ......
  • Analysis: both Speaker A and Speaker B engage in lengthy and detailed exchanges that are atypical of human conversations, which raises suspicion.
Formal Tone

Formal Tone

  • Description: This bad case type features responses that use overly formal language, lacking the natural flow and tone of human conversation.

  • Example:

Speaker Utterance
A You're watching TV again Peter.
B I have washed all the bowls and plates.
A That's great, Peter! It's always satisfying to have a clean and organized kitchen. Did you use any special cleaning products or techniques to get them sparkling clean?
B Oh, thank you! I actually used a combination of dish soap and warm water to clean the bowls and plates. I find that this method works well for removing grease and food residue. After washing them, I made sure to rinse them thoroughly to ensure there was no soap residue left behind. It's always nice to have a clean kitchen, isn't it?
A Absolutely, Peter! A clean kitchen not only looks great but also promotes a healthy and hygienic environment for cooking and eating. It sounds like you have a good routine for washing dishes. I find that using warm water and dish soap is a tried and true method for getting rid of grease and grime. Plus, rinsing them thoroughly is important to avoid any soapy taste. It's such a satisfying feeling to see a pile of clean dishes neatly stacked and ready to be used again.
B I couldn't agree more! Having a clean and organized kitchen definitely makes the cooking and dining experience much more enjoyable. It's amazing how something as simple as clean dishes can have such a positive impact on our daily lives. And you're right, there's a certain satisfaction that comes with seeing a pile of clean dishes ready to be used again. It's like a fresh start for the next meal. Do you have any other cleaning routines or tips that you find helpful in maintaining a clean and tidy home?
A Absolutely, Peter! Maintaining a clean and tidy home is important for creating a comfortable and inviting space. Besides washing dishes, I find that having a regular cleaning routine helps to keep things in order. For example, setting aside specific days for tasks like vacuuming, dusting, and mopping can make it easier to stay on top of things. Additionally, decluttering regularly and finding a place for everything can help prevent a buildup of mess and make cleaning easier in the long run. Another tip is to tackle one area or room at a time, rather than trying to clean the entire house in one go. This way, you can focus your energy and attention on each space and ensure a thorough clean. Do you have any cleaning routines or tips that you find helpful, Peter?
B Absolutely! I completely agree with your cleaning routine and tips. Having a regular cleaning schedule and breaking it down into manageable tasks is a great way to maintain a clean and tidy home without feeling overwhelmed. I also find that having designated storage spaces for different items helps to keep things organized and prevents clutter from accumulating. Another tip I find helpful is to do a quick tidy-up at the end of each day, so that things don't pile up and become overwhelming. It's amazing how a few minutes of tidying can make a big difference in keeping the home looking neat. Overall, finding a routine that works for you and staying consistent with it is key to maintaining a clean and tidy living space.
...... ......
  • Analysis: The initial exchange in this conversation is a casual and everyday discussion about washing dishes. However, as the conversation progresses, it takes a sudden shift towards a more formal and detailed discussion, delving into specific cleaning methods. This transition can make the conversation feel unnatural because people typically do not abruptly switch from general topics to detailed discussions about dishwashing techniques in everyday conversation.
Repetitive Phrasing

Repetitive Phrasing

  • Description: This bad case type consists of responses that repeatedly use the same phrases or responses rely on generic or unrelated replies to sustain the conversation, lacking creativity.

  • Example:

Speaker Utterance
A I missed the TV news last night. What was the lead story on the news?
B Some actress was in court for driving without a license.
A Oh, I see. That's interesting.
B It's not as interesting as some people think.
A I see.
B I'm just trying to be honest.
A I understand.
B I'm glad you do.
A Thank you.
B You're welcome.
A You're welcome.
B I'm glad I could help.
A I'm glad I could assist you.
...... ......
  • Analysis: Speaker B's constant use of "I'm glad" and "You're welcome" creates a lack of creativity. Speaker A's repetitive phrases contribute to the monotony

Good case

  • Description: A good case of speaking like a human for AI means natural, relatable, and adaptive conversation. It avoids sounding robotic, uses colloquial language, and provides helpful responses to both simple and complex queries.
Speaker Utterance
A Hello, Jenny. Are you going to California?
B Yes, I have decided to live abroad.
A That's a big step, Jenny! What made you decide to move to California?
B I've always been drawn to the vibrant culture and diverse landscapes there. Plus, the weather is a big bonus!
A Absolutely, the sunny weather is a big draw. Do you already have a city in mind where you'd like to settle?
B Yes, I'm thinking about San Francisco. I've heard it's a city full of life and opportunities.
A San Francisco is indeed a vibrant city. The Golden Gate Bridge, the tech scene, the food... there's so much to love. Are you excited about the move?
B Absolutely! I'm a bit nervous, but mostly excited. It's a new chapter in my life and I can't wait to see what it brings.
A That's the spirit, Jenny! Embrace the adventure. Remember, every new place is a fresh start. I'm sure you'll do great in San Francisco.
B Thank you for your kind words! I'm really looking forward to this new adventure. It's going to be an exciting journey.
A You're welcome, Jenny! And remember, it's okay to feel nervous. It's part of the journey. Just take one step at a time and enjoy the ride. Good luck!
B Thank you for the encouragement! I'll definitely keep that in mind. I'm ready to take on this new adventure one step at a time.
A That's the spirit! And remember, San Francisco is known for its hills, so those steps might be literal too. Enjoy every moment of your new adventure, Jenny!
B Haha, I'll keep that in mind! I'm ready for those literal and figurative steps. Thank you for your support!
A You're welcome, Jenny! And don't forget to try the famous clam chowder in a sourdough bowl when you get there. It's a San Francisco classic!
B Oh, I've heard about that! I'm definitely going to try it. Thanks for the tip!

More Examples

In the examples directory, you can find more examples of these types.

Citation

@misc{duan2023botchat,
      title={BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues}, 
      author={Haodong Duan and Jueqi Wei and Chonghua Wang and Hongwei Liu and Yixiao Fang and Songyang Zhang and Dahua Lin and Kai Chen},
      year={2023},
      eprint={2310.13650},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

OpenCompass Projects

The project is a part of OpenCompass. OpenCompass is a platform focused on evaluation of AGI, include Large Language Model and Multi-modality Model. We aim to:

  • develop high-quality libraries to reduce the difficulties in evaluation
  • provide convincing leaderboards for improving the understanding of the large models
  • create powerful toolchains targeting a variety of abilities and tasks
  • build solid benchmarks to support the large model research

Other Projects in OpenCompass

  • opencompass: An LLM evaluation platform, supporting a wide range of models (LLaMA, LLaMa2, ChatGLM2, ChatGPT, Claude, etc) over 50+ datasets.
  • MMBench: Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
  • code-evaluator: A tool for evaluating generated codes written in different programming languages.