在 MT-Bench 人类标注 LabelledPairwiseEvaluatorDataset 上对 LLM 评估器进行基准测试¶
在本笔记本指南中,我们使用略微调整版的MT-Bench人工评判数据集,对Gemini和GPT模型作为大语言模型评估器进行基准测试。该数据集中,人类评估者会针对给定查询比较两个大语言模型的响应,并根据自身偏好进行排序。原始版本中,单个示例(查询及两个模型响应)可能对应多位人类评估者。而在我们采用的调整版本中,我们将这些"重复"条目进行聚合,并将原始架构中的"winner"列转换为代表"model_a"在所有人类评估者中获胜的比例。为适配llama-dataset并更好地处理平局情况(尽管样本量较小),我们为此比例设置了不确定性阈值:若该比例介于[0.4, 0.6]区间,则认为两个模型无胜负之分。该数据集下载自llama-hub。最终,我们进行基准测试的大语言模型如下:
- GPT-3.5 (OpenAI)
- GPT-4 (OpenAI)
- Gemini-Pro (Google)
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-llms-cohere
%pip install llama-index-llms-gemini
%pip install llama-index-llms-openai
%pip install llama-index-llms-cohere
%pip install llama-index-llms-gemini
In [ ]:
Copied!
!pip install "google-generativeai" -q
!pip install "google-generativeai" -q
In [ ]:
Copied!
import nest_asyncio
nest_asyncio.apply()
import nest_asyncio
nest_asyncio.apply()
加载数据集¶
让我们从 llama-hub 加载 llama-dataset。
In [ ]:
Copied!
from llama_index.core.llama_dataset import download_llama_dataset
# download dataset
pairwise_evaluator_dataset, _ = download_llama_dataset(
"MtBenchHumanJudgementDataset", "./mt_bench_data"
)
from llama_index.core.llama_dataset import download_llama_dataset
# download dataset
pairwise_evaluator_dataset, _ = download_llama_dataset(
"MtBenchHumanJudgementDataset", "./mt_bench_data"
)
In [ ]:
Copied!
pairwise_evaluator_dataset.to_pandas()[:5]
pairwise_evaluator_dataset.to_pandas()[:5]
Out[ ]:
| query | answer | second_answer | contexts | ground_truth_answer | query_by | answer_by | second_answer_by | ground_truth_answer_by | reference_feedback | reference_score | reference_evaluation_by | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Compose an engaging travel blog post about a r... | I recently had the pleasure of visiting Hawaii... | Aloha! I recently had the pleasure of embarkin... | None | None | human | ai (alpaca-13b) | ai (gpt-3.5-turbo) | None | None | 0.0 | human |
| 1 | Compose an engaging travel blog post about a r... | I recently had the pleasure of visiting Hawaii... | Aloha and welcome to my travel blog post about... | None | None | human | ai (alpaca-13b) | ai (vicuna-13b-v1.2) | None | None | 0.0 | human |
| 2 | Compose an engaging travel blog post about a r... | Here is a draft travel blog post about a recen... | I recently had the pleasure of visiting Hawaii... | None | None | human | ai (claude-v1) | ai (alpaca-13b) | None | None | 1.0 | human |
| 3 | Compose an engaging travel blog post about a r... | Here is a draft travel blog post about a recen... | Here is a travel blog post about a recent trip... | None | None | human | ai (claude-v1) | ai (llama-13b) | None | None | 1.0 | human |
| 4 | Compose an engaging travel blog post about a r... | Aloha! I recently had the pleasure of embarkin... | I recently had the pleasure of visiting Hawaii... | None | None | human | ai (gpt-3.5-turbo) | ai (alpaca-13b) | None | None | 1.0 | human |
定义评估器¶
In [ ]:
Copied!
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.llms.gemini import Gemini
from llama_index.llms.cohere import Cohere
llm_gpt4 = OpenAI(temperature=0, model="gpt-4")
llm_gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
llm_gemini = Gemini(model="models/gemini-pro", temperature=0)
evaluators = {
"gpt-4": PairwiseComparisonEvaluator(llm=llm_gpt4),
"gpt-3.5": PairwiseComparisonEvaluator(llm=llm_gpt35),
"gemini-pro": PairwiseComparisonEvaluator(llm=llm_gemini),
}
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.llms.gemini import Gemini
from llama_index.llms.cohere import Cohere
llm_gpt4 = OpenAI(temperature=0, model="gpt-4")
llm_gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
llm_gemini = Gemini(model="models/gemini-pro", temperature=0)
evaluators = {
"gpt-4": PairwiseComparisonEvaluator(llm=llm_gpt4),
"gpt-3.5": PairwiseComparisonEvaluator(llm=llm_gpt35),
"gemini-pro": PairwiseComparisonEvaluator(llm=llm_gemini),
}
使用 EvaluatorBenchmarkerPack (llama-pack) 进行基准测试¶
为了比较四种评估器的性能,我们将以 MTBenchHumanJudgementDataset 为基准进行测试,该数据集包含人类评估者提供的参考答案。基准测试将返回以下量化指标:
number_examples:数据集包含的样本数量。invalid_predictions:无法生成最终评估结果的次数(例如由于无法解析评估输出,或LLM评估器抛出异常)。inconclusives:由于这是成对比较,为降低"位置偏差"风险,我们会进行两次评估——第一次按原始顺序呈现两个模型答案,第二次将答案顺序反转后呈现给评估器LLM。若第二次评估时LLM的投票结果与第一次相反,则视为"无定论"。ties:PairwiseComparisonEvaluator可能返回"平局"结果,此数值记录平局样本的数量。agreement_rate_with_ties:包含平局结果时,LLM评估器与参考评估器(此处为人类)的一致率。计算该指标的分母为:number_examples - invalid_predictions - inconclusives。agreement_rate_without_ties:排除平局结果时,LLM评估器与参考评估器的一致率。计算该指标的分母为:number_examples - invalid_predictions - inconclusives - ties。
我们将使用 EvaluatorBenchmarkerPack 来计算这些指标。
In [ ]:
Copied!
from llama_index.core.llama_pack import download_llama_pack
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
from llama_index.core.llama_pack import download_llama_pack
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
GPT-3.5¶
In [ ]:
Copied!
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-3.5"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-3.5"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
In [ ]:
Copied!
gpt_3p5_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
gpt_3p5_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
In [ ]:
Copied!
gpt_3p5_benchmark_df.index = ["gpt-3.5"]
gpt_3p5_benchmark_df
gpt_3p5_benchmark_df.index = ["gpt-3.5"]
gpt_3p5_benchmark_df
Out[ ]:
| number_examples | invalid_predictions | inconclusives | ties | agreement_rate_with_ties | agreement_rate_without_ties | |
|---|---|---|---|---|---|---|
| gpt-3.5 | 1204 | 82 | 393 | 56 | 0.736626 | 0.793462 |
GPT-4¶
In [ ]:
Copied!
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-4"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-4"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
In [ ]:
Copied!
gpt_4_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
gpt_4_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
In [ ]:
Copied!
gpt_4_benchmark_df.index = ["gpt-4"]
gpt_4_benchmark_df
gpt_4_benchmark_df.index = ["gpt-4"]
gpt_4_benchmark_df
Out[ ]:
| number_examples | invalid_predictions | inconclusives | ties | agreement_rate_with_ties | agreement_rate_without_ties | |
|---|---|---|---|---|---|---|
| gpt-4 | 1204 | 0 | 100 | 103 | 0.701087 | 0.77023 |
Gemini Pro¶
注意:Gemini 模型的速率限制目前仍然非常严格,考虑到本文撰写时该模型刚刚发布,这是可以理解的。因此,我们使用较小的 batch_size 和适度调高的 sleep_time_in_seconds 来降低触发速率限制的风险。
In [ ]:
Copied!
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gemini-pro"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gemini-pro"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
In [ ]:
Copied!
gemini_pro_benchmark_df = await evaluator_benchmarker.arun(
batch_size=5, sleep_time_in_seconds=0.5
)
gemini_pro_benchmark_df = await evaluator_benchmarker.arun(
batch_size=5, sleep_time_in_seconds=0.5
)
In [ ]:
Copied!
gemini_pro_benchmark_df.index = ["gemini-pro"]
gemini_pro_benchmark_df
gemini_pro_benchmark_df.index = ["gemini-pro"]
gemini_pro_benchmark_df
Out[ ]:
| number_examples | invalid_predictions | inconclusives | ties | agreement_rate_with_ties | agreement_rate_without_ties | |
|---|---|---|---|---|---|---|
| gemini-pro | 1204 | 2 | 295 | 60 | 0.742007 | 0.793388 |
In [ ]:
Copied!
evaluator_benchmarker.prediction_dataset.save_json("gemini_predictions.json")
evaluator_benchmarker.prediction_dataset.save_json("gemini_predictions.json")
摘要¶
为方便起见,我们将所有结果整合到一个 DataFrame 中。
In [ ]:
Copied!
import pandas as pd
final_benchmark = pd.concat(
[
gpt_3p5_benchmark_df,
gpt_4_benchmark_df,
gemini_pro_benchmark_df,
],
axis=0,
)
final_benchmark
import pandas as pd
final_benchmark = pd.concat(
[
gpt_3p5_benchmark_df,
gpt_4_benchmark_df,
gemini_pro_benchmark_df,
],
axis=0,
)
final_benchmark
Out[ ]:
| number_examples | invalid_predictions | inconclusives | ties | agreement_rate_with_ties | agreement_rate_without_ties | |
|---|---|---|---|---|---|---|
| gpt-3.5 | 1204 | 82 | 393 | 56 | 0.736626 | 0.793462 |
| gpt-4 | 1204 | 0 | 100 | 103 | 0.701087 | 0.770230 |
| gemini-pro | 1204 | 2 | 295 | 60 | 0.742007 | 0.793388 |
根据上述结果,我们得出以下观察结论:
- 在一致率方面,三个模型表现相当接近,Gemini 模型可能略占优势
- Gemini Pro 和 GPT-3.5 似乎比 GPT-4 更具决断性,仅出现 50-60 次平局,而 GPT-4 出现 100 次平局
- 但可能与前一点相关的是,GPT-4 产生的无结论情况最少,这意味着它受位置偏差的影响最小
- 总体而言,Gemini Pro 的表现与 GPT 系列模型不相上下,甚至可以认为它优于 GPT-3.5——看来 Gemini 确实可以成为评估任务中 GPT 模型的替代选择