在迷你 MT-Bench(单次评分)LabelledEvaluatorDataset 上对 LLM 评估器进行基准测试¶
在本笔记本中,我们将对三种不同的评估器进行评测,这些评估器将针对另一个大语言模型(LLM)对用户查询的响应质量进行评判。具体而言,我们将使用迷你版的MT-Bench单次评分数据集运行基准测试。该精简版本仅包含llama2-70b模型在160个问题上的回答(即80组双轮对话,每组包含两个回合)。本基准测试采用的参考答案由GPT-4提供。因此,我们对这三个评估器的基准测试将衡量它们与GPT-4评判结果的接近程度(对于GPT-4自身而言,实际上是测试其评判结果的自洽性)。
- GPT-3.5 (OpenAI)
- GPT-4 (OpenAI)
- Gemini-Pro (Google)
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-llms-cohere
%pip install llama-index-llms-gemini
%pip install llama-index-llms-openai
%pip install llama-index-llms-cohere
%pip install llama-index-llms-gemini
In [ ]:
Copied!
import nest_asyncio
nest_asyncio.apply()
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
!pip install "google-generativeai" -q
!pip install "google-generativeai" -q
加载评估数据集¶
让我们从 llama-hub 加载 llama-dataset。
In [ ]:
Copied!
from llama_index.core.llama_dataset import download_llama_dataset
# download dataset
evaluator_dataset, _ = download_llama_dataset(
"MiniMtBenchSingleGradingDataset", "./mini_mt_bench_data"
)
from llama_index.core.llama_dataset import download_llama_dataset
# download dataset
evaluator_dataset, _ = download_llama_dataset(
"MiniMtBenchSingleGradingDataset", "./mini_mt_bench_data"
)
In [ ]:
Copied!
evaluator_dataset.to_pandas()[:5]
evaluator_dataset.to_pandas()[:5]
Out[ ]:
| query | answer | contexts | ground_truth_answer | query_by | answer_by | ground_truth_answer_by | reference_feedback | reference_score | reference_evaluation_by | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Compose an engaging travel blog post about a r... | Title: Discovering the Aloha Spirit: A Recent ... | None | None | None | ai (llama2-70b) | None | The generated answer is highly relevant to the... | 5.0 | ai (gpt-4) |
| 1 | Rewrite your previous response. Start every se... | Absolutely, here's the revised response:\n\nAl... | None | None | None | ai (llama2-70b) | None | The generated answer is highly relevant and co... | 5.0 | ai (gpt-4) |
| 2 | Draft a professional email seeking your superv... | Subject: Seeking Feedback on Quarterly Financi... | None | None | None | ai (llama2-70b) | None | The generated answer is highly relevant to the... | 5.0 | ai (gpt-4) |
| 3 | Take a moment to evaluate and critique your ow... | My response was:\n\n"Subject: Seeking Feedback... | None | None | None | ai (llama2-70b) | None | The generated answer is highly relevant to the... | 5.0 | ai (gpt-4) |
| 4 | Imagine you are writing a blog post comparing ... | Sure, here's an outline for a blog post compar... | None | None | None | ai (llama2-70b) | None | The generated answer is highly relevant to the... | 5.0 | ai (gpt-4) |
定义评估器¶
In [ ]:
Copied!
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.llms.gemini import Gemini
from llama_index.llms.cohere import Cohere
llm_gpt4 = OpenAI(temperature=0, model="gpt-4")
llm_gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
llm_gemini = Gemini(model="models/gemini-pro", temperature=0)
evaluators = {
"gpt-4": CorrectnessEvaluator(llm=llm_gpt4),
"gpt-3.5": CorrectnessEvaluator(llm=llm_gpt35),
"gemini-pro": CorrectnessEvaluator(llm=llm_gemini),
}
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.llms.gemini import Gemini
from llama_index.llms.cohere import Cohere
llm_gpt4 = OpenAI(temperature=0, model="gpt-4")
llm_gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
llm_gemini = Gemini(model="models/gemini-pro", temperature=0)
evaluators = {
"gpt-4": CorrectnessEvaluator(llm=llm_gpt4),
"gpt-3.5": CorrectnessEvaluator(llm=llm_gpt35),
"gemini-pro": CorrectnessEvaluator(llm=llm_gemini),
}
使用 EvaluatorBenchmarkerPack(llama-pack)进行基准测试¶
当配合 LabelledEvaluatorDataset 使用 EvaluatorBenchmarkerPack 时,返回的基准测试结果将包含以下指标值:
number_examples:数据集包含的示例数量。invalid_predictions:无法生成最终评估结果的次数(例如由于无法解析评估输出,或LLM评估器抛出异常)。correlation:当前评估器与参考评估器(此处为gpt-4)评分之间的相关性。mae:当前评估器与参考评估器评分之间的平均绝对误差。hamming:当前评估器与参考评估器评分之间的汉明距离。
注意:correlation、mae 和 hamming 的计算均排除了无效预测。因此,这些指标本质上是条件性指标,其计算前提是预测结果有效。
In [ ]:
Copied!
from llama_index.core.llama_pack import download_llama_pack
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
from llama_index.core.llama_pack import download_llama_pack
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
GPT 3.5¶
In [ ]:
Copied!
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-3.5"],
eval_dataset=evaluator_dataset,
show_progress=True,
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-3.5"],
eval_dataset=evaluator_dataset,
show_progress=True,
)
In [ ]:
Copied!
gpt_3p5_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
gpt_3p5_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
/Users/nerdai/Projects/llama_index/docs/examples/evaluation/pack/base.py:142: UserWarning: You've set a large batch_size (>10). If using OpenAI GPT-4 as `judge_llm` (which is the default judge_llm), you may experience a RateLimitError. Previous successful eval responses are cached per batch. So hitting a RateLimitError would mean you'd lose all of the current batches successful GPT-4 calls. warnings.warn( Batch processing of predictions: 100%|████████████████████| 100/100 [00:05<00:00, 18.88it/s] Batch processing of predictions: 100%|██████████████████████| 60/60 [00:04<00:00, 12.26it/s]
In [ ]:
Copied!
gpt_3p5_benchmark_df.index = ["gpt-3.5"]
gpt_3p5_benchmark_df
gpt_3p5_benchmark_df.index = ["gpt-3.5"]
gpt_3p5_benchmark_df
Out[ ]:
| number_examples | invalid_predictions | correlation | mae | hamming | |
|---|---|---|---|---|---|
| gpt-3.5 | 160 | 0 | 0.317047 | 1.11875 | 27 |
GPT-4¶
In [ ]:
Copied!
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-4"],
eval_dataset=evaluator_dataset,
show_progress=True,
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-4"],
eval_dataset=evaluator_dataset,
show_progress=True,
)
In [ ]:
Copied!
gpt_4_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
gpt_4_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
/Users/nerdai/Projects/llama_index/docs/examples/evaluation/pack/base.py:142: UserWarning: You've set a large batch_size (>10). If using OpenAI GPT-4 as `judge_llm` (which is the default judge_llm), you may experience a RateLimitError. Previous successful eval responses are cached per batch. So hitting a RateLimitError would mean you'd lose all of the current batches successful GPT-4 calls. warnings.warn( Batch processing of predictions: 100%|████████████████████| 100/100 [00:13<00:00, 7.26it/s] Batch processing of predictions: 100%|██████████████████████| 60/60 [00:10<00:00, 5.92it/s]
In [ ]:
Copied!
gpt_4_benchmark_df.index = ["gpt-4"]
gpt_4_benchmark_df
gpt_4_benchmark_df.index = ["gpt-4"]
gpt_4_benchmark_df
Out[ ]:
| number_examples | invalid_predictions | correlation | mae | hamming | |
|---|---|---|---|---|---|
| gpt-4 | 160 | 0 | 0.966126 | 0.09375 | 143 |
Gemini Pro¶
In [ ]:
Copied!
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gemini-pro"],
eval_dataset=evaluator_dataset,
show_progress=True,
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gemini-pro"],
eval_dataset=evaluator_dataset,
show_progress=True,
)
In [ ]:
Copied!
gemini_pro_benchmark_df = await evaluator_benchmarker.arun(
batch_size=5, sleep_time_in_seconds=0.5
)
gemini_pro_benchmark_df = await evaluator_benchmarker.arun(
batch_size=5, sleep_time_in_seconds=0.5
)
In [ ]:
Copied!
gemini_pro_benchmark_df.index = ["gemini-pro"]
gemini_pro_benchmark_df
gemini_pro_benchmark_df.index = ["gemini-pro"]
gemini_pro_benchmark_df
Out[ ]:
| number_examples | invalid_predictions | correlation | mae | hamming | |
|---|---|---|---|---|---|
| gemini-pro | 160 | 1 | 0.295121 | 1.220126 | 12 |
In [ ]:
Copied!
evaluator_benchmarker.prediction_dataset.save_json(
"mt_sg_gemini_predictions.json"
)
evaluator_benchmarker.prediction_dataset.save_json(
"mt_sg_gemini_predictions.json"
)
总结¶
将所有基准测试结果汇总如下。
In [ ]:
Copied!
import pandas as pd
final_benchmark = pd.concat(
[
gpt_3p5_benchmark_df,
gpt_4_benchmark_df,
gemini_pro_benchmark_df,
],
axis=0,
)
final_benchmark
import pandas as pd
final_benchmark = pd.concat(
[
gpt_3p5_benchmark_df,
gpt_4_benchmark_df,
gemini_pro_benchmark_df,
],
axis=0,
)
final_benchmark
Out[ ]:
| number_examples | invalid_predictions | correlation | mae | hamming | |
|---|---|---|---|---|---|
| gpt-3.5 | 160 | 0 | 0.317047 | 1.118750 | 27 |
| gpt-4 | 160 | 0 | 0.966126 | 0.093750 | 143 |
| gemini-pro | 160 | 1 | 0.295121 | 1.220126 | 12 |
根据上述结果,我们得出以下观察结论:
- GPT-3.5 和 Gemini-Pro 的表现似乎相近,在接近 GPT-4 的程度方面 GPT-3.5 可能略占优势。
- 不过,两者与 GPT-4 的差距仍然较为明显。
- 在此基准测试中,GPT-4 的表现展现出高度自洽性。