Tonic Validate 评估器¶
本笔记本展示了一些基础用法示例,介绍如何通过 LlamaIndex 使用 Tonic Validate 的 RAG 评估指标。要使用这些评估器,您需要安装 tonic_validate 库,可通过 pip install tonic-validate 命令安装。
%pip install llama-index-evaluation-tonic-validate
import json
import pandas as pd
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.evaluation.tonic_validate import (
AnswerConsistencyEvaluator,
AnswerSimilarityEvaluator,
AugmentationAccuracyEvaluator,
AugmentationPrecisionEvaluator,
RetrievalPrecisionEvaluator,
TonicValidateEvaluator,
)
单次提问使用示例¶
在本示例中,我们展示了一个问题案例,其提供的参考答案与大语言模型(LLM)的响应答案不匹配。系统检索到两个上下文片段,其中仅有一个片段包含正确答案。
question = "What makes Sam Altman a good founder?"
reference_answer = "He is smart and has a great force of will."
llm_answer = "He is a good founder because he is smart."
retrieved_context_list = [
"Sam Altman is a good founder. He is very smart.",
"What makes Sam Altman such a good founder is his great force of will.",
]
答案相似度评分是一个介于0到5之间的分数,用于评估大语言模型(LLM)生成的答案与参考答案的匹配程度。在本案例中,两者并未完全吻合,因此答案相似度评分未达到满分5分。
answer_similarity_evaluator = AnswerSimilarityEvaluator()
score = await answer_similarity_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=4.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
答案一致性得分介于0.0到1.0之间,用于衡量答案是否包含检索上下文未出现的信息。本例中答案确实出现在检索上下文中,因此得分为1。
answer_consistency_evaluator = AnswerConsistencyEvaluator()
score = await answer_consistency_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强准确率衡量的是检索到的上下文在答案中所占的百分比。在本例中,检索到的上下文有一项出现在答案里,因此得分为0.5。
augmentation_accuracy_evaluator = AugmentationAccuracyEvaluator()
score = await augmentation_accuracy_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强精度用于衡量相关检索内容是否被纳入最终答案。虽然两个检索到的上下文都具备相关性,但仅有一个被实际采用到回答中。因此,该指标得分为0.5。
augmentation_precision_evaluator = AugmentationPrecisionEvaluator()
score = await augmentation_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
检索精度衡量的是检索到的上下文与问题相关的百分比。在本案例中,检索到的两段上下文均与问题相关,因此得分为1.0。
retrieval_precision_evaluator = RetrievalPrecisionEvaluator()
score = await retrieval_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
TonicValidateEvaluator 可以一次性计算 Tonic Validate 的所有指标。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
scores.score_dict
{'answer_consistency': 1.0,
'answer_similarity': 4.0,
'augmentation_accuracy': 0.5,
'augmentation_precision': 0.5,
'retrieval_precision': 1.0}
你也可以使用 TonicValidateEvaluator 一次性评估多个查询和响应,并返回一个可记录到 Tonic Validate UI(validate.tonic.ai)的 tonic_validate Run 对象。
具体操作时,只需将问题列表、LLM 回答列表、检索上下文列表和参考答案列表传入,然后调用 evaluate_run 方法即可。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate_run(
[question], [llm_answer], [retrieved_context_list], [reference_answer]
)
scores.run_data[0].scores
{'answer_consistency': 1.0,
'answer_similarity': 3.0,
'augmentation_accuracy': 0.5,
'augmentation_precision': 0.5,
'retrieval_precision': 1.0}
带标注的 RAG 数据集示例¶
我们使用 EvaluatingLlmSurveyPaperDataset 数据集,并通过 Tonic Validate 的答案相似度评分来评估默认的 LlamaIndex RAG 系统。该数据集属于 LabelledRagDataset 类型,因此每个问题都包含参考正确答案。数据集包含 276 个关于论文《评估大型语言模型:全面综述》的问题及参考答案。
我们将使用 TonicValidateEvaluator 的答案相似度评分指标,对该数据集上默认 RAG 系统的响应结果进行评估。
!llamaindex-cli download-llamadataset EvaluatingLlmSurveyPaperDataset --download-dir ./data
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import VectorStoreIndex
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data(
num_workers=4
) # parallel loading
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
predictions_dataset = rag_dataset.make_predictions_with(query_engine)
questions, retrieved_context_lists, reference_answers, llm_answers = zip(
*[
(e.query, e.reference_contexts, e.reference_answer, p.response)
for e, p in zip(rag_dataset.examples, predictions_dataset.predictions)
]
)
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 2.09it/s] Successfully downloaded EvaluatingLlmSurveyPaperDataset to ./data
from tonic_validate.metrics import AnswerSimilarityMetric
tonic_validate_evaluator = TonicValidateEvaluator(
metrics=[AnswerSimilarityMetric()], model_evaluator="gpt-4-1106-preview"
)
scores = await tonic_validate_evaluator.aevaluate_run(
questions, retrieved_context_lists, reference_answers, llm_answers
)
overall_scores 表示数据集中 276 个问题的平均得分。
scores.overall_scores
{'answer_similarity': 2.2644927536231885}
使用 pandas 和 matplotlib,我们可以绘制相似度分数的直方图。
import matplotlib.pyplot as plt
import pandas as pd
score_list = [x.scores["answer_similarity"] for x in scores.run_data]
value_counts = pd.Series(score_list).value_counts()
fig, ax = plt.subplots()
ax.bar(list(value_counts.index), list(value_counts))
ax.set_title("Answer Similarity Score Value Counts")
plt.show()
由于 0 分是最常见的评分结果,说明还有很大的改进空间。这在意料之中,因为我们使用的是默认参数。我们可以通过调整众多可能的 RAG 参数来优化这一评分结果。