RAGChecker:面向RAG系统的细粒度诊断评估框架¶
RAGChecker是一个专为检索增强生成(RAG)系统设计的综合评估框架。它提供了一套完整的指标体系,用于评估RAG系统的检索和生成组件,并深入分析其性能表现。
RAGChecker的核心特性包括:
- 采用声明级蕴含检查的细粒度分析
- 涵盖整体性能、检索器效率和生成器准确性的综合指标
- 可指导RAG系统改进的实用洞察
更多信息请访问RAGChecker GitHub仓库。
RAGChecker评估指标¶
RAGChecker提供了一套全面的指标来评估RAG系统的各个方面:
整体指标:
- 精确率(Precision):模型响应中正确声明的比例
- 召回率(Recall):模型响应覆盖真实声明的比例
- F1分数(F1 Score):精确率和召回率的调和平均数
检索器指标:
- 声明召回率(Claim Recall):检索片段覆盖真实声明的比例
- 上下文精确率(Context Precision):检索片段中相关内容的占比
生成器指标:
- 上下文利用率(Context Utilization):生成器利用检索片段中相关信息的能力
- 噪声敏感性(Noise Sensitivity):生成器包含检索片段中错误信息的倾向
- 幻觉率(Hallucination):未在任何检索片段中出现错误声明的比例
- 自有知识率(Self-knowledge):未在任何检索片段中出现正确声明的比例
- 忠实度(Faithfulness):生成器响应与检索片段的契合程度
这些指标提供了对检索和生成组件的细致评估,可针对性地改进RAG系统。
安装要求¶
In [ ]:
Copied!
%pip install -qU ragchecker llama-index
%pip install -qU ragchecker llama-index
安装与导入¶
首先,导入必要的库:
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from ragchecker.integrations.llama_index import response_to_rag_results
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from ragchecker.integrations.llama_index import response_to_rag_results
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics
创建 LlamaIndex 查询引擎¶
接下来,我们将使用示例数据集创建一个简单的 LlamaIndex 查询引擎:
In [ ]:
Copied!
# Load documents
documents = SimpleDirectoryReader("path/to/your/documents").load_data()
# Create index
index = VectorStoreIndex.from_documents(documents)
# Create query engine
rag_application = index.as_query_engine()
# Load documents
documents = SimpleDirectoryReader("path/to/your/documents").load_data()
# Create index
index = VectorStoreIndex.from_documents(documents)
# Create query engine
rag_application = index.as_query_engine()
在 LlamaIndex 中使用 RAGChecker¶
现在我们将演示如何使用 response_to_rag_results
函数将 LlamaIndex 的输出转换为 RAGChecker 格式:
In [ ]:
Copied!
# User query and groud truth answer
user_query = "What is RAGChecker?"
gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance."
# Get response from LlamaIndex
response_object = rag_application.query(user_query)
# Convert to RAGChecker format
rag_result = response_to_rag_results(
query=user_query,
gt_answer=gt_answer,
response_object=response_object,
)
# Create RAGResults object
rag_results = RAGResults.from_dict({"results": [rag_result]})
print(rag_results)
# User query and groud truth answer
user_query = "What is RAGChecker?"
gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance."
# Get response from LlamaIndex
response_object = rag_application.query(user_query)
# Convert to RAGChecker format
rag_result = response_to_rag_results(
query=user_query,
gt_answer=gt_answer,
response_object=response_object,
)
# Create RAGResults object
rag_results = RAGResults.from_dict({"results": [rag_result]})
print(rag_results)
使用 RAGChecker 进行评估¶
现在我们已经将结果转换为正确的格式,接下来使用 RAGChecker 对其进行评估:
In [ ]:
Copied!
# Initialize RAGChecker
evaluator = RAGChecker(
extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
checker_name="bedrock/meta.llama3-70b-instruct-v1:0",
batch_size_extractor=32,
batch_size_checker=32,
)
# Evaluate using RAGChecker
evaluator.evaluate(rag_results, all_metrics)
# Print detailed results
print(rag_results)
# Initialize RAGChecker
evaluator = RAGChecker(
extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
checker_name="bedrock/meta.llama3-70b-instruct-v1:0",
batch_size_extractor=32,
batch_size_checker=32,
)
# Evaluate using RAGChecker
evaluator.evaluate(rag_results, all_metrics)
# Print detailed results
print(rag_results)
输出结果将如下所示:
RAGResults(
1 RAG results,
Metrics:
{
"overall_metrics": {
"precision": 66.7,
"recall": 27.3,
"f1": 38.7
},
"retriever_metrics": {
"claim_recall": 54.5,
"context_precision": 100.0
},
"generator_metrics": {
"context_utilization": 16.7,
"noise_sensitivity_in_relevant": 0.0,
"noise_sensitivity_in_irrelevant": 0.0,
"hallucination": 33.3,
"self_knowledge": 0.0,
"faithfulness": 66.7
}
}
)
该输出全面展示了RAG系统的性能表现,包含如前文所述的总体指标、检索器指标和生成器指标。
选择特定指标组¶
除了使用 all_metrics
评估所有指标外,您还可以按如下方式选择特定指标组:
In [ ]:
Copied!
from ragchecker.metrics import (
overall_metrics,
retriever_metrics,
generator_metrics,
)
from ragchecker.metrics import (
overall_metrics,
retriever_metrics,
generator_metrics,
)
选择独立指标¶
如需更精细的控制,您可以根据需求选择特定的独立指标:
In [ ]:
Copied!
from ragchecker.metrics import (
precision,
recall,
f1,
claim_recall,
context_precision,
context_utilization,
noise_sensitivity_in_relevant,
noise_sensitivity_in_irrelevant,
hallucination,
self_knowledge,
faithfulness,
)
from ragchecker.metrics import (
precision,
recall,
f1,
claim_recall,
context_precision,
context_utilization,
noise_sensitivity_in_relevant,
noise_sensitivity_in_irrelevant,
hallucination,
self_knowledge,
faithfulness,
)
总结¶
本笔记本演示了如何将 RAGChecker 与 LlamaIndex 集成以评估 RAG 系统的性能。我们涵盖以下内容:
- 使用 LlamaIndex 配置 RAGChecker
- 将 LlamaIndex 输出转换为 RAGChecker 格式
- 使用多种指标评估 RAG 结果
- 通过特定指标组或独立指标定制评估方案
通过利用 RAGChecker 的全面评估指标,您可以深入洞察 RAG 系统的性能表现,识别改进空间,并优化检索和生成组件。该集成为开发和优化更高效的 RAG 应用程序提供了强大工具。