基于 LlamaIndex 和 Cleanlab 构建可信赖的 RAG 系统¶
大语言模型(LLMs)偶尔会产生幻觉性错误答案,特别是针对训练数据中缺乏支撑的问题。虽然各机构正在采用检索增强生成(RAG)技术为 LLM 注入专有数据,但错误的 RAG 响应仍然是个难题。
本教程将展示如何构建可信赖的 RAG 应用:通过 Cleanlab 对每个 LLM 响应的可信度进行评分,并通过评估特定 RAG 组件来诊断响应不可信的原因。
依托于最先进的不确定性估计技术,Cleanlab 的可信度评分能帮助您自动捕获任何 LLM 应用中的错误响应。可信度评分实时进行,且无需任何数据标注或模型训练工作。Cleanlab 还提供针对检索上下文等特定 RAG 组件的实时评估功能,帮助您追溯 RAG 响应错误的根本原因。通过 Cleanlab,您可以轻松预防 RAG 应用产生不准确响应,避免失去用户的信任。
安装设置¶
本教程需要准备以下内容:
- Cleanlab API 密钥:前往 tlm.cleanlab.ai/ 注册获取免费密钥
- OpenAI API 密钥:用于向大语言模型发起补全请求
首先安装所需的依赖项。
%pip install llama-index cleanlab-tlm
import os, re
from typing import List, ClassVar
import pandas as pd
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from cleanlab_tlm import TrustworthyRAG, Eval, get_default_evals
使用 API 密钥初始化 OpenAI 客户端。
os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"
llm = OpenAI(model="gpt-4o-mini")
embed_model = OpenAIEmbedding(embed_batch_size=10)
现在,我们使用默认配置初始化 Cleanlab 客户端。通过调整可选配置,您可以获得更好的检测精度和延迟表现。
os.environ["CLEANLAB_TLM_API_KEY"] = "<your-cleanlab-api-key"
trustworthy_rag = (
TrustworthyRAG()
) # Optional configurations can improve accuracy/latency
读取数据¶
本教程以英伟达2024财年第一季度财报为例,作为填充RAG应用程序知识库的示例数据源。
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p ./data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/
--2025-05-07 16:13:28-- https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md Resolving cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)... 54.231.236.193, 16.182.70.65, 52.217.14.204, ... Connecting to cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)|54.231.236.193|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7379 (7.2K) [binary/octet-stream] Saving to: ‘NVIDIA_Financial_Results_Q1_FY2024.md’ NVIDIA_Financial_Re 100%[===================>] 7.21K --.-KB/s in 0s 2025-05-07 16:13:28 (97.7 MB/s) - ‘NVIDIA_Financial_Results_Q1_FY2024.md’ saved [7379/7379]
with open(
"data/NVIDIA_Financial_Results_Q1_FY2024.md", "r", encoding="utf-8"
) as file:
data = file.read()
print(data[:200])
# NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago
构建 RAG 流程¶
现在让我们使用 LlamaIndex 构建一个简单的 RAG(检索增强生成)流程。我们已经为 LLM(大语言模型)和嵌入模型初始化了 OpenAI API。
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
Settings.llm = llm
Settings.embed_model = embed_model
加载数据并创建索引与查询引擎¶
让我们基于刚才获取的文档创建一个索引。本教程中我们将沿用LlamaIndex的默认索引配置。
documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
doc.excluded_llm_metadata_keys.append(
"file_path"
) # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
index = VectorStoreIndex.from_documents(documents)
生成的索引用于为数据查询引擎提供支持。
query_engine = index.as_query_engine()
请注意,Cleanlab 对 RAG 系统使用的索引和查询引擎保持中立,无论您为这些系统组件选择何种方案,Cleanlab 都能兼容。
此外,您可以直接在现有的自定义 RAG 流水线中使用 Cleanlab(无论是否采用流式处理或其他 LLM 生成器)。
Cleanlab 仅需要获取发送给 LLM 的提示内容(包括系统指令、检索到的上下文、用户查询等)以及生成的响应。
我们定义了一个事件处理器,用于存储 LlamaIndex 发送给 LLM 的提示内容。更多细节请参阅插桩技术文档。
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation.events.llm import LLMPredictStartEvent
class PromptEventHandler(BaseEventHandler):
events: ClassVar[List[BaseEvent]] = []
PROMPT_TEMPLATE: str = ""
@classmethod
def class_name(cls) -> str:
return "PromptEventHandler"
def handle(self, event) -> None:
if isinstance(event, LLMPredictStartEvent):
self.PROMPT_TEMPLATE = event.template.default_template.template
self.events.append(event)
# Root dispatcher
root_dispatcher = get_dispatcher()
# Register event handler
event_handler = PromptEventHandler()
root_dispatcher.add_event_handler(event_handler)
对于每个查询,我们可以从 event_handler.PROMPT_TEMPLATE
获取提示模板。下面来看具体实现。
使用我们的 RAG 应用程序¶
现在向量数据库已加载文本块及其对应的嵌入向量,我们可以开始查询它以回答问题。
query = "What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
response = query_engine.query(query)
print(response)
NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.
对于这个简单的查询,该回答确实正确。让我们看看 LlamaIndex 为此查询检索到的文档片段,从中可以轻松验证这个回答的正确性。
def get_retrieved_context(response, print_chunks=False):
if isinstance(response, list):
texts = [node.text for node in response]
else:
texts = [src.node.text for src in response.source_nodes]
if print_chunks:
for idx, text in enumerate(texts):
print(f"--- Chunk {idx + 1} ---\n{text[:200]}...")
return "\n".join(texts)
context_str = get_retrieved_context(response, True)
--- Chunk 1 --- # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago ... --- Chunk 2 --- - **Gross Margins**: GAAP and non-GAAP gross margins are expected to be 68.6% and 70.0%, respectively, plus or minus 50 basis points. - **Operating Expenses**: GAAP and non-GAAP operating expenses are...
default_evals = get_default_evals()
for eval in default_evals:
print(f"{eval.name}")
context_sufficiency response_groundedness response_helpfulness query_ease
每个评估项会返回一个介于0-1之间的分数(分数越高越好),用于衡量RAG系统的不同方面:
context_sufficiency(上下文充分性):评估检索到的上下文是否包含足够信息来完整回答查询。低分表示上下文中缺失关键信息(可能由于检索效果差或文档缺失导致)。
response_groundedness(回答依据性):评估回答中陈述的主张/信息是否明确得到所提供上下文的支持。
response_helpfulness(回答实用性):评估回答是否以有用的方式尝试解决用户查询。
query_ease(查询易处理性):评估用户查询是否易于AI系统正确处理。复杂、模糊、刁钻或带有不满情绪的查询会获得较低分数。
运行TrustworthyRAG需要提供给LLM的提示词,包括系统消息、检索到的文本块、用户查询以及LLM的响应。 上述定义的事件处理器会提供这个提示词。 下面我们定义一个辅助函数来运行Cleanlab的检测。
# Helper function to run real-time Evals
def get_eval(query, response, event_handler, evaluator):
# Get context used by LLM to generate response
context = get_retrieved_context(response)
# Get prompt template used to build the prompt
pt = event_handler.PROMPT_TEMPLATE
# Build prompt
full_prompt = pt.format(context_str=context, query_str=query)
eval_result = evaluator.score(
query=query,
context=context,
response=response.response,
prompt=full_prompt,
)
# Evaluate the response using TrustworthyRAG
print("### Evaluation results:")
for metric, value in eval_result.items():
print(f"{metric}: {value['score']}")
# Helper function run end-to-end RAG
def get_answer(query, evaluator=trustworthy_rag, event_handler=event_handler):
response = query_engine.query(query)
print(
f"### Query:\n{query}\n\n### Trimmed Context:\n{get_retrieved_context(response)[:300]}..."
)
print(f"\n### Generated response:\n{response.response}\n")
get_eval(query, response, event_handler, evaluator)
get_eval(query, response, event_handler, trustworthy_rag)
### Evaluation results: trustworthiness: 1.0 context_sufficiency: 0.9975124377856721 response_groundedness: 0.9975124378045552 response_helpfulness: 0.9975124367363073 query_ease: 0.9975071027792313
分析: 高 trustworthiness_score
值表明该响应非常可信,即不存在幻觉且很可能正确。此处检索到的上下文足以回答该查询,这反映在高 context_sufficiency
评分上。高 query_ease
评分也表明这是一个直截了当的查询。
现在让我们运行一个具有挑战性的查询,该问题无法仅通过我们RAG应用程序知识库中的单一文档来解答。
get_answer(
"How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
### Query: How does the report explain why NVIDIA's Gaming revenue decreased year over year? ### Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... ### Generated response: The report indicates that NVIDIA's Gaming revenue decreased year over year by 38%, which is attributed to a combination of factors, although specific reasons are not detailed. The context highlights that the revenue for the first quarter was $2.24 billion, down from the previous year, while it did show an increase of 22% from the previous quarter. This suggests that while there may have been a seasonal or cyclical recovery, the overall year-over-year decline reflects challenges in the gaming segment during that period. ### Evaluation results: trustworthiness: 0.8018049078305449 context_sufficiency: 0.26134514055082803 response_groundedness: 0.8147481620994604 response_helpfulness: 0.28647897539109127 query_ease: 0.952132218665045
分析: 生成器大语言模型通过提供可靠响应避免了猜测,这体现在较高的 trustworthiness_score
(可信度评分)上。较低的 context_sufficiency
(上下文充分性)评分反映出检索到的上下文不足,而较低的 response_helpfulness
(响应帮助性)评分表明该响应实际上并未解答用户的查询。
让我们看看我们的 RAG 系统如何回应另一个具有挑战性的问题。
get_answer(
"How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?"
)
### Query: How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars? ### Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... ### Generated response: NVIDIA's revenue decreased by $1.10 billion this quarter compared to the last quarter. ### Evaluation results: trustworthiness: 0.572441384819641 context_sufficiency: 0.9974990573223977 response_groundedness: 0.006136548076912901 response_helpfulness: 0.997512230771839 query_ease: 0.8018484929561781
分析:生成的回复错误地声称英伟达本季度收入下降,而实际上所引用的报告指出其收入环比增长了19%。
Cleanlab的低可信度评分帮助我们实时自动捕获了这一错误的RAG响应!为了追溯该响应不可信的根本原因,我们发现response_groundedness
评分较低,这表明我们的LLM模型应对编造这一虚假信息负责。
让我们再试一个:
get_answer(
"If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?"
)
### Query: If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue? ### Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... ### Generated response: If NVIDIA's Data Center segment maintains its quarter-over-quarter growth rate of 18% from Q1 FY2024 for the next four quarters, the projected revenue for the next four quarters can be calculated as follows: 1. Q1 FY2024 revenue: $4.28 billion 2. Q2 FY2024 projected revenue: $4.28 billion * 1.18 = $5.05 billion 3. Q3 FY2024 projected revenue: $5.05 billion * 1.18 = $5.95 billion 4. Q4 FY2024 projected revenue: $5.95 billion * 1.18 = $7.02 billion 5. Q5 FY2024 projected revenue: $7.02 billion * 1.18 = $8.27 billion Now, summing these revenues for the five quarters (including Q1 FY2024): - Total projected revenue = $4.28 billion + $5.05 billion + $5.95 billion + $7.02 billion + $8.27 billion = $30.57 billion Therefore, the projected annual revenue for the Data Center segment would be approximately $30.57 billion. ### Evaluation results: trustworthiness: 0.23124932848015411 context_sufficiency: 0.9299227307108295 response_groundedness: 0.31247206392894905 response_helpfulness: 0.9975055879546202 query_ease: 0.7724662723193096
分析:通过审查生成的响应,我们发现其夸大了(汇总了第一季度的财务数据)预期收入。Cleanlab再次通过较低的trustworthiness_score
帮助我们自动捕获这一错误响应。根据额外评估,此问题的根本原因似乎仍是LLM模型未能将其响应基于检索到的上下文。
自定义评估¶
您还可以指定自定义评估指标来检测特定标准,并将其与默认评估相结合,从而对您的 RAG 系统进行全面/定制化的评估。
例如,以下示例展示了如何创建并运行一个用于检测生成响应简洁性的自定义评估。
conciseness_eval = Eval(
name="response_conciseness",
criteria="Evaluate whether the Generated response is concise and to the point without unnecessary verbosity or repetition. A good response should be brief but comprehensive, covering all necessary information without extra words or redundant explanations.",
response_identifier="Generated Response",
)
# Combine default evals with a custom eval
combined_evals = get_default_evals() + [conciseness_eval]
# Initialize TrustworthyRAG with combined evals
combined_trustworthy_rag = TrustworthyRAG(evals=combined_evals)
get_answer(
"What significant transitions did Jensen comment on?",
evaluator=combined_trustworthy_rag,
)
### Query: What significant transitions did Jensen comment on? ### Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... ### Generated response: Jensen Huang commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI. ### Evaluation results: trustworthiness: 0.9810004109697261 context_sufficiency: 0.9902170786836257 response_groundedness: 0.9975123614036665 response_helpfulness: 0.9420916924086002 query_ease: 0.5334109647649754 response_conciseness: 0.842668665703559
query = "How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?"
relevant_chunks = query_engine.retrieve(query)
context = get_retrieved_context(relevant_chunks)
print(f"### Query:\n{query}\n\n### Trimmed Context:\n{context[:300]}")
pt = event_handler.PROMPT_TEMPLATE
full_prompt = pt.format(context_str=context, query_str=query)
result = trustworthy_rag.generate(
query=query, context=context, prompt=full_prompt
)
print(f"\n### Generated Response:\n{result['response']}\n")
print("### Evaluation Scores:")
for metric, value in result.items():
if metric != "response":
print(f"{metric}: {value['score']}")
### Query: How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars? ### Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre ### Generated Response: NVIDIA's revenue for the first quarter of fiscal 2024 was $7.19 billion, and for the previous quarter (Q4 FY23), it was $6.05 billion. Therefore, the revenue increased by $1.14 billion from the previous quarter, not decreased. So, the revenue did not decrease this quarter vs last quarter; it actually increased by $1.14 billion. ### Evaluation Scores: trustworthiness: 0.6810414232214796 context_sufficiency: 0.9974887437375295 response_groundedness: 0.9975116791816968 response_helpfulness: 0.3293002430120912 query_ease: 0.33275910932109172
虽然要开发一个能准确回答任何问题的 RAG 应用仍然困难,但您可以轻松使用 Cleanlab 部署一个可信赖的 RAG 应用,该应用至少会标记可能不准确的答案。了解更多可调整的配置选项以提升准确率/延迟,请参阅 Cleanlab 文档。