基于可信语言模型的可靠RAG系统¶

本教程演示如何在任何RAG系统中使用Cleanlab的可信语言模型（TLM），为答案的可信度评分并实时自动捕捉错误/幻觉响应。

当前的RAG和智能体应用经常产生不可靠的响应，因为它们依赖于本质上不可靠的LLM。Cleanlab的可信语言模型使用最先进的LLM不确定性估计，实时评估每个LLM响应的可信度。无论您的RAG架构或检索索引流程如何，Cleanlab都能有效工作。

为了诊断RAG答案何时不可信，本教程展示如何用Cleanlab替代您的LLM来生成响应并评估其可信度。您也可以仅使用Cleanlab为未修改的RAG系统响应评分，并运行其他实时评估，参见我们的评估教程。

环境配置¶

RAG的核心在于将LLM与数据连接，以提供更准确的答案。本教程以Nvidia 2024财年第一季度财报作为示例数据集。使用以下命令下载数据（财报）并存储到名为data/的目录中。

In [ ]:

Copied!

!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p ./data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p ./data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/

现在让我们安装所需的依赖项。

In [ ]:

Copied!

%pip install llama-index-llms-cleanlab llama-index llama-index-embeddings-huggingface
%pip install llama-index-llms-cleanlab llama-index llama-index-embeddings-huggingface

我们随后初始化 Cleanlab 的 TLM。此处我们使用默认设置初始化一个 CleanlabTLM 对象。

In [ ]:

Copied!





from llama_index.llms.cleanlab import CleanlabTLM

# set api key in env or in llm
# get free API key from: https://cleanlab.ai/
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")
from llama_index.llms.cleanlab import CleanlabTLM

# set api key in env or in llm
# get free API key from: https://cleanlab.ai/
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")

注意：如果在上述导入过程中遇到 ValidationError，请将您的 Python 版本升级至 >= 3.11

通过调整高级 TLM 教程中所述的 TLM 配置，您可以获得更好的效果。

例如，如果您的应用需要使用 OpenAI 的 GPT-4 模型并将输出令牌限制为 256，可以通过 options 参数进行如下配置：

options = {
    "model": "gpt-4",
    "max_tokens": 256,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

让我们从向大语言模型提出一个简单问题开始。

In [ ]:

Copied!

response = llm.complete("What is NVIDIA's ticker symbol?")
print(response)
response = llm.complete("What is NVIDIA's ticker symbol?")
print(response)

NVIDIA's ticker symbol is NVDA.

TLM 不仅提供响应，还包含一个可信度评分，用于表示该响应优质/准确的置信度。您可以直接从响应中获取该评分。

In [ ]:

Copied!

response.additional_kwargs
response.additional_kwargs

Out[ ]:

{'trustworthiness_score': 0.9884868983475051}

使用 TLM 构建 RAG 流程¶

现在我们将 TLM 集成到 RAG 流程中。

In [ ]:

Copied!

from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Settings.llm = llm
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Settings.llm = llm

指定嵌入模型¶

RAG（检索增强生成）系统通过嵌入模型将查询与文档片段进行匹配，从而检索最相关的数据。在此我们选择使用Hugging Face提供的免费本地嵌入模型。如需使用其他嵌入模型，可参考这份LlamaIndex指南。

In [ ]:

Copied!

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

加载数据并创建索引与查询引擎¶

我们将从存储在 data 目录中的文档创建索引。该系统可以索引同一文件夹中的多个文件，不过在本教程中我们仅使用一个文档。本教程中我们沿用 LlamaIndex 的默认索引设置。

In [ ]:

Copied!





documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
    # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
    doc.excluded_llm_metadata_keys.append("file_path")
index = VectorStoreIndex.from_documents(documents)
documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
    # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
    doc.excluded_llm_metadata_keys.append("file_path")
index = VectorStoreIndex.from_documents(documents)

生成的索引用于为数据查询引擎提供支持。

In [ ]:

Copied!

query_engine = index.as_query_engine()
query_engine = index.as_query_engine()

请注意，TLM（可信度语言模型）对用于RAG（检索增强生成）的索引和查询引擎保持中立，与您为系统这些组件选择的任何方案都兼容。

此外，您可以直接在现有自定义构建的RAG流程（使用其他任意LLM生成器，无论是否支持流式传输）中使用TLM的可信度评分。
要实现这一点，您需要获取发送给LLM的提示（包括系统指令、检索到的上下文、用户查询等）以及返回的响应。TLM需要这两部分内容来预测可信度。

关于此方法的详细说明和示例代码可参阅此处。

从大语言模型响应中提取可信度评分¶

如前所述，Cleanlab 的 TLM 在响应提示时，除了返回文本内容外，还会提供 trustworthiness_score 可信度评分。

当 TLM 应用于 RAG 流水线时，Llamaindex 提供了一套观测工具，可帮助我们监控 RAG 运行时的后台事件。
利用该工具，我们可以从大语言模型的响应中提取 trustworthiness_score 可信度评分。

下面定义一个简单的事件处理器，用于存储每次向大语言模型发送请求时获得的评分。更多关于观测工具的细节，请参阅 Llamaindex 官方文档。

In [ ]:

Copied!





from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent


class GetTrustworthinessScore(BaseEventHandler):
    events: ClassVar[List[BaseEvent]] = []
    trustworthiness_score: float = 0.0

    @classmethod
    def class_name(cls) -> str:
        """Class name."""
        return "GetTrustworthinessScore"

    def handle(self, event: BaseEvent) -> Dict:
        if isinstance(event, LLMCompletionEndEvent):
            self.trustworthiness_score = event.response.additional_kwargs[
                "trustworthiness_score"
            ]
            self.events.append(event)


# Root dispatcher
root_dispatcher = get_dispatcher()

# Register event handler
event_handler = GetTrustworthinessScore()
root_dispatcher.add_event_handler(event_handler)
from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent


class GetTrustworthinessScore(BaseEventHandler):
    events: ClassVar[List[BaseEvent]] = []
    trustworthiness_score: float = 0.0

    @classmethod
    def class_name(cls) -> str:
        """Class name."""
        return "GetTrustworthinessScore"

    def handle(self, event: BaseEvent) -> Dict:
        if isinstance(event, LLMCompletionEndEvent):
            self.trustworthiness_score = event.response.additional_kwargs[
                "trustworthiness_score"
            ]
            self.events.append(event)


# Root dispatcher
root_dispatcher = get_dispatcher()

# Register event handler
event_handler = GetTrustworthinessScore()
root_dispatcher.add_event_handler(event_handler)

对于每个查询，我们可以从 event_handler.trustworthiness_score 中获取这个分数。下面来看具体实现。

基于 RAG 系统的查询应答¶

让我们来测试基于 TLM 构建的 RAG 流程。此处我们提出不同复杂度级别的问题进行验证。

In [ ]:

Copied!





# Optional: Define `display_response` helper function


# This method presents formatted responses from our TLM-based RAG pipeline. It parses the output to display both the text response itself and the corresponding trustworthiness score.
def display_response(response):
    response_str = response.response
    trustworthiness_score = event_handler.trustworthiness_score
    print(f"Response: {response_str}")
    print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")
# Optional: Define `display_response` helper function


# This method presents formatted responses from our TLM-based RAG pipeline. It parses the output to display both the text response itself and the corresponding trustworthiness score.
def display_response(response):
    response_str = response.response
    trustworthiness_score = event_handler.trustworthiness_score
    print(f"Response: {response_str}")
    print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")

简单问题¶

我们首先提出一些可以直接通过提供的数据回答、且能在几行文本中轻松定位的直截了当的问题。

In [ ]:

Copied!





response = query_engine.query(
    "What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
)
display_response(response)
response = query_engine.query(
    "What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
)
display_response(response)

Response: NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.
Trustworthiness score: 1.0

In [ ]:

Copied!





response = query_engine.query(
    "What was the GAAP earnings per diluted share for the quarter?"
)
display_response(response)
response = query_engine.query(
    "What was the GAAP earnings per diluted share for the quarter?"
)
display_response(response)

Response: The GAAP earnings per diluted share for the quarter (Q1 FY24) was $0.82.
Trustworthiness score: 0.99

In [ ]:

Copied!





response = query_engine.query(
    "What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?"
)
display_response(response)
response = query_engine.query(
    "What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?"
)
display_response(response)

Response: Jensen Huang, NVIDIA's CEO, commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI.
Trustworthiness score: 0.99

TLM 对这些回答给出了高可信度评分，表明其对这些答案的准确性有高度信心。经过快速事实核查（查阅原始财报）后，我们可以确认 TLM 确实准确回答了这些问题。若您感兴趣，以下是这些问题对应数据上下文的相关摘录：

英伟达（NASDAQ: NVDA）今日公布截至2023年4月30日的第一季度财报，营收达71.9亿美元...

本季度GAAP摊薄每股收益为0.82美元，同比增长28%，环比增长44%。

英伟达创始人兼首席执行官黄仁勋评论了计算机行业正在经历的重大转型，特别是加速计算和生成式AI...

无可用上下文的问题¶

现在让我们看看 TLM 如何回应那些无法通过提供数据回答的查询。

In [ ]:

Copied!





response = query_engine.query(
    "What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?"
)
display_response(response)
response = query_engine.query(
    "What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?"
)
display_response(response)

Response: The report indicates that NVIDIA's professional visualization revenue declined by 53% year-over-year. While the specific factors contributing to this decline are not detailed in the provided information, several potential reasons can be inferred:

1. **Market Conditions**: The overall market for professional visualization may have faced challenges, leading to reduced demand for NVIDIA's products in this segment.

2. **Increased Competition**: The presence of competitors in the professional visualization space could have impacted NVIDIA's market share and revenue.

3. **Economic Factors**: Broader economic conditions, such as inflation or reduced spending in industries that utilize professional visualization tools, may have contributed to the decline.

4. **Transition to New Technologies**: The introduction of new technologies, such as the NVIDIA Omniverse™ Cloud, may have shifted focus away from traditional professional visualization products, affecting revenue.

5. **Product Lifecycle**: If certain products were nearing the end of their lifecycle or if there were delays in new product launches, this could have impacted sales.

Overall, while the report does not specify the exact reasons for the decline, these factors could be contributing elements based on industry trends and market dynamics.
Trustworthiness score: 0.76

较低的 TLM 可信度评分表明对该回答存在稍多不确定性，这与可用信息的缺乏相符。让我们尝试提出更多问题。

In [ ]:

Copied!





response = query_engine.query(
    "How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
display_response(response)
response = query_engine.query(
    "How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
display_response(response)

Response: The report does not explicitly explain the reasons for the year-over-year decrease in NVIDIA's Gaming revenue. However, it does provide context regarding the overall performance of the gaming segment, noting that first-quarter revenue was $2.24 billion, which is down 38% from a year ago but up 22% from the previous quarter. This suggests that while there may have been a decline compared to the same period last year, there was a recovery compared to the previous quarter. Factors that could contribute to the year-over-year decline might include market conditions, competition, or changes in consumer demand, but these specifics are not detailed in the report.
Trustworthiness score: 0.92

In [ ]:

Copied!





response = query_engine.query(
    "How does NVIDIA's dividend payout for this quarter compare to the industry average?",
)
display_response(response)
response = query_engine.query(
    "How does NVIDIA's dividend payout for this quarter compare to the industry average?",
)
display_response(response)

Response: The context information provided does not include specific details about the industry average for dividend payouts. Therefore, I cannot directly compare NVIDIA's dividend payout for this quarter to the industry average. However, NVIDIA announced a quarterly cash dividend of $0.04 per share for shareholders of record on June 8, 2023. To assess how this compares to the industry average, one would need to look up the average dividend payout for similar companies in the technology or semiconductor industry.
Trustworthiness score: 0.93

我们观察到，TLM展现出识别可用信息局限性的能力。它避免生成推测性回答或幻觉内容，从而维持问答系统的可靠性。这种行为体现了对上下文边界的理解，并优先考虑准确性而非猜测。

挑战性问题¶

让我们看看RAG系统如何应对更困难的问题，其中有些可能具有误导性。

In [ ]:

Copied!





response = query_engine.query(
    "How much did Nvidia's revenue decrease this quarter vs last quarter, in terms of $?"
)
display_response(response)
response = query_engine.query(
    "How much did Nvidia's revenue decrease this quarter vs last quarter, in terms of $?"
)
display_response(response)

Response: NVIDIA's revenue for the first quarter of fiscal 2024 was $7.19 billion, and it was reported that this revenue was up 19% from the previous quarter. To find the revenue for the previous quarter, we can use the following calculation:

Let \( x \) be the revenue for the previous quarter. 

The equation based on the 19% increase is:
\[ 
x + 0.19x = 7.19 \text{ billion} 
\]
\[ 
1.19x = 7.19 \text{ billion} 
\]
\[ 
x = \frac{7.19 \text{ billion}}{1.19} \approx 6.04 \text{ billion} 
\]

Now, to find the decrease in revenue from the previous quarter to this quarter:
\[ 
\text{Decrease} = 7.19 \text{ billion} - 6.04 \text{ billion} \approx 1.15 \text{ billion} 
\]

Thus, NVIDIA's revenue decreased by approximately $1.15 billion this quarter compared to the last quarter.
Trustworthiness score: 0.6

In [ ]:

Copied!





response = query_engine.query(
    "This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?",
)
display_response(response)
response = query_engine.query(
    "This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?",
)
display_response(response)

Response: The report mentions the following companies: Microsoft and Dell. ServiceNow is also mentioned in the context, but it is not specified in the provided highlights. Therefore, the companies explicitly mentioned in the report are Microsoft and Dell.
Trustworthiness score: 0.6

In [ ]:

Copied!





response = query_engine.query(
    "How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced in NVIDIA's Q1 FY2024 financial results?",
)
display_response(response)
response = query_engine.query(
    "How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced in NVIDIA's Q1 FY2024 financial results?",
)
display_response(response)

Response: In NVIDIA's Q1 FY2024 financial results, the following RTX GPU models were officially announced:

1. **GeForce RTX 4060 family of GPUs**
2. **GeForce RTX 4070 GPU**
3. **Six new NVIDIA RTX GPUs for mobile and desktop workstations**

This totals to **eight RTX GPU models** announced.
Trustworthiness score: 0.74

In [ ]:

Copied!





response = query_engine.query(
    "If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?",
)
display_response(response)
response = query_engine.query(
    "If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?",
)
display_response(response)

Response: To calculate the projected annual revenue for NVIDIA's Data Center segment if it maintains its Q1 FY2024 quarter-over-quarter growth rate, we first need to determine the growth rate from Q4 FY2023 to Q1 FY2024.

NVIDIA reported a record Data Center revenue of $4.28 billion for Q1 FY2024. The revenue for the previous quarter (Q4 FY2023) can be calculated as follows:

Let \( R \) be the revenue for Q4 FY2023. The growth rate from Q4 FY2023 to Q1 FY2024 is given by:

\[
\text{Growth Rate} = \frac{\text{Q1 Revenue} - \text{Q4 Revenue}}{\text{Q4 Revenue}} = \frac{4.28 - R}{R}
\]

We know that the overall revenue for Q1 FY2024 is $7.19 billion, which is up 19% from the previous quarter. Therefore, we can express the revenue for Q4 FY2023 as:

\[
\text{Q1 FY2024 Revenue} = \text{Q4 FY2023 Revenue} \times 1.19
\]

Substituting the known value:

\[
7.19 = R \times 1.19
\]

Solving for \( R \):

\[
R = \frac{7.19}{1.19} \approx 6.03 \text{ billion}
\]

Now, we can calculate the Data Center revenue for Q4 FY2023. Since we don't have the exact figure for the Data Center revenue in Q4 FY2023, we will assume that the Data Center revenue also grew by the same percentage as the overall revenue. 

Now, we can calculate the quarter-over-quarter growth rate for the Data Center segment:

\[
\text{Growth Rate} = \frac{4.28 - R_D}{R_D}
\]

Where \( R_D \) is the Data Center revenue for Q4 FY2023. However, we need to find \( R_D \) first. 

Assuming the Data Center revenue was a certain percentage of the total revenue in Q4 FY2023, we can estimate it. For simplicity, let's assume the Data Center revenue was around 50% of the total revenue in Q4 FY2023 (this is a rough estimate, as we don't have the exact figure).

Thus, \( R_D \approx 0.5 \times 6
Trustworthiness score: 0.69

TLM 会通过低可信度分数自动提醒我们这些答案不可靠。配备 TLM 的 RAG 系统能帮助你在看到低可信度分数时保持适当警惕。以下是前述问题的正确答案：

英伟达本季度收入较上季度增长了 11.4 亿美元。

谷歌、亚马逊云服务、微软、甲骨文、ServiceNow、美敦力、戴尔科技。

未提及 RTX GPU 的具体总数量。

若未来四个季度保持此增长率，预计年收入约为 263.4 亿美元。

借助 TLM，你可以轻松提升对任何 RAG 系统的信任度！

阅读 TLM 性能基准测试了解可信度评分的有效性。
无需像本教程那样用 Cleanlab 的模型替换你的 LLM，你也可以直接使用 Cleanlab 来检测现有未修改 RAG 系统中的错误响应；查看我们的实时评估教程。