Cleanlab 可信语言模型¶

Cleanlab 的可信语言模型通过最先进的大语言模型不确定性评估技术，实时为每个LLM响应生成可信度评分。在那些无法容忍幻觉问题和其他LLM错误的场景中，可信度评分显得尤为重要。

本页演示如何用TLM替代您现有的大语言模型，既能生成响应又能评估其可信度。但这并非TLM的唯一使用方式。若需为现有未经修改的RAG应用添加可信度评分功能，请参阅可信RAG教程。除RAG应用外，您还可以通过TLM.get_trustworthiness_score()为任何已生成的LLM响应评估可信度。

更多信息请查阅Cleanlab官方文档。

安装¶

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-llms-cleanlab
%pip install llama-index-llms-cleanlab

In [ ]:

Copied!

%pip install llama-index
%pip install llama-index

In [ ]:

Copied!

from llama_index.llms.cleanlab import CleanlabTLM
from llama_index.llms.cleanlab import CleanlabTLM

In [ ]:

Copied!





# set api key in env or in llm
# get free API key from: https://cleanlab.ai/
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")
# set api key in env or in llm
# get free API key from: https://cleanlab.ai/
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")

In [ ]:

Copied!

resp = llm.complete("Who is Paul Graham?")
resp = llm.complete("Who is Paul Graham?")

In [ ]:

Copied!

print(resp)
print(resp)

Paul Graham is an American computer scientist, entrepreneur, and venture capitalist. He is best known as the co-founder of the startup accelerator Y Combinator, which has helped launch numerous successful companies including Dropbox, Airbnb, and Reddit. Graham is also a prolific writer and essayist, known for his insightful and thought-provoking essays on topics ranging from startups and entrepreneurship to technology and society. He has been influential in the tech industry and is highly regarded for his expertise and contributions to the startup ecosystem.

你还可以在 additional_kwargs 中获取上述响应的可信度评分。TLM 会自动为所有 <prompt, response> 配对计算该评分。

In [ ]:

Copied!

print(resp.additional_kwargs)
print(resp.additional_kwargs)

{'trustworthiness_score': 0.8659043183923533}

高分表示可以信任大语言模型的回答。我们再来看另一个例子。

In [ ]:

Copied!

resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)

In [ ]:

Copied!

print(resp)
print(resp)

The first automobile engine used in a commercial truck in the United States was the 1899 Winton Motor Carriage Company Model 10, which had a 2-cylinder engine with 20 horsepower.

In [ ]:

Copied!

print(resp.additional_kwargs)
print(resp.additional_kwargs)

{'trustworthiness_score': 0.5820799504369166}

低分表明不应信任该大语言模型的回答。

从这两个简单例子可以看出，得分最高的大语言模型回答具有直接、准确且细节恰当的特点。
而可信度得分较低的回答则提供了无帮助或事实错误的答案，这类现象有时被称为"幻觉"。

流式传输¶

Cleanlab 的 TLM 原生不支持同时流式传输响应和可信度评分。但您可以通过替代方案实现低延迟的流式响应，以满足应用需求。
关于该方案的详细说明及示例代码，请参阅此处。

TLM 高级用法¶

TLM 可通过以下选项进行配置：

model：底层使用的 LLM 模型
max_tokens：响应中生成的最大 token 数量
num_candidate_responses：TLM 内部生成的备选响应数量
num_consistency_samples：用于评估 LLM 响应一致性的内部采样次数
use_self_reflection：是否要求 LLM 对生成的响应进行自我反思和评估
log：指定返回的额外元数据。包含 "explanation" 可获取低可信度评分的解释说明

这些配置以字典形式在初始化时传递给 CleanlabTLM 对象。
更多选项详情可参阅 Cleanlab 的 API 文档，部分选项的使用案例可在此笔记本中查看。

以下示例展示如何配置应用使用 gpt-4 模型并输出 128 个 token。

In [ ]:

Copied!





options = {
    "model": "gpt-4",
    "max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
options = {
    "model": "gpt-4",
    "max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

In [ ]:

Copied!

resp = llm.complete("Who is Paul Graham?")
resp = llm.complete("Who is Paul Graham?")

In [ ]:

Copied!

print(resp)
print(resp)

Paul Graham is a British-born American computer scientist, entrepreneur, venture capitalist, author, and essayist. He is best known for co-founding Viaweb, which was sold to Yahoo in 1998 for over $49 million and became Yahoo Store. He also co-founded the influential startup accelerator and seed capital firm Y Combinator, which has launched over 2,000 companies including Dropbox, Airbnb, Stripe, and Reddit. Graham is also known for his essays on startup companies and programming languages.

要理解为何TLM对先前关于马力的相关问题评估可信度较低，请在初始化TLM时指定"explanation"标志。

In [ ]:

Copied!





options = {
    "log": ["explanation"],
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
options = {
    "log": ["explanation"],
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)

In [ ]:

Copied!

print(resp)
print(resp)

The first automobile engine used in a commercial truck in the United States was in the 1899 "Motor Truck" built by the American company, the "GMC Truck Company." This early truck was equipped with a 2-horsepower engine. However, it's important to note that the development of commercial trucks evolved rapidly, and later models featured significantly more powerful engines.

In [ ]:

Copied!

print(resp.additional_kwargs["explanation"])
print(resp.additional_kwargs["explanation"])

The proposed answer incorrectly attributes the first commercial truck in the United States to the GMC Truck Company and states that it was built in 1899 with a 2-horsepower engine. In reality, the first commercial truck is generally recognized as the "Motor Truck" built by the American company, the "GMC Truck Company," but it was actually produced by the "GMC" brand, which was established later. The first commercial truck is often credited to the "Benz Velo" or similar early models, which had varying horsepower ratings. The specific claim of a 2-horsepower engine is also misleading, as early trucks typically had more powerful engines. Therefore, the answer contains inaccuracies regarding both the manufacturer and the specifications of the engine. 
This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
The horsepower of the first automobile engine used in a commercial truck in the United States was 6 horsepower.