AIMon

AIMon 的 LlamaIndex 扩展：大语言模型响应评估¶

本笔记本介绍 AIMon 为 LlamaIndex 框架设计的评估器，用于评估集成到 LlamaIndex 中的语言模型（LLM）生成响应的质量和准确性。以下是所有可用评估器的概览：

幻觉评估器：检测模型是否生成上下文未支持的虚构信息（幻觉）
指导原则评估器：确保模型响应遵循预定义的指令和准则
完整性评估器：检查响应是否全面解决了查询或任务的所有方面
简洁性评估器：评估响应是否简明扼要，避免不必要的冗长
毒性评估器：标记响应中包含的有害、冒犯或不恰当语言
上下文相关性评估器：评估所提供上下文对于支持模型响应的相关性和准确性

本笔记本将重点演示如何使用幻觉评估器、指导原则评估器和上下文相关性评估器来评估您的 RAG（检索增强生成）应用。

了解更多 AIMon 相关信息，请访问：官网和文档

先决条件¶

让我们从安装依赖项和设置 API 密钥开始。

In [ ]:

Copied!

%%capture
!pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai
%%capture
!pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai

在 Google Collab 的密钥管理中配置您的 OPENAI_API_KEY 和 AIMON_API_KEY 并授予笔记本访问权限。我们将使用 OpenAI 作为大语言模型和嵌入生成模型，同时通过 AIMon 持续监控质量问题。

AIMon API 密钥可在此处获取here。

In [ ]:

Copied!

import os
import json

# Import Colab Secrets userdata module.
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
import os
import json

# Import Colab Secrets userdata module.
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

评估数据集¶

在本示例中，我们将使用 MeetingBank 数据集[1]的文本记录作为上下文信息。

In [ ]:

Copied!

%%capture
from datasets import load_dataset

meetingbank = load_dataset("huuuyeah/meetingbank")
%%capture
from datasets import load_dataset

meetingbank = load_dataset("huuuyeah/meetingbank")

该函数用于提取文本记录并将其转换为 llama_index.core.Document 类型的对象列表。

In [ ]:

Copied!





from llama_index.core import Document


def extract_and_create_documents(transcripts):
    documents = []

    for transcript in transcripts:
        try:
            doc = Document(text=transcript)
            documents.append(doc)

        except Exception as e:
            print(f"Failed to create document")

    return documents


transcripts = [meeting["transcript"] for meeting in meetingbank["train"]]
documents = extract_and_create_documents(
    transcripts[:5]
)  ## Using only 5 transcripts to keep this example fast and concise.
from llama_index.core import Document


def extract_and_create_documents(transcripts):
    documents = []

    for transcript in transcripts:
        try:
            doc = Document(text=transcript)
            documents.append(doc)

        except Exception as e:
            print(f"Failed to create document")

    return documents


transcripts = [meeting["transcript"] for meeting in meetingbank["train"]]
documents = extract_and_create_documents(
    transcripts[:5]
)  ## Using only 5 transcripts to keep this example fast and concise.

设置一个嵌入模型。这里我们将使用 text-embedding-3-small 模型。

In [ ]:

Copied!

from llama_index.embeddings.openai import OpenAIEmbedding

embedding_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=100, max_retries=3
)
from llama_index.embeddings.openai import OpenAIEmbedding

embedding_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=100, max_retries=3
)

将文档分割为节点并生成其嵌入向量

In [ ]:

Copied!

from aimon_llamaindex import generate_embeddings_for_docs

nodes = generate_embeddings_for_docs(documents, embedding_model)
from aimon_llamaindex import generate_embeddings_for_docs

nodes = generate_embeddings_for_docs(documents, embedding_model)

将带有嵌入向量的节点插入内存中的向量存储索引（Vector Store Index）。

In [ ]:

Copied!

from aimon_llamaindex import build_index

index = build_index(nodes)
from aimon_llamaindex import build_index

index = build_index(nodes)

实例化向量索引检索器

In [ ]:

Copied!

from aimon_llamaindex import build_retriever

retriever = build_retriever(index, similarity_top_k=5)
from aimon_llamaindex import build_retriever

retriever = build_retriever(index, similarity_top_k=5)

构建 LLM 应用¶

配置大型语言模型。这里我们选择 OpenAI 的 gpt-4o-mini 模型，并将温度参数设置为 0.1。

In [ ]:

Copied!





## OpenAI's LLM
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    temperature=0.4,
    system_prompt="""
                    Please be professional and polite.
                    Answer the user's question in a single line.
                    Even if the context lacks information to answer the question, make
                    sure that you answer the user's question based on your own knowledge.
                    """,
)
## OpenAI's LLM
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    temperature=0.4,
    system_prompt="""
                    Please be professional and polite.
                    Answer the user's question in a single line.
                    Even if the context lacks information to answer the question, make
                    sure that you answer the user's question based on your own knowledge.
                    """,
)

定义查询与指令

In [ ]:

Copied!





user_query = "Which council bills were amended for zoning regulations?"
user_instructions = [
    "Keep the response concise, preferably under the 100 word limit."
]
user_query = "Which council bills were amended for zoning regulations?"
user_instructions = [
    "Keep the response concise, preferably under the 100 word limit."
]

动态更新 LLM 的系统提示，根据用户定义的指令进行配置

In [ ]:

Copied!

llm.system_prompt += (
    f"Please comply to the following instructions {user_instructions}."
)
llm.system_prompt += (
    f"Please comply to the following instructions {user_instructions}."
)

获取查询的响应。

In [ ]:

Copied!

from aimon_llamaindex import get_response

llm_response = get_response(user_query, retriever, llm)
from aimon_llamaindex import get_response

llm_response = get_response(user_query, retriever, llm)

使用 AIMon 运行评估¶

配置 AIMon 客户端

In [ ]:

Copied!

from aimon import Client

aimon_client = Client(
    auth_header="Bearer {}".format(userdata.get("AIMON_API_KEY"))
)
from aimon import Client

aimon_client = Client(
    auth_header="Bearer {}".format(userdata.get("AIMON_API_KEY"))
)

使用 AIMon 的指令遵循模型（又称指南评估器）

该模型评估生成文本是否遵循给定指令，确保大型语言模型（LLM）在各种任务中遵循用户的指导方针和意图，从而产生更准确、更相关的输出。

In [ ]:

Copied!





from aimon_llamaindex.evaluators import GuidelineEvaluator

guideline_evaluator = GuidelineEvaluator(aimon_client)
evaluation_result = guideline_evaluator.evaluate(
    user_query, llm_response, user_instructions
)
from aimon_llamaindex.evaluators import GuidelineEvaluator

guideline_evaluator = GuidelineEvaluator(aimon_client)
evaluation_result = guideline_evaluator.evaluate(
    user_query, llm_response, user_instructions
)

In [ ]:

Copied!

print(json.dumps(evaluation_result, indent=4))
print(json.dumps(evaluation_result, indent=4))

{
    "extractions": [],
    "instructions_list": [
        {
            "explanation": "",
            "follow_probability": 0.982,
            "instruction": "Keep the response concise, preferably under the 100 word limit.",
            "label": true
        }
    ],
    "score": 1.0
}

使用 AIMon 的幻觉检测评估模型（HDM-2）

AIMon 的 HDM-2 可检测大语言模型输出中的幻觉内容。该模型提供 0.0 至 1.0 区间的"幻觉分数"，用于量化事实错误或虚构信息的可能性，从而确保生成更可靠、更准确的响应。

In [ ]:

Copied!

from aimon_llamaindex.evaluators import HallucinationEvaluator

hallucination_evaluator = HallucinationEvaluator(aimon_client)
evalution_result = hallucination_evaluator.evaluate(user_query, llm_response)
from aimon_llamaindex.evaluators import HallucinationEvaluator

hallucination_evaluator = HallucinationEvaluator(aimon_client)
evalution_result = hallucination_evaluator.evaluate(user_query, llm_response)

In [ ]:

Copied!

## Printing the initial evaluation result for Hallucination
print(json.dumps(evalution_result, indent=4))
## Printing the initial evaluation result for Hallucination
print(json.dumps(evalution_result, indent=4))

{
    "is_hallucinated": "False",
    "score": 0.22446,
    "sentences": [
        {
            "score": 0.22446,
            "text": "The council bills amended for zoning regulations include the small lot moratorium and the text amendment related to off-street parking exemptions for preexisting small lots. These amendments aim to balance the interests of local neighborhoods, health institutions, and developers."
        }
    ]
}

使用 AIMon 的上下文相关性评估器来评估大语言模型（LLM）生成响应时所使用的上下文数据的相关性。

In [ ]:

Copied!





from aimon_llamaindex.evaluators import ContextRelevanceEvaluator

evaluator = ContextRelevanceEvaluator(aimon_client)
task_definition = (
    "Find the relevance of the context data used to generate this response."
)
evaluation_result = evaluator.evaluate(
    user_query, llm_response, task_definition
)
from aimon_llamaindex.evaluators import ContextRelevanceEvaluator

evaluator = ContextRelevanceEvaluator(aimon_client)
task_definition = (
    "Find the relevance of the context data used to generate this response."
)
evaluation_result = evaluator.evaluate(
    user_query, llm_response, task_definition
)

In [ ]:

Copied!

print(json.dumps(evaluation_result, indent=4))
print(json.dumps(evaluation_result, indent=4))

[
    {
        "explanations": [
            "Document 1 discusses a council bill related to zoning regulations, specifically mentioning a text amendment that aims to balance neighborhood interests with developer needs. However, it primarily focuses on parking issues and personal experiences rather than detailing specific zoning regulation amendments or the council bills directly related to them, which makes it less relevant to the query.",
            "2. Document 2 mentions zoning and development issues, including the need for mass transit and affordability, but it does not provide specific information on which council bills were amended for zoning regulations. The discussion is more about general concerns regarding development and transportation rather than direct references to zoning amendments.",
            "3. Document 3 touches on zoning laws and amendments but does not specify which council bills were amended for zoning regulations. While it discusses the context of zoning and housing, it lacks concrete details that directly answer the query about specific bills.",
            "4. Document 4 discusses broader issues about affordable housing and transportation without directly addressing any specific council bills or amendments related to zoning regulations. The focus is on general priorities and funding rather than specific legislative changes, making it less relevant to the query.",
            "5. Document 5 mentions support for a zoning code amendment regarding parking exemptions for small lots, which is somewhat related to zoning regulations. However, it does not provide specific details about the council bills amended for zoning regulations, thus failing to fully address the query."
        ],
        "query": "Which council bills were amended for zoning regulations?",
        "relevance_scores": [
            40.5,
            40.25,
            44.25,
            38.5,
            43.0
        ]
    }
]

结论¶

在本笔记本中，我们使用 LlamaIndex 框架构建了一个简单的 RAG 应用程序。在获取查询响应后，我们通过 AIMon 的评估器对其进行了质量评估。

参考文献¶

[1] Y. Hu, T. Ganter, H. Deilamsalehy, F. Dernoncourt, H. Foroosh, 和 F. Liu, "MeetingBank: 会议摘要的基准数据集," arXiv, 2023年5月. [在线]. 可获取: https://arxiv.org/abs/2305.17529. 访问日期: 2025年1月16日.