使用 Arize Phoenix 实现可观测性 - 追踪与评估 LlamaIndex 应用¶

LlamaIndex 提供的高级 API 能让用户仅用几行代码就构建强大应用。然而，理解底层运行机制并定位问题根源可能颇具挑战。Phoenix 通过可视化查询引擎每次调用的底层结构，并根据延迟、令牌计数或其他评估指标突出显示有问题的执行片段，使您的 LLM 应用变得可观测。

本教程中您将：

使用 LlamaIndex 构建一个简单的查询引擎，通过检索增强生成技术回答关于 Paul Graham 论文的问题
使用全局 arize_phoenix 处理器以 OpenInference 追踪格式记录追踪数据
检查应用的追踪记录和执行片段，识别延迟和成本的来源
将追踪数据导出为 pandas 数据框并运行 LLM 评估

ℹ️ 本 notebook 需要 OpenAI API 密钥

可观测性文档

1. 安装依赖项并导入库¶

安装 Phoenix、LlamaIndex 和 OpenAI。

In [ ]:

Copied!





!pip install llama-index
!pip install llama-index-callbacks-arize-phoenix
!pip install arize-phoenix[evals]
!pip install "openinference-instrumentation-llama-index>=1.0.0"
!pip install llama-index
!pip install llama-index-callbacks-arize-phoenix
!pip install arize-phoenix[evals]
!pip install "openinference-instrumentation-llama-index>=1.0.0"

In [ ]:

Copied!





import json
import os
from getpass import getpass
from urllib.request import urlopen

import nest_asyncio
import openai
import pandas as pd
import phoenix as px
from llama_index.core import (
    Settings,
    set_global_handler,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)
from phoenix.session.evaluation import (
    get_qa_with_reference,
    get_retrieved_documents,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

nest_asyncio.apply()
pd.set_option("display.max_colwidth", 1000)
import json
import os
from getpass import getpass
from urllib.request import urlopen

import nest_asyncio
import openai
import pandas as pd
import phoenix as px
from llama_index.core import (
    Settings,
    set_global_handler,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)
from phoenix.session.evaluation import (
    get_qa_with_reference,
    get_retrieved_documents,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

nest_asyncio.apply()
pd.set_option("display.max_colwidth", 1000)

2. 启动 Phoenix¶

您可以在后台运行 Phoenix 来收集由任何配置了 OpenInferenceTraceCallbackHandler 的 LlamaIndex 应用程序发出的追踪数据。Phoenix 支持 LlamaIndex 的一键式可观测性功能，该功能会自动为您的 LlamaIndex 应用程序添加观测配置！如需详细了解如何为 LlamaIndex 应用程序添加观测配置，请参阅我们的集成指南。

启动 Phoenix 并按照单元格输出中的指示打开 Phoenix 用户界面（由于尚未运行 LlamaIndex 应用程序，此时界面应为空）。

In [ ]:

Copied!

session = px.launch_app()
session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit https://jfgzmj4xrg3-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix

3. 配置您的OpenAI API密钥¶

如果尚未将OpenAI API密钥设置为环境变量，请进行设置。

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

4. 构建索引并创建查询引擎¶

a. 下载数据

b. 加载数据

c. 配置 Phoenix 追踪

d. 配置 LLM 与嵌入模型

e. 创建索引

f. 创建查询引擎

下载数据¶

In [ ]:

Copied!

!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"

--2024-04-26 03:09:56--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘paul_graham_essay.txt’

paul_graham_essay.t 100%[===================>]  73.28K  --.-KB/s    in 0.01s   

2024-04-26 03:09:56 (5.58 MB/s) - ‘paul_graham_essay.txt’ saved [75042/75042]

--2024-04-26 03:09:56--  http://paul_graham_essay.txt/
Resolving paul_graham_essay.txt (paul_graham_essay.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘paul_graham_essay.txt’
FINISHED --2024-04-26 03:09:56--
Total wall clock time: 0.2s
Downloaded: 1 files, 73K in 0.01s (5.58 MB/s)

加载数据¶

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["paul_graham_essay.txt"]
).load_data()
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["paul_graham_essay.txt"]
).load_data()

配置 Phoenix 追踪功能¶

在 LlamaIndex 中启用 Phoenix 追踪功能，只需将 arize_phoenix 设置为全局处理器。该操作会将 Phoenix 的 OpenInferenceTraceCallback 挂载为全局处理器。Phoenix 采用 OpenInference 追踪标准——这是一个用于捕获和存储 LLM 应用追踪数据的开源规范，使得 LLM 应用程序能够无缝对接 Phoenix 等 LLM 可观测性解决方案。

In [ ]:

Copied!

set_global_handler("arize_phoenix")
set_global_handler("arize_phoenix")

配置大语言模型与嵌入模型¶

In [ ]:

Copied!





from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
embed_model = OpenAIEmbedding()

Settings.llm = llm
Settings.embed_model = embed_model
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
embed_model = OpenAIEmbedding()

Settings.llm = llm
Settings.embed_model = embed_model

创建索引¶

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

创建查询引擎¶

In [ ]:

Copied!

query_engine = index.as_query_engine(similarity_top_k=5)
query_engine = index.as_query_engine(similarity_top_k=5)

5. 运行查询引擎并在 Phoenix 中查看追踪数据¶

In [ ]:

Copied!





queries = [
    "what did paul graham do growing up?",
    "why did paul graham start YC?",
]
queries = [
    "what did paul graham do growing up?",
    "why did paul graham start YC?",
]

In [ ]:

Copied!

for query in tqdm(queries):
    query_engine.query(query)
for query in tqdm(queries):
    query_engine.query(query)

100%|██████████| 2/2 [00:07<00:00,  3.81s/it]

In [ ]:

Copied!

print(query_engine.query("Who is Paul Graham?"))
print(query_engine.query("Who is Paul Graham?"))

Paul Graham is a writer, entrepreneur, and investor known for his involvement in various projects and ventures. He has written essays on diverse topics, founded companies like Viaweb and Y Combinator, and has a strong presence in the startup and technology industry.

In [ ]:

Copied!

print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: https://jfgzmj4xrg4-496ff2e9c6d22116-6006-colab.googleusercontent.com/

6. 导出与评估追踪数据¶

您可以将追踪数据导出为 pandas 数据框以进行进一步分析和评估。

在本例中，我们将把 retriever 的追踪片段导出为两个独立的数据框：

queries_df：其中每个查询检索到的文档会被合并到单个列中
retrieved_documents_df：其中每个检索到的文档会被"展开"为独立行，以便单独评估每个查询-文档对

通过这种方式，我们可以计算多种评估指标，包括：

相关性：检索到的文档是否与响应内容相关联？
问答准确性：应用程序的响应是否基于检索到的上下文？
幻觉检测：应用程序是否编造了虚假信息？

In [ ]:

Copied!

queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())
queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())

接下来，定义您的评估模型和评估器。

评估器构建于语言模型之上，通过提示大语言模型（LLM）来评估回答质量、检索文档的相关性等，即使在没有人工标注数据的情况下也能提供质量信号。选择一个评估器类型，并使用您希望用于执行评估的语言模型，通过我们经过实战检验的评估模板来实例化它。

In [ ]:

Copied!





eval_model = OpenAIModel(
    model="gpt-4",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(
        eval_name="Hallucination", dataframe=hallucination_eval_df
    ),
    SpanEvaluations(
        eval_name="QA Correctness", dataframe=qa_correctness_eval_df
    ),
    DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)
eval_model = OpenAIModel(
    model="gpt-4",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(
        eval_name="Hallucination", dataframe=hallucination_eval_df
    ),
    SpanEvaluations(
        eval_name="QA Correctness", dataframe=qa_correctness_eval_df
    ),
    DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)

run_evals |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/15 (0.0%) | ⏳ 00:00<? | ?it/s

如需了解更多关于 Phoenix、LLM Tracing 和 LLM Evals 的详细信息，请查阅文档。