使用 Arize Phoenix 实现可观测性 - 追踪与评估 LlamaIndex 应用¶
LlamaIndex 提供的高级 API 能让用户仅用几行代码就构建强大应用。然而,理解底层运行机制并定位问题根源可能颇具挑战。Phoenix 通过可视化查询引擎每次调用的底层结构,并根据延迟、令牌计数或其他评估指标突出显示有问题的执行片段
,使您的 LLM 应用变得可观测。
本教程中您将:
- 使用 LlamaIndex 构建一个简单的查询引擎,通过检索增强生成技术回答关于 Paul Graham 论文的问题
- 使用全局
arize_phoenix
处理器以 OpenInference 追踪格式记录追踪数据 - 检查应用的追踪记录和执行片段,识别延迟和成本的来源
- 将追踪数据导出为 pandas 数据框并运行 LLM 评估
ℹ️ 本 notebook 需要 OpenAI API 密钥
1. 安装依赖项并导入库¶
安装 Phoenix、LlamaIndex 和 OpenAI。
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index-callbacks-arize-phoenix
!pip install arize-phoenix[evals]
!pip install "openinference-instrumentation-llama-index>=1.0.0"
!pip install llama-index
!pip install llama-index-callbacks-arize-phoenix
!pip install arize-phoenix[evals]
!pip install "openinference-instrumentation-llama-index>=1.0.0"
In [ ]:
Copied!
import json
import os
from getpass import getpass
from urllib.request import urlopen
import nest_asyncio
import openai
import pandas as pd
import phoenix as px
from llama_index.core import (
Settings,
set_global_handler,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
HallucinationEvaluator,
OpenAIModel,
QAEvaluator,
RelevanceEvaluator,
run_evals,
)
from phoenix.session.evaluation import (
get_qa_with_reference,
get_retrieved_documents,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm
nest_asyncio.apply()
pd.set_option("display.max_colwidth", 1000)
import json
import os
from getpass import getpass
from urllib.request import urlopen
import nest_asyncio
import openai
import pandas as pd
import phoenix as px
from llama_index.core import (
Settings,
set_global_handler,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
HallucinationEvaluator,
OpenAIModel,
QAEvaluator,
RelevanceEvaluator,
run_evals,
)
from phoenix.session.evaluation import (
get_qa_with_reference,
get_retrieved_documents,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm
nest_asyncio.apply()
pd.set_option("display.max_colwidth", 1000)
In [ ]:
Copied!
session = px.launch_app()
session = px.launch_app()
🌍 To view the Phoenix app in your browser, visit https://jfgzmj4xrg3-496ff2e9c6d22116-6006-colab.googleusercontent.com/ 📺 To view the Phoenix app in a notebook, run `px.active_session().view()` 📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
3. 配置您的OpenAI API密钥¶
如果尚未将OpenAI API密钥设置为环境变量,请进行设置。
In [ ]:
Copied!
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
下载数据¶
In [ ]:
Copied!
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"
--2024-04-26 03:09:56-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘paul_graham_essay.txt’ paul_graham_essay.t 100%[===================>] 73.28K --.-KB/s in 0.01s 2024-04-26 03:09:56 (5.58 MB/s) - ‘paul_graham_essay.txt’ saved [75042/75042] --2024-04-26 03:09:56-- http://paul_graham_essay.txt/ Resolving paul_graham_essay.txt (paul_graham_essay.txt)... failed: Name or service not known. wget: unable to resolve host address ‘paul_graham_essay.txt’ FINISHED --2024-04-26 03:09:56-- Total wall clock time: 0.2s Downloaded: 1 files, 73K in 0.01s (5.58 MB/s)
加载数据¶
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["paul_graham_essay.txt"]
).load_data()
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["paul_graham_essay.txt"]
).load_data()
配置 Phoenix 追踪功能¶
在 LlamaIndex 中启用 Phoenix 追踪功能,只需将 arize_phoenix
设置为全局处理器。该操作会将 Phoenix 的 OpenInferenceTraceCallback 挂载为全局处理器。Phoenix 采用 OpenInference 追踪标准——这是一个用于捕获和存储 LLM 应用追踪数据的开源规范,使得 LLM 应用程序能够无缝对接 Phoenix 等 LLM 可观测性解决方案。
In [ ]:
Copied!
set_global_handler("arize_phoenix")
set_global_handler("arize_phoenix")
配置大语言模型与嵌入模型¶
In [ ]:
Copied!
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
embed_model = OpenAIEmbedding()
Settings.llm = llm
Settings.embed_model = embed_model
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
embed_model = OpenAIEmbedding()
Settings.llm = llm
Settings.embed_model = embed_model
创建索引¶
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
创建查询引擎¶
In [ ]:
Copied!
query_engine = index.as_query_engine(similarity_top_k=5)
query_engine = index.as_query_engine(similarity_top_k=5)
5. 运行查询引擎并在 Phoenix 中查看追踪数据¶
In [ ]:
Copied!
queries = [
"what did paul graham do growing up?",
"why did paul graham start YC?",
]
queries = [
"what did paul graham do growing up?",
"why did paul graham start YC?",
]
In [ ]:
Copied!
for query in tqdm(queries):
query_engine.query(query)
for query in tqdm(queries):
query_engine.query(query)
100%|██████████| 2/2 [00:07<00:00, 3.81s/it]
In [ ]:
Copied!
print(query_engine.query("Who is Paul Graham?"))
print(query_engine.query("Who is Paul Graham?"))
Paul Graham is a writer, entrepreneur, and investor known for his involvement in various projects and ventures. He has written essays on diverse topics, founded companies like Viaweb and Y Combinator, and has a strong presence in the startup and technology industry.
In [ ]:
Copied!
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")
🚀 Open the Phoenix UI if you haven't already: https://jfgzmj4xrg4-496ff2e9c6d22116-6006-colab.googleusercontent.com/
6. 导出与评估追踪数据¶
您可以将追踪数据导出为 pandas 数据框以进行进一步分析和评估。
在本例中,我们将把 retriever
的追踪片段导出为两个独立的数据框:
queries_df
:其中每个查询检索到的文档会被合并到单个列中retrieved_documents_df
:其中每个检索到的文档会被"展开"为独立行,以便单独评估每个查询-文档对
通过这种方式,我们可以计算多种评估指标,包括:
- 相关性:检索到的文档是否与响应内容相关联?
- 问答准确性:应用程序的响应是否基于检索到的上下文?
- 幻觉检测:应用程序是否编造了虚假信息?
In [ ]:
Copied!
queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())
queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())
接下来,定义您的评估模型和评估器。
评估器构建于语言模型之上,通过提示大语言模型(LLM)来评估回答质量、检索文档的相关性等,即使在没有人工标注数据的情况下也能提供质量信号。选择一个评估器类型,并使用您希望用于执行评估的语言模型,通过我们经过实战检验的评估模板来实例化它。
In [ ]:
Copied!
eval_model = OpenAIModel(
model="gpt-4",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)
hallucination_eval_df, qa_correctness_eval_df = run_evals(
dataframe=queries_df,
evaluators=[hallucination_evaluator, qa_correctness_evaluator],
provide_explanation=True,
)
relevance_eval_df = run_evals(
dataframe=retrieved_documents_df,
evaluators=[relevance_evaluator],
provide_explanation=True,
)[0]
px.Client().log_evaluations(
SpanEvaluations(
eval_name="Hallucination", dataframe=hallucination_eval_df
),
SpanEvaluations(
eval_name="QA Correctness", dataframe=qa_correctness_eval_df
),
DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)
eval_model = OpenAIModel(
model="gpt-4",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)
hallucination_eval_df, qa_correctness_eval_df = run_evals(
dataframe=queries_df,
evaluators=[hallucination_evaluator, qa_correctness_evaluator],
provide_explanation=True,
)
relevance_eval_df = run_evals(
dataframe=retrieved_documents_df,
evaluators=[relevance_evaluator],
provide_explanation=True,
)[0]
px.Client().log_evaluations(
SpanEvaluations(
eval_name="Hallucination", dataframe=hallucination_eval_df
),
SpanEvaluations(
eval_name="QA Correctness", dataframe=qa_correctness_eval_df
),
DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)
run_evals | | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s
run_evals | | 0/15 (0.0%) | ⏳ 00:00<? | ?it/s
如需了解更多关于 Phoenix、LLM Tracing 和 LLM Evals 的详细信息,请查阅文档。