递归检索器 + 文档智能体¶

本指南展示如何结合递归检索和"文档智能体"技术，实现对异构文档的高级决策处理。

推动我们寻求更优检索方案的动机主要有两点：

将检索嵌入与基于文本块的合成过程解耦。通过文档摘要获取内容往往比原始文本块能返回更相关的查询上下文，这正是递归检索直接实现的优势。
在单个文档内部，用户可能需要执行超越事实性问答的动态任务。为此我们引入"文档智能体"概念——这类智能体可同时访问针对特定文档的向量搜索和摘要工具。

数据准备与下载¶

在本节中，我们将先定义所需的导入项，然后下载关于不同城市的维基百科文章。每篇文章将单独存储。

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
%pip install llama-index-agent-openai

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!





from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI

In [ ]:

Copied!

wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

In [ ]:

Copied!





from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [ ]:

Copied!





# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()

定义 LLM + 服务上下文 + 回调管理器

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]:

Copied!

from llama_index.core import Settings

Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")

为每份文档构建文档智能体¶

本节我们将为每份文档定义"文档智能体"。

首先为每份文档同时建立向量索引（用于语义搜索）和摘要索引（用于内容概括）。随后将这两个查询引擎转换为工具，传递给基于OpenAI函数调用的智能体。

该文档智能体能够动态选择在指定文档内执行语义搜索或内容摘要操作。

我们为每个城市创建独立的文档智能体。

In [ ]:

Copied!





from llama_index.agent.openai import OpenAIAgent

# Build agents dictionary
agents = {}

for wiki_title in wiki_titles:
    # build vector index
    vector_index = VectorStoreIndex.from_documents(
        city_docs[wiki_title],
    )
    # build summary index
    summary_index = SummaryIndex.from_documents(
        city_docs[wiki_title],
    )
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=(
                    f"Useful for retrieving specific context from {wiki_title}"
                ),
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=(
                    "Useful for summarization questions related to"
                    f" {wiki_title}"
                ),
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-3.5-turbo-0613")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
    )

    agents[wiki_title] = agent
from llama_index.agent.openai import OpenAIAgent

# Build agents dictionary
agents = {}

for wiki_title in wiki_titles:
    # build vector index
    vector_index = VectorStoreIndex.from_documents(
        city_docs[wiki_title],
    )
    # build summary index
    summary_index = SummaryIndex.from_documents(
        city_docs[wiki_title],
    )
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=(
                    f"Useful for retrieving specific context from {wiki_title}"
                ),
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=(
                    "Useful for summarization questions related to"
                    f" {wiki_title}"
                ),
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-3.5-turbo-0613")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
    )

    agents[wiki_title] = agent

在这些智能体之上构建可组合检索器¶

现在我们定义一组摘要节点，其中每个节点都链接到对应的维基百科城市文章。然后在这些节点之上定义一个可组合的检索器+查询引擎，将查询路由到指定节点，该节点会进一步将其路由至相关文档智能体。

In [ ]:

Copied!





# define top-level nodes
objects = []
for wiki_title in wiki_titles:
    # define index node that links to these agents
    wiki_summary = (
        f"This content contains Wikipedia articles about {wiki_title}. Use"
        " this index if you need to lookup specific facts about"
        f" {wiki_title}.\nDo not use this index if you want to analyze"
        " multiple cities."
    )
    node = IndexNode(
        text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title]
    )
    objects.append(node)
# define top-level nodes
objects = []
for wiki_title in wiki_titles:
    # define index node that links to these agents
    wiki_summary = (
        f"This content contains Wikipedia articles about {wiki_title}. Use"
        " this index if you need to lookup specific facts about"
        f" {wiki_title}.\nDo not use this index if you want to analyze"
        " multiple cities."
    )
    node = IndexNode(
        text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title]
    )
    objects.append(node)

In [ ]:

Copied!





# define top-level retriever
vector_index = VectorStoreIndex(
    objects=objects,
)
query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)
# define top-level retriever
vector_index = VectorStoreIndex(
    objects=objects,
)
query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)

运行示例查询¶

In [ ]:

Copied!

# should use Boston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Boston")
# should use Boston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Boston")

Retrieval entering Boston: OpenAIAgent
Retrieving from object OpenAIAgent with query Tell me about the sports teams in Boston
Added user message to memory: Tell me about the sports teams in Boston

In [ ]:

Copied!

print(response)
print(response)

Boston is home to several professional sports teams across different leagues, including a successful baseball team in Major League Baseball, a highly successful American football team in the National Football League, one of the most successful basketball teams in the NBA, a professional ice hockey team in the National Hockey League, and a professional soccer team in Major League Soccer. These teams have a rich history, passionate fan bases, and have achieved great success both locally and nationally.

In [ ]:

Copied!

# should use Houston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Houston")
# should use Houston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Houston")

Retrieval entering Houston: OpenAIAgent
Retrieving from object OpenAIAgent with query Tell me about the sports teams in Houston
Added user message to memory: Tell me about the sports teams in Houston

In [ ]:

Copied!

print(response)
print(response)

Houston is home to several professional sports teams across different leagues, including the Houston Texans in the NFL, the Houston Rockets in the NBA, the Houston Astros in MLB, the Houston Dynamo in MLS, and the Houston Dash in NWSL. These teams compete in football, basketball, baseball, soccer, and women's soccer respectively, and have achieved various levels of success in their respective leagues. Additionally, the city also has minor league baseball, hockey, and other sports teams that cater to sports enthusiasts.

In [ ]:

Copied!





# should use Seattle agent -> summary tool
response = query_engine.query(
    "Give me a summary on all the positive aspects of Chicago"
)
# should use Seattle agent -> summary tool
response = query_engine.query(
    "Give me a summary on all the positive aspects of Chicago"
)

Retrieval entering Chicago: OpenAIAgent
Retrieving from object OpenAIAgent with query Give me a summary on all the positive aspects of Chicago
Added user message to memory: Give me a summary on all the positive aspects of Chicago
=== Calling Function ===
Calling function: summary_tool with args: {
  "input": "positive aspects of Chicago"
}
Got output: Chicago is recognized for its robust economy, acting as a key hub for finance, culture, commerce, industry, education, technology, telecommunications, and transportation. It stands out in the derivatives market and is a top-ranking city in terms of gross domestic product. Chicago is a favored destination for tourists, known for its rich art scene covering visual arts, literature, film, theater, comedy, food, dance, and music. The city hosts prestigious educational institutions and professional sports teams across different leagues.
========================

In [ ]:

Copied!

print(response)
print(response)

Chicago is known for its strong economy with a focus on finance, culture, commerce, industry, education, technology, telecommunications, and transportation. It is a major player in the derivatives market and boasts a high gross domestic product. The city is a popular tourist destination with a vibrant art scene that includes visual arts, literature, film, theater, comedy, food, dance, and music. Additionally, Chicago is home to prestigious educational institutions and professional sports teams across various leagues.