递归检索器 + 查询引擎演示¶

在本演示中，我们将通过一个用例展示"递归检索器"模块在分层数据上的应用。

递归检索的核心概念是：我们不仅探索最直接相关的节点，还会通过节点关系探索附加的检索器/查询引擎并执行它们。例如，某个节点可能表示结构化表格的简明摘要，并链接到针对该结构化表格的SQL/Pandas查询引擎。当该节点被检索到时，我们还需要查询底层查询引擎以获取答案。

这对于具有层级关系的文档特别有用。在本示例中，我们将处理一篇关于亿万富翁的维基百科文章（PDF格式），其中包含文本和多种嵌入式结构化表格。我们首先为每个表格创建Pandas查询引擎，同时用IndexNode（存储指向查询引擎的链接）表示每个表格；该节点与其他节点一起存储在向量数据库中。

在查询时，如果获取到IndexNode，则将查询底层的查询引擎/检索器。

环境配置说明

我们使用camelot从PDF中提取基于文本的表格。

In [ ]:

Copied!





%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-experimental
%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-experimental

In [ ]:

Copied!





import camelot

# https://en.wikipedia.org/wiki/The_World%27s_Billionaires
from llama_index.core import VectorStoreIndex
from llama_index.experimental.query_engine import PandasQueryEngine
from llama_index.core.schema import IndexNode
from llama_index.llms.openai import OpenAI

from llama_index.readers.file import PyMuPDFReader
from typing import List
import camelot

# https://en.wikipedia.org/wiki/The_World%27s_Billionaires
from llama_index.core import VectorStoreIndex
from llama_index.experimental.query_engine import PandasQueryEngine
from llama_index.core.schema import IndexNode
from llama_index.llms.openai import OpenAI

from llama_index.readers.file import PyMuPDFReader
from typing import List

默认设置¶

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
import os

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

In [ ]:

Copied!





from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

加载文档（及表格）¶

我们使用 PyMuPDFReader 来读取文档的正文内容。

同时通过 camelot 工具从文档中提取部分结构化表格数据

In [ ]:

Copied!

file_path = "billionaires_page.pdf"
file_path = "billionaires_page.pdf"

In [ ]:

Copied!

# initialize PDF reader
reader = PyMuPDFReader()
# initialize PDF reader
reader = PyMuPDFReader()

In [ ]:

Copied!

docs = reader.load(file_path)
docs = reader.load(file_path)

In [ ]:

Copied!





# use camelot to parse tables
def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = (
            table_df.rename(columns=table_df.iloc[0])
            .drop(table_df.index[0])
            .reset_index(drop=True)
        )
        table_dfs.append(table_df)
    return table_dfs
# use camelot to parse tables
def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = (
            table_df.rename(columns=table_df.iloc[0])
            .drop(table_df.index[0])
            .reset_index(drop=True)
        )
        table_dfs.append(table_df)
    return table_dfs

In [ ]:

Copied!

table_dfs = get_tables(file_path, pages=[3, 25])
table_dfs = get_tables(file_path, pages=[3, 25])

In [ ]:

Copied!

# shows list of top billionaires in 2023
table_dfs[0]
# shows list of top billionaires in 2023
table_dfs[0]

Out[ ]:

	No.	Name	Net worth\n(USD)	Age	Nationality	Primary source(s) of wealth
0	1	Bernard Arnault &\nfamily	$211 billion	74	France	LVMH
1	2	Elon Musk	$180 billion	51	United\nStates	Tesla, SpaceX, X Corp.
2	3	Jeff Bezos	$114 billion	59	United\nStates	Amazon
3	4	Larry Ellison	$107 billion	78	United\nStates	Oracle Corporation
4	5	Warren Buffett	$106 billion	92	United\nStates	Berkshire Hathaway
5	6	Bill Gates	$104 billion	67	United\nStates	Microsoft
6	7	Michael Bloomberg	$94.5 billion	81	United\nStates	Bloomberg L.P.
7	8	Carlos Slim & family	$93 billion	83	Mexico	Telmex, América Móvil, Grupo\nCarso
8	9	Mukesh Ambani	$83.4 billion	65	India	Reliance Industries
9	10	Steve Ballmer	$80.7 billion	67	United\nStates	Microsoft

In [ ]:

Copied!

# shows list of top billionaires
table_dfs[1]
# shows list of top billionaires
table_dfs[1]

Out[ ]:

	Year	Number of billionaires	Group's combined net worth
0	2023[2]	2,640	$12.2 trillion
1	2022[6]	2,668	$12.7 trillion
2	2021[11]	2,755	$13.1 trillion
3	2020	2,095	$8.0 trillion
4	2019	2,153	$8.7 trillion
5	2018	2,208	$9.1 trillion
6	2017	2,043	$7.7 trillion
7	2016	1,810	$6.5 trillion
8	2015[18]	1,826	$7.1 trillion
9	2014[67]	1,645	$6.4 trillion
10	2013[68]	1,426	$5.4 trillion
11	2012	1,226	$4.6 trillion
12	2011	1,210	$4.5 trillion
13	2010	1,011	$3.6 trillion
14	2009	793	$2.4 trillion
15	2008	1,125	$4.4 trillion
16	2007	946	$3.5 trillion
17	2006	793	$2.6 trillion
18	2005	691	$2.2 trillion
19	2004	587	$1.9 trillion
20	2003	476	$1.4 trillion
21	2002	497	$1.5 trillion
22	2001	538	$1.8 trillion
23	2000	470	$898 billion
24	Sources: Forbes.[18][67][66][68]

创建 Pandas 查询引擎¶

我们为每个结构化表格创建了一个 pandas 查询引擎。

这些引擎可以独立执行，用于回答关于各表格的查询。

警告： 该工具允许 LLM 访问 eval 函数。在运行此工具的机器上可能存在任意代码执行的风险。虽然对代码进行了一定程度的过滤，但除非具备严格的沙箱环境或虚拟机保护，否则不建议在生产环境中使用此工具。

In [ ]:

Copied!





# define query engines over these tables
llm = OpenAI(model="gpt-4")

df_query_engines = [
    PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs
]
# define query engines over these tables
llm = OpenAI(model="gpt-4")

df_query_engines = [
    PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs
]

In [ ]:

Copied!





response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))
response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))

$180 billion

In [ ]:

Copied!





response = df_query_engines[1].query(
    "How many billionaires were there in 2009?"
)
print(str(response))
response = df_query_engines[1].query(
    "How many billionaires were there in 2009?"
)
print(str(response))

构建向量索引¶

在分块后的文档以及链接到表格的额外 IndexNode 对象上构建向量索引。

In [ ]:

Copied!

from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)
from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)

In [ ]:

Copied!





# define index nodes
summaries = [
    (
        "This node provides information about the world's richest billionaires"
        " in 2023"
    ),
    (
        "This node provides information on the number of billionaires and"
        " their combined net worth from 2000 to 2023."
    ),
]

df_nodes = [
    IndexNode(text=summary, index_id=f"pandas{idx}")
    for idx, summary in enumerate(summaries)
]

df_id_query_engine_mapping = {
    f"pandas{idx}": df_query_engine
    for idx, df_query_engine in enumerate(df_query_engines)
}
# define index nodes
summaries = [
    (
        "This node provides information about the world's richest billionaires"
        " in 2023"
    ),
    (
        "This node provides information on the number of billionaires and"
        " their combined net worth from 2000 to 2023."
    ),
]

df_nodes = [
    IndexNode(text=summary, index_id=f"pandas{idx}")
    for idx, summary in enumerate(summaries)
]

df_id_query_engine_mapping = {
    f"pandas{idx}": df_query_engine
    for idx, df_query_engine in enumerate(df_query_engines)
}

In [ ]:

Copied!

# construct top-level vector index + query engine
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)
# construct top-level vector index + query engine
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

在 `RetrieverQueryEngine` 中使用 `RecursiveRetriever`¶

我们定义一个 RecursiveRetriever 对象来实现递归检索/查询节点功能。随后将其与 ResponseSynthesizer 一同置入 RetrieverQueryEngine 中以合成响应结果。

该组件需要传入两个映射关系：检索器ID到检索器的映射，以及查询引擎ID到查询引擎的映射。最后需指定一个根ID，表示首次查询时使用的检索器。

In [ ]:

Copied!





# baseline vector index (that doesn't include the extra df nodes).
# used to benchmark
vector_index0 = VectorStoreIndex(doc_nodes)
vector_query_engine0 = vector_index0.as_query_engine()
# baseline vector index (that doesn't include the extra df nodes).
# used to benchmark
vector_index0 = VectorStoreIndex(doc_nodes)
vector_query_engine0 = vector_index0.as_query_engine()

In [ ]:

Copied!





from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)

In [ ]:

Copied!

response = query_engine.query(
    "What's the net worth of the second richest billionaire in 2023?"
)
response = query_engine.query(
    "What's the net worth of the second richest billionaire in 2023?"
)

Retrieving with query id None: What's the net worth of the second richest billionaire in 2023?
Retrieved node with id, entering: pandas0
Retrieving with query id pandas0: What's the net worth of the second richest billionaire in 2023?
Got response: $180 billion

In [ ]:

Copied!

response.source_nodes[0].node.get_content()
response.source_nodes[0].node.get_content()

Out[ ]:

"Query: What's the net worth of the second richest billionaire in 2023?\nResponse: $180\xa0billion"

In [ ]:

Copied!

str(response)
str(response)

Out[ ]:

'$180 billion.'

In [ ]:

Copied!

response = query_engine.query("How many billionaires were there in 2009?")
response = query_engine.query("How many billionaires were there in 2009?")

Retrieving with query id None: How many billionaires were there in 2009?
Retrieved node with id, entering: pandas1
Retrieving with query id pandas1: How many billionaires were there in 2009?
Got response: 793

In [ ]:

Copied!

str(response)
str(response)

Out[ ]:

'793'

In [ ]:

Copied!

response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)
response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)

In [ ]:

Copied!

print(response.source_nodes[0].node.get_content())
print(response.source_nodes[0].node.get_content())

In [ ]:

Copied!

print(str(response))
print(str(response))

Based on the context information, it is not possible to determine the exact number of billionaires in 2009. The provided information only mentions the number of billionaires in 2013 and 2014.

In [ ]:

Copied!

response.source_nodes[0].node.get_content()
response.source_nodes[0].node.get_content()

In [ ]:

Copied!

response = query_engine.query(
    "Which billionaires are excluded from this list?"
)
response = query_engine.query(
    "Which billionaires are excluded from this list?"
)

In [ ]:

Copied!

print(str(response))
print(str(response))

Royal families and dictators whose wealth is contingent on a position are excluded from this list.

递归检索器 + 查询引擎演示¶

默认设置¶

加载文档（及表格）¶

创建 Pandas 查询引擎¶

构建向量索引¶

在 RetrieverQueryEngine 中使用 RecursiveRetriever¶

在 `RetrieverQueryEngine` 中使用 `RecursiveRetriever`¶