从基础到进阶的自动检索指南（使用 Pinecone + Arize Phoenix）¶

本笔记本将展示如何针对 Pinecone 执行自动检索，这使您能够执行各种半结构化查询，其范围远超标准 top-k 语义搜索的能力。

我们将演示如何设置基础自动检索，以及如何扩展其功能（通过自定义提示和动态元数据检索实现）。

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-vector-stores-pinecone
%pip install llama-index-vector-stores-pinecone

In [ ]:

Copied!

# !pip install llama-index>=0.9.31 scikit-learn==1.2.2 arize-phoenix==2.4.1 pinecone-client>=3.0.0
# !pip install llama-index>=0.9.31 scikit-learn==1.2.2 arize-phoenix==2.4.1 pinecone-client>=3.0.0

第一部分：配置自动检索功能¶

要配置自动检索功能，请按以下步骤操作：

完成初始化设置，加载数据并构建Pinecone向量索引
定义自动检索器并运行示例查询
使用Phoenix观察每条追踪记录并可视化提示输入/输出
演示如何自定义自动检索提示模板

1.a 配置 Pinecone/Phoenix、加载数据并构建向量索引¶

本节将配置 Pinecone 并导入书籍/电影相关的示例数据（包含文本数据和元数据）。

同时配置 Phoenix 以捕获下游追踪数据。

In [ ]:

Copied!





# setup Phoenix
import phoenix as px
import llama_index.core

px.launch_app()
llama_index.core.set_global_handler("arize_phoenix")
# setup Phoenix
import phoenix as px
import llama_index.core

px.launch_app()
llama_index.core.set_global_handler("arize_phoenix")

🌍 To view the Phoenix app in your browser, visit http://127.0.0.1:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix

In [ ]:

Copied!





import os

os.environ[
    "PINECONE_API_KEY"
] = "<Your Pinecone API key, from app.pinecone.io>"
# os.environ["OPENAI_API_KEY"] = "sk-..."
import os

os.environ[
    "PINECONE_API_KEY"
] = ""
# os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]:

Copied!

from pinecone import Pinecone
from pinecone import ServerlessSpec

api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)
from pinecone import Pinecone
from pinecone import ServerlessSpec

api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)

In [ ]:

Copied!

# delete if needed
# pc.delete_index("quickstart-index")
# delete if needed
# pc.delete_index("quickstart-index")

In [ ]:

Copied!





# Dimensions are for text-embedding-ada-002
try:
    pc.create_index(
        "quickstart-index",
        dimension=1536,
        metric="euclidean",
        spec=ServerlessSpec(cloud="aws", region="us-west-2"),
    )
except Exception as e:
    # Most likely index already exists
    print(e)
    pass
# Dimensions are for text-embedding-ada-002
try:
    pc.create_index(
        "quickstart-index",
        dimension=1536,
        metric="euclidean",
        spec=ServerlessSpec(cloud="aws", region="us-west-2"),
    )
except Exception as e:
    # Most likely index already exists
    print(e)
    pass

In [ ]:

Copied!

pinecone_index = pc.Index("quickstart-index")
pinecone_index = pc.Index("quickstart-index")

加载文档，构建 PineconeVectorStore 和 VectorStoreIndex¶

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore

In [ ]:

Copied!





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
            "theme": "Fiction",
            "year": 2010,
        },
    ),
    TextNode(
        text="To Kill a Mockingbird",
        metadata={
            "author": "Harper Lee",
            "theme": "Fiction",
            "year": 1960,
        },
    ),
    TextNode(
        text="1984",
        metadata={
            "author": "George Orwell",
            "theme": "Totalitarianism",
            "year": 1949,
        },
    ),
    TextNode(
        text="The Great Gatsby",
        metadata={
            "author": "F. Scott Fitzgerald",
            "theme": "The American Dream",
            "year": 1925,
        },
    ),
    TextNode(
        text="Harry Potter and the Sorcerer's Stone",
        metadata={
            "author": "J.K. Rowling",
            "theme": "Fiction",
            "year": 1997,
        },
    ),
]
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
            "theme": "Fiction",
            "year": 2010,
        },
    ),
    TextNode(
        text="To Kill a Mockingbird",
        metadata={
            "author": "Harper Lee",
            "theme": "Fiction",
            "year": 1960,
        },
    ),
    TextNode(
        text="1984",
        metadata={
            "author": "George Orwell",
            "theme": "Totalitarianism",
            "year": 1949,
        },
    ),
    TextNode(
        text="The Great Gatsby",
        metadata={
            "author": "F. Scott Fitzgerald",
            "theme": "The American Dream",
            "year": 1925,
        },
    ),
    TextNode(
        text="Harry Potter and the Sorcerer's Stone",
        metadata={
            "author": "J.K. Rowling",
            "theme": "Fiction",
            "year": 1997,
        },
    ),
]

In [ ]:

Copied!





vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    namespace="test",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    namespace="test",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [ ]:

Copied!

index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Upserted vectors:   0%|          | 0/7 [00:00<?, ?it/s]

1.b 定义自动检索器并运行示例查询¶

配置 `VectorIndexAutoRetriever`¶

其中一个输入是描述向量存储集合内容的 schema，这类似于描述 SQL 数据库中表结构的表模式。该模式信息随后会被注入到提示词中，并传递给大语言模型（LLM）以推断完整查询应包含的内容（包括元数据过滤器）。

In [ ]:

Copied!





from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="famous books and movies",
    metadata_info=[
        MetadataInfo(
            name="director",
            type="str",
            description=("Name of the director"),
        ),
        MetadataInfo(
            name="theme",
            type="str",
            description=("Theme of the book/movie"),
        ),
        MetadataInfo(
            name="year",
            type="int",
            description=("Year of the book/movie"),
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    empty_query_top_k=10,
    # this is a hack to allow for blank queries in pinecone
    default_empty_query_vector=[0] * 1536,
    verbose=True,
)
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="famous books and movies",
    metadata_info=[
        MetadataInfo(
            name="director",
            type="str",
            description=("Name of the director"),
        ),
        MetadataInfo(
            name="theme",
            type="str",
            description=("Theme of the book/movie"),
        ),
        MetadataInfo(
            name="year",
            type="int",
            description=("Year of the book/movie"),
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    empty_query_top_k=10,
    # this is a hack to allow for blank queries in pinecone
    default_empty_query_vector=[0] * 1536,
    verbose=True,
)

执行查询示例¶

让我们运行几个利用结构化信息的示例查询。

In [ ]:

Copied!

nodes = retriever.retrieve(
    "Tell me about some books/movies after the year 2000"
)
nodes = retriever.retrieve(
    "Tell me about some books/movies after the year 2000"
)

Using query str: 
Using filters: [('year', '>', 2000)]

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

Inception
{'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010}

In [ ]:

Copied!

nodes = retriever.retrieve("Tell me about some books that are Fiction")
nodes = retriever.retrieve("Tell me about some books that are Fiction")

Using query str: Fiction
Using filters: [('theme', '==', 'Fiction')]

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

Inception
{'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010}
To Kill a Mockingbird
{'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}

传入额外的元数据过滤器¶

若需传入未被自动推断的额外元数据过滤器，请执行以下操作。

In [ ]:

Copied!





from llama_index.core.vector_stores import MetadataFilters

filter_dicts = [{"key": "year", "operator": "==", "value": 1997}]
filters = MetadataFilters.from_dicts(filter_dicts)
retriever2 = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    empty_query_top_k=10,
    # this is a hack to allow for blank queries in pinecone
    default_empty_query_vector=[0] * 1536,
    extra_filters=filters,
)
from llama_index.core.vector_stores import MetadataFilters

filter_dicts = [{"key": "year", "operator": "==", "value": 1997}]
filters = MetadataFilters.from_dicts(filter_dicts)
retriever2 = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    empty_query_top_k=10,
    # this is a hack to allow for blank queries in pinecone
    default_empty_query_vector=[0] * 1536,
    extra_filters=filters,
)

In [ ]:

Copied!





nodes = retriever2.retrieve("Tell me about some books that are Fiction")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever2.retrieve("Tell me about some books that are Fiction")
for node in nodes:
    print(node.text)
    print(node.metadata)

Harry Potter and the Sorcerer's Stone
{'author': 'J.K. Rowling', 'theme': 'Fiction', 'year': 1997}

失败查询示例¶

注意：此时未检索到任何结果！我们稍后会修复此问题。

In [ ]:

Copied!

nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
nodes = retriever.retrieve("Tell me about some books that are mafia-themed")

Using query str: books
Using filters: [('theme', '==', 'mafia')]

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

可视化追踪记录¶

让我们打开 Phoenix 来查看这些追踪记录吧！

No description has been provided for this image

现在来看看自动检索提示。可以看到该提示运用了两个小样本示例。

第二部分：扩展自动检索功能（动态元数据检索）¶

我们现在通过自定义提示词来扩展自动检索功能。在第一部分中，我们明确添加了一些规则。

在第二部分中，我们实现了动态元数据检索机制，该机制将执行两阶段检索流程：首先从向量数据库中获取相关元数据，然后将这些元数据作为少量示例插入自动检索提示词中（当然，第二阶段检索会从向量数据库中获取实际条目）。

2.a 优化自动检索提示词¶

当前的自动检索提示词虽然可用，但仍有多个改进空间。例如：提示词中硬编码了两个少样本示例（如何加入自定义示例？），且自动检索并非总能正确推断出元数据过滤器。

举例说明：所有theme字段均为大写形式。我们该如何让大语言模型明确这一点，以避免其错误推断出小写形式的"theme"？

现在让我们尝试修改这个提示词！

In [ ]:

Copied!

from llama_index.core.prompts import display_prompt_dict
from llama_index.core import PromptTemplate
from llama_index.core.prompts import display_prompt_dict
from llama_index.core import PromptTemplate

In [ ]:

Copied!

prompts_dict = retriever.get_prompts()
prompts_dict = retriever.get_prompts()

In [ ]:

Copied!

display_prompt_dict(prompts_dict)
display_prompt_dict(prompts_dict)

In [ ]:

Copied!

# look at required template variables.
prompts_dict["prompt"].template_vars
# look at required template variables.
prompts_dict["prompt"].template_vars

Out[ ]:

['schema_str', 'info_str', 'query_str']

自定义提示词¶

让我们对提示词稍作定制。具体操作如下：

移除第一个少样本示例以节省token数量
添加一条规则：在推断"主题"时始终将首字母大写

请注意，该提示词模板要求预先定义好 schema_str、info_str 和 query_str 这三个变量。

In [ ]:

Copied!





# write prompt template, and modify it.

prompt_tmpl_str = """\
Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

{schema_str}

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters take into account the descriptions of attributes.
Make sure that filters are only used as needed. If there are no filters that should be applied return [] for the filter value.
If the user's query explicitly mentions number of documents to retrieve, set top_k to that number, otherwise do not set top_k.
Do NOT EVER infer a null value for a filter. This will break the downstream program. Instead, don't include the filter.

<< Example 1. >>
Data Source:
```json
{{
    "metadata_info": [
        {{
            "name": "author",
            "type": "str",
            "description": "Author name"
        }},
        {{
            "name": "book_title",
            "type": "str",
            "description": "Book title"
        }},
        {{
            "name": "year",
            "type": "int",
            "description": "Year Published"
        }},
        {{
            "name": "pages",
            "type": "int",
            "description": "Number of pages"
        }},
        {{
            "name": "summary",
            "type": "str",
            "description": "A short summary of the book"
        }}
    ],
    "content_info": "Classic literature"
}}
```

User Query:
What are some books by Jane Austen published after 1813 that explore the theme of marriage for social standing?

Additional Instructions:
None

Structured Request:
```json
{{"query": "Books related to theme of marriage for social standing", "filters": [{{"key": "year", "value": "1813", "operator": ">"}}, {{"key": "author", "value": "Jane Austen", "operator": "=="}}], "top_k": null}}

```

<< Example 2. >>
Data Source:
```json
{info_str}
```

User Query:
{query_str}

Additional Instructions:
{additional_instructions}

Structured Request:
"""
# write prompt template, and modify it.

prompt_tmpl_str = """\
Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

{schema_str}

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters take into account the descriptions of attributes.
Make sure that filters are only used as needed. If there are no filters that should be applied return [] for the filter value.
If the user's query explicitly mentions number of documents to retrieve, set top_k to that number, otherwise do not set top_k.
Do NOT EVER infer a null value for a filter. This will break the downstream program. Instead, don't include the filter.

<< Example 1. >>
Data Source:
```json
{{
    "metadata_info": [
        {{
            "name": "author",
            "type": "str",
            "description": "Author name"
        }},
        {{
            "name": "book_title",
            "type": "str",
            "description": "Book title"
        }},
        {{
            "name": "year",
            "type": "int",
            "description": "Year Published"
        }},
        {{
            "name": "pages",
            "type": "int",
            "description": "Number of pages"
        }},
        {{
            "name": "summary",
            "type": "str",
            "description": "A short summary of the book"
        }}
    ],
    "content_info": "Classic literature"
}}
```

User Query:
What are some books by Jane Austen published after 1813 that explore the theme of marriage for social standing?

Additional Instructions:
None

Structured Request:
```json
{{"query": "Books related to theme of marriage for social standing", "filters": [{{"key": "year", "value": "1813", "operator": ">"}}, {{"key": "author", "value": "Jane Austen", "operator": "=="}}], "top_k": null}}

```

<< Example 2. >>
Data Source:
```json
{info_str}
```

User Query:
{query_str}

Additional Instructions:
{additional_instructions}

Structured Request:
"""

In [ ]:

Copied!

prompt_tmpl = PromptTemplate(prompt_tmpl_str)
prompt_tmpl = PromptTemplate(prompt_tmpl_str)

你会注意到我们添加了一个 additional_instructions 模板变量。这允许我们插入特定于向量集合的指令。

我们将使用 partial_format 来添加该指令。

In [ ]:

Copied!





add_instrs = """\
If one of the filters is 'theme', please make sure that the first letter of the inferred value is capitalized. Only words that are capitalized are valid values for "theme". \
"""
prompt_tmpl = prompt_tmpl.partial_format(additional_instructions=add_instrs)
add_instrs = """\
If one of the filters is 'theme', please make sure that the first letter of the inferred value is capitalized. Only words that are capitalized are valid values for "theme". \
"""
prompt_tmpl = prompt_tmpl.partial_format(additional_instructions=add_instrs)

In [ ]:

Copied!

retriever.update_prompts({"prompt": prompt_tmpl})
retriever.update_prompts({"prompt": prompt_tmpl})

重新执行部分查询¶

现在让我们尝试重新执行一些查询，您会发现该值会被自动推断出来。

In [ ]:

Copied!

nodes = retriever.retrieve(
    "Tell me about some books that are friendship-themed"
)
nodes = retriever.retrieve(
    "Tell me about some books that are friendship-themed"
)

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

2.b 实现动态元数据检索¶

除了在提示中硬编码规则外，另一种选择是检索相关的元数据少样本示例，以帮助大语言模型更准确地推断出正确的元数据筛选条件。

这种方法能更有效地防止大语言模型在推断"where"子句时出错，特别是在拼写/值格式等细节方面。

我们可以通过向量检索实现这一目标。现有的向量数据库集合存储了原始文本和元数据；我们可以直接查询该集合，或者单独仅索引元数据并从该索引中检索。本节我们选择前者方案，但在实际应用中您可能更倾向于后者。

In [ ]:

Copied!

# define retriever that fetches the top 2 examples.
metadata_retriever = index.as_retriever(similarity_top_k=2)
# define retriever that fetches the top 2 examples.
metadata_retriever = index.as_retriever(similarity_top_k=2)

我们沿用前一节中定义的相同 prompt_tmpl_str。

In [ ]:

Copied!





from typing import List, Any


def format_additional_instrs(**kwargs: Any) -> str:
    """Format examples into a string."""

    nodes = metadata_retriever.retrieve(kwargs["query_str"])
    context_str = (
        "Here is the metadata of relevant entries from the database collection. "
        "This should help you infer the right filters: \n"
    )
    for node in nodes:
        context_str += str(node.node.metadata) + "\n"
    return context_str


ext_prompt_tmpl = PromptTemplate(
    prompt_tmpl_str,
    function_mappings={"additional_instructions": format_additional_instrs},
)
from typing import List, Any


def format_additional_instrs(**kwargs: Any) -> str:
    """Format examples into a string."""

    nodes = metadata_retriever.retrieve(kwargs["query_str"])
    context_str = (
        "Here is the metadata of relevant entries from the database collection. "
        "This should help you infer the right filters: \n"
    )
    for node in nodes:
        context_str += str(node.node.metadata) + "\n"
    return context_str


ext_prompt_tmpl = PromptTemplate(
    prompt_tmpl_str,
    function_mappings={"additional_instructions": format_additional_instrs},
)

In [ ]:

Copied!

retriever.update_prompts({"prompt": ext_prompt_tmpl})
retriever.update_prompts({"prompt": ext_prompt_tmpl})

重新执行部分查询¶

现在让我们尝试重新执行部分查询，可以看到值会被自动推断出来。

In [ ]:

Copied!





nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
for node in nodes:
    print(node.text)
    print(node.metadata)

Using query str: books
Using filters: [('theme', '==', 'Mafia')]
The Godfather
{'director': 'Francis Ford Coppola', 'theme': 'Mafia', 'year': 1972}

In [ ]:

Copied!





nodes = retriever.retrieve("Tell me some books authored by HARPER LEE")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever.retrieve("Tell me some books authored by HARPER LEE")
for node in nodes:
    print(node.text)
    print(node.metadata)

Using query str: Books authored by Harper Lee
Using filters: [('author', '==', 'Harper Lee')]
To Kill a Mockingbird
{'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}

从基础到进阶的自动检索指南（使用 Pinecone + Arize Phoenix）¶

第一部分：配置自动检索功能¶

1.a 配置 Pinecone/Phoenix、加载数据并构建向量索引¶

加载文档，构建 PineconeVectorStore 和 VectorStoreIndex¶

1.b 定义自动检索器并运行示例查询¶

配置 VectorIndexAutoRetriever¶

执行查询示例¶

传入额外的元数据过滤器¶

失败查询示例¶

可视化追踪记录¶

第二部分：扩展自动检索功能（动态元数据检索）¶

2.a 优化自动检索提示词¶

自定义提示词¶

重新执行部分查询¶

2.b 实现动态元数据检索¶

重新执行部分查询¶

配置 `VectorIndexAutoRetriever`¶