安装配置¶
我们首先定义导入项并创建一个空的 Weaviate 集合。
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-vector-stores-weaviate
%pip install llama-index-vector-stores-weaviate
In [ ]:
Copied!
!pip install llama-index weaviate-client
!pip install llama-index weaviate-client
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
我们将利用 GPT-4 的推理能力来推断元数据过滤器。根据您的使用场景,"gpt-3.5-turbo" 同样可以胜任。
In [ ]:
Copied!
# set up OpenAI
import os
import getpass
import openai
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
openai.api_key = os.environ["OPENAI_API_KEY"]
# set up OpenAI
import os
import getpass
import openai
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
openai.api_key = os.environ["OPENAI_API_KEY"]
In [ ]:
Copied!
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings
Settings.llm = OpenAI(model="gpt-4")
Settings.embed_model = OpenAIEmbedding()
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings
Settings.llm = OpenAI(model="gpt-4")
Settings.embed_model = OpenAIEmbedding()
本 Notebook 使用 Weaviate 的嵌入式模式,该模式支持 Linux 和 macOS 系统。
若您希望试用 Weaviate 的全托管服务 Weaviate Cloud Services (WCS),可启用注释中的代码。
In [ ]:
Copied!
import weaviate
from weaviate.embedded import EmbeddedOptions
# Connect to Weaviate client in embedded mode
client = weaviate.connect_to_embedded()
# Enable this code if you want to use Weaviate Cloud Services instead of Embedded mode.
"""
import weaviate
# cloud
cluster_url = ""
api_key = ""
client = weaviate.connect_to_wcs(cluster_url=cluster_url,
auth_credentials=weaviate.auth.AuthApiKey(api_key),
)
# local
# client = weaviate.connect_to_local()
"""
import weaviate
from weaviate.embedded import EmbeddedOptions
# Connect to Weaviate client in embedded mode
client = weaviate.connect_to_embedded()
# Enable this code if you want to use Weaviate Cloud Services instead of Embedded mode.
"""
import weaviate
# cloud
cluster_url = ""
api_key = ""
client = weaviate.connect_to_wcs(cluster_url=cluster_url,
auth_credentials=weaviate.auth.AuthApiKey(api_key),
)
# local
# client = weaviate.connect_to_local()
"""
定义示例数据¶
我们将一些包含文本块的示例节点插入向量数据库。请注意,每个 TextNode 不仅包含文本,还包含元数据(例如 category 和 country)。这些元数字段将在底层向量数据库中以原样形式转换/存储。
In [ ]:
Copied!
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
TextNode(
text=(
"Angelina Jolie is an American actress, filmmaker, and"
" humanitarian. She has received numerous awards for her acting"
" and is known for her philanthropic work."
),
metadata={
"category": "Entertainment",
"country": "United States",
},
),
TextNode(
text=(
"Elon Musk is a business magnate, industrial designer, and"
" engineer. He is the founder, CEO, and lead designer of SpaceX,"
" Tesla, Inc., Neuralink, and The Boring Company."
),
metadata={
"category": "Business",
"country": "United States",
},
),
TextNode(
text=(
"Rihanna is a Barbadian singer, actress, and businesswoman. She"
" has achieved significant success in the music industry and is"
" known for her versatile musical style."
),
metadata={
"category": "Music",
"country": "Barbados",
},
),
TextNode(
text=(
"Cristiano Ronaldo is a Portuguese professional footballer who is"
" considered one of the greatest football players of all time. He"
" has won numerous awards and set multiple records during his"
" career."
),
metadata={
"category": "Sports",
"country": "Portugal",
},
),
]
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
TextNode(
text=(
"Angelina Jolie is an American actress, filmmaker, and"
" humanitarian. She has received numerous awards for her acting"
" and is known for her philanthropic work."
),
metadata={
"category": "Entertainment",
"country": "United States",
},
),
TextNode(
text=(
"Elon Musk is a business magnate, industrial designer, and"
" engineer. He is the founder, CEO, and lead designer of SpaceX,"
" Tesla, Inc., Neuralink, and The Boring Company."
),
metadata={
"category": "Business",
"country": "United States",
},
),
TextNode(
text=(
"Rihanna is a Barbadian singer, actress, and businesswoman. She"
" has achieved significant success in the music industry and is"
" known for her versatile musical style."
),
metadata={
"category": "Music",
"country": "Barbados",
},
),
TextNode(
text=(
"Cristiano Ronaldo is a Portuguese professional footballer who is"
" considered one of the greatest football players of all time. He"
" has won numerous awards and set multiple records during his"
" career."
),
metadata={
"category": "Sports",
"country": "Portugal",
},
),
]
使用 Weaviate 向量数据库构建向量索引¶
此处我们将数据加载至向量数据库。如前所述,每个节点的文本和元数据都将转换为 Weaviate 中的对应表示形式。现在我们可以对该数据执行语义查询,并通过 Weaviate 进行元数据过滤。
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore
vector_store = WeaviateVectorStore(
weaviate_client=client, index_name="LlamaIndex_filter"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore
vector_store = WeaviateVectorStore(
weaviate_client=client, index_name="LlamaIndex_filter"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
In [ ]:
Copied!
index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)
定义 VectorIndexAutoRetriever¶
我们定义核心模块 VectorIndexAutoRetriever。该模块接收 VectorStoreInfo 参数,其中包含向量存储集合的结构化描述及其支持的元数据过滤器信息。这些信息将用于自动检索提示中,供大语言模型(LLM)推断元数据过滤器。
In [ ]:
Copied!
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo
vector_store_info = VectorStoreInfo(
content_info="brief biography of celebrities",
metadata_info=[
MetadataInfo(
name="category",
type="str",
description=(
"Category of the celebrity, one of [Sports, Entertainment,"
" Business, Music]"
),
),
MetadataInfo(
name="country",
type="str",
description=(
"Country of the celebrity, one of [United States, Barbados,"
" Portugal]"
),
),
],
)
retriever = VectorIndexAutoRetriever(
index, vector_store_info=vector_store_info
)
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo
vector_store_info = VectorStoreInfo(
content_info="brief biography of celebrities",
metadata_info=[
MetadataInfo(
name="category",
type="str",
description=(
"Category of the celebrity, one of [Sports, Entertainment,"
" Business, Music]"
),
),
MetadataInfo(
name="country",
type="str",
description=(
"Country of the celebrity, one of [United States, Barbados,"
" Portugal]"
),
),
],
)
retriever = VectorIndexAutoRetriever(
index, vector_store_info=vector_store_info
)
处理示例数据¶
我们尝试对一些示例数据进行处理。请注意元数据过滤器是如何被自动推断的——这有助于实现更精确的检索!
In [ ]:
Copied!
response = retriever.retrieve("Tell me about celebrities from United States")
response = retriever.retrieve("Tell me about celebrities from United States")
In [ ]:
Copied!
print(response[0])
print(response[0])
In [ ]:
Copied!
response = retriever.retrieve(
"Tell me about Sports celebrities from United States"
)
response = retriever.retrieve(
"Tell me about Sports celebrities from United States"
)
In [ ]:
Copied!
print(response[0])
print(response[0])