阿里云开放搜索向量存储¶
阿里云OpenSearch向量检索版是由阿里巴巴集团开发的大规模分布式搜索引擎。该产品为整个阿里集团(包括淘宝、天猫、菜鸟、优酷等面向中国大陆以外地区客户的电商平台)提供搜索服务,同时也是阿里云OpenSearch的基础引擎。经过多年发展,阿里云OpenSearch向量检索版已满足高可用、高时效和成本效益的业务需求,并提供了自动化运维系统,用户可根据业务特性基于此构建定制化搜索服务。
运行前需确保已创建实例。
安装配置¶
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-alibabacloud-opensearch
%pip install llama-index
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
请提供 OpenAI 访问密钥¶
如需使用 OpenAI 的嵌入功能,您需要提供 OpenAI API 密钥:
import openai
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载文档¶
from llama_index.core import SimpleDirectoryReader
from IPython.display import Markdown, display
# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
Total documents: 1
创建阿里云 OpenSearch 向量存储对象:¶
要执行下一步操作,您需要拥有一个阿里云OpenSearch向量检索服务实例,并配置好数据表。
# if run fllowing cells raise async io exception, run this
import nest_asyncio
nest_asyncio.apply()
# initialize without metadata filter
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="*****",
instance_id="*****",
username="your_username",
password="your_password",
table_name="llama",
)
vector_store = AlibabaCloudOpenSearchStore(config)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
查询索引¶
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
Before college, the author worked on writing and programming. They wrote short stories and tried writing programs on the IBM 1401 in 9th grade using an early version of Fortran.
连接到现有存储¶
由于该存储由阿里云 OpenSearch 提供支持,其本质具有持久化特性。因此,若需连接至先前已创建并填充数据的存储,可按以下步骤操作:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="***",
instance_id="***",
username="your_username",
password="your_password",
table_name="llama",
)
vector_store = AlibabaCloudOpenSearchStore(config)
# Create index from existing stored vectors
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine()
response = query_engine.query(
"What did the author study prior to working on AI?"
)
display(Markdown(f"<b>{response}</b>"))
元数据过滤¶
阿里云OpenSearch向量存储支持在查询时进行元数据过滤。以下示例单元(基于全新数据表操作)展示了这一功能。
本演示中,为简洁起见,仅加载了单个源文档(../data/paul_graham/paul_graham_essay.txt
文本文件)。不过您将为该文档附加一些自定义元数据,以此说明如何通过文档附加元数据的条件限制来约束查询范围。
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="****",
instance_id="****",
username="your_username",
password="your_password",
table_name="llama",
)
md_storage_context = StorageContext.from_defaults(
vector_store=AlibabaCloudOpenSearchStore(config)
)
def my_file_metadata(file_name: str):
"""Depending on the input file name, associate a different metadata."""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# this (unfortunately) will not happen in this demo
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# Load documents and build index
md_documents = SimpleDirectoryReader(
"../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
为查询引擎添加过滤器:
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[MetadataFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"How long it took the author to write his thesis?"
)
display(Markdown(f"<b>{md_response}</b>"))
为了验证筛选功能是否生效,可以尝试将其修改为仅使用 "dinos"
文档... 这次将不会得到任何回复 :)