腾讯云向量数据库¶
腾讯云向量数据库(Tencent Cloud VectorDB) 是一款全托管、自研的企业级分布式数据库服务,专为存储、检索和分析多维向量数据而设计。该数据库支持多种索引类型和相似度计算方法,单个索引可支持高达10亿级的向量规模,并能承载百万级QPS及毫秒级查询延迟。腾讯云向量数据库不仅能作为大模型的外部知识库提升其回答准确性,还可广泛应用于推荐系统、自然语言处理服务、计算机视觉和智能客服等AI领域。
本笔记本展示了TencentVectorDB作为LlamaIndex中向量存储库的基本用法。
运行前,请确保已创建数据库实例。
安装¶
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-tencentvectordb
!pip install llama-index
!pip install tcvectordb
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
)
from llama_index.vector_stores.tencentvectordb import TencentVectorDB
from llama_index.core.vector_stores.tencentvectordb import (
CollectionParams,
FilterField,
)
import tcvectordb
tcvectordb.debug.DebugEnable = False
请提供 OpenAI 访问密钥¶
如需使用 OpenAI 的嵌入功能,您需要提供 OpenAI API 密钥:
import openai
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
OpenAI API Key: ········
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
创建并填充向量存储库¶
接下来您将从本地文件加载一些Paul Graham的随笔文章,并将其存储至腾讯云向量数据库。
# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
f"First document, text ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)
Total documents: 1 First document, id: 5b7489b6-0cca-4088-8f30-6de32d540fdf First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35 First document, text (75019 characters): ==================== What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ...
初始化腾讯云向量数据库¶
创建向量存储时,若底层数据库集合尚不存在,将同时完成该集合的初始化:
vector_store = TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(dimension=1536, drop_exists=True),
)
现在将这个存储封装到 index LlamaIndex 抽象中以供后续查询:
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
请注意,上述 from_documents 调用会同时完成多项操作:将输入文档分割为可管理大小的片段("节点"),为每个节点计算嵌入向量,并将它们全部存储至腾讯云向量数据库。
查询存储¶
基础查询¶
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
The author chose to work on AI because of his fascination with the novel The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also drawn to the idea that AI could be used to explore the ultimate truths that other fields could not.
基于 MMR 的查询¶
MMR(最大边际相关性)方法旨在从存储中获取既与查询相关又尽可能彼此不同的文本片段,目的是为构建最终答案提供更广泛的上下文:
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
The author chose to work on AI because he was impressed and envious of his friend who had built a computer kit and was able to type programs into it. He was also inspired by a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also disappointed with philosophy courses in college, which he found to be boring, and he wanted to work on something that seemed more powerful.
连接到现有存储库¶
由于该存储库由腾讯云向量数据库(Tencent Cloud VectorDB)提供支持,其本质具有持久化特性。因此,若需连接至先前已创建并填充数据的存储库,可按以下步骤操作:
new_vector_store = TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(dimension=1536, drop_exists=False),
)
# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
vector_store=new_vector_store
)
# now you can do querying, etc:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What did the author study prior to working on AI?"
)
print(response)
The author studied philosophy and painting, worked on spam filters, and wrote essays prior to working on AI.
从索引中移除文档¶
首先通过从索引生成的 Retriever 获取文档片段(或称"节点")的明确列表:
retriever = new_index_instance.as_retriever(
vector_store_query_mode="mmr",
similarity_top_k=3,
vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
print(f" [{idx}] score = {node_with_score.score}")
print(f" id = {node_with_score.node.node_id}")
print(f" text = {node_with_score.node.text[:90]} ...")
Found 3 nodes.
[0] score = 0.42589144520149874
id = 05f53f06-9905-461a-bc6d-fa4817e5a776
text = What I Worked On
February 2021
Before college the two main things I worked on, outside o ...
[1] score = -0.0012061281453193962
id = 2f9f843e-6495-4646-a03d-4b844ff7c1ab
text = been explored. But all I wanted was to get out of grad school, and my rapidly written diss ...
[2] score = 0.025454533089838027
id = 28ad32da-25f9-4aaa-8487-88390ec13348
text = showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress ...
但请注意!使用向量存储时,应将文档视为合理的删除单元,而非属于该文档的单个节点。在本例中,由于您仅插入了单个文本文件,因此所有节点都将具有相同的 ref_doc_id:
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
Nodes' ref_doc_id: 5b7489b6-0cca-4088-8f30-6de32d540fdf 5b7489b6-0cca-4088-8f30-6de32d540fdf 5b7489b6-0cca-4088-8f30-6de32d540fdf
假设现在你需要删除已上传的文本文件:
new_vector_store.delete(nodes_with_scores[0].node.ref_doc_id)
重复执行完全相同的查询并立即检查结果。此时应当显示未找到任何结果:
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
Found 0 nodes.
元数据过滤¶
腾讯云 VectorDB 向量数据库支持在查询时通过精确匹配的 key=value 键值对形式进行元数据过滤。以下示例单元(基于全新创建的集合)展示了这一功能。
本演示中,为简洁起见,仅加载了单个源文档(即 ../data/paul_graham/paul_graham_essay.txt 文本文件)。不过您会为该文档附加一些自定义元数据,以此演示如何通过文档附加元数据的条件限制来约束查询范围。
filter_fields = [
FilterField(name="source_type"),
]
md_storage_context = StorageContext.from_defaults(
vector_store=TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(
dimension=1536, drop_exists=True, filter_fields=filter_fields
),
)
)
def my_file_metadata(file_name: str):
"""Depending on the input file name, associate a different metadata."""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# this (unfortunately) will not happen in this demo
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# Load documents and build index
md_documents = SimpleDirectoryReader(
"../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
现在你可以为查询引擎添加过滤功能了:
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[ExactMatchFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"How long it took the author to write his thesis?"
)
print(md_response.response)
It took the author five weeks to write his thesis.
为了验证过滤机制是否生效,可以尝试将其修改为仅使用 "dinos" 文档...这次将不会得到任何回复 :)