Cassandra 向量存储¶
Apache Cassandra® 是一款面向列的 NoSQL 数据库,具有高度可扩展性和高可用性特性。从 5.0 版本开始,该数据库原生支持向量搜索功能。
DataStax 推出的 Astra DB through CQL 是基于 Cassandra 构建的无服务器托管数据库,提供相同的接口和优势。
本笔记本展示 LlamaIndex 中 Cassandra 向量存储的基本用法。
要运行完整代码,您需要具备以下任一环境:已启用向量搜索功能的 Cassandra 集群,或 DataStax Astra DB 实例。
安装¶
%pip install llama-index-vector-stores-cassandra
!pip install --quiet "astrapy>=0.5.8"
import os
from getpass import getpass
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Document,
StorageContext,
)
from llama_index.vector_stores.cassandra import CassandraVectorStore
下一步是使用全局数据库连接初始化 CassIO:这是 Cassandra 集群和 Astra DB 唯一操作略有差异的步骤:
初始化(Cassandra 集群)¶
在此情况下,您首先需要创建一个 cassandra.cluster.Session 对象,具体方法如 Cassandra 驱动文档 所述。具体细节会有所不同(例如网络设置和身份验证),但可能类似于:
from cassandra.cluster import Cluster
cluster = Cluster(["127.0.0.1"])
session = cluster.connect()
import cassio
CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ")
cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)
初始化(通过 CQL 连接 Astra DB)¶
在这种情况下,您需要使用以下连接参数初始化 CassIO:
- 数据库 ID,例如
01234567-89ab-cdef-0123-456789abcdef - 令牌,例如
AstraCS:6gBhNmsk135....(必须是"数据库管理员"令牌) - 可选的 Keyspace 名称(如果省略,将使用数据库的默认 Keyspace)
ASTRA_DB_ID = input("ASTRA_DB_ID = ")
ASTRA_DB_TOKEN = getpass("ASTRA_DB_TOKEN = ")
desired_keyspace = input("ASTRA_DB_KEYSPACE (optional, can be left empty) = ")
if desired_keyspace:
ASTRA_DB_KEYSPACE = desired_keyspace
else:
ASTRA_DB_KEYSPACE = None
import cassio
cassio.init(
database_id=ASTRA_DB_ID,
token=ASTRA_DB_TOKEN,
keyspace=ASTRA_DB_KEYSPACE,
)
OpenAI 密钥¶
如需使用 OpenAI 的嵌入功能,您需要提供 OpenAI API 密钥:
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2023-11-10 01:44:05-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.01s 2023-11-10 01:44:06 (4.80 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
创建并填充向量存储库¶
现在您将从本地文件加载一些保罗·格雷厄姆(Paul Graham)的散文,并将其存储到Cassandra向量存储库中。
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
"First document, text"
f" ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)
Total documents: 1 First document, id: 12bc6987-366a-49eb-8de0-7b52340e4958 First document, hash: abe31930a1775c78df5a5b1ece7108f78fedbf5fe4a9cf58d7a21808fccaef34 First document, text (75014 characters): ==================== What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...
初始化 Cassandra 向量存储¶
创建向量存储时,若底层数据库表尚未存在,将自动创建该表:
cassandra_store = CassandraVectorStore(
table="cass_v_table", embedding_dimension=1536
)
现在将这个存储封装到 index LlamaIndex 抽象中以供后续查询:
storage_context = StorageContext.from_defaults(vector_store=cassandra_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
请注意,上述 from_documents 调用会同时完成多项操作:将输入文档分割为可管理大小的片段("节点"),为每个节点计算嵌入向量,并将它们全部存储到 Cassandra 向量数据库中。
查询存储¶
基础查询¶
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
The author chose to work on AI because they were inspired by a novel called The Moon is a Harsh Mistress, which featured an intelligent computer, and a PBS documentary that showed Terry Winograd using SHRDLU. These experiences sparked the author's interest in AI and motivated them to pursue it as a field of study and work.
基于 MMR 的查询¶
MMR(最大边际相关性)方法旨在从存储中获取既与查询相关又尽可能彼此不同的文本片段,目的是为构建最终答案提供更广泛的上下文:
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
The author chose to work on AI because they believed that teaching SHRDLU more words would eventually lead to the development of intelligent programs. They were fascinated by the potential of AI and saw it as an opportunity to expand their understanding of programming and push the limits of what could be achieved.
连接到现有存储¶
由于该存储后端采用 Cassandra 数据库,其本质具有持久化特性。若需连接至先前已创建并填充数据的存储,请按以下步骤操作:
new_store_instance = CassandraVectorStore(
table="cass_v_table", embedding_dimension=1536
)
# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
vector_store=new_store_instance
)
# now you can do querying, etc:
query_engine = new_index_instance.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What did the author study prior to working on AI?"
)
print(response.response)
The author studied philosophy prior to working on AI.
从索引中移除文档¶
首先通过从索引生成的 Retriever 获取文档片段(或称"节点")的明确列表:
retriever = new_index_instance.as_retriever(
vector_store_query_mode="mmr",
similarity_top_k=3,
vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
print(f" [{idx}] score = {node_with_score.score}")
print(f" id = {node_with_score.node.node_id}")
print(f" text = {node_with_score.node.text[:90]} ...")
Found 3 nodes.
[0] score = 0.4251742327832831
id = 7e628668-58fa-4548-9c92-8c31d315dce0
text = What I Worked On
February 2021
Before college the two main things I worked on, outside o ...
[1] score = -0.020323897262800816
id = aa279d09-717f-4d68-9151-594c5bfef7ce
text = This was now only weeks away. My nice landlady let me leave my stuff in her attic. I had s ...
[2] score = 0.011198131320563909
id = 50b9170d-6618-4e8b-aaf8-36632e2801a6
text = It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDL ...
但请注意!使用向量存储时,应将文档视为合理的删除单元,而非属于该文档的单个节点。在本例中,由于您仅插入了单个文本文件,因此所有节点都将具有相同的 ref_doc_id:
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
Nodes' ref_doc_id: 12bc6987-366a-49eb-8de0-7b52340e4958 12bc6987-366a-49eb-8de0-7b52340e4958 12bc6987-366a-49eb-8de0-7b52340e4958
假设现在你需要删除已上传的文本文件:
new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)
重复执行完全相同的查询并立即检查结果。此时您应该看到没有找到任何结果:
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
Found 0 nodes.
元数据过滤¶
Cassandra 向量存储支持在查询时通过精确匹配的 key=value 键值对形式进行元数据过滤。以下操作单元(基于全新的 Cassandra 表)将演示这一功能。
本示例中,为简洁起见,仅加载了单个源文档(即 ../data/paul_graham/paul_graham_essay.txt 文本文件)。不过您会为该文档附加一些自定义元数据,以此演示如何通过文档附加元数据的条件限制来约束查询范围。
md_storage_context = StorageContext.from_defaults(
vector_store=CassandraVectorStore(
table="cass_v_table_md", embedding_dimension=1536
)
)
def my_file_metadata(file_name: str):
"""Depending on the input file name, associate a different metadata."""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# this (unfortunately) will not happen in this demo
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# Load documents and build index
md_documents = SimpleDirectoryReader(
"./data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
现在你已掌握:可以为查询引擎添加过滤功能了:
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[ExactMatchFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"did the author appreciate Lisp and painting?"
)
print(md_response.response)
Yes, the author appreciated Lisp and painting. They mentioned spending a significant amount of time working on Lisp and even building a new dialect of Lisp called Arc. Additionally, the author mentioned spending most of 2014 painting and experimenting with different techniques.
为了验证筛选功能是否生效,可以尝试将其修改为仅使用 "dinos" 文档... 这次将不会得到任何回复 :)