自定义存储配置#
默认情况下,LlamaIndex隐藏了底层复杂性,让您用不到5行代码即可查询数据:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the documents.")
在底层实现中,LlamaIndex支持可替换的存储层,允许您自定义文档内容(即Node对象)、嵌入向量和索引元数据的存储位置。

底层API#
要实现自定义存储,我们可以不使用高级API:
index = VectorStoreIndex.from_documents(documents)
而是使用提供更细粒度控制的底层API:
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.storage.index_store import SimpleIndexStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core.node_parser import SentenceSplitter
# 创建解析器并将文档解析为节点
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)
# 使用默认存储创建存储上下文
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore(),
vector_store=SimpleVectorStore(),
index_store=SimpleIndexStore(),
)
# 创建(或加载)文档存储并添加节点
storage_context.docstore.add_documents(nodes)
# 构建索引
index = VectorStoreIndex(nodes, storage_context=storage_context)
# 保存索引
index.storage_context.persist(persist_dir="<persist_dir>")
# 也可设置index_id将多个索引保存至同一目录
index.set_index_id("<index_id>")
index.storage_context.persist(persist_dir="<persist_dir>")
# 后续加载索引时需先配置存储上下文
# 这将从persist_dir加载持久化的存储
storage_context = StorageContext.from_defaults(persist_dir="<persist_dir>")
# 然后加载索引对象
from llama_index.core import load_index_from_storage
loaded_index = load_index_from_storage(storage_context)
# 若从包含多个索引的persist_dir加载特定索引
loaded_index = load_index_from_storage(storage_context, index_id="<index_id>")
# 若从persist_dir加载多个索引
loaded_indicies = load_index_from_storage(
storage_context, index_ids=["<index_id>", ...]
)
只需修改一行代码实例化不同的文档存储、索引存储和向量存储,即可自定义底层存储。详见文档存储、向量存储、索引存储指南。
向量存储集成与存储#
我们的大多数向量存储集成会将整个索引(向量+文本)存储在向量存储自身中。这样做的主要优势是无需像前文所示显式持久化索引,因为向量存储已经托管并持久化了索引数据。
支持此实践的向量存储包括:
- AzureAISearchVectorStore
- ChatGPTRetrievalPluginClient
- CassandraVectorStore
- ChromaVectorStore
- EpsillaVectorStore
- DocArrayHnswVectorStore
- DocArrayInMemoryVectorStore
- JaguarVectorStore
- LanceDBVectorStore
- MetalVectorStore
- MilvusVectorStore
- MyScaleVectorStore
- OpensearchVectorStore
- PineconeVectorStore
- QdrantVectorStore
- TablestoreVectorStore
- RedisVectorStore
- UpstashVectorStore
- WeaviateVectorStore
以下是使用Pinecone的简单示例:
import pinecone
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.pinecone import PineconeVectorStore
# 创建Pinecone索引
api_key = "api_key"
pinecone.init(api_key=api_key, environment="us-west1-gcp")
pinecone.create_index(
"quickstart", dimension=1536, metric="euclidean", pod_type="p1"
)
index = pinecone.Index("quickstart")
# 构建向量存储
vector_store = PineconeVectorStore(pinecone_index=index)
# 创建存储上下文
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# 加载文档
documents = SimpleDirectoryReader("./data").load_data()
# 创建索引(将自动插入文档/向量到Pinecone)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
如果已有包含数据的向量存储,可以直接连接并创建VectorStoreIndex:
index = pinecone.Index("quickstart")
vector_store = PineconeVectorStore(pinecone_index=index)
loaded_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)