Milvus 向量数据库¶

本指南演示如何使用 LlamaIndex 和 Milvus 构建检索增强生成（RAG）系统。

该 RAG 系统将检索系统与生成模型相结合，根据给定提示生成新文本。系统首先使用 Milvus 等向量相似性搜索引擎从语料库中检索相关文档，然后利用生成模型基于检索到的文档生成新文本。

Milvus 是全球最先进的开源向量数据库，专为支持嵌入相似性搜索和 AI 应用而构建。

本笔记本将展示使用 MilvusVectorStore 的快速演示。

准备工作¶

安装依赖项¶

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-vector-stores-milvus
%pip install llama-index-vector-stores-milvus

In [ ]:

Copied!

%pip install llama-index
%pip install llama-index

本笔记本将使用 Milvus Lite，需要更高版本的 pymilvus：

In [ ]:

Copied!

%pip install pymilvus>=2.4.2
%pip install pymilvus>=2.4.2

如果您正在使用 Google Colab，为了使刚安装的依赖项生效，您可能需要重启运行时（点击屏幕顶部的"Runtime"菜单，然后从下拉菜单中选择"Restart session"）。

配置 OpenAI¶

首先让我们添加 OpenAI 的 API 密钥。这将使我们能够访问 ChatGPT。

In [ ]:

Copied!

import openai

openai.api_key = "sk-***********"
import openai

openai.api_key = "sk-***********"

准备数据¶

您可以通过以下命令下载示例数据：

In [ ]:

Copied!

! mkdir -p "data/"
! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O "data/paul_graham_essay.txt"
! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf" -O "data/uber_2021.pdf"
! mkdir -p "data/"
! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O "data/paul_graham_essay.txt"
! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf" -O "data/uber_2021.pdf"

快速入门¶

生成数据¶

作为第一个示例，我们将从文件 paul_graham_essay.txt 生成文档。这是保罗·格雷厄姆（Paul Graham）的一篇题为《我的工作历程》的独立文章。我们将使用 SimpleDirectoryReader 来生成这些文档。

In [ ]:

Copied!





from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(
    input_files=["./data/paul_graham_essay.txt"]
).load_data()

print("Document ID:", documents[0].doc_id)
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(
    input_files=["./data/paul_graham_essay.txt"]
).load_data()

print("Document ID:", documents[0].doc_id)

Document ID: 95f25e4d-f270-4650-87ce-006d69d82033

跨数据创建索引¶

现在我们有了文档，可以创建索引并插入文档。对于索引，我们将使用 MilvusVectorStore。MilvusVectorStore 接收以下参数：

基础参数

uri (str, 可选)：连接地址，形式为 "https://地址:端口"（适用于 Milvus 或 Zilliz Cloud 服务）或 "path/to/local/milvus.db"（适用于本地轻量版 Milvus）。默认为 "./milvus_llamaindex.db"。
token (str, 可选)：登录令牌。未启用 RBAC 时留空，启用 RBAC 时通常为 "用户名:密码"。
collection_name (str, 可选)：存储数据的集合名称。默认为 "llamalection"。
overwrite (bool, 可选)：是否覆盖同名现有集合。默认为 False。

标量字段（含文档 ID 和文本）

doc_id_field (str, 可选)：集合中文档 ID 字段的名称。默认为 DEFAULT_DOC_ID_KEY。
text_key (str, 可选)：传入集合中存储文本的键名。自带集合时使用。默认为 DEFAULT_TEXT_KEY。
scalar_field_names (list, 可选)：集合模式中包含的额外标量字段名称列表。
scalar_field_types (list, 可选)：额外标量字段的类型列表。

稠密向量字段

enable_dense (bool)：启用/禁用稠密向量嵌入的布尔标志。默认为 True。
dim (int, 可选)：集合中嵌入向量的维度。当 enable_sparse 为 False 且创建新集合时必须指定。
embedding_field (str, 可选)：集合中稠密向量嵌入字段的名称，默认为 DEFAULT_EMBEDDING_KEY。
index_config (dict, 可选)：构建稠密向量索引的配置。默认为 None。
search_config (dict, 可选)：Milvus 稠密索引搜索配置。必须与 index_config 指定的索引类型兼容。默认为 None。
similarity_metric (str, 可选)：稠密向量相似度度量方式，当前支持 IP（内积）、COSINE（余弦相似度）和 L2（欧氏距离）。

稀疏向量字段

enable_sparse (bool)：启用/禁用稀疏向量嵌入的布尔标志。默认为 False。
sparse_embedding_field (str)：稀疏向量嵌入字段名称，默认为 DEFAULT_SPARSE_EMBEDDING_KEY。
sparse_embedding_function (Union[BaseSparseEmbeddingFunction, BaseMilvusBuiltInFunction], 可选)：若 enable_sparse 为 True，需提供此对象以实现文本到稀疏向量的转换。若为 None，则使用默认稀疏向量函数（BM25BuiltInFunction）。
sparse_index_config (dict, 可选)：构建稀疏向量索引的配置。默认为 None。

混合排序器

hybrid_ranker (str)：指定混合搜索使用的排序器类型。当前仅支持 ["RRFRanker", "WeightedRanker"]。默认为 "RRFRanker"。
hybrid_ranker_params (dict, 可选)：混合排序器的配置参数，其结构取决于具体排序器类型：
- "RRFRanker" 需包含：
  - "k" (int)：用于倒数排序融合（RRF）算法的参数，该值用于计算排名分数，通过组合多种排序策略提升搜索相关性。
- "WeightedRanker" 需包含：
  - "weights" (float 列表)：包含两个权重的列表：
    1. 稠密向量组件的权重
    2. 稀疏向量组件的权重这些权重用于调整混合检索中稠密和稀疏组件的重要性。默认为空字典，表示使用排序器预定义的默认设置。

其他参数

collection_properties (dict, 可选)：集合属性（如 TTL 和 MMAP）。默认为 None。可包含：
- "collection.ttl.seconds" (int)：设置后，集合中的数据将在指定时间后过期。过期数据会被清理且不参与搜索或查询。
- "mmap.enabled" (bool)：是否在集合级别启用内存映射存储。
index_management (IndexManagement)：指定索引管理策略。默认为 "create_if_not_exists"。
batch_size (int)：配置向 Milvus 插入数据时的单批次文档处理量。默认为 DEFAULT_BATCH_SIZE。
consistency_level (str, 可选)：新建集合的一致性级别。默认为 "Session"。

In [ ]:

Copied!





# Create an index over the documents
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore


vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# Create an index over the documents
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore


vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

关于 MilvusVectorStore 的参数配置：

将 uri 设置为本地文件路径（例如 ./milvus.db）是最便捷的方式，这种方式会自动调用 Milvus Lite 将所有数据存储在该文件中。
若需处理大规模数据，可在 docker 或 kubernetes 上部署性能更强的 Milvus 服务器。此场景下请使用服务器地址（例如 http://localhost:19530）作为 uri 参数。
如需使用 Milvus 全托管云服务 Zilliz Cloud，需调整 uri 和 token 参数，分别对应 Zilliz Cloud 控制台中的公网地址与 API 密钥。

查询数据¶

现在我们已经将文档存储到索引中，可以针对该索引进行提问。索引将使用其内部存储的数据作为 chatgpt 的知识库。

In [ ]:

Copied!

query_engine = index.as_query_engine()
res = query_engine.query("What did the author learn?")
print(res)
query_engine = index.as_query_engine()
res = query_engine.query("What did the author learn?")
print(res)

The author learned that philosophy courses in college were boring to him, leading him to switch his focus to studying AI.

In [ ]:

Copied!





res = query_engine.query(
    "What challenges did the disease pose for the author?"
)
print(res)
res = query_engine.query(
    "What challenges did the disease pose for the author?"
)
print(res)

The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in her losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.

接下来的测试表明，覆盖操作会清除原有数据。

In [ ]:

Copied!





from llama_index.core import Document


vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    [Document(text="The number that is being searched for is ten.")],
    storage_context,
)
query_engine = index.as_query_engine()
res = query_engine.query("Who is the author?")
print(res)
from llama_index.core import Document


vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    [Document(text="The number that is being searched for is ten.")],
    storage_context,
)
query_engine = index.as_query_engine()
res = query_engine.query("Who is the author?")
print(res)

The author is the individual who created the context information.

下一个测试展示了如何向已存在的索引添加额外数据。

In [ ]:

Copied!





del index, vector_store, storage_context, query_engine

vector_store = MilvusVectorStore(uri="./milvus_demo.db", overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
query_engine = index.as_query_engine()
res = query_engine.query("What is the number?")
print(res)
del index, vector_store, storage_context, query_engine

vector_store = MilvusVectorStore(uri="./milvus_demo.db", overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
query_engine = index.as_query_engine()
res = query_engine.query("What is the number?")
print(res)

The number is ten.

In [ ]:

Copied!

res = query_engine.query("Who is the author?")
print(res)
res = query_engine.query("Who is the author?")
print(res)

Paul Graham

元数据过滤¶

我们可以通过筛选特定来源来生成结果。以下示例演示了如何从目录加载所有文档，随后根据元数据对其进行过滤。

In [ ]:

Copied!





from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# Load all the two documents loaded before
documents_all = SimpleDirectoryReader("./data/").load_data()

vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents_all, storage_context)
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# Load all the two documents loaded before
documents_all = SimpleDirectoryReader("./data/").load_data()

vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents_all, storage_context)

我们只想从文件 uber_2021.pdf 中检索文档。

In [ ]:

Copied!





filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query(
    "What challenges did the disease pose for the author?"
)

print(res)
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query(
    "What challenges did the disease pose for the author?"
)

print(res)

The disease posed challenges related to the adverse impact on the business and operations, including reduced demand for Mobility offerings globally, affecting travel behavior and demand. Additionally, the pandemic led to driver supply constraints, impacted by concerns regarding COVID-19, with uncertainties about when supply levels would return to normal. The rise of the Omicron variant further affected travel, resulting in advisories and restrictions that could adversely impact both driver supply and consumer demand for Mobility offerings.

这次从文件 paul_graham_essay.txt 中检索时，我们得到了不同的结果。

In [ ]:

Copied!





filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query(
    "What challenges did the disease pose for the author?"
)

print(res)
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query(
    "What challenges did the disease pose for the author?"
)

print(res)

The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in his mother losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.