Milvus 向量数据库与全文检索¶
全文检索采用精确关键词匹配技术,通常运用BM25等算法按相关性对文档进行排序。在检索增强生成(RAG)系统中,该方法通过检索相关文本来提升AI生成回答的质量。
而语义搜索则通过理解上下文含义来提供更广泛的结果。将两种方法结合形成的混合搜索能显著提升信息检索效果——尤其在单一方法表现不足的场景下。
借助Milvus 2.5的Sparse-BM25技术,原始文本可自动转换为稀疏向量。这消除了手动生成稀疏嵌入的需求,实现了语义理解与关键词相关性平衡的混合搜索策略。
本教程将指导您使用LlamaIndex和Milvus构建支持全文检索与混合搜索的RAG系统。我们将从单独实现全文检索开始,随后通过集成语义搜索来获得更全面的结果。
开始本教程前,请确保您已了解全文检索基础及LlamaIndex中使用Milvus的入门知识。
%pip install llama-index-vector-stores-milvus
%pip install llama-index-embeddings-openai
%pip install llama-index-llms-openai
如果您使用的是 Google Colab,可能需要重启运行时(点击界面顶部的"Runtime"菜单,然后从下拉菜单中选择"Restart session"。)
设置账户
本教程使用 OpenAI 进行文本嵌入和答案生成。您需要准备 OpenAI API 密钥。
import openai
openai.api_key = "sk-"
要使用 Milvus 向量数据库,请指定您的 Milvus 服务器 URI
(可选添加 TOKEN
)。要启动 Milvus 服务器,您可以按照 Milvus 安装指南 进行设置,或直接免费试用 Zilliz Cloud。
全文检索功能目前支持 Milvus 单机版、Milvus 分布式版和 Zilliz Cloud,但暂不支持 Milvus Lite 版本(计划未来实现)。如需了解更多信息,请联系 support@zilliz.com。
URI = "http://localhost:19530"
# TOKEN = ""
下载示例数据
运行以下命令将示例文档下载到 "data/paul_graham" 目录中:
%mkdir -p 'data/paul_graham/'
%wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2025-03-27 07:49:01-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.07s 2025-03-27 07:49:01 (1.01 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
结合全文检索的 RAG 系统¶
在 RAG 系统中集成全文检索功能,能在语义搜索与精准可预测的关键词检索之间取得平衡。虽然您也可以选择仅使用全文检索,但建议将全文检索与语义搜索结合使用以获得更优的搜索结果。出于演示目的,下文将分别展示纯全文检索和混合检索的实现方案。
要开始使用,请通过 SimpleDirectoryReaderLoad
加载保罗·格雷厄姆的文章《我的工作历程》:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# Let's take a look at the first document
print("Example document:\n", documents[0])
Example document: Doc ID: 16b7942f-bf1a-4197-85e1-f31d51ea25a9 Text: What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I ...
基于 BM25 算法的全文检索¶
LlamaIndex 的 MilvusVectorStore
支持全文检索功能,可实现高效的基于关键词的内容获取。通过将内置函数设为 sparse_embedding_function
,该系统会采用 BM25 评分算法对搜索结果进行排序。
本节将演示如何利用 BM25 算法构建支持全文检索的 RAG 系统。
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.vector_stores.milvus.utils import BM25BuiltInFunction
from llama_index.core import Settings
# Skip dense embedding model
Settings.embed_model = None
# Build Milvus vector store creating a new collection
vector_store = MilvusVectorStore(
uri=URI,
# token=TOKEN,
enable_dense=False,
enable_sparse=True, # Only enable sparse to demo full text search
sparse_embedding_function=BM25BuiltInFunction(),
overwrite=True,
)
# Store documents in Milvus
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
Embeddings have been explicitly disabled. Using MockEmbedding.
上述代码将示例文档插入 Milvus 并构建索引,以实现全文检索的 BM25 排序功能。该操作禁用了稠密向量嵌入,并采用默认参数的 BM25BuiltInFunction
。
您可以在 BM25BuiltInFunction
参数中指定输入输出字段:
input_field_names (str)
:输入文本字段(默认:"text")。表示 BM25 算法应用的文本字段。若使用自定义集合且文本字段名称不同,需修改此参数。output_field_names (str)
:存储 BM25 函数输出结果的字段(默认:"sparse_embedding")。
向量存储库配置完成后,可通过 Milvus 使用 "sparse" 或 "text_search" 查询模式执行全文检索:
import textwrap
query_engine = index.as_query_engine(
vector_store_query_mode="sparse", similarity_top_k=5
)
answer = query_engine.query("What did the author learn at Viaweb?")
print(textwrap.fill(str(answer), 100))
The author learned several important lessons at Viaweb. They learned about the importance of growth rate as the ultimate test of a startup, the value of building stores for users to understand retail and software usability, and the significance of being the "entry level" option in a market. Additionally, they discovered the accidental success of making Viaweb inexpensive, the challenges of hiring too many people, and the relief felt when the company was acquired by Yahoo.
自定义文本分析器¶
分析器在全文搜索中扮演着关键角色,通过将句子拆分为词元并进行词法处理(如词干提取和停用词移除)。这类工具通常具有语言特异性。更多细节请参阅 Milvus 分析器指南。
Milvus 支持两种分析器类型:内置分析器和自定义分析器。默认情况下,BM25BuiltInFunction
使用标准内置分析器,该分析器基于标点符号对文本进行分词。
如需使用不同分析器或自定义现有分析器,可通过向 analyzer_params
参数传值实现:
bm25_function = BM25BuiltInFunction(
analyzer_params={
"tokenizer": "standard",
"filter": [
"lowercase", # Built-in filter
{"type": "length", "max": 40}, # Custom cap size of a single token
{"type": "stop", "stop_words": ["of", "to"]}, # Custom stopwords
],
},
enable_match=True,
)
# Create index over the documnts
vector_store = MilvusVectorStore(
uri=URI,
# token=TOKEN,
# enable_dense=True, # enable_dense defaults to True
dim=1536,
enable_sparse=True,
sparse_embedding_function=BM25BuiltInFunction(),
overwrite=True,
# hybrid_ranker="RRFRanker", # hybrid_ranker defaults to "RRFRanker"
# hybrid_ranker_params={}, # hybrid_ranker_params defaults to {}
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model="default", # "default" will use OpenAI embedding
)
工作原理
该方法将文档存储在 Milvus 集合中,同时包含两种向量字段:
embedding
:由 OpenAI 嵌入模型生成的稠密向量,用于语义搜索。sparse_embedding
:使用 BM25BuiltInFunction 计算的稀疏向量,用于全文检索。
此外,我们采用默认参数的 "RRFRanker" 实现了重排序策略。如需自定义排序器,可按照 Milvus 重排序指南 配置 hybrid_ranker
和 hybrid_ranker_params
参数。
现在,让我们通过示例查询来测试这个 RAG 系统:
# Query
query_engine = index.as_query_engine(
vector_store_query_mode="hybrid", similarity_top_k=5
)
answer = query_engine.query("What did the author learn at Viaweb?")
print(textwrap.fill(str(answer), 100))
The author learned several important lessons at Viaweb. These included the importance of understanding growth rate as the ultimate test of a startup, the impact of hiring too many people, the challenges of being at the mercy of investors, and the relief experienced when Yahoo bought the company. Additionally, the author learned about the significance of user feedback, the value of building stores for users, and the realization that growth rate is crucial for the long-term success of a startup.
这种混合方法通过同时利用语义检索和基于关键词的检索,确保 RAG 系统能提供更准确、更具上下文感知能力的响应。