相对分数融合与基于分布的分数融合¶
本示例展示使用 QueryFusionRetriever 的两种改进方法,旨在优化 Reciprocal Rank Fusion 算法:
- 相对分数融合(Weaviate)
- 基于分布的分数融合(Mazzeschi: 博客文章)
%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
安装¶
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
下载数据
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
接下来,我们将为文档建立一个向量索引。
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=256)
index = VectorStoreIndex.from_documents(
documents, transformations=[splitter], show_progress=True
)
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 7.55it/s] Generating embeddings: 100%|██████████| 504/504 [00:03<00:00, 128.32it/s]
首先,我们创建检索器。每个检索器将获取前10个最相似的节点。
from llama_index.retrievers.bm25 import BM25Retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore, similarity_top_k=10
)
接下来,我们可以创建融合检索器,它将从两个检索器返回的20个节点中筛选出最相似的10个节点。
需要注意的是,向量检索器和BM25检索器可能返回完全相同的节点,只是顺序不同;这种情况下,融合检索器实际上仅起到重新排序的作用。
from llama_index.core.retrievers import QueryFusionRetriever
retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
retriever_weights=[0.6, 0.4],
similarity_top_k=10,
num_queries=1, # set this to 1 to disable query generation
mode="relative_score",
use_async=True,
verbose=True,
)
# apply nested async to run in a notebook
import nest_asyncio
nest_asyncio.apply()
nodes_with_scores = retriever.retrieve(
"What happened at Interleafe and Viaweb?"
)
for node in nodes_with_scores:
print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
Score: 0.60 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group... ----- Score: 0.59 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl... ----- Score: 0.40 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in... ----- Score: 0.36 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and... ----- Score: 0.25 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo... ----- Score: 0.25 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l... ----- Score: 0.21 - To find out, we decided to try making a version of our store builder that you could control through ... ----- Score: 0.11 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th... ----- Score: 0.11 - The next year, from the summer of 1998 to the summer of 1999, must have been the least productive of... ----- Score: 0.07 - The point is that it was really cheap, less than half market price. [8] Most software you can launc... -----
from llama_index.core.retrievers import QueryFusionRetriever
retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
retriever_weights=[0.6, 0.4],
similarity_top_k=10,
num_queries=1, # set this to 1 to disable query generation
mode="dist_based_score",
use_async=True,
verbose=True,
)
nodes_with_scores = retriever.retrieve(
"What happened at Interleafe and Viaweb?"
)
for node in nodes_with_scores:
print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
Score: 0.42 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group... ----- Score: 0.41 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl... ----- Score: 0.32 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in... ----- Score: 0.30 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and... ----- Score: 0.27 - To find out, we decided to try making a version of our store builder that you could control through ... ----- Score: 0.24 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo... ----- Score: 0.24 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l... ----- Score: 0.20 - Now we felt like we were really onto something. I had visions of a whole new generation of software ... ----- Score: 0.20 - Users wouldn't need anything more than a browser. This kind of software, known as a web app, is com... ----- Score: 0.18 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th... -----
在查询引擎中的使用!¶
现在,我们可以将检索器接入查询引擎,用于合成自然语言响应。
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever)
response = query_engine.query("What happened at Interleafe and Viaweb?")
from llama_index.core.response.notebook_utils import display_response
display_response(response)
Final Response: At Interleaf, there was a group called Release Engineering that was as large as the group writing the software. They had to deal with versions, ports, and other complexities. In contrast, at Viaweb, the software could be updated directly on the server, simplifying the process. Viaweb was founded with $10,000 in seed funding, and the software allowed building a whole store through the browser without the need for client software or command line inputs on the server. The company aimed to be easy to use and inexpensive, offering low monthly prices for their services.