相对分数融合与基于分布的分数融合¶

本示例展示使用 QueryFusionRetriever 的两种改进方法，旨在优化 Reciprocal Rank Fusion 算法：

相对分数融合（Weaviate）
基于分布的分数融合（Mazzeschi: 博客文章）

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25
%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25

In [ ]:

Copied!

import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

安装¶

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

下载数据

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [ ]:

Copied!

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

接下来，我们将为文档建立一个向量索引。

In [ ]:

Copied!





from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=256)

index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], show_progress=True
)
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=256)

index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], show_progress=True
)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00,  7.55it/s]
Generating embeddings: 100%|██████████| 504/504 [00:03<00:00, 128.32it/s]

使用相对分数融合创建混合融合检索器¶

在此步骤中，我们将索引与基于 BM25 的检索器进行融合。这将使我们能够同时捕获输入查询中的语义关系和关键词。

由于这两种检索器都会计算分数，我们可以使用 QueryFusionRetriever 重新排序节点，而无需额外模型或过多计算。

以下示例采用 Weaviate 的相对分数融合算法，该算法对每个结果集应用 MinMax 缩放器后进行加权求和。这里我们将向量检索器的权重略高于 BM25（0.6 比 0.4）。

首先，我们创建检索器。每个检索器将获取前10个最相似的节点。

In [ ]:

Copied!

from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)

bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=10
)
from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)

bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=10
)

接下来，我们可以创建融合检索器，它将从两个检索器返回的20个节点中筛选出最相似的10个节点。

需要注意的是，向量检索器和BM25检索器可能返回完全相同的节点，只是顺序不同；这种情况下，融合检索器实际上仅起到重新排序的作用。

In [ ]:

Copied!





from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="relative_score",
    use_async=True,
    verbose=True,
)
from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="relative_score",
    use_async=True,
    verbose=True,
)

In [ ]:

Copied!

# apply nested async to run in a notebook
import nest_asyncio

nest_asyncio.apply()
# apply nested async to run in a notebook
import nest_asyncio

nest_asyncio.apply()

In [ ]:

Copied!

nodes_with_scores = retriever.retrieve(
    "What happened at Interleafe and Viaweb?"
)
nodes_with_scores = retriever.retrieve(
    "What happened at Interleafe and Viaweb?"
)

In [ ]:

Copied!

for node in nodes_with_scores:
    print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
for node in nodes_with_scores:
    print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")

Score: 0.60 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group...
-----
Score: 0.59 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl...
-----
Score: 0.40 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in...
-----
Score: 0.36 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and...
-----
Score: 0.25 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo...
-----
Score: 0.25 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l...
-----
Score: 0.21 - To find out, we decided to try making a version of our store builder that you could control through ...
-----
Score: 0.11 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th...
-----
Score: 0.11 - The next year, from the summer of 1998 to the summer of 1999, must have been the least productive of...
-----
Score: 0.07 - The point is that it was really cheap, less than half market price.

[8] Most software you can launc...
-----

基于分布的分数融合¶

基于分布的分数融合是相对分数融合的一个变体，其分数缩放方式略有不同——根据每个结果集的分数平均值和标准差进行调整。

In [ ]:

Copied!





from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="dist_based_score",
    use_async=True,
    verbose=True,
)

nodes_with_scores = retriever.retrieve(
    "What happened at Interleafe and Viaweb?"
)

for node in nodes_with_scores:
    print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="dist_based_score",
    use_async=True,
    verbose=True,
)

nodes_with_scores = retriever.retrieve(
    "What happened at Interleafe and Viaweb?"
)

for node in nodes_with_scores:
    print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")

Score: 0.42 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group...
-----
Score: 0.41 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl...
-----
Score: 0.32 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in...
-----
Score: 0.30 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and...
-----
Score: 0.27 - To find out, we decided to try making a version of our store builder that you could control through ...
-----
Score: 0.24 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo...
-----
Score: 0.24 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l...
-----
Score: 0.20 - Now we felt like we were really onto something. I had visions of a whole new generation of software ...
-----
Score: 0.20 - Users wouldn't need anything more than a browser.

This kind of software, known as a web app, is com...
-----
Score: 0.18 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th...
-----

在查询引擎中的使用！¶

现在，我们可以将检索器接入查询引擎，用于合成自然语言响应。

In [ ]:

Copied!

from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever)
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever)

In [ ]:

Copied!

response = query_engine.query("What happened at Interleafe and Viaweb?")
response = query_engine.query("What happened at Interleafe and Viaweb?")

In [ ]:

Copied!

from llama_index.core.response.notebook_utils import display_response

display_response(response)
from llama_index.core.response.notebook_utils import display_response

display_response(response)

Final Response: At Interleaf, there was a group called Release Engineering that was as large as the group writing the software. They had to deal with versions, ports, and other complexities. In contrast, at Viaweb, the software could be updated directly on the server, simplifying the process. Viaweb was founded with $10,000 in seed funding, and the software allowed building a whole store through the browser without the need for client software or command line inputs on the server. The company aimed to be easy to use and inexpensive, offering low monthly prices for their services.