从零构建高级融合检索器¶

本教程将向您展示如何从零开始构建一个高级检索器。

具体而言，我们将演示如何从头实现我们的 QueryFusionRetriever。

该实现主要受到以下 RAG-fusion 仓库的启发：https://github.com/Raudaschl/rag-fusion。

安装配置¶

我们加载文档并构建一个简单的向量索引。

In [ ]:

Copied!

%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25

In [ ]:

Copied!

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio

nest_asyncio.apply()

加载文档¶

In [ ]:

Copied!

!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2024-04-03 09:32:31--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’

data/llama2.pdf     100%[===================>]  13.03M  7.44MB/s    in 1.8s    

2024-04-03 09:32:33 (7.44 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!

from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

模型设置¶

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]:

Copied!





from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=256
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=256
)

加载至向量存储¶

In [ ]:

Copied!





from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], embed_model=embed_model
)
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], embed_model=embed_model
)

定义高级检索器¶

我们定义一个执行以下步骤的高级检索器：

查询生成/重写：根据原始用户查询生成多个衍生查询
在检索器集合上对每个查询执行检索
重排序/融合：合并所有查询的检索结果，并通过重排序步骤"融合"最相关的顶级结果

在下一节中，我们将把这个模块接入响应合成系统。

第一步：查询生成/重写¶

首先需要根据原始查询生成新查询，以更好地匹配查询意图，并提高检索结果的精确度/召回率。例如，我们可以将查询重写为更小的子查询。

这可以通过提示ChatGPT来实现。

In [ ]:

Copied!

from llama_index.core import PromptTemplate
from llama_index.core import PromptTemplate

In [ ]:

Copied!

query_str = "How do the models developed in this work compare to open-source chat models based on the benchmarks tested?"
query_str = "How do the models developed in this work compare to open-source chat models based on the benchmarks tested?"

In [ ]:

Copied!





query_gen_prompt_str = (
    "You are a helpful assistant that generates multiple search queries based on a "
    "single input query. Generate {num_queries} search queries, one on each line, "
    "related to the following input query:\n"
    "Query: {query}\n"
    "Queries:\n"
)
query_gen_prompt = PromptTemplate(query_gen_prompt_str)
query_gen_prompt_str = (
    "You are a helpful assistant that generates multiple search queries based on a "
    "single input query. Generate {num_queries} search queries, one on each line, "
    "related to the following input query:\n"
    "Query: {query}\n"
    "Queries:\n"
)
query_gen_prompt = PromptTemplate(query_gen_prompt_str)

In [ ]:

Copied!





def generate_queries(llm, query_str: str, num_queries: int = 4):
    fmt_prompt = query_gen_prompt.format(
        num_queries=num_queries - 1, query=query_str
    )
    response = llm.complete(fmt_prompt)
    queries = response.text.split("\n")
    return queries
def generate_queries(llm, query_str: str, num_queries: int = 4):
    fmt_prompt = query_gen_prompt.format(
        num_queries=num_queries - 1, query=query_str
    )
    response = llm.complete(fmt_prompt)
    queries = response.text.split("\n")
    return queries

In [ ]:

Copied!

queries = generate_queries(llm, query_str, num_queries=4)
queries = generate_queries(llm, query_str, num_queries=4)

In [ ]:

Copied!

print(queries)
print(queries)

['1. Comparison of models developed in this work to open-source chat models in benchmark testing', '2. Performance evaluation of models developed in this work versus open-source chat models on tested benchmarks', '3. Analysis of differences between models developed in this work and open-source chat models in benchmark assessments']

步骤 2：为每个查询执行向量搜索¶

现在我们需要为每个查询执行检索操作。这意味着我们将从每个向量存储中获取前 k 个最相关的结果。

注意：我们也可以配置多个检索器。此时需要运行的查询总数为 NM，其中 N 表示检索器数量，M 表示生成的查询数量。因此最终也会得到 NM 个检索结果列表。

这里我们将使用向量存储提供的检索器。如需了解如何从零开始构建检索器，请参阅我们的相关教程。

In [ ]:

Copied!





from tqdm.asyncio import tqdm


async def run_queries(queries, retrievers):
    """Run queries against retrievers."""
    tasks = []
    for query in queries:
        for i, retriever in enumerate(retrievers):
            tasks.append(retriever.aretrieve(query))

    task_results = await tqdm.gather(*tasks)

    results_dict = {}
    for i, (query, query_result) in enumerate(zip(queries, task_results)):
        results_dict[(query, i)] = query_result

    return results_dict
from tqdm.asyncio import tqdm


async def run_queries(queries, retrievers):
    """Run queries against retrievers."""
    tasks = []
    for query in queries:
        for i, retriever in enumerate(retrievers):
            tasks.append(retriever.aretrieve(query))

    task_results = await tqdm.gather(*tasks)

    results_dict = {}
    for i, (query, query_result) in enumerate(zip(queries, task_results)):
        results_dict[(query, i)] = query_result

    return results_dict

In [ ]:

Copied!





# get retrievers
from llama_index.retrievers.bm25 import BM25Retriever


## vector retriever
vector_retriever = index.as_retriever(similarity_top_k=2)

## bm25 retriever
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=2
)
# get retrievers
from llama_index.retrievers.bm25 import BM25Retriever


## vector retriever
vector_retriever = index.as_retriever(similarity_top_k=2)

## bm25 retriever
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=2
)

In [ ]:

Copied!

results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])
results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])

  0%|          | 0/6 [00:00<?, ?it/s]

100%|██████████| 6/6 [00:00<00:00, 11.14it/s]

第三步：执行结果融合¶

此步骤的核心是执行结果融合：将多个检索器的返回结果合并为一组并进行重新排序。

需注意，同一节点可能被不同检索器多次返回，因此需要设计去重机制，并根据多次检索结果对节点进行重新排序。

我们将演示如何执行"倒数排序融合"算法：针对每个节点，累加其在所有检索结果列表中的倒数排序值，然后按照总分从高到低重新排序节点。

完整论文参见：https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

In [ ]:

Copied!





from typing import List
from llama_index.core.schema import NodeWithScore


def fuse_results(results_dict, similarity_top_k: int = 2):
    """Fuse results."""
    k = 60.0  # `k` is a parameter used to control the impact of outlier rankings.
    fused_scores = {}
    text_to_node = {}

    # compute reciprocal rank scores
    for nodes_with_scores in results_dict.values():
        for rank, node_with_score in enumerate(
            sorted(
                nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True
            )
        ):
            text = node_with_score.node.get_content()
            text_to_node[text] = node_with_score
            if text not in fused_scores:
                fused_scores[text] = 0.0
            fused_scores[text] += 1.0 / (rank + k)

    # sort results
    reranked_results = dict(
        sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    )

    # adjust node scores
    reranked_nodes: List[NodeWithScore] = []
    for text, score in reranked_results.items():
        reranked_nodes.append(text_to_node[text])
        reranked_nodes[-1].score = score

    return reranked_nodes[:similarity_top_k]
from typing import List
from llama_index.core.schema import NodeWithScore


def fuse_results(results_dict, similarity_top_k: int = 2):
    """Fuse results."""
    k = 60.0  # `k` is a parameter used to control the impact of outlier rankings.
    fused_scores = {}
    text_to_node = {}

    # compute reciprocal rank scores
    for nodes_with_scores in results_dict.values():
        for rank, node_with_score in enumerate(
            sorted(
                nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True
            )
        ):
            text = node_with_score.node.get_content()
            text_to_node[text] = node_with_score
            if text not in fused_scores:
                fused_scores[text] = 0.0
            fused_scores[text] += 1.0 / (rank + k)

    # sort results
    reranked_results = dict(
        sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    )

    # adjust node scores
    reranked_nodes: List[NodeWithScore] = []
    for text, score in reranked_results.items():
        reranked_nodes.append(text_to_node[text])
        reranked_nodes[-1].score = score

    return reranked_nodes[:similarity_top_k]

In [ ]:

Copied!

final_results = fuse_results(results_dict)
final_results = fuse_results(results_dict)

In [ ]:

Copied!

for n in final_results:
    print(n.score, "\n", n.text, "\n********\n")
for n in final_results:
    print(n.score, "\n", n.text, "\n********\n")

0.03333333333333333
Figure 12: Human evaluation results for Llama 2-Chat models compared to open- and closed-source models
across ~4,000 helpfulness prompts with three raters per prompt.
The largest Llama 2-Chat model is competitive with ChatGPT. Llama 2-Chat 70B model has a win rate of
36% and a tie rate of 31.5% relative to ChatGPT. Llama 2-Chat 70B model outperforms PaLM-bison chat
model by a large percentage on our prompt set. More results and analysis is available in Section A.3.7.
Inter-Rater Reliability (IRR).
In our human evaluations, three different annotators provided independent
assessments for each model generation comparison. High IRR scores (closer to 1.0) are typically seen as
better from a data quality perspective, however, context is important. Highly subjective tasks like evaluating
the overall helpfulness of LLM generations will usually have lower IRR scores than more objective labelling
tasks. There are relatively few public benchmarks for these contexts, so we feel sharing our analysis here will
benefit the research community.
We used Gwet’s AC1/2 statistic (Gwet, 2008, 2014) to measure inter-rater reliability (IRR), as we found it to
be the most stable metric across different measurement scenarios. On the 7-point Likert scale helpfulness
task that is used in our analysis, Gwet’s AC2 score varies between 0.37 and 0.55 depending on the specific
model comparison. We see scores on the lower end of that range for ratings from model comparisons with
similar win rates to each other (like the Llama 2-Chat-70B-chat vs. ChatGPT comparison). We see scores on
the higher end of that range for ratings from model comparisons with a more clear winner (like the Llama
2-Chat-34b-chat vs. Falcon-40b-instruct).
Limitations of human evaluations.
While our results indicate that Llama 2-Chat is on par with ChatGPT
on human evaluations, it is important to note that human evaluations have several limitations.
• By academic and research standards, we have a large prompt set of 4k prompts. However, it does not cover
real-world usage of these models, which will likely cover a significantly larger number of use cases.
• Diversity of the prompts could be another factor in our results. For example, our prompt set does not
include any coding- or reasoning-related prompts.
• We only evaluate the final generation of a multi-turn conversation. A more interesting evaluation could be
to ask the models to complete a task and rate the overall experience with the model over multiple turns.
• Human evaluation for generative models is inherently subjective and noisy. As a result, evaluation on a
different set of prompts or with different instructions could result in different results.
19
********

0.03306010928961749
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov
Thomas Scialom∗
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
our human evaluations for helpfulness and safety, may be a suitable substitute for closed-
source models. We provide a detailed description of our approach to fine-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
†Second author
Contributions for all the authors can be found in Section A.1.
arXiv:2307.09288v2 [cs.CL] 19 Jul 2023
********

分析：上述代码包含几个简单直接的组成部分：

遍历每个检索列表中的节点，并将其倒序排名值累加到节点ID上。节点ID是其文本的哈希值，用于去重。
按分数从高到低对结果进行排序。
调整节点分数。

接入 RetrieverQueryEngine¶

现在我们已经准备好将其定义为自定义检索器，并将其接入我们的 RetrieverQueryEngine（该引擎负责检索与合成功能）。

In [ ]:

Copied!





from typing import List

from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
import asyncio


class FusionRetriever(BaseRetriever):
    """Ensemble retriever with fusion."""

    def __init__(
        self,
        llm,
        retrievers: List[BaseRetriever],
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._retrievers = retrievers
        self._similarity_top_k = similarity_top_k
        self._llm = llm
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        queries = generate_queries(
            self._llm, query_bundle.query_str, num_queries=4
        )
        results = asyncio.run(run_queries(queries, self._retrievers))
        final_results = fuse_results(
            results, similarity_top_k=self._similarity_top_k
        )

        return final_results
from typing import List

from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
import asyncio


class FusionRetriever(BaseRetriever):
    """Ensemble retriever with fusion."""

    def __init__(
        self,
        llm,
        retrievers: List[BaseRetriever],
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._retrievers = retrievers
        self._similarity_top_k = similarity_top_k
        self._llm = llm
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        queries = generate_queries(
            self._llm, query_bundle.query_str, num_queries=4
        )
        results = asyncio.run(run_queries(queries, self._retrievers))
        final_results = fuse_results(
            results, similarity_top_k=self._similarity_top_k
        )

        return final_results

In [ ]:

Copied!

from llama_index.core.query_engine import RetrieverQueryEngine

fusion_retriever = FusionRetriever(
    llm, [vector_retriever, bm25_retriever], similarity_top_k=2
)

query_engine = RetrieverQueryEngine(fusion_retriever)
from llama_index.core.query_engine import RetrieverQueryEngine

fusion_retriever = FusionRetriever(
    llm, [vector_retriever, bm25_retriever], similarity_top_k=2
)

query_engine = RetrieverQueryEngine(fusion_retriever)

In [ ]:

Copied!

response = query_engine.query(query_str)
response = query_engine.query(query_str)

In [ ]:

Copied!

print(str(response))
print(str(response))

The models developed in this work, specifically the Llama 2-Chat models, outperform open-source chat models on most benchmarks that were tested.