从 LlamaHub 下载 LlamaDataset¶

您可以通过 llamahub.ai 浏览我们提供的基准数据集。本指南将展示如何下载数据集及其源文本文档。具体而言，download_llama_dataset 函数会下载评估数据集（即 LabelledRagDataset）以及最初用于构建该评估数据集的源文本文件 Document 集合。

最后，在本指南中，我们还将演示完整的端到端工作流程：下载评估数据集 → 使用您自己的 RAG 管道（查询引擎）进行预测 → 评估这些预测结果。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-llms-openai

In [ ]:

Copied!





from llama_index.core.llama_dataset import download_llama_dataset

# download and install dependencies
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)
from llama_index.core.llama_dataset import download_llama_dataset

# download and install dependencies
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

github url: https://raw.githubusercontent.com/nerdai/llama-hub/datasets/llama_hub/llama_datasets/library.json
github url: https://media.githubusercontent.com/media/run-llama/llama_datasets/main/llama_datasets/paul_graham_essay/rag_dataset.json
github url: https://media.githubusercontent.com/media/run-llama/llama_datasets/main/llama_datasets/paul_graham_essay/source.txt

In [ ]:

Copied!

rag_dataset.to_pandas()[:5]
rag_dataset.to_pandas()[:5]

Out[ ]:

	query	reference_contexts	reference_answer	reference_answer_by	query_by
0	In the essay, the author mentions his early ex...	[What I Worked On\n\nFebruary 2021\n\nBefore c...	The first computer the author used for program...	ai (gpt-4)	ai (gpt-4)
1	The author switched his major from philosophy ...	[What I Worked On\n\nFebruary 2021\n\nBefore c...	The two specific influences that led the autho...	ai (gpt-4)	ai (gpt-4)
2	In the essay, the author discusses his initial...	[I couldn't have put this into words when I wa...	The two main influences that initially drew th...	ai (gpt-4)	ai (gpt-4)
3	The author mentions his shift of interest towa...	[I couldn't have put this into words when I wa...	The author shifted his interest towards Lisp a...	ai (gpt-4)	ai (gpt-4)
4	In the essay, the author mentions his interest...	[So I looked around to see what I could salvag...	The author in the essay is Paul Graham, who wa...	ai (gpt-4)	ai (gpt-4)

通过 documents，您可以构建自己的 RAG（检索增强生成）流程，随后进行预测并执行评估，以便与数据集关联的 DatasetCard 中列出的基准进行比较 llamahub.ai。

预测¶

注意：本笔记本剩余部分将演示如何手动执行预测及后续评估（仅供说明用途）。你也可以使用 RagEvaluatorPack，该工具包会自动通过你提供的 RAG 系统完成预测和评估工作。

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex

# a basic RAG pipeline, uses defaults
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
from llama_index.core import VectorStoreIndex

# a basic RAG pipeline, uses defaults
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

你现在可以手动创建预测并执行评估，或者下载 PredictAndEvaluatePack 通过单行代码自动完成这些操作。

In [ ]:

Copied!

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio

nest_asyncio.apply()

In [ ]:

Copied!





# manually
prediction_dataset = await rag_dataset.amake_predictions_with(
    query_engine=query_engine, show_progress=True
)
# manually
prediction_dataset = await rag_dataset.amake_predictions_with(
    query_engine=query_engine, show_progress=True
)

100%|███████████████████████████████████████████████████████| 44/44 [00:08<00:00,  4.90it/s]

In [ ]:

Copied!

prediction_dataset.to_pandas()[:5]
prediction_dataset.to_pandas()[:5]

Out[ ]:

	response	contexts
0	The author mentions that the first computer he...	[What I Worked On\n\nFebruary 2021\n\nBefore c...
1	The author switched his major from philosophy ...	[I couldn't have put this into words when I wa...
2	The author mentions two main influences that i...	[I couldn't have put this into words when I wa...
3	The author mentions that he shifted his intere...	[So I looked around to see what I could salvag...
4	The author mentions his interest in both compu...	[What I Worked On\n\nFebruary 2021\n\nBefore c...

评估¶

现在我们已获得预测结果，可以从两个维度进行评估：

生成响应：预测响应与参考答案的匹配程度
检索上下文：预测时检索到的上下文与参考上下文的匹配程度

注意：对于检索上下文，我们无法使用标准检索指标（如命中率和平均倒数排名），因为这类指标要求使用与生成基准数据时相同的索引。但LabelledRagDataset的创建并不依赖于特定索引。因此，我们将使用预测上下文与参考上下文之间的语义相似度作为衡量标准。

In [ ]:

Copied!

import tqdm
import tqdm

为评估响应质量，我们将采用LLM-As-A-Judge模式。具体而言，将使用以下评估器：

CorrectnessEvaluator（正确性评估器）
FaithfulnessEvaluator（忠实度评估器）
RelevancyEvaluator（相关性评估器）

针对检索上下文的质量评估，我们将采用SemanticSimilarityEvaluator（语义相似度评估器）。

In [ ]:

Copied!





# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = RelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["faithfulness"] = FaithfulnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["semantic_similarity"] = SemanticSimilarityEvaluator()
# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = RelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["faithfulness"] = FaithfulnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["semantic_similarity"] = SemanticSimilarityEvaluator()

遍历 (labelled_example, prediction) 数据对，并分别对每个数据对执行评估。

In [ ]:

Copied!





evals = {
    "correctness": [],
    "relevancy": [],
    "faithfulness": [],
    "context_similarity": [],
}

for example, prediction in tqdm.tqdm(
    zip(rag_dataset.examples, prediction_dataset.predictions)
):
    correctness_result = judges["correctness"].evaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

    relevancy_result = judges["relevancy"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    faithfulness_result = judges["faithfulness"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    semantic_similarity_result = judges["semantic_similarity"].evaluate(
        query=example.query,
        response="\n".join(prediction.contexts),
        reference="\n".join(example.reference_contexts),
    )

    evals["correctness"].append(correctness_result)
    evals["relevancy"].append(relevancy_result)
    evals["faithfulness"].append(faithfulness_result)
    evals["context_similarity"].append(semantic_similarity_result)
evals = {
    "correctness": [],
    "relevancy": [],
    "faithfulness": [],
    "context_similarity": [],
}

for example, prediction in tqdm.tqdm(
    zip(rag_dataset.examples, prediction_dataset.predictions)
):
    correctness_result = judges["correctness"].evaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

    relevancy_result = judges["relevancy"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    faithfulness_result = judges["faithfulness"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    semantic_similarity_result = judges["semantic_similarity"].evaluate(
        query=example.query,
        response="\n".join(prediction.contexts),
        reference="\n".join(example.reference_contexts),
    )

    evals["correctness"].append(correctness_result)
    evals["relevancy"].append(relevancy_result)
    evals["faithfulness"].append(faithfulness_result)
    evals["context_similarity"].append(semantic_similarity_result)

44it [07:15,  9.90s/it]

In [ ]:

Copied!





import json

# saving evaluations
evaluations_objects = {
    "context_similarity": [e.dict() for e in evals["context_similarity"]],
    "correctness": [e.dict() for e in evals["correctness"]],
    "faithfulness": [e.dict() for e in evals["faithfulness"]],
    "relevancy": [e.dict() for e in evals["relevancy"]],
}

with open("evaluations.json", "w") as json_file:
    json.dump(evaluations_objects, json_file)
import json

# saving evaluations
evaluations_objects = {
    "context_similarity": [e.dict() for e in evals["context_similarity"]],
    "correctness": [e.dict() for e in evals["correctness"]],
    "faithfulness": [e.dict() for e in evals["faithfulness"]],
    "relevancy": [e.dict() for e in evals["relevancy"]],
}

with open("evaluations.json", "w") as json_file:
    json.dump(evaluations_objects, json_file)

现在，我们可以使用笔记本工具函数来查看这些评估结果。

In [ ]:

Copied!





import pandas as pd
from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    ["base_rag"] * len(evals["correctness"]),
    evals["correctness"],
    metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
    ["base_rag"] * len(evals["relevancy"]),
    evals["relevancy"],
    metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
    ["base_rag"] * len(evals["faithfulness"]),
    evals["faithfulness"],
    metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
    ["base_rag"] * len(evals["context_similarity"]),
    evals["context_similarity"],
    metric="context_similarity",
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
        mean_context_similarity_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
import pandas as pd
from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    ["base_rag"] * len(evals["correctness"]),
    evals["correctness"],
    metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
    ["base_rag"] * len(evals["relevancy"]),
    evals["relevancy"],
    metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
    ["base_rag"] * len(evals["faithfulness"]),
    evals["faithfulness"],
    metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
    ["base_rag"] * len(evals["context_similarity"]),
    evals["context_similarity"],
    metric="context_similarity",
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
        mean_context_similarity_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

In [ ]:

Copied!

mean_scores_df
mean_scores_df

Out[ ]:

rag	base_rag
metrics
mean_correctness_score	4.238636
mean_relevancy_score	0.977273
mean_faithfulness_score	0.977273
mean_context_similarity_score	0.933568

在这个玩具示例中，我们看到基础的 RAG 流水线在评估基准（rag_dataset）上表现相当出色！为了完整性说明，若要通过 RagEvaluatorPack 执行上述步骤，请使用下方提供的代码：

In [ ]:

Copied!





from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine, rag_dataset=rag_dataset, show_progress=True
)

############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds.  #
# For Usage Tier 1, settings that seemed to work well were batch_size=5,   #
# and sleep_time_in_seconds=15 (as of December 2023.)                      #
############################################################################

benchmark_df = await rag_evaluator_pack.arun(
    batch_size=20,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)
from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine, rag_dataset=rag_dataset, show_progress=True
)

############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds.  #
# For Usage Tier 1, settings that seemed to work well were batch_size=5,   #
# and sleep_time_in_seconds=15 (as of December 2023.)                      #
############################################################################

benchmark_df = await rag_evaluator_pack.arun(
    batch_size=20,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)