In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-postprocessor-cohere-rerank
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-postprocessor-cohere-rerank
%pip install llama-index-readers-file pymupdf
In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2
安装配置¶
此处我们定义所需的导入项。
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
# This results in nested event-loops when we start an event-loop to make async queries.
# This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio
nest_asyncio.apply()
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
# This results in nested event-loops when we start an event-loop to make async queries.
# This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
)
from llama_index.core import SummaryIndex
from llama_index.core.response.notebook_utils import display_response
from llama_index.llms.openai import OpenAI
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
)
from llama_index.core import SummaryIndex
from llama_index.core.response.notebook_utils import display_response
from llama_index.llms.openai import OpenAI
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. NumExpr defaulting to 8 threads.
加载数据¶
本节首先将Llama 2论文作为单个文档载入。随后根据不同的文本块大小进行多次分块处理,并为每个分块尺寸构建独立的向量索引。
In [ ]:
Copied!
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
--2023-09-28 12:56:38-- https://arxiv.org/pdf/2307.09288.pdf Resolving arxiv.org (arxiv.org)... 128.84.21.199 Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 13661300 (13M) [application/pdf] Saving to: ‘data/llama2.pdf’ data/llama2.pdf 100%[===================>] 13.03M 521KB/s in 42s 2023-09-28 12:57:20 (320 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]
In [ ]:
Copied!
from pathlib import Path
from llama_index.core import Document
from llama_index.readers.file import PyMuPDFReader
from pathlib import Path
from llama_index.core import Document
from llama_index.readers.file import PyMuPDFReader
In [ ]:
Copied!
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
我们尝试了不同的分块大小:128、256、512 和 1024。
In [ ]:
Copied!
# initialize modules
llm = OpenAI(model="gpt-4")
chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
print(f"Chunk Size: {chunk_size}")
splitter = SentenceSplitter(chunk_size=chunk_size)
nodes = splitter.get_nodes_from_documents(docs)
# add chunk size to nodes to track later
for node in nodes:
node.metadata["chunk_size"] = chunk_size
node.excluded_embed_metadata_keys = ["chunk_size"]
node.excluded_llm_metadata_keys = ["chunk_size"]
nodes_list.append(nodes)
# build vector index
vector_index = VectorStoreIndex(nodes)
vector_indices.append(vector_index)
# initialize modules
llm = OpenAI(model="gpt-4")
chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
print(f"Chunk Size: {chunk_size}")
splitter = SentenceSplitter(chunk_size=chunk_size)
nodes = splitter.get_nodes_from_documents(docs)
# add chunk size to nodes to track later
for node in nodes:
node.metadata["chunk_size"] = chunk_size
node.excluded_embed_metadata_keys = ["chunk_size"]
node.excluded_llm_metadata_keys = ["chunk_size"]
nodes_list.append(nodes)
# build vector index
vector_index = VectorStoreIndex(nodes)
vector_indices.append(vector_index)
Chunk Size: 128 Chunk Size: 256 Chunk Size: 512 Chunk Size: 1024
定义集成检索器¶
我们主要通过递归检索抽象来设置一个"集成"检索器。其工作原理如下:
- 为每个分块大小定义独立的
IndexNode对应向量检索器(128分块大小的检索器、256分块大小的检索器等) - 将所有IndexNode放入单个
SummaryIndex中——当调用对应检索器时,会返回所有节点 - 定义一个递归检索器,其根节点为摘要索引检索器。这将首先从摘要索引检索器获取所有节点,然后递归调用每个分块大小的向量检索器
- 对最终结果进行重新排序
最终效果是:当执行查询时,所有向量检索器都会被调用。
In [ ]:
Copied!
# try ensemble retrieval
from llama_index.core.tools import RetrieverTool
from llama_index.core.schema import IndexNode
# retriever_tools = []
retriever_dict = {}
retriever_nodes = []
for chunk_size, vector_index in zip(chunk_sizes, vector_indices):
node_id = f"chunk_{chunk_size}"
node = IndexNode(
text=(
"Retrieves relevant context from the Llama 2 paper (chunk size"
f" {chunk_size})"
),
index_id=node_id,
)
retriever_nodes.append(node)
retriever_dict[node_id] = vector_index.as_retriever()
# try ensemble retrieval
from llama_index.core.tools import RetrieverTool
from llama_index.core.schema import IndexNode
# retriever_tools = []
retriever_dict = {}
retriever_nodes = []
for chunk_size, vector_index in zip(chunk_sizes, vector_indices):
node_id = f"chunk_{chunk_size}"
node = IndexNode(
text=(
"Retrieves relevant context from the Llama 2 paper (chunk size"
f" {chunk_size})"
),
index_id=node_id,
)
retriever_nodes.append(node)
retriever_dict[node_id] = vector_index.as_retriever()
定义递归检索器。
In [ ]:
Copied!
from llama_index.core.selectors import PydanticMultiSelector
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core import SummaryIndex
# the derived retriever will just retrieve all nodes
summary_index = SummaryIndex(retriever_nodes)
retriever = RecursiveRetriever(
root_id="root",
retriever_dict={"root": summary_index.as_retriever(), **retriever_dict},
)
from llama_index.core.selectors import PydanticMultiSelector
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core import SummaryIndex
# the derived retriever will just retrieve all nodes
summary_index = SummaryIndex(retriever_nodes)
retriever = RecursiveRetriever(
root_id="root",
retriever_dict={"root": summary_index.as_retriever(), **retriever_dict},
)
让我们在一个示例查询上测试检索器。
In [ ]:
Copied!
nodes = await retriever.aretrieve(
"Tell me about the main aspects of safety fine-tuning"
)
nodes = await retriever.aretrieve(
"Tell me about the main aspects of safety fine-tuning"
)
In [ ]:
Copied!
print(f"Number of nodes: {len(nodes)}")
for node in nodes:
print(node.node.metadata["chunk_size"])
print(node.node.get_text())
print(f"Number of nodes: {len(nodes)}")
for node in nodes:
print(node.node.metadata["chunk_size"])
print(node.node.get_text())
定义重排序器用于处理最终检索到的节点集合。
In [ ]:
Copied!
# define reranker
from llama_index.core.postprocessor import LLMRerank, SentenceTransformerRerank
from llama_index.postprocessor.cohere_rerank import CohereRerank
# reranker = LLMRerank()
# reranker = SentenceTransformerRerank(top_n=10)
reranker = CohereRerank(top_n=10)
# define reranker
from llama_index.core.postprocessor import LLMRerank, SentenceTransformerRerank
from llama_index.postprocessor.cohere_rerank import CohereRerank
# reranker = LLMRerank()
# reranker = SentenceTransformerRerank(top_n=10)
reranker = CohereRerank(top_n=10)
定义检索器查询引擎以整合递归检索器与重排序器。
In [ ]:
Copied!
# define RetrieverQueryEngine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
# define RetrieverQueryEngine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
In [ ]:
Copied!
response = query_engine.query(
"Tell me about the main aspects of safety fine-tuning"
)
response = query_engine.query(
"Tell me about the main aspects of safety fine-tuning"
)
In [ ]:
Copied!
display_response(
response, show_source=True, source_length=500, show_source_metadata=True
)
display_response(
response, show_source=True, source_length=500, show_source_metadata=True
)
分析各文本块的相关重要性¶
基于集成检索的一个有趣特性是,通过重排序机制,我们实际上可以利用最终检索结果集中各文本块的排序来判断不同分块尺寸的重要性。例如,如果某些特定分块尺寸的文本块总是排在靠前位置,那么这些分块尺寸很可能与查询具有更高的相关性。
In [ ]:
Copied!
# compute the average precision for each chunk size based on positioning in combined ranking
from collections import defaultdict
import pandas as pd
def mrr_all(metadata_values, metadata_key, source_nodes):
# source nodes is a ranked list
# go through each value, find out positioning in source_nodes
value_to_mrr_dict = {}
for metadata_value in metadata_values:
mrr = 0
for idx, source_node in enumerate(source_nodes):
if source_node.node.metadata[metadata_key] == metadata_value:
mrr = 1 / (idx + 1)
break
else:
continue
# normalize AP, set in dict
value_to_mrr_dict[metadata_value] = mrr
df = pd.DataFrame(value_to_mrr_dict, index=["MRR"])
df.style.set_caption("Mean Reciprocal Rank")
return df
# compute the average precision for each chunk size based on positioning in combined ranking
from collections import defaultdict
import pandas as pd
def mrr_all(metadata_values, metadata_key, source_nodes):
# source nodes is a ranked list
# go through each value, find out positioning in source_nodes
value_to_mrr_dict = {}
for metadata_value in metadata_values:
mrr = 0
for idx, source_node in enumerate(source_nodes):
if source_node.node.metadata[metadata_key] == metadata_value:
mrr = 1 / (idx + 1)
break
else:
continue
# normalize AP, set in dict
value_to_mrr_dict[metadata_value] = mrr
df = pd.DataFrame(value_to_mrr_dict, index=["MRR"])
df.style.set_caption("Mean Reciprocal Rank")
return df
In [ ]:
Copied!
# Compute the Mean Reciprocal Rank for each chunk size (higher is better)
# we can see that chunk size of 256 has the highest ranked results.
print("Mean Reciprocal Rank for each Chunk Size")
mrr_all(chunk_sizes, "chunk_size", response.source_nodes)
# Compute the Mean Reciprocal Rank for each chunk size (higher is better)
# we can see that chunk size of 256 has the highest ranked results.
print("Mean Reciprocal Rank for each Chunk Size")
mrr_all(chunk_sizes, "chunk_size", response.source_nodes)
Mean Reciprocal Rank for each Chunk Size
Out[ ]:
| 128 | 256 | 512 | 1024 | |
|---|---|---|---|---|
| MRR | 0.333333 | 1.0 | 0.5 | 0.25 |
评估¶
我们通过更严格的评估方法,对比了集成检索器与"基线"检索器的性能表现。
首先定义/加载一个评估基准数据集,然后在其上运行多种评估方案。
警告:该过程可能成本高昂,尤其是使用GPT-4时。请谨慎操作,并根据预算调整样本规模。
In [ ]:
Copied!
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
# NOTE: run this if the dataset isn't already saved
eval_llm = OpenAI(model="gpt-4")
# generate questions from the largest chunks (1024)
dataset_generator = DatasetGenerator(
nodes_list[-1],
llm=eval_llm,
show_progress=True,
num_questions_per_chunk=2,
)
# NOTE: run this if the dataset isn't already saved
eval_llm = OpenAI(model="gpt-4")
# generate questions from the largest chunks (1024)
dataset_generator = DatasetGenerator(
nodes_list[-1],
llm=eval_llm,
show_progress=True,
num_questions_per_chunk=2,
)
In [ ]:
Copied!
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
In [ ]:
Copied!
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")
In [ ]:
Copied!
# optional
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
# optional
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
比较结果¶
In [ ]:
Copied!
import asyncio
import nest_asyncio
nest_asyncio.apply()
import asyncio
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
RelevancyEvaluator,
FaithfulnessEvaluator,
PairwiseComparisonEvaluator,
)
# NOTE: can uncomment other evaluators
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_r = RelevancyEvaluator(llm=eval_llm)
evaluator_f = FaithfulnessEvaluator(llm=eval_llm)
pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm)
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
RelevancyEvaluator,
FaithfulnessEvaluator,
PairwiseComparisonEvaluator,
)
# NOTE: can uncomment other evaluators
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_r = RelevancyEvaluator(llm=eval_llm)
evaluator_f = FaithfulnessEvaluator(llm=eval_llm)
pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm)
In [ ]:
Copied!
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner
max_samples = 60
eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]
# resetup base query engine and ensemble query engine
# base query engine
base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)
# ensemble query engine
reranker = CohereRerank(top_n=4)
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner
max_samples = 60
eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]
# resetup base query engine and ensemble query engine
# base query engine
base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)
# ensemble query engine
reranker = CohereRerank(top_n=4)
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
In [ ]:
Copied!
base_pred_responses = get_responses(
eval_qs[:max_samples], base_query_engine, show_progress=True
)
base_pred_responses = get_responses(
eval_qs[:max_samples], base_query_engine, show_progress=True
)
In [ ]:
Copied!
pred_responses = get_responses(
eval_qs[:max_samples], query_engine, show_progress=True
)
pred_responses = get_responses(
eval_qs[:max_samples], query_engine, show_progress=True
)
In [ ]:
Copied!
import numpy as np
pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]
import numpy as np
pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]
In [ ]:
Copied!
evaluator_dict = {
"correctness": evaluator_c,
"faithfulness": evaluator_f,
# "relevancy": evaluator_r,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)
evaluator_dict = {
"correctness": evaluator_c,
"faithfulness": evaluator_f,
# "relevancy": evaluator_r,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)
In [ ]:
Copied!
eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
In [ ]:
Copied!
base_eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=base_pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
base_eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=base_pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
In [ ]:
Copied!
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)
| names | correctness | faithfulness | semantic_similarity | |
|---|---|---|---|---|
| 0 | Ensemble Retriever | 4.375000 | 0.983333 | 0.964546 |
| 1 | Base Retriever | 4.066667 | 0.983333 | 0.956692 |
In [ ]:
Copied!
batch_runner = BatchEvalRunner(
{"pairwise": pairwise_evaluator}, workers=3, show_progress=True
)
pairwise_eval_results = await batch_runner.aevaluate_response_strs(
queries=eval_qs[:max_samples],
response_strs=pred_response_strs[:max_samples],
reference=base_pred_response_strs[:max_samples],
)
batch_runner = BatchEvalRunner(
{"pairwise": pairwise_evaluator}, workers=3, show_progress=True
)
pairwise_eval_results = await batch_runner.aevaluate_response_strs(
queries=eval_qs[:max_samples],
response_strs=pred_response_strs[:max_samples],
reference=base_pred_response_strs[:max_samples],
)
In [ ]:
Copied!
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["pairwise"],
)
display(results_df)
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["pairwise"],
)
display(results_df)
Out[ ]:
| names | pairwise | |
|---|---|---|
| 0 | Pairwise Comparison | 0.5 |