元数据替换 + 节点句子窗口¶
在本笔记中,我们使用 SentenceWindowNodeParser 将文档解析为每个节点包含单个句子。每个节点还包含一个"窗口",其中包含该节点句子前后相邻的句子。
在检索完成后、将检索到的句子传递给大语言模型之前,会使用 MetadataReplacementNodePostProcessor 将单个句子替换为包含周围句子的窗口。
这种方法对于大型文档/索引特别有用,因为它有助于检索更精细的细节。
默认情况下,句子窗口会包含原句前后各5个相邻句子。
在这种情况下,不使用分块大小设置,而是遵循窗口设置。
%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai
%load_ext autoreload
%autoreload 2
安装¶
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
!pip install llama-index
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter
# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
# base node parser is a sentence splitter
text_splitter = SentenceSplitter()
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-mpnet-base-v2", max_length=512
)
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter
加载数据并构建索引¶
本节将介绍如何加载数据并构建向量索引。
加载数据¶
此处,我们使用最新IPCC气候报告的第三章来构建索引。
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: www..ch
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
提取节点¶
我们将提取出需要存储在 VectorIndex 中的节点集合。这包括通过句子窗口解析器处理的节点,以及使用标准解析器提取的"基础"节点。
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes = text_splitter.get_nodes_from_documents(documents)
构建索引¶
我们同时构建句子索引和"基础"索引(采用默认分块大小)。
from llama_index.core import VectorStoreIndex
sentence_index = VectorStoreIndex(nodes)
base_index = VectorStoreIndex(base_nodes)
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
query_engine = sentence_index.as_query_engine(
similarity_top_k=2,
# the target key defaults to `window` to match the node_parser's default
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window")
],
)
window_response = query_engine.query(
"What are the concerns surrounding the AMOC?"
)
print(window_response)
There is low confidence in the quantification of Atlantic Meridional Overturning Circulation (AMOC) changes in the 20th century due to low agreement in quantitative reconstructed and simulated trends. Additionally, direct observational records since the mid-2000s remain too short to determine the relative contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is very likely that AMOC will decline for all SSP scenarios over the 21st century, but it will not involve an abrupt collapse before 2100.
我们还可以检查为每个节点检索到的原始句子,以及实际发送给大语言模型(LLM)的句子窗口。
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]
print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")
Window: Nevertheless, projected future annual cumulative upwelling wind changes at most locations and seasons remain within ±10–20% of present-day values (medium confidence) (WGI AR6 Section 9.2.3.5; Fox-Kemper et al., 2021). Continuous observation of the Atlantic meridional overturning circulation (AMOC) has improved the understanding of its variability (Frajka-Williams et al., 2019), but there is low confidence in the quantification of AMOC changes in the 20th century because of low agreement in quantitative reconstructed and simulated trends (WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). Direct observational records since the mid-2000s remain too short to determine the relative contributions of internal variability, natural forcing and anthropogenic forcing to AMOC change (high confidence) (WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). Over the 21st century, AMOC will very likely decline for all SSP scenarios but will not involve an abrupt collapse before 2100 (WGI AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021). 3.2.2.4 Sea Ice Changes Sea ice is a key driver of polar marine life, hosting unique ecosystems and affecting diverse marine organisms and food webs through its impact on light penetration and supplies of nutrients and organic matter (Arrigo, 2014). Since the late 1970s, Arctic sea ice area has decreased for all months, with an estimated decrease of 2 million km2 (or 25%) for summer sea ice (averaged for August, September and October) in 2010–2019 as compared with 1979–1988 (WGI AR6 Section 9.3.1.1; Fox-Kemper et al., 2021). ------------------ Original Sentence: Over the 21st century, AMOC will very likely decline for all SSP scenarios but will not involve an abrupt collapse before 2100 (WGI AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).
与常规 VectorStoreIndex 的对比¶
query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query(
"What are the concerns surrounding the AMOC?"
)
print(vector_response)
The concerns surrounding the AMOC are not provided in the given context information.
看来这个方法没奏效。让我们调高 top k 值吧!相比句子窗口索引,这会降低速度并消耗更多 token。
query_engine = base_index.as_query_engine(similarity_top_k=5)
vector_response = query_engine.query(
"What are the concerns surrounding the AMOC?"
)
print(vector_response)
There are concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation). The context information mentions that the AMOC will decline over the 21st century, with high confidence but low confidence for quantitative projections.
分析¶
显然,SentenceWindowNodeParser + MetadataReplacementNodePostProcessor 的组合方案是当前最优选。但原因何在?
在句子级别生成的嵌入向量似乎能捕捉更细粒度的细节特征,例如单词 AMOC 的语义信息。
我们还可以对比不同索引检索到的文本块效果!
for source_node in window_response.source_nodes:
print(source_node.node.metadata["original_text"])
print("--------")
Over the 21st century, AMOC will very likely decline for all SSP scenarios but will not involve an abrupt collapse before 2100 (WGI AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021). -------- Direct observational records since the mid-2000s remain too short to determine the relative contributions of internal variability, natural forcing and anthropogenic forcing to AMOC change (high confidence) (WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). --------
在这里,我们可以看到句子窗口索引轻松检索到了两个讨论AMOC的节点。请注意,这里的嵌入仅基于原始句子生成,但大语言模型实际读取的还包括周边上下文内容!
现在,让我们尝试分析为什么简单的向量索引会失败。
for node in vector_response.source_nodes:
print("AMOC mentioned?", "AMOC" in node.node.text)
print("--------")
AMOC mentioned? False -------- AMOC mentioned? False -------- AMOC mentioned? True -------- AMOC mentioned? False -------- AMOC mentioned? False --------
源节点 [2] 提到了 AMOC,但这段文本实际内容是什么?
print(vector_response.source_nodes[2].node.text)
2021; Gulev et al. 2021)The AMOC will decline over the 21st century (high confidence, but low confidence for quantitative projections).4.3.2.3, 9.2.3 (Fox-Kemper et al. 2021; Lee et al. 2021) Sea ice Arctic sea ice changes‘Current Arctic sea ice coverage levels are the lowest since at least 1850 for both annual mean and late-summer values (high confidence).’2.3.2.1, 9.3.1 (Fox-Kemper et al. 2021; Gulev et al. 2021)‘The Arctic will become practically ice-free in September by the end of the 21st century under SSP2-4.5, SSP3-7.0 and SSP5-8.5[…](high confidence).’4.3.2.1, 9.3.1 (Fox-Kemper et al. 2021; Lee et al. 2021) Antarctic sea ice changesThere is no global significant trend in Antarctic sea ice area from 1979 to 2020 (high confidence).2.3.2.1, 9.3.2 (Fox-Kemper et al. 2021; Gulev et al. 2021)There is low confidence in model simulations of future Antarctic sea ice.9.3.2 (Fox-Kemper et al. 2021) Ocean chemistry Changes in salinityThe ‘large-scale, near-surface salinity contrasts have intensified since at least 1950 […] (virtually certain).’2.3.3.2, 9.2.2.2 (Fox-Kemper et al. 2021; Gulev et al. 2021)‘Fresh ocean regions will continue to get fresher and salty ocean regions will continue to get saltier in the 21st century (medium confidence).’9.2.2.2 (Fox-Kemper et al. 2021) Ocean acidificationOcean surface pH has declined globally over the past four decades (virtually certain).2.3.3.5, 5.3.2.2 (Canadell et al. 2021; Gulev et al. 2021)Ocean surface pH will continue to decrease ‘through the 21st century, except for the lower-emission scenarios SSP1-1.9 and SSP1-2.6 […] (high confidence).’4.3.2.5, 4.5.2.2, 5.3.4.1 (Lee et al. 2021; Canadell et al. 2021) Ocean deoxygenationDeoxygenation has occurred in most open ocean regions since the mid-20th century (high confidence).2.3.3.6, 5.3.3.2 (Canadell et al. 2021; Gulev et al. 2021)Subsurface oxygen content ‘is projected to transition to historically unprecedented condition with decline over the 21st century (medium confidence).’5.3.3.2 (Canadell et al. 2021) Changes in nutrient concentrationsNot assessed in WGI Not assessed in WGI
关于 AMOC 的讨论确实存在,但遗憾的是它位于文本的中间段落。大型语言模型(LLMs)在处理检索上下文时,经常会出现忽略或弱化中间部分文本内容的现象。近期论文《"Lost in the Middle" 对此进行了专门探讨。
[可选] 评估¶
我们通过更严格的测试来比较句子窗口检索器与基础检索器的性能差异。
首先定义/加载评估基准数据集,然后在其上运行多种评估方案。
警告:此操作可能成本高昂,尤其是使用GPT-4时。请谨慎调整样本量以控制预算。
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
import nest_asyncio
import random
nest_asyncio.apply()
len(base_nodes)
428
num_nodes_eval = 30
# there are 428 nodes total. Take the first 200 to generate questions (the back half of the doc is all references)
sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)
# NOTE: run this if the dataset isn't already saved
# generate questions from the largest chunks (1024)
dataset_generator = DatasetGenerator(
sample_eval_nodes,
llm=OpenAI(model="gpt-4"),
show_progress=True,
num_questions_per_chunk=2,
)
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()
eval_dataset.save_json("data/ipcc_eval_qr_dataset.json")
# optional
eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")
比较结果¶
import asyncio
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
RelevancyEvaluator,
FaithfulnessEvaluator,
PairwiseComparisonEvaluator,
)
from collections import defaultdict
import pandas as pd
# NOTE: can uncomment other evaluators
evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))
evaluator_s = SemanticSimilarityEvaluator()
evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))
evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))
# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner
max_samples = 30
eval_qs = eval_dataset.questions
ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]
# resetup base query engine and sentence window query engine
# base query engine
base_query_engine = base_index.as_query_engine(similarity_top_k=2)
# sentence window query engine
query_engine = sentence_index.as_query_engine(
similarity_top_k=2,
# the target key defaults to `window` to match the node_parser's default
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window")
],
)
import numpy as np
base_pred_responses = get_responses(
eval_qs[:max_samples], base_query_engine, show_progress=True
)
pred_responses = get_responses(
eval_qs[:max_samples], query_engine, show_progress=True
)
pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]
evaluator_dict = {
"correctness": evaluator_c,
"faithfulness": evaluator_f,
"relevancy": evaluator_r,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
对忠实度/语义相似度进行评估。
eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
base_eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=base_pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
results_df = get_results_df(
[eval_results, base_eval_results],
["Sentence Window Retriever", "Base Retriever"],
["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)
display(results_df)
| names | correctness | relevancy | faithfulness | semantic_similarity | |
|---|---|---|---|---|---|
| 0 | Sentence Window Retriever | 4.366667 | 0.933333 | 0.933333 | 0.959583 |
| 1 | Base Retriever | 4.216667 | 0.900000 | 0.933333 | 0.958664 |