SlideNodeParser¶
SLIDE(滑动局部信息文档提取法)是一种专为增强长文档实体关系提取而设计的文本分块方法,尤其适用于低资源语言场景。该方法通过将局部上下文嵌入每个文本块(同时不超出大语言模型的上下文窗口限制),旨在为GraphRAG流水线提供支持。
SlideNodeParser
严格实现了该方法的近似版本,通过使用邻近文本块的滑动窗口生成简短而有意义的上下文,从而提升下游检索和推理质量。这项技术已被证实能有效提升基于图结构的检索增强生成技术效果。
以下是论文中描述的技术实现:
给定文档D和由句子边界及词元数量分割的基础文本块列表(C1, C2, ..., Ck),SLIDE方法采用固定大小的滑动窗口获取邻近文本块,为每个文本块构建局部上下文。该上下文通过LLM生成的摘要附加到原始文本块中。
窗口大小是根据模型上下文长度和计算预算定义的超参数。每个文本块Ci通过包含前后若干邻近块(例如前后各5块)实现增强,最终形成窗口大小+1的输入量送入LLM处理。
该过程对文档中每个文本块重复执行。最终生成的是嵌入了更丰富窗口特异性上下文的文本块集合,能显著提升知识图谱和搜索检索质量,在多语言或资源受限环境中表现尤为突出。
In [ ]:
Copied!
%pip install llama-index-node-parser-slide
%pip install llama-index-node-parser-slide
安装 ipy 小部件以支持进度条(可选)¶
In [ ]:
Copied!
%pip install ipywidgets
%pip install ipywidgets
数据准备¶
此处我们以一段示例文本为例。
In [ ]:
Copied!
text = """Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks.
This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction.
Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents.
They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction.
We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows.
SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits.
It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English.
For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction.
Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings.
Since SLIDE enhances knowledge graph construction in GraphRAG systems through contextual chunking, we first discuss related work in GraphRAG and chunking, highlighting their strengths and limitations.
This sets the stage for our approach, which builds on GraphRAG by using overlapping windows to improve entity and relationship extraction.
2.1 GraphRAG and Knowledge Graphs.
GraphRAG (Edge et al., 2024) is an advanced RAG framework that integrates knowledge graphs with large language models (LLMs) (Trajanoska et al., 2023) to enhance reasoning and contextual understanding.
Unlike traditional RAG systems, GraphRAG builds a knowledge graph with entities as nodes and relationships as edges, enabling precise and context-rich responses by leveraging the graph’s structure (Edge et al., 2024; Wu et al., 2024).
Large language models (LLMs), such as GPT-4, show reduced effectiveness in entity and relationship extraction as input chunk lengths increase, degrading accuracy for longer texts (Edge et al., 2024).
They also struggle with relationship extraction in low-resource languages, limiting their applicability (Chen et al., 2024; Jinensibieke et al., 2024).
Building upon this work, our approach further enhances knowledge graph extraction by incorporating localized context which improves entity and relationship extraction.
2.2 Contextual Chunking.
Recent work in RAG systems has explored advanced chunking techniques to enhance retrieval and knowledge graph construction.
Günther et al. (2024) implemented late chunking, where entire documents are embedded to capture global context before splitting into chunks, improving retrieval by emphasizing document-level coherence.
However, this focus on global embeddings is less suited for knowledge graph construction.
Our method instead uses localized context from raw text to retain meaningful relationships for improved entity and relationship extraction.
Wu et al. (2024) introduced a hybrid chunking approach for Medical Graph RAG, combining structural markers like paragraphs with semantic coherence to produce self-contained chunks.
While effective, this approach relies on predefined boundaries.
Our method extends this by generating contextual information from neighboring chunks, enhancing the completeness of knowledge graph construction.
Contextual retrieval (Anthropic, 2024) improves accuracy but struggles with longer documents, as embedding each chunk with full document context is computationally expensive and truncates critical information with documents exceeding maximum context length of the model (Jiang et al., 2024; Li et al., 2024).
Our overlapping window-based approach addresses these inefficiencies, improving performance in both retrieval and knowledge graph construction.
"""
from llama_index.core import Document
document = Document(text=text)
text = """Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks.
This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction.
Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents.
They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction.
We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows.
SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits.
It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English.
For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction.
Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings.
Since SLIDE enhances knowledge graph construction in GraphRAG systems through contextual chunking, we first discuss related work in GraphRAG and chunking, highlighting their strengths and limitations.
This sets the stage for our approach, which builds on GraphRAG by using overlapping windows to improve entity and relationship extraction.
2.1 GraphRAG and Knowledge Graphs.
GraphRAG (Edge et al., 2024) is an advanced RAG framework that integrates knowledge graphs with large language models (LLMs) (Trajanoska et al., 2023) to enhance reasoning and contextual understanding.
Unlike traditional RAG systems, GraphRAG builds a knowledge graph with entities as nodes and relationships as edges, enabling precise and context-rich responses by leveraging the graph’s structure (Edge et al., 2024; Wu et al., 2024).
Large language models (LLMs), such as GPT-4, show reduced effectiveness in entity and relationship extraction as input chunk lengths increase, degrading accuracy for longer texts (Edge et al., 2024).
They also struggle with relationship extraction in low-resource languages, limiting their applicability (Chen et al., 2024; Jinensibieke et al., 2024).
Building upon this work, our approach further enhances knowledge graph extraction by incorporating localized context which improves entity and relationship extraction.
2.2 Contextual Chunking.
Recent work in RAG systems has explored advanced chunking techniques to enhance retrieval and knowledge graph construction.
Günther et al. (2024) implemented late chunking, where entire documents are embedded to capture global context before splitting into chunks, improving retrieval by emphasizing document-level coherence.
However, this focus on global embeddings is less suited for knowledge graph construction.
Our method instead uses localized context from raw text to retain meaningful relationships for improved entity and relationship extraction.
Wu et al. (2024) introduced a hybrid chunking approach for Medical Graph RAG, combining structural markers like paragraphs with semantic coherence to produce self-contained chunks.
While effective, this approach relies on predefined boundaries.
Our method extends this by generating contextual information from neighboring chunks, enhancing the completeness of knowledge graph construction.
Contextual retrieval (Anthropic, 2024) improves accuracy but struggles with longer documents, as embedding each chunk with full document context is computationally expensive and truncates critical information with documents exceeding maximum context length of the model (Jiang et al., 2024; Li et al., 2024).
Our overlapping window-based approach addresses these inefficiencies, improving performance in both retrieval and knowledge graph construction.
"""
from llama_index.core import Document
document = Document(text=text)
配置大语言模型¶
In [ ]:
Copied!
import os
os.environ["OPENAI_API_KEY"] = "sk-..." # Replace with your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "sk-..." # Replace with your OpenAI API key
In [ ]:
Copied!
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
embed_model = OpenAIEmbedding()
llm = OpenAI(model="gpt-4o-mini")
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
embed_model = OpenAIEmbedding()
llm = OpenAI(model="gpt-4o-mini")
In [ ]:
Copied!
# Calculate token count of the text
from llama_index.core.utilities.token_counting import TokenCounter
token_counter = TokenCounter()
token_count = token_counter.get_string_tokens(text)
print(f"Token count: {token_count}")
# Calculate token count of the text
from llama_index.core.utilities.token_counting import TokenCounter
token_counter = TokenCounter()
token_count = token_counter.get_string_tokens(text)
print(f"Token count: {token_count}")
Token count: 759
配置 SlideNodeParser¶
In [ ]:
Copied!
# Lets choose a chunk size of 200 tokens and window size of 5
from llama_index.node_parser.slide import SlideNodeParser
parser = SlideNodeParser.from_defaults(
chunk_size=200,
window_size=5,
)
# Lets choose a chunk size of 200 tokens and window size of 5
from llama_index.node_parser.slide import SlideNodeParser
parser = SlideNodeParser.from_defaults(
chunk_size=200,
window_size=5,
)
运行同步阻塞函数¶
In [ ]:
Copied!
import time
start_time = time.time()
nodes = parser.get_nodes_from_documents([document], show_progress=True)
end_time = time.time()
print(f"Time taken to parse: {end_time - start_time} seconds")
import time
start_time = time.time()
nodes = parser.get_nodes_from_documents([document], show_progress=True)
end_time = time.time()
print(f"Time taken to parse: {end_time - start_time} seconds")
检查数据块¶
In [ ]:
Copied!
for i, node in enumerate(nodes):
print(f"\n--- Chunk {i+1} ---")
print("Text:", node.text)
print("Local Context:", node.metadata.get("local_context"))
for i, node in enumerate(nodes):
print(f"\n--- Chunk {i+1} ---")
print("Text:", node.text)
print("Local Context:", node.metadata.get("local_context"))
--- Chunk 1 --- Text: Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction. Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents. They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction. We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows. SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits. It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English. For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction. Local Context: assistant: The chunk provided introduces SLIDE (Sliding Localized Information for Document Extraction), a method that addresses the challenges of constructing accurate knowledge graphs from long texts and low-resource languages. It highlights how SLIDE improves knowledge graph extraction by processing long documents with overlapping windows, enhancing entity and relationship extraction performance significantly for both English and Afrikaans languages within the GraphRAG framework. --- Chunk 2 --- Text: Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings. Since SLIDE enhances knowledge graph construction in GraphRAG systems through contextual chunking, we first discuss related work in GraphRAG and chunking, highlighting their strengths and limitations. This sets the stage for our approach, which builds on GraphRAG by using overlapping windows to improve entity and relationship extraction. 2.1 GraphRAG and Knowledge Graphs. GraphRAG (Edge et al., 2024) is an advanced RAG framework that integrates knowledge graphs with large language models (LLMs) (Trajanoska et al., 2023) to enhance reasoning and contextual understanding. Local Context: assistant: The chunk provided discusses how SLIDE enhances knowledge graph construction in GraphRAG systems through contextual chunking. It also introduces GraphRAG as an advanced framework that integrates knowledge graphs with large language models to enhance reasoning and contextual understanding. --- Chunk 3 --- Text: Unlike traditional RAG systems, GraphRAG builds a knowledge graph with entities as nodes and relationships as edges, enabling precise and context-rich responses by leveraging the graph’s structure (Edge et al., 2024; Wu et al., 2024). Large language models (LLMs), such as GPT-4, show reduced effectiveness in entity and relationship extraction as input chunk lengths increase, degrading accuracy for longer texts (Edge et al., 2024). They also struggle with relationship extraction in low-resource languages, limiting their applicability (Chen et al., 2024; Jinensibieke et al., 2024). Building upon this work, our approach further enhances knowledge graph extraction by incorporating localized context which improves entity and relationship extraction. 2.2 Contextual Chunking. Recent work in RAG systems has explored advanced chunking techniques to enhance retrieval and knowledge graph construction. Günther et al. Local Context: assistant: The chunk provided discusses the unique approach of GraphRAG in constructing knowledge graphs using entities and relationships, highlighting its advantages over traditional RAG systems. It also addresses the challenges faced by large language models in entity and relationship extraction, particularly in longer texts and low-resource languages. The chunk further introduces the enhancement of knowledge graph extraction through localized context and references recent advancements in chunking techniques within RAG systems. --- Chunk 4 --- Text: (2024) implemented late chunking, where entire documents are embedded to capture global context before splitting into chunks, improving retrieval by emphasizing document-level coherence. However, this focus on global embeddings is less suited for knowledge graph construction. Our method instead uses localized context from raw text to retain meaningful relationships for improved entity and relationship extraction. Wu et al. (2024) introduced a hybrid chunking approach for Medical Graph RAG, combining structural markers like paragraphs with semantic coherence to produce self-contained chunks. While effective, this approach relies on predefined boundaries. Our method extends this by generating contextual information from neighboring chunks, enhancing the completeness of knowledge graph construction. Contextual retrieval (Anthropic, 2024) improves accuracy but struggles with longer documents, as embedding each chunk with full document context is computationally expensive and truncates critical information with documents exceeding maximum context length of the model (Jiang et al., 2024; Li et al., 2024). Local Context: assistant: This chunk discusses different chunking approaches in the context of knowledge graph construction within RAG systems. It contrasts the use of global embeddings for document-level coherence with localized context for improved entity and relationship extraction. It also mentions a hybrid chunking approach introduced by Wu et al. for Medical Graph RAG, highlighting the importance of contextual information from neighboring chunks in enhancing knowledge graph completeness. --- Chunk 5 --- Text: Our overlapping window-based approach addresses these inefficiencies, improving performance in both retrieval and knowledge graph construction. Local Context: assistant: The chunk provided discusses an overlapping window-based approach that aims to address inefficiencies in retrieval and knowledge graph construction, as part of a broader discussion on enhancing RAG systems through advanced chunking techniques and contextual retrieval methods.
让我们运行支持并行LLM调用的异步版本¶
In [ ]:
Copied!
parser.llm_workers = 4
start_time = time.time()
nodes = await parser.aget_nodes_from_documents([document], show_progress=True)
end_time = time.time()
print(f"Time taken to parse: {end_time - start_time} seconds")
parser.llm_workers = 4
start_time = time.time()
nodes = await parser.aget_nodes_from_documents([document], show_progress=True)
end_time = time.time()
print(f"Time taken to parse: {end_time - start_time} seconds")
检查数据块¶
In [ ]:
Copied!
for i, node in enumerate(nodes):
print(f"\n--- Chunk {i+1} ---")
print("Text:", node.text)
print("Local Context:", node.metadata.get("local_context"))
for i, node in enumerate(nodes):
print(f"\n--- Chunk {i+1} ---")
print("Text:", node.text)
print("Local Context:", node.metadata.get("local_context"))
--- Chunk 1 --- Text: Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction. Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents. They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction. We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows. SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits. It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English. For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction. Local Context: assistant: The chunk provided introduces SLIDE (Sliding Localized Information for Document Extraction), a method that addresses the challenges of constructing accurate knowledge graphs from long texts and low-resource languages. It highlights how SLIDE improves knowledge graph extraction by processing long documents with overlapping windows, enhancing entity and relationship extraction performance significantly for both English and Afrikaans languages within the GraphRAG framework. --- Chunk 2 --- Text: Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings. Since SLIDE enhances knowledge graph construction in GraphRAG systems through contextual chunking, we first discuss related work in GraphRAG and chunking, highlighting their strengths and limitations. This sets the stage for our approach, which builds on GraphRAG by using overlapping windows to improve entity and relationship extraction. 2.1 GraphRAG and Knowledge Graphs. GraphRAG (Edge et al., 2024) is an advanced RAG framework that integrates knowledge graphs with large language models (LLMs) (Trajanoska et al., 2023) to enhance reasoning and contextual understanding. Local Context: assistant: The chunk provided discusses how SLIDE enhances knowledge graph construction in GraphRAG systems through contextual chunking. It also introduces related work in GraphRAG and chunking, setting the stage for the approach that builds on GraphRAG by using overlapping windows to improve entity and relationship extraction. --- Chunk 3 --- Text: Unlike traditional RAG systems, GraphRAG builds a knowledge graph with entities as nodes and relationships as edges, enabling precise and context-rich responses by leveraging the graph’s structure (Edge et al., 2024; Wu et al., 2024). Large language models (LLMs), such as GPT-4, show reduced effectiveness in entity and relationship extraction as input chunk lengths increase, degrading accuracy for longer texts (Edge et al., 2024). They also struggle with relationship extraction in low-resource languages, limiting their applicability (Chen et al., 2024; Jinensibieke et al., 2024). Building upon this work, our approach further enhances knowledge graph extraction by incorporating localized context which improves entity and relationship extraction. 2.2 Contextual Chunking. Recent work in RAG systems has explored advanced chunking techniques to enhance retrieval and knowledge graph construction. Günther et al. Local Context: assistant: The chunk provided discusses how GraphRAG differs from traditional RAG systems by constructing knowledge graphs with entities as nodes and relationships as edges, leading to precise responses. It also highlights the challenges faced by large language models in entity and relationship extraction, especially in longer texts and low-resource languages. The approach presented in the chunk aims to enhance knowledge graph extraction by incorporating localized context to improve entity and relationship extraction, building upon existing research in contextual chunking techniques within RAG systems. --- Chunk 4 --- Text: (2024) implemented late chunking, where entire documents are embedded to capture global context before splitting into chunks, improving retrieval by emphasizing document-level coherence. However, this focus on global embeddings is less suited for knowledge graph construction. Our method instead uses localized context from raw text to retain meaningful relationships for improved entity and relationship extraction. Wu et al. (2024) introduced a hybrid chunking approach for Medical Graph RAG, combining structural markers like paragraphs with semantic coherence to produce self-contained chunks. While effective, this approach relies on predefined boundaries. Our method extends this by generating contextual information from neighboring chunks, enhancing the completeness of knowledge graph construction. Contextual retrieval (Anthropic, 2024) improves accuracy but struggles with longer documents, as embedding each chunk with full document context is computationally expensive and truncates critical information with documents exceeding maximum context length of the model (Jiang et al., 2024; Li et al., 2024). Local Context: assistant: This chunk discusses different approaches to chunking in the context of knowledge graph construction within RAG systems. It contrasts the use of global embeddings for document-level coherence with the utilization of localized context for improved entity and relationship extraction. Additionally, it mentions a hybrid chunking approach introduced by Wu et al. for Medical Graph RAG, highlighting the importance of generating contextual information from neighboring chunks to enhance knowledge graph completeness. --- Chunk 5 --- Text: Our overlapping window-based approach addresses these inefficiencies, improving performance in both retrieval and knowledge graph construction. Local Context: assistant: The chunk provided discusses an overlapping window-based approach that addresses inefficiencies in retrieval and knowledge graph construction, aiming to improve performance within the broader context of advanced chunking techniques and knowledge graph extraction strategies discussed in the document.