TopicNodeParser¶

MedGraphRAG 旨在通过一种新颖的基于图的检索增强生成框架，生成基于证据的结果，从而提升大语言模型在医疗领域的能力，提高处理私人医疗数据时的安全性和可靠性。

TopicNodeParser 实现了论文中描述的近似分块技术。

以下是论文中概述的技术：

大型医疗文档通常包含多个主题或多样化内容。为有效处理这些文档，我们首先将其分割成符合大语言模型（LLMs）上下文限制的数据块。传统的分块方法（如基于令牌大小或固定字符的分割）通常无法准确检测主题的细微变化，导致这些数据块可能无法完整捕获预期上下文，造成语义丰富度的损失。

为提高准确性，我们采用字符分隔与基于主题分割的混合方法。具体而言，我们使用静态字符（换行符）来隔离文档中的各个段落，随后对派生文本进行语义分块。我们的方法包括使用命题转换技术（Chen等人，2023），该技术可从原始文本中提取独立陈述。通过命题转换，每个段落被转化为自包含的陈述。接着我们对文档进行序列分析以评估每个命题，决定其应合并到现有数据块还是创建新块。这一决策通过LLM的零样本方法实现。为减少序列处理产生的噪声，我们采用滑动窗口技术，每次处理五个段落。通过移除首个段落并添加下一个段落来持续调整窗口，保持主题一致性关注。我们设定硬性阈值，规定最长数据块不得超过LLM的上下文长度限制。完成文档分块后，我们在每个独立数据块上构建图谱。

In [ ]:

Copied!

%pip install llama-index llama-index-node-parser-topic
%pip install llama-index llama-index-node-parser-topic

数据准备¶

此处我们以一段示例文本作为参考。

注意：这些命题由大型语言模型生成，在创建节点时可能导致处理时间延长。实验时请谨慎操作。

In [ ]:

Copied!

text = """In this paper, we introduce a novel graph RAG method for applying LLMs to the medical domain, which we refer to as Medical Graph RAG (MedRAG). This technique improves LLM performance in the medical domain by response queries with grounded source citations and clear interpretations of medical terminology, boosting the transparency and interpretability of the results. This approach involves a three-tier hierarchical graph construction method. Initially, we use documents provided by users as our top-level source to extract entities. These entities are then linked to a second level consisting of more basic entities previously abstracted from credible medical books and papers. Subsequently, these entities are connected to a third level—the fundamental medical dictionary graph—that provides detailed explanations of each medical term and their semantic relationships. We then construct a comprehensive graph at the highest level by linking entities based on their content and hierarchical connections. This method ensures that the knowledge can be traced back to its sources and the results are factually accurate.

To respond to user queries, we implement a U-retrieve strategy that combines top-down retrieval with bottom-up response generation. The process begins by structuring the query using predefined medical tags and indexing them through the graphs in a top-down manner. The system then generates responses based on these queries, pulling from meta-graphs—nodes retrieved along with their TopK related nodes and relationships—and summarizing the information into a detailed response. This technique maintains a balance between global context awareness and the contextual limitations inherent in LLMs.

Our medical graph RAG provides Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. The results provides the provenance, or source grounding information, as it generates each response, and demonstrates that an answer is grounded in the dataset. Having the cited source for each assertion readily available also enables a human user to quickly and accurately audit the LLM’s output directly against the original source material. It is super useful in the field of medicine that security is very important, and each of the reasoning should be evidence-based. By using such a method, we construct an evidence-based Medical LLM that the clinician could easiely check the source of the reasoning and calibrate the model response to ensure the safty usage of llm in the clinical senarios.

To evaluate our medical graph RAG, we implemented the method on several popular open and closed-source LLMs, including ChatGPT OpenAI (2023a) and LLaMA Touvron et al. (2023), testing them across mainstream medical Q&A benchmarks such as PubMedQA Jin et al. (2019), MedMCQA Pal et al. (2022), and USMLE Kung et al. (2023). For the RAG process, we supplied a comprehensive medical dictionary as the foundational knowledge layer, the UMLS medical knowledge graph Lindberg et al. (1993) as the foundamental layer detailing semantic relationships, and a curated MedC-K dataset Wu et al. (2023) —comprising the latest medical papers and books—as the intermediate level of data to simulate user-provided private data. Our experiments demonstrate that our model significantly enhances the performance of general-purpose LLMs on medical questions. Remarkably, it even surpasses many fine-tuned or specially trained LLMs on medical corpora, solely using the RAG approach without additional training.
"""

from llama_index.core import Document

documents = [Document(text=text)]
text = """In this paper, we introduce a novel graph RAG method for applying LLMs to the medical domain, which we refer to as Medical Graph RAG (MedRAG). This technique improves LLM performance in the medical domain by response queries with grounded source citations and clear interpretations of medical terminology, boosting the transparency and interpretability of the results. This approach involves a three-tier hierarchical graph construction method. Initially, we use documents provided by users as our top-level source to extract entities. These entities are then linked to a second level consisting of more basic entities previously abstracted from credible medical books and papers. Subsequently, these entities are connected to a third level—the fundamental medical dictionary graph—that provides detailed explanations of each medical term and their semantic relationships. We then construct a comprehensive graph at the highest level by linking entities based on their content and hierarchical connections. This method ensures that the knowledge can be traced back to its sources and the results are factually accurate.

To respond to user queries, we implement a U-retrieve strategy that combines top-down retrieval with bottom-up response generation. The process begins by structuring the query using predefined medical tags and indexing them through the graphs in a top-down manner. The system then generates responses based on these queries, pulling from meta-graphs—nodes retrieved along with their TopK related nodes and relationships—and summarizing the information into a detailed response. This technique maintains a balance between global context awareness and the contextual limitations inherent in LLMs.

Our medical graph RAG provides Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. The results provides the provenance, or source grounding information, as it generates each response, and demonstrates that an answer is grounded in the dataset. Having the cited source for each assertion readily available also enables a human user to quickly and accurately audit the LLM’s output directly against the original source material. It is super useful in the field of medicine that security is very important, and each of the reasoning should be evidence-based. By using such a method, we construct an evidence-based Medical LLM that the clinician could easiely check the source of the reasoning and calibrate the model response to ensure the safty usage of llm in the clinical senarios.

To evaluate our medical graph RAG, we implemented the method on several popular open and closed-source LLMs, including ChatGPT OpenAI (2023a) and LLaMA Touvron et al. (2023), testing them across mainstream medical Q&A benchmarks such as PubMedQA Jin et al. (2019), MedMCQA Pal et al. (2022), and USMLE Kung et al. (2023). For the RAG process, we supplied a comprehensive medical dictionary as the foundational knowledge layer, the UMLS medical knowledge graph Lindberg et al. (1993) as the foundamental layer detailing semantic relationships, and a curated MedC-K dataset Wu et al. (2023) —comprising the latest medical papers and books—as the intermediate level of data to simulate user-provided private data. Our experiments demonstrate that our model significantly enhances the performance of general-purpose LLMs on medical questions. Remarkably, it even surpasses many fine-tuned or specially trained LLMs on medical corpora, solely using the RAG approach without additional training.
"""

from llama_index.core import Document

documents = [Document(text=text)]

In [ ]:

Copied!

print(documents[0].get_content())
print(documents[0].get_content())

In this paper, we introduce a novel graph RAG method for applying LLMs to the medical domain, which we refer to as Medical Graph RAG (MedRAG). This technique improves LLM performance in the medical domain by response queries with grounded source citations and clear interpretations of medical terminology, boosting the transparency and interpretability of the results. This approach involves a three-tier hierarchical graph construction method. Initially, we use documents provided by users as our top-level source to extract entities. These entities are then linked to a second level consisting of more basic entities previously abstracted from credible medical books and papers. Subsequently, these entities are connected to a third level—the fundamental medical dictionary graph—that provides detailed explanations of each medical term and their semantic relationships. We then construct a comprehensive graph at the highest level by linking entities based on their content and hierarchical connections. This method ensures that the knowledge can be traced back to its sources and the results are factually accurate.

To respond to user queries, we implement a U-retrieve strategy that combines top-down retrieval with bottom-up response generation. The process begins by structuring the query using predefined medical tags and indexing them through the graphs in a top-down manner. The system then generates responses based on these queries, pulling from meta-graphs—nodes retrieved along with their TopK related nodes and relationships—and summarizing the information into a detailed response. This technique maintains a balance between global context awareness and the contextual limitations inherent in LLMs.

Our medical graph RAG provides Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. The results provides the provenance, or source grounding information, as it generates each response, and demonstrates that an answer is grounded in the dataset. Having the cited source for each assertion readily available also enables a human user to quickly and accurately audit the LLM’s output directly against the original source material. It is super useful in the field of medicine that security is very important, and each of the reasoning should be evidence-based. By using such a method, we construct an evidence-based Medical LLM that the clinician could easiely check the source of the reasoning and calibrate the model response to ensure the safty usage of llm in the clinical senarios.

To evaluate our medical graph RAG, we implemented the method on several popular open and closed-source LLMs, including ChatGPT OpenAI (2023a) and LLaMA Touvron et al. (2023), testing them across mainstream medical Q&A benchmarks such as PubMedQA Jin et al. (2019), MedMCQA Pal et al. (2022), and USMLE Kung et al. (2023). For the RAG process, we supplied a comprehensive medical dictionary as the foundational knowledge layer, the UMLS medical knowledge graph Lindberg et al. (1993) as the foundamental layer detailing semantic relationships, and a curated MedC-K dataset Wu et al. (2023) —comprising the latest medical papers and books—as the intermediate level of data to simulate user-provided private data. Our experiments demonstrate that our model significantly enhances the performance of general-purpose LLMs on medical questions. Remarkably, it even surpasses many fine-tuned or specially trained LLMs on medical corpora, solely using the RAG approach without additional training.

配置大语言模型与嵌入模型¶

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."  # Replace with your OpenAI API key
import os

os.environ["OPENAI_API_KEY"] = "sk-..."  # Replace with your OpenAI API key

In [ ]:

Copied!

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding()
llm = OpenAI(model="gpt-4o-mini")
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding()
llm = OpenAI(model="gpt-4o-mini")

定义 TopicNodeParser¶

In [ ]:

Copied!

from llama_index.node_parser.topic import TopicNodeParser
from llama_index.node_parser.topic import TopicNodeParser

基于大语言模型的主题相似度分析¶

In [ ]:

Copied!





node_parser = TopicNodeParser.from_defaults(
    llm=llm,
    max_chunk_size=1000,
    similarity_method="llm",  # can be "llm" or "embedding"
    window_size=2,  # paper suggests window_size=5
)
node_parser = TopicNodeParser.from_defaults(
    llm=llm,
    max_chunk_size=1000,
    similarity_method="llm",  # can be "llm" or "embedding"
    window_size=2,  # paper suggests window_size=5
)

In [ ]:

Copied!

nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)
nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

让我们检查数据块。¶

In [ ]:

Copied!

print(nodes[0].get_content())
print(nodes[0].get_content())

This paper introduces a novel graph RAG method for applying LLMs to the medical domain. The novel graph RAG method is referred to as Medical Graph RAG (MedRAG). The Medical Graph RAG technique improves LLM performance in the medical domain. The Medical Graph RAG technique responds to queries with grounded source citations. The Medical Graph RAG technique provides clear interpretations of medical terminology. The Medical Graph RAG technique boosts the transparency of the results. The Medical Graph RAG technique boosts the interpretability of the results. The Medical Graph RAG approach involves a three-tier hierarchical graph construction method.

In [ ]:

Copied!

print(nodes[1].get_content())
print(nodes[1].get_content())

Documents provided by users are used as the top-level source to extract entities. The extracted entities are linked to a second level consisting of more basic entities. The more basic entities are previously abstracted from credible medical books and papers. The extracted entities are connected to a third level, which is the fundamental medical dictionary graph. The fundamental medical dictionary graph provides detailed explanations of each medical term. The fundamental medical dictionary graph provides the semantic relationships of medical terms.

In [ ]:

Copied!

print(nodes[2].get_content())
print(nodes[2].get_content())

A comprehensive graph is constructed at the highest level by linking entities based on their content. A comprehensive graph is constructed at the highest level by linking entities based on their hierarchical connections.

基于嵌入的主题相似度¶

In [ ]:

Copied!





node_parser = TopicNodeParser.from_defaults(
    embed_model=embed_model,
    llm=llm,
    max_chunk_size=1000,
    similarity_method="embedding",  # can be "llm" or "embedding"
    similarity_threshold=0.8,
    window_size=2,  # paper suggests window_size=5
)
node_parser = TopicNodeParser.from_defaults(
    embed_model=embed_model,
    llm=llm,
    max_chunk_size=1000,
    similarity_method="embedding",  # can be "llm" or "embedding"
    similarity_threshold=0.8,
    window_size=2,  # paper suggests window_size=5
)

In [ ]:

Copied!

nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)
nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

让我们检查数据块。¶

In [ ]:

Copied!

print(nodes[0].get_content())
print(nodes[0].get_content())

This paper introduces a novel graph RAG method for applying LLMs to the medical domain. The novel graph RAG method is referred to as Medical Graph RAG (MedRAG). The Medical Graph RAG technique improves LLM performance in the medical domain. The Medical Graph RAG technique responds to queries with grounded source citations. The Medical Graph RAG technique provides clear interpretations of medical terminology. The Medical Graph RAG technique boosts the transparency of the results. The Medical Graph RAG technique boosts the interpretability of the results. The Medical Graph RAG approach involves a three-tier hierarchical graph construction method.

In [ ]:

Copied!

print(nodes[1].get_content())
print(nodes[1].get_content())

Documents provided by users are used as the top-level source to extract entities. The extracted entities are linked to a second level consisting of more basic entities. The more basic entities are previously abstracted from credible medical books and papers. The extracted entities are connected to a third level, which is the fundamental medical dictionary graph. The fundamental medical dictionary graph provides detailed explanations of each medical term. The fundamental medical dictionary graph provides semantic relationships between medical terms. A comprehensive graph is constructed at the highest level by linking entities based on their content. A comprehensive graph is constructed at the highest level by linking entities based on their hierarchical connections. The Medical Graph RAG method ensures that the knowledge can be traced back to its sources. The Medical Graph RAG method ensures that the results are factually accurate.

In [ ]:

Copied!

print(nodes[2].get_content())
print(nodes[2].get_content())

The U-retrieve strategy is implemented to respond to user queries. The U-retrieve strategy combines top-down retrieval with bottom-up response generation.