基于 LlamaIndex 的 GraphRAG 实现¶

GraphRAG（图结构 + 检索增强生成）结合了检索增强生成（RAG）和查询聚焦摘要（QFS）的优势，能够高效处理大规模文本数据集上的复杂查询。虽然 RAG 擅长获取精确信息，但在需要主题理解的宽泛查询上表现欠佳，而 QFS 虽能解决这一问题却难以扩展。GraphRAG 整合这两种方法，为海量多样化文本语料库提供响应迅速且全面的查询能力。

本指南将介绍如何使用 LlamaIndex 的 PropertyGraph 抽象构建 GraphRAG 流水线。

注意： 这是 GraphRAG 的近似实现方案。我们正在开发一系列技术手册来详细说明 GraphRAG 的精确实现方法。

GraphRAG 方法¶

GraphRAG 包含两个步骤：

图生成 - 基于给定文档创建图谱，构建社区及其摘要
查询应答 - 使用步骤1生成的社区摘要来回答查询

图生成过程：

源文档分块处理： 将源文档分割为更小的文本块以便处理
文本块到元素实例： 分析每个文本块以识别和提取实体及关系，生成表示这些元素的元组列表
元素实例到元素摘要： 利用大语言模型将提取的实体和关系汇总为每个元素的描述性文本块
元素摘要到图社区： 这些实体、关系和摘要构成图谱，随后使用分层Leiden算法进行社区划分，建立层次结构
图社区到社区摘要： 由大语言模型为每个社区生成摘要，揭示数据集的整体主题结构和语义信息

查询应答过程：

社区摘要到全局答案： 利用社区摘要响应用户查询。首先生成中间答案，最终整合形成全面的全局答案。

GraphRAG 流水线组件¶

以下是我们为实现上述所有流程而构建的不同组件：

源文档到文本分块：使用 SentenceSplitter 实现，分块大小为 1024，分块重叠为 20 个 token。
文本分块到元素实例 AND 元素实例到元素摘要：使用 GraphRAGExtractor 实现。
元素摘要到图社区 AND 图社区到社区摘要：使用 GraphRAGStore 实现。
社区摘要到全局答案：使用 GraphQueryEngine 实现。

接下来我们将逐一检查这些组件并构建 GraphRAG 流水线。

安装¶

graspologic 用于通过 hierarchical_leiden 构建社区。

In [ ]:

Copied!

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0
!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

加载数据¶

我们将使用从Diffbot获取的新闻文章样本数据集，该数据集已由Tomaz上传至GitHub以便于访问。

该数据集包含2,500个样本；为便于实验，我们将使用其中的50个样本，这些样本包含新闻文章的title（标题）和text（正文）信息。

In [ ]:

Copied!





import pandas as pd
from llama_index.core import Document

news = pd.read_csv(
    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)[:50]

news.head()
import pandas as pd
from llama_index.core import Document

news = pd.read_csv(
    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)[:50]

news.head()

Out[ ]:

	title	date	text
0	Chevron: Best Of Breed	2031-04-06T01:36:32.000000000+00:00	JHVEPhoto Like many companies in the O&G secto...
1	FirstEnergy (NYSE:FE) Posts Earnings Results	2030-04-29T06:55:28.000000000+00:00	FirstEnergy (NYSE:FE – Get Rating) posted its ...
2	Dáil almost suspended after Sinn Féin TD put p...	2023-06-15T14:32:11.000000000+00:00	The Dáil was almost suspended on Thursday afte...
3	Epic’s latest tool can animate hyperrealistic ...	2023-06-15T14:00:00.000000000+00:00	Today, Epic is releasing a new tool designed t...
4	EU to Ban Huawei, ZTE from Internal Commission...	2023-06-15T13:50:00.000000000+00:00	The European Commission is planning to ban equ...

按要求准备 LlamaIndex 所需的文档

In [ ]:

Copied!





documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for i, row in news.iterrows()
]
documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for i, row in news.iterrows()
]

设置 API 密钥与大型语言模型¶

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4")
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4")

GraphRAGExtractor¶

GraphRAGExtractor 类专为从文本中提取三元组（主体-关系-客体）而设计，并通过使用大语言模型（LLM）为实体和关系添加描述性信息来增强其属性。

该功能与 SimpleLLMPathExtractor 类似，但额外包含处理实体和关系描述的增强功能。具体实现可参考现有的提取器示例。

功能分解如下：

核心组件：

llm: 用于信息提取的语言模型
extract_prompt: 引导LLM进行信息提取的提示模板
parse_fn: 将LLM输出解析为结构化数据的函数
max_paths_per_chunk: 限制每个文本块提取的三元组数量
num_workers: 用于多文本节点并行处理的线程数

主要方法：

__call__: 处理文本节点列表的入口点
acall: 提升性能的异步版本调用方法
_aextract: 处理单个节点的核心方法

提取流程：

对于每个输入节点（文本块）：

将文本与提取提示一同发送至LLM
解析LLM响应以提取实体、关系及其描述信息
实体被转换为EntityNode对象，实体描述存储于元数据中
关系被转换为Relation对象，关系描述存储于元数据中
这些信息被添加到节点的元数据中，分别存储在KG_NODES_KEY和KG_RELATIONS_KEY下

注意： 当前实现仅使用关系描述功能，下一版本将在检索阶段加入实体描述的应用。

In [ ]:

Copied!





import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            metadata[
                "entity_description"
            ] = description  # Not used in the current implementation. But will be useful in future work.
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=metadata
            )
            existing_nodes.append(entity_node)

        metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            subj_node = EntityNode(name=subj, properties=metadata)
            obj_node = EntityNode(name=obj, properties=metadata)
            metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj_node.id,
                target_id=obj_node.id,
                properties=metadata,
            )

            existing_nodes.extend([subj_node, obj_node])
            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )
import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            metadata[
                "entity_description"
            ] = description  # Not used in the current implementation. But will be useful in future work.
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=metadata
            )
            existing_nodes.append(entity_node)

        metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            subj_node = EntityNode(name=subj, properties=metadata)
            obj_node = EntityNode(name=obj, properties=metadata)
            metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj_node.id,
                target_id=obj_node.id,
                properties=metadata,
            )

            existing_nodes.extend([subj_node, obj_node])
            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

GraphRAGStore¶

GraphRAGStore 类是 SimplePropertyGraphStore 类的扩展，专为实现 GraphRAG 流水线而设计。以下是其核心组件与功能的解析：

该类通过社区检测算法对图中相关节点进行分组，随后使用大语言模型（LLM）为每个社区生成摘要。

核心方法：

build_communities():

将内部图表示转换为 NetworkX 图结构
应用层次化 Leiden 算法进行社区检测
收集每个社区的详细信息
为各社区生成摘要

generate_community_summary(text):

使用 LLM 生成社区内关系摘要
摘要包含实体名称及关系描述的综合说明

_create_nx_graph():

将内部图表示转换为 NetworkX 图结构（用于社区检测）

_collect_community_info(nx_graph, clusters):

根据社区归属收集各节点的详细信息
创建社区内每个关系的字符串表示形式

_summarize_communities(community_info):

使用 LLM 生成并存储各社区摘要

get_community_summaries():

返回社区摘要（若尚未构建则先执行构建流程）

In [ ]:

Copied!





import re
from llama_index.core.graph_stores import SimplePropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage


class GraphRAGStore(SimplePropertyGraphStore):
    community_summary = {}
    max_cluster_size = 5

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        for node in self.graph.nodes.values():
            nx_graph.add_node(str(node))
        for relation in self.graph.relations.values():
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """Collect detailed information for each node based on their community."""
        community_mapping = {item.node: item.cluster for item in clusters}
        community_info = {}
        for item in clusters:
            cluster_id = item.cluster
            node = item.node
            if cluster_id not in community_info:
                community_info[cluster_id] = []

            for neighbor in nx_graph.neighbors(node):
                if community_mapping[neighbor] == cluster_id:
                    edge_data = nx_graph.get_edge_data(node, neighbor)
                    if edge_data:
                        detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                        community_info[cluster_id].append(detail)
        return community_info

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary
import re
from llama_index.core.graph_stores import SimplePropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage


class GraphRAGStore(SimplePropertyGraphStore):
    community_summary = {}
    max_cluster_size = 5

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        for node in self.graph.nodes.values():
            nx_graph.add_node(str(node))
        for relation in self.graph.relations.values():
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """Collect detailed information for each node based on their community."""
        community_mapping = {item.node: item.cluster for item in clusters}
        community_info = {}
        for item in clusters:
            cluster_id = item.cluster
            node = item.node
            if cluster_id not in community_info:
                community_info[cluster_id] = []

            for neighbor in nx_graph.neighbors(node):
                if community_mapping[neighbor] == cluster_id:
                    edge_data = nx_graph.get_edge_data(node, neighbor)
                    if edge_data:
                        detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                        community_info[cluster_id].append(detail)
        return community_info

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

/usr/local/lib/python3.10/dist-packages/graspologic/models/edge_swaps.py:215: NumbaDeprecationWarning: The keyword argument 'nopython=False' was supplied. From Numba 0.59.0 the default is being changed to True and use of 'nopython=False' will raise a warning as the argument will have no effect. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  _edge_swap_numba = nb.jit(_edge_swap, nopython=False)

GraphRAGQueryEngine¶

GraphRAGQueryEngine 类是一个采用 GraphRAG 方法处理查询的自定义查询引擎。它利用 GraphRAGStore 生成的社区摘要来回答用户查询。以下是其功能解析：

核心组件：

graph_store: GraphRAGStore 实例，包含社区摘要数据 llm: 用于生成和聚合答案的语言模型（LLM）

关键方法：

custom_query(query_str: str)

这是处理查询的主入口点。该方法会检索社区摘要，从每个摘要生成答案，最后将这些答案聚合成最终响应

generate_answer_from_summary(community_summary, query):

基于单个社区摘要生成查询答案
使用语言模型结合查询上下文解读社区摘要

aggregate_answers(community_answers):

将来自不同社区的独立答案组合成连贯的最终响应
使用语言模型将多个视角综合为简洁的单一答案

查询处理流程：

从图存储中检索社区摘要
为每个社区摘要生成针对查询的具体答案
将所有社区特定答案聚合成最终连贯响应

使用示例：

query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)

response = query_engine.query("query")

In [ ]:

Copied!





from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for _, community_summary in community_summaries.items()
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for _, community_summary in community_summaries.items()
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

构建端到端 GraphRAG 全流程¶

现在我们已经定义了所有必要组件，接下来构建 GraphRAG 处理流程：

从文本中创建节点/文本块
使用 GraphRAGExtractor 和 GraphRAGStore 构建 PropertyGraphIndex
基于上述构建的图谱创建社区，并为每个社区生成摘要
创建 GraphRAGQueryEngine 并开始执行查询

从文本创建节点/分块¶

In [ ]:

Copied!





from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

In [ ]:

Copied!

len(nodes)
len(nodes)

Out[ ]:

使用 `GraphRAGExtractor` 和 `GraphRAGStore` 构建 ProperGraphIndex¶

In [ ]:

Copied!





KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

3. Output Formatting:
- Return the result in valid JSON format with two keys: 'entities' (list of entity objects) and 'relationships' (list of relationship objects).
- Exclude any text outside the JSON structure (e.g., no explanations or comments).
- If no entities or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.

-An Output Example-
{
  "entities": [
    {
      "entity_name": "Albert Einstein",
      "entity_type": "Person",
      "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics."
    },
    {
      "entity_name": "Theory of Relativity",
      "entity_type": "Scientific Theory",
      "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference."
    },
    {
      "entity_name": "Nobel Prize in Physics",
      "entity_type": "Award",
      "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences."
    }
  ],
  "relationships": [
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Theory of Relativity",
      "relation": "developed",
      "relationship_description": "Albert Einstein is the developer of the theory of relativity."
    },
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Nobel Prize in Physics",
      "relation": "won",
      "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921."
    }
  ]
}

-Real Data-
######################
text: {text}
######################
output:"""
KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

3. Output Formatting:
- Return the result in valid JSON format with two keys: 'entities' (list of entity objects) and 'relationships' (list of relationship objects).
- Exclude any text outside the JSON structure (e.g., no explanations or comments).
- If no entities or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.

-An Output Example-
{
  "entities": [
    {
      "entity_name": "Albert Einstein",
      "entity_type": "Person",
      "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics."
    },
    {
      "entity_name": "Theory of Relativity",
      "entity_type": "Scientific Theory",
      "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference."
    },
    {
      "entity_name": "Nobel Prize in Physics",
      "entity_type": "Award",
      "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences."
    }
  ],
  "relationships": [
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Theory of Relativity",
      "relation": "developed",
      "relationship_description": "Albert Einstein is the developer of the theory of relativity."
    },
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Nobel Prize in Physics",
      "relation": "won",
      "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921."
    }
  ]
}

-Real Data-
######################
text: {text}
######################
output:"""

In [ ]:

Copied!





import json


def parse_fn(response_str: str) -> Any:
    json_pattern = r"\{.*\}"
    match = re.search(json_pattern, response_str, re.DOTALL)
    entities = []
    relationships = []
    if not match:
        return entities, relationships
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [
            (
                entity["entity_name"],
                entity["entity_type"],
                entity["entity_description"],
            )
            for entity in data.get("entities", [])
        ]
        relationships = [
            (
                relation["source_entity"],
                relation["target_entity"],
                relation["relation"],
                relation["relationship_description"],
            )
            for relation in data.get("relationships", [])
        ]
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships


kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)
import json


def parse_fn(response_str: str) -> Any:
    json_pattern = r"\{.*\}"
    match = re.search(json_pattern, response_str, re.DOTALL)
    entities = []
    relationships = []
    if not match:
        return entities, relationships
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [
            (
                entity["entity_name"],
                entity["entity_type"],
                entity["entity_description"],
            )
            for entity in data.get("entities", [])
        ]
        relationships = [
            (
                relation["source_entity"],
                relation["target_entity"],
                relation["relation"],
                relation["relationship_description"],
            )
            for relation in data.get("relationships", [])
        ]
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships


kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)

In [ ]:

Copied!





from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    property_graph_store=GraphRAGStore(),
    kg_extractors=[kg_extractor],
    show_progress=True,
)
from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    property_graph_store=GraphRAGStore(),
    kg_extractors=[kg_extractor],
    show_progress=True,
)

Extracting paths from text: 100%|██████████| 50/50 [04:30<00:00,  5.41s/it]
Generating embeddings: 100%|██████████| 1/1 [00:01<00:00,  1.24s/it]
Generating embeddings: 100%|██████████| 4/4 [00:00<00:00,  4.22it/s]

In [ ]:

Copied!

list(index.property_graph_store.graph.nodes.values())[-1]
list(index.property_graph_store.graph.nodes.values())[-1]

Out[ ]:

EntityNode(label='entity', embedding=None, properties={'relationship_description': 'Gett Taxi is a competitor of Uber in the Israeli taxi market.', 'triplet_source_id': 'e4f765e3-fdfd-48d0-92a9-36f75b5865aa'}, name='Competition')

In [ ]:

Copied!

list(index.property_graph_store.graph.relations.values())[0]
list(index.property_graph_store.graph.relations.values())[0]

Out[ ]:

Relation(label='O&G sector', source_id='Chevron', target_id='Operates in', properties={'relationship_description': 'Chevron operates in the O&G sector, as evidenced by the text mentioning that it is a company in this industry.', 'triplet_source_id': '6a28dc67-0dc0-486f-8dd6-70a3502f1c8e'})

In [ ]:

Copied!

list(index.property_graph_store.graph.relations.values())[0].properties[
    "relationship_description"
]
list(index.property_graph_store.graph.relations.values())[0].properties[
    "relationship_description"
]

Out[ ]:

'Chevron operates in the O&G sector, as evidenced by the text mentioning that it is a company in this industry.'

构建社区¶

这将为每个社区创建对应的社区结构和摘要。

In [ ]:

Copied!

index.property_graph_store.build_communities()
index.property_graph_store.build_communities()

创建查询引擎¶

In [ ]:

Copied!

query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store, llm=llm
)
query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store, llm=llm
)

查询¶

In [ ]:

Copied!





response = query_engine.query(
    "What are the main news discussed in the document?"
)
display(Markdown(f"{response.response}"))
response = query_engine.query(
    "What are the main news discussed in the document?"
)
display(Markdown(f"{response.response}"))

The document discusses various news topics across different sectors. In the business sector, it mentions FirstEnergy being a publicly traded company on the New York Stock Exchange and State Street Corporation being listed on the NYSE. It also discusses Coinbase Global Inc.'s repurchase of $64.5 million worth of 0.50% convertible senior notes and the closure of the startup Protonn. In the political sphere, it highlights a theatrical act performed by Sinn Féin TD John Brady during a debate on retained firefighters. In the tech industry, it discusses the European Commission's actions against ZTE Corp. and TikTok Inc. due to security concerns. In the sports sector, it mentions Manchester United's interest in Harry Kane, the transfer of Jude Bellingham from Borussia Dortmund to Real Madrid, and the negotiation process for Maliek Collins' contract extension with the Houston Texans. In the music industry, it discusses the acquisition of The Hollies' recording catalog by BMG and the distribution pact between ADA Worldwide and Rostrum Records. In the hospitality sector, it mentions the partnership between Supplier.io and Hyatt Hotels. In the energy sector, it discusses the partnership between GE Vernova and Amplus Solar. In the gaming industry, it discusses the creation of the unannounced game "Star Ocean: The Second Story R" by Square Enix. In the automotive industry, it mentions the upcoming launch of the Hyundai Exter in India and Stellantis' plans to shut down the Belvidere Assembly Plant. In the airline industry, it discusses Deutsche Bank's decision to upgrade Allegiant Travel's status from Hold to Buy. In the football sector, it discusses the rejected bids made by Arsenal for Rice and the rejected bid received by Chelsea for Mason Mount. In the space industry, it mentions MDA Ltd.'s participation in the Jefferies Virtual Space Summit. In the transportation industry, it discusses Uber's strategic decision to exit the Israeli market and the emergence of Yango as a key player in the Israeli taxi market.

In [ ]:

Copied!

response = query_engine.query("What are news related to financial sector?")
display(Markdown(f"{response.response}"))
response = query_engine.query("What are news related to financial sector?")
display(Markdown(f"{response.response}"))

The recent news related to the financial sector includes Morgan Stanley hiring Thomas Christl to co-head its coverage of consumer and retail clients in Europe. KeyBank has expanded its presence in the Western U.S. by opening a new branch in American Fork and donated $10,000 to the Five.12 Foundation. BMG has acquired the recording catalog of The Hollies, and Matt Pincus led a $15 million pre-growth round of investment for Soundtrack Your Brand. Hyatt Hotels and Supplier.io have been honored with the Supply & Demand Chain Executive 2023 Top Supply Chain Projects award. Bank of America Corp. reported a decline in uninsured deposits, while JPMorgan Chase & Co. reported a 1.9% increase in uninsured deposits. Coinbase Global Inc. repurchased $64.5 million worth of 0.50% convertible senior notes and also decided to repurchase its 0.50% Convertible Senior Notes due 2026 for approximately $45.5 million. Deutsche Bank upgraded Allegiant Travel's status from Hold to Buy and increased the price target to $145. Lastly, Tesla Inc.'s stock performance was analyzed by Ihor Dusaniwsky, a managing director at S3 Partners, and the company formed a significant partnership with General Motors Co. in the electric vehicle industry.

未来工作：¶

本指南是 GraphRAG 的近似实现方案。在后续版本中，我们计划进行以下扩展：

实现基于实体描述嵌入的检索功能
集成 Neo4JPropertyGraphStore 支持
为社区摘要生成的每个答案计算有用性评分，并过滤掉评分为零的答案
执行实体消歧以消除重复实体
实现声明/协变量信息提取、局部搜索与全局搜索技术

基于 LlamaIndex 的 GraphRAG 实现¶

GraphRAG 方法¶

GraphRAG 流水线组件¶

安装¶

加载数据¶

设置 API 密钥与大型语言模型¶

GraphRAGExtractor¶

GraphRAGStore¶

GraphRAGQueryEngine¶

构建端到端 GraphRAG 全流程¶

从文本创建节点/分块¶

使用 GraphRAGExtractor 和 GraphRAGStore 构建 ProperGraphIndex¶

构建社区¶

创建查询引擎¶

查询¶

未来工作：¶

使用 `GraphRAGExtractor` 和 `GraphRAGStore` 构建 ProperGraphIndex¶