记忆模块#

概念#

记忆是智能体系统的核心组件，它允许你存储和检索历史信息。

在LlamaIndex中，通常可以通过使用现有的BaseMemory类或创建自定义类来实现记忆功能。

当智能体运行时，它会调用memory.put()来存储信息，并调用memory.get()来检索信息。

注意： ChatMemoryBuffer已被弃用。在未来的版本中，默认将替换为Memory类，后者更灵活且支持更复杂的记忆配置。本节示例将使用Memory类。目前框架中默认使用ChatMemoryBuffer来创建基本的聊天历史缓冲区，为智能体提供符合令牌限制的最后X条消息。Memory类的操作方式类似，但更灵活且支持更复杂的记忆配置。

使用方式#

使用Memory类，你可以创建同时具有短期记忆（即消息的FIFO队列）和可选长期记忆（即随时间提取信息）的记忆系统。

为智能体配置记忆#

你可以通过将记忆传递给run()方法来为智能体设置记忆：

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.memory import Memory

memory = Memory.from_defaults(session_id="my_session", token_limit=40000)

agent = FunctionAgent(llm=llm, tools=tools)

response = await agent.run("<question that invokes tool>", memory=memory)

手动管理记忆#

你也可以直接调用memory.put_messages()和memory.get()来手动管理记忆，并传入聊天历史。

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.llms import ChatMessage
from llama_index.core.memory import Memory


memory = Memory.from_defaults(session_id="my_session", token_limit=40000)
memory.put_messages(
    [
        ChatMessage(role="user", content="Hello, world!"),
        ChatMessage(role="assistant", content="Hello, world to you too!"),
    ]
)
chat_history = memory.get()

agent = FunctionAgent(llm=llm, tools=tools)

# 传入聊天历史会覆盖任何现有记忆
response = await agent.run(
    "<question that invokes tool>", chat_history=chat_history
)

从智能体获取最新记忆#

你可以通过从智能体上下文中获取最新记忆：

from llama_index.core.workflow import Context

ctx = Context(agent)

response = await ctx.run("<question that invokes tool>", ctx=ctx)

# 获取记忆
memory = await ctx.store.get("memory")
chat_history = memory.get()

自定义记忆#

短期记忆#

默认情况下，Memory类会存储符合令牌限制的最后X条消息。你可以通过向Memory类传入token_limit和chat_history_token_ratio参数来自定义此行为。

token_limit（默认值：30000）：存储短期和长期记忆的最大令牌数。
chat_history_token_ratio（默认值：0.7）：短期聊天历史占令牌总数的比例。如果聊天历史超过此比例，最早的消息将被刷新到长期记忆中（如果启用了长期记忆）。
token_flush_size（默认值：3000）：当聊天历史超过令牌限制时，刷新到长期记忆的令牌数量。

memory = Memory.from_defaults(
    session_id="my_session",
    token_limit=40000,
    chat_history_token_ratio=0.7,
    token_flush_size=3000,
)

长期记忆#

长期记忆以 Memory Block 对象形式存在。这些对象接收从短期记忆刷新的消息，并可选择性地处理这些消息以提取信息。当检索记忆时，短期记忆和长期记忆会被合并。

目前预定义了三种记忆块：

StaticMemoryBlock：存储静态信息的记忆块
FactExtractionMemoryBlock：从聊天历史中提取事实的记忆块
VectorMemoryBlock：在向量数据库中存储和检索批量聊天消息的记忆块

默认情况下，根据 insert_method 参数，记忆块会被插入系统消息或最新的用户消息中。

这听起来有些复杂，但实际上非常简单。请看示例：

from llama_index.core.memory import (
    StaticMemoryBlock,
    FactExtractionMemoryBlock,
    VectorMemoryBlock,
)

blocks = [
    StaticMemoryBlock(
        name="core_info",
        static_content="My name is Logan, and I live in Saskatoon. I work at LlamaIndex.",
        priority=0,
    ),
    FactExtractionMemoryBlock(
        name="extracted_info",
        llm=llm,
        max_facts=50,
        priority=1,
    ),
    VectorMemoryBlock(
        name="vector_memory",
        # required: pass in a vector store like qdrant, chroma, weaviate, milvus, etc.
        vector_store=vector_store,
        priority=2,
        embed_model=embed_model,
        # The top-k message batches to retrieve
        # similarity_top_k=2,
        # optional: How many previous messages to include in the retrieval query
        # retrieval_context_window=5
        # optional: pass optional node-postprocessors for things like similarity threshold, etc.
        # node_postprocessors=[...],
    ),
]

这里我们设置了三个记忆块：

core_info：静态记忆块，存储用户的核心信息。静态内容可以是字符串或 ContentBlock 对象列表（如 TextBlock、ImageBlock 等）。这些信息将始终被插入记忆。
extracted_info：提取型记忆块，将从聊天历史中提取信息。我们传入用于从刷新聊天历史中提取事实的 llm，并将 max_facts 设为 50。如果提取的事实超过此限制，系统会自动对 max_facts 进行摘要并缩减，以便为新信息腾出空间。
vector_memory：向量记忆块，将在向量数据库中存储和检索批量聊天消息。每个批次都是刷新聊天消息的列表。这里我们传入了用于存储和检索聊天消息的 vector_store 和 embed_model。

您可能注意到我们为每个块设置了 priority。当记忆块内容（即长期记忆）+ 短期记忆超过 Memory 对象的令牌限制时，该参数用于确定处理顺序。

当记忆块过长时，它们会被自动"截断"。默认情况下，这意味着它们会从记忆中移除，直到再次有空间为止。可以通过实现自定义截断逻辑的记忆块子类来定制此行为。

priority=0：该块将始终保留在内存中
priority=1, 2, 3 等：当记忆超过令牌限制时，该值决定记忆块的截断顺序，以确保短期记忆 + 长期记忆内容不超过 token_limit

现在，让我们将这些块传入 Memory 类：

memory = Memory.from_defaults(
    session_id="my_session",
    token_limit=40000,
    memory_blocks=blocks,
    insert_method="system",
)

随着记忆的使用，短期记忆会逐渐填满。当短期记忆超过 chat_history_token_ratio 时，符合 token_flush_size 的最旧消息将被刷新并发送到每个记忆块进行处理。

检索记忆时，短期记忆和长期记忆会被合并。Memory 对象将确保短期记忆 + 长期记忆内容不超过 token_limit。如果超过，将在记忆块上调用 .truncate() 方法，使用 priority 确定截断顺序。

Tip

默认使用 tiktoken 进行令牌计数。要自定义此行为，可将 tokenizer_fn 参数设置为一个可调用对象，该对象接收字符串并返回列表，列表长度即令牌计数。

当记忆收集到足够信息后，我们可能会看到类似这样的记忆输出：

# 可选传入消息列表以获取，这些消息将被转发到记忆块
chat_history = memory.get(messages=[...])

print(chat_history[0].content)

输出示例如下：

<memory>
<static_memory>
My name is Logan, and I live in Saskatoon. I work at LlamaIndex.
</static_memory>
<fact_extraction_memory>
<fact>Fact 1</fact>
<fact>Fact 2</fact>
<fact>Fact 3</fact>
</fact_extraction_memory>
<retrieval_based_memory>
<message role='user'>Msg 1</message>
<message role='assistant'>Msg 2</message>
<message role='user'>Msg 3</message>
</retrieval_based_memory>
</memory>

此处记忆被插入系统消息中，每个记忆块都有特定部分。

自定义记忆块#

除了预定义的记忆块，您也可以创建自定义记忆块。

from typing import Optional, List, Any
from llama_index.core.llms import ChatMessage
from llama_index.core.memory.memory import BaseMemoryBlock


# 使用泛型定义记忆块的输出类型
# 可以是 str 或 List[ContentBlock]
class MentionCounter(BaseMemoryBlock[str]):
    """
    统计用户提及特定名称次数的记忆块
    """

    mention_name: str = "Logan"
    mention_count: int = 0

    async def _aget(
        self, messages: Optional[List[ChatMessage]] = None, **block_kwargs: Any
    ) -> str:
        return f"Logan was mentioned {self.mention_count} times."

    async def _aput(self, messages: List[ChatMessage]) -> None:
        for message in messages:
            if self.mention_name in message.content:
                self.mention_count += 1

    async def atruncate(
        self, content: str, tokens_to_truncate: int
    ) -> Optional[str]:
        return ""

这里我们定义了一个统计用户提及特定名称次数的记忆块。其截断方法很简单，只返回空字符串。

远程记忆#

默认情况下，Memory 类使用内存中的 SQLite 数据库。您可以通过更改数据库 URI 来接入任何远程数据库。

您可以自定义表名，也可以选择直接传入异步引擎。这对于管理自己的连接池很有用。

from llama_index.core.memory import Memory

memory = Memory.from_defaults(
    session_id="my_session",
    token_limit=40000,
    async_database_uri="postgresql+asyncpg://postgres:mark90@localhost:5432/postgres",
    # 可选：指定表名
    # table_name="memory_table",
    # 可选：直接传入异步引擎
    # 这对于管理自己的连接池很有用
    # async_engine=engine,
)

记忆与工作流上下文对比#

在文档的这个阶段，您可能遇到过使用工作流并序列化 Context 对象以保存和恢复特定工作流状态的情况。工作流 Context 是一个复杂对象，包含工作流的运行时信息以及在工作流步骤间共享的键/值对。

相比之下，Memory 对象更简单，仅包含 ChatMessage 对象，以及可选的用于长期记忆的 MemoryBlock 对象列表。

在大多数实际情况下，您会同时使用两者。如果不自定义记忆，那么序列化 Context 对象就足够了。

from llama_index.core.workflow import Context

ctx = Context(workflow)

# 序列化上下文
ctx_dict = ctx.to_dict()

# 反序列化上下文
ctx = Context.from_dict(workflow, ctx_dict)

在其他情况下，如使用 FunctionAgent、AgentWorkflow 或 ReActAgent 时，如果自定义了记忆，则需要将其作为单独的运行时参数提供（特别是因为除了默认情况外，Memory 对象不可序列化）。

response = await agent.run("Hello!", memory=memory)

最后，在某些情况下（如人机交互），您需要同时提供 Context（以恢复工作流）和 Memory（以存储聊天历史）。

response = await agent.run("Hello!", ctx=ctx, memory=memory)

（已弃用）内存类型#

在 llama_index.core.memory 模块中，我们提供了几种不同的内存类型：

ChatMemoryBuffer：基础内存缓冲区，存储符合令牌限制的最新 X 条消息。
ChatSummaryMemoryBuffer：内存缓冲区，存储符合令牌限制的最新 X 条消息，并在对话过长时定期生成摘要。
VectorMemory：将聊天消息存储到向量数据库并从其中检索的内存类型。不保证消息顺序，返回与最新用户消息最相似的记录。
SimpleComposableMemory：组合多个内存类型的复合内存。通常用于将 VectorMemory 与 ChatMemoryBuffer 或 ChatSummaryMemoryBuffer 结合使用。

示例#

以下是一些内存使用的实际案例：

注意： 已弃用的示例： - 聊天内存缓冲区 - 聊天摘要内存缓冲区 - 复合内存 - 向量内存 - Mem0 内存