结构化数据提取简介#

大语言模型（LLM）擅长数据理解，这使其具备了一项重要功能：能够将常规的人类语言（我们称之为非结构化数据）转换为计算机程序可处理的特定、规范、预期格式。我们将这一过程的输出称为结构化数据。由于转换过程中通常会忽略大量冗余信息，因此我们称之为提取。

LlamaIndex 中结构化数据提取的核心机制基于 Pydantic 类：您用 Pydantic 定义数据结构，LlamaIndex 则配合 Pydantic 将 LLM 的输出强制转换为该结构。

什么是 Pydantic？#

Pydantic 是一个广泛使用的数据验证和转换库，其核心依赖于 Python 的类型声明。该项目文档中有详细指南，但我们会在此介绍最基础的内容。

创建 Pydantic 类需继承自 Pydantic 的 BaseModel 类：

from pydantic import BaseModel


class User(BaseModel):
    id: int
    name: str = "Jane Doe"

此例创建了包含 id 和 name 两个字段的 User 类。id 被定义为整数类型，name 则是默认值为 Jane Doe 的字符串。

通过模型嵌套可构建更复杂的结构：

from typing import List, Optional
from pydantic import BaseModel


class Foo(BaseModel):
    count: int
    size: Optional[float] = None


class Bar(BaseModel):
    apple: str = "x"
    banana: str = "y"


class Spam(BaseModel):
    foo: Foo
    bars: List[Bar]

现在 Spam 包含 foo 和 bars。Foo 有 count 和可选的 size，而 bars 是由包含 apple 和 banana 属性的对象组成的列表。

将 Pydantic 对象转换为 JSON 模式#

Pydantic 支持将类转换为符合通用标准的 JSON 序列化模式对象。例如前文的 User 类会被序列化为：

{
  "properties": {
    "id": {
      "title": "Id",
      "type": "integer"
    },
    "name": {
      "default": "Jane Doe",
      "title": "Name",
      "type": "string"
    }
  },
  "required": [
    "id"
  ],
  "title": "User",
  "type": "object"
}

这个特性至关重要：这些 JSON 格式的模式常被传递给 LLM，而 LLM 会将其作为返回数据的格式规范。

使用注解#

如前所述，LLM 将 Pydantic 生成的 JSON 模式作为返回数据的指令。为提高数据返回准确性，建议为对象和字段添加自然语言描述说明其用途。Pydantic 通过文档字符串和字段支持此功能。

后续示例将统一使用以下 Pydantic 类：

from datetime import datetime


class LineItem(BaseModel):
    """发票中的单项条目"""

    item_name: str = Field(description="该条目的名称")
    price: float = Field(description="该条目的价格")


class Invoice(BaseModel):
    """发票信息的数据表示"""

    invoice_id: str = Field(
        description="该发票的唯一标识符，通常为数字"
    )
    date: datetime = Field(description="发票创建日期")
    line_items: list[LineItem] = Field(
        description="该发票包含的所有条目列表"
    )

这会扩展为更复杂的 JSON 模式：

{
  "$defs": {
    "LineItem": {
      "description": "A line item in an invoice.",
      "properties": {
        "item_name": {
          "description": "The name of this item",
          "title": "Item Name",
          "type": "string"
        },
        "price": {
          "description": "The price of this item",
          "title": "Price",
          "type": "number"
        }
      },
      "required": [
        "item_name",
        "price"
      ],
      "title": "LineItem",
      "type": "object"
    }
  },
  "description": "A representation of information from an invoice.",
  "properties": {
    "invoice_id": {
      "description": "A unique identifier for this invoice, often a number",
      "title": "Invoice Id",
      "type": "string"
    },
    "date": {
      "description": "The date this invoice was created",
      "format": "date-time",
      "title": "Date",
      "type": "string"
    },
    "line_items": {
      "description": "A list of all the items in this invoice",
      "items": {
        "$ref": "#/$defs/LineItem"
      },
      "title": "Line Items",
      "type": "array"
    }
  },
  "required": [
    "invoice_id",
    "date",
    "line_items"
  ],
  "title": "Invoice",
  "type": "object"
}

现在您已了解 Pydantic 及其生成模式的基础知识，接下来可以开始在 LlamaIndex 中使用 Pydantic 类进行结构化数据提取，请参阅结构化LLM。