Skip to content

使用结构化LLM#

在LlamaIndex中提取结构化数据的最高级方式是实例化一个结构化LLM。首先,让我们像之前一样实例化我们的Pydantic类:

from datetime import datetime


class LineItem(BaseModel):
    """A line item in an invoice."""

    item_name: str = Field(description="The name of this item")
    price: float = Field(description="The price of this item")


class Invoice(BaseModel):
    """A representation of information from an invoice."""

    invoice_id: str = Field(
        description="A unique identifier for this invoice, often a number"
    )
    date: datetime = Field(description="The date this invoice was created")
    line_items: list[LineItem] = Field(
        description="A list of all the items in this invoice"
    )

如果您是第一次使用LlamaIndex,让我们先安装依赖项:

  • 运行pip install llama-index-core llama-index-llms-openai获取LLM(为简化操作我们将使用OpenAI,但您也可以选择其他LLM)
  • 获取OpenAI API密钥并设置为名为OPENAI_API_KEY的环境变量
  • 运行pip install llama-index-readers-file获取PDFReader
    • 注意:为了更好的PDF解析效果,我们推荐使用LlamaParse

现在让我们加载一个实际发票的文本内容:

from llama_index.readers.file import PDFReader
from pathlib import Path

pdf_reader = PDFReader()
documents = pdf_reader.load_data(file=Path("./uber_receipt.pdf"))
text = documents[0].text

然后实例化一个LLM,给它我们的Pydantic类,并要求它使用发票的纯文本进行complete操作:

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
sllm = llm.as_structured_llm(Invoice)

response = sllm.complete(text)

response是一个LlamaIndex的CompletionResponse对象,包含两个属性:textrawtext包含经过Pydantic处理的JSON序列化响应:

json_response = json.loads(response.text)
print(json.dumps(json_response, indent=2))
{
    "invoice_id": "Visa \u2022\u2022\u2022\u20224469",
    "date": "2024-10-10T19:49:00",
    "line_items": [
        {"item_name": "Trip fare", "price": 12.18},
        {"item_name": "Access for All Fee", "price": 0.1},
        {"item_name": "CA Driver Benefits", "price": 0.32},
        {"item_name": "Booking Fee", "price": 2.0},
        {"item_name": "San Francisco City Tax", "price": 0.21},
    ],
}

注意这份发票没有ID,所以LLM尽力使用了信用卡号。Pydantic验证并不能保证完美结果!

responseraw属性(可能有些令人困惑)包含Pydantic对象本身:

from pprint import pprint

pprint(response.raw)
Invoice(
    invoice_id="Visa ••••4469",
    date=datetime.datetime(2024, 10, 10, 19, 49),
    line_items=[
        LineItem(item_name="Trip fare", price=12.18),
        LineItem(item_name="Access for All Fee", price=0.1),
        LineItem(item_name="CA Driver Benefits", price=0.32),
        LineItem(item_name="Booking Fee", price=2.0),
        LineItem(item_name="San Francisco City Tax", price=0.21),
    ],
)

注意Pydantic创建的是完整的datetime对象,而不仅仅是转换字符串。

结构化LLM的工作方式与常规LLM类完全相同:您可以调用chatstreamachatastream等方法,它都会返回Pydantic对象。您还可以将结构化LLM作为参数传递给VectorStoreIndex.as_query_engine(llm=sllm),它将自动以结构化对象响应您的RAG查询。

结构化LLM会为您处理所有提示工作。如果您需要更多控制提示的方式,请继续阅读结构化预测