底层结构化数据提取#

若您的LLM支持工具调用功能，且需要更直接地控制LlamaIndex如何提取数据，可直接在LLM上使用chat_with_tools方法。若LLM不支持工具调用，您可直接指示LLM并自行解析输出。我们将展示两种实现方式。

直接调用工具#

from llama_index.core.program.function_program import get_function_tool

tool = get_function_tool(Invoice)

resp = llm.chat_with_tools(
    [tool],
    # chat_history=chat_history,  # 可选择传入聊天历史而非user_msg
    user_msg="Extract an invoice from the following text: " + text,
    tool_required=True,  # 可选择强制工具调用
)

tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_calls=False
)

outputs = []
for tool_call in tool_calls:
    if tool_call.tool_name == "Invoice":
        outputs.append(Invoice(**tool_call.tool_kwargs))

# 使用输出结果
print(outputs[0])

当LLM具备工具调用API时，此方法与structured_predict效果相同。但若LLM支持，您可选择允许多个工具调用。这样就能从同一输入中提取多个对象，如下例所示：

from llama_index.core.program.function_program import get_function_tool

tool = get_function_tool(LineItem)

resp = llm.chat_with_tools(
    [tool],
    user_msg="Extract line items from the following text: " + text,
    allow_parallel_tool_calls=True,
)

tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_calls=False
)

outputs = []
for tool_call in tool_calls:
    if tool_call.tool_name == "LineItem":
        outputs.append(LineItem(**tool_call.tool_kwargs))

# 使用输出结果
print(outputs)

若需通过单次LLM调用提取多个Pydantic对象，此方法可实现该需求。

直接提示#

若因某些原因LlamaIndex提供的简化提取方案均不适用，您可绕过这些方法直接提示LLM并自行解析输出，如下所示：

schema = Invoice.model_json_schema()
prompt = "Here is a JSON schema for an invoice: " + json.dumps(
    schema, indent=2
)
prompt += (
    """
  从以下文本中提取发票信息。
  根据上述模式将输出格式化为JSON对象。
  不要包含除JSON对象外的任何文本。
  省略所有Markdown格式。不要包含任何前言或解释。
"""
    + text
)

response = llm.complete(prompt)

print(response)

invoice = Invoice.model_validate_json(response.text)

pprint(invoice)

恭喜！您已掌握LlamaIndex中结构化数据提取的所有知识。

其他指南#

如需深入了解LlamaIndex的结构化数据提取功能，请参阅以下指南：