OpenAI JSON 模式与函数调用在数据提取中的对比¶

OpenAI 刚刚发布了 JSON 模式：这项新配置会强制约束大语言模型仅生成可解析为有效 JSON 的字符串（但不保证符合任何特定模式校验）。

此前，从文本中提取结构化数据的最佳方式是通过函数调用功能。

本笔记本将探讨最新的 JSON 模式与函数调用特性在结构化输出和数据提取方面的优劣权衡。

更新：OpenAI 已澄清函数调用始终启用 JSON 模式，而常规消息需手动开启该功能 (https://community.openai.com/t/json-mode-vs-function-calling/476994/4)

生成合成数据¶

我们将首先生成一些用于数据提取任务的合成数据。让我们向大语言模型（LLM）请求一段假设的销售对话记录。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index-llms-openai
%pip install llama-index-program-openai

In [ ]:

Copied!





from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-1106")
response = llm.complete(
    "Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-1106")
response = llm.complete(
    "Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)

In [ ]:

Copied!

transcript = response.text
print(transcript)
transcript = response.text
print(transcript)

[Phone rings]

John: Hello, this is John.

Sarah: Hi John, this is Sarah from XYZ Company. I'm calling to discuss our new product, the XYZ Widget, and see if it might be a good fit for your business.

John: Hi Sarah, thanks for reaching out. I'm definitely interested in learning more about the XYZ Widget. Can you give me a quick overview of what it does?

Sarah: Of course! The XYZ Widget is a cutting-edge tool that helps businesses streamline their workflow and improve productivity. It's designed to automate repetitive tasks and provide real-time data analytics to help you make informed decisions.

John: That sounds really interesting. I can see how that could benefit our team. Do you have any case studies or success stories from other companies who have used the XYZ Widget?

Sarah: Absolutely, we have several case studies that I can share with you. I'll send those over along with some additional information about the product. I'd also love to schedule a demo for you and your team to see the XYZ Widget in action.

John: That would be great. I'll make sure to review the case studies and then we can set up a time for the demo. In the meantime, are there any specific action items or next steps we should take?

Sarah: Yes, I'll send over the information and then follow up with you to schedule the demo. In the meantime, feel free to reach out if you have any questions or need further information.

John: Sounds good, I appreciate your help Sarah. I'm looking forward to learning more about the XYZ Widget and seeing how it can benefit our business.

Sarah: Thank you, John. I'll be in touch soon. Have a great day!

John: You too, bye.

设置我们所需的架构¶

让我们将期望的输出"结构"指定为 Pydantic 模型。

In [ ]:

Copied!





from pydantic import BaseModel, Field
from typing import List


class CallSummary(BaseModel):
    """Data model for a call summary."""

    summary: str = Field(
        description="High-level summary of the call transcript. Should not exceed 3 sentences."
    )
    products: List[str] = Field(
        description="List of products discussed in the call"
    )
    rep_name: str = Field(description="Name of the sales rep")
    prospect_name: str = Field(description="Name of the prospect")
    action_items: List[str] = Field(description="List of action items")
from pydantic import BaseModel, Field
from typing import List


class CallSummary(BaseModel):
    """Data model for a call summary."""

    summary: str = Field(
        description="High-level summary of the call transcript. Should not exceed 3 sentences."
    )
    products: List[str] = Field(
        description="List of products discussed in the call"
    )
    rep_name: str = Field(description="Name of the sales rep")
    prospect_name: str = Field(description="Name of the prospect")
    action_items: List[str] = Field(description="List of action items")

通过函数调用实现数据提取¶

我们可以使用 LlamaIndex 中的 OpenAIPydanticProgram 模块来极大简化流程，只需定义一个提示模板，并传入我们已定义的 LLM 和 pydantic 模型即可。

In [ ]:

Copied!

from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage

In [ ]:

Copied!





prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts."
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
program = OpenAIPydanticProgram.from_defaults(
    output_cls=CallSummary,
    llm=llm,
    prompt=prompt,
    verbose=True,
)
prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts."
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
program = OpenAIPydanticProgram.from_defaults(
    output_cls=CallSummary,
    llm=llm,
    prompt=prompt,
    verbose=True,
)

In [ ]:

Copied!

output = program(transcript=transcript)
output = program(transcript=transcript)

Function call: CallSummary with args: {"summary":"Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.","products":["XYZ Widget"],"rep_name":"Sarah","prospect_name":"John","action_items":["Review case studies","Schedule demo"]}

我们现在已经获得了符合预期的结构化数据，采用 Pydantic 模型实现。快速检查显示结果与预期完全一致。

In [ ]:

Copied!

output.dict()
output.dict()

Out[ ]:

{'summary': 'Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.',
 'products': ['XYZ Widget'],
 'rep_name': 'Sarah',
 'prospect_name': 'John',
 'action_items': ['Review case studies', 'Schedule demo']}

使用 JSON 模式进行数据提取¶

让我们尝试用 JSON 模式实现同样的功能，而非使用函数调用

In [ ]:

Copied!





prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON following the given schema below:\n"
                "{json_schema}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON following the given schema below:\n"
                "{json_schema}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

In [ ]:

Copied!

messages = prompt.format_messages(
    json_schema=CallSummary.schema_json(), transcript=transcript
)
messages = prompt.format_messages(
    json_schema=CallSummary.schema_json(), transcript=transcript
)

In [ ]:

Copied!

output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content
output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

我们获得了一个有效的 JSON，但它只是机械地复述了我们指定的模式，并未实际执行提取操作。

In [ ]:

Copied!

print(output)
print(output)

{
  "title": "CallSummary",
  "description": "Data model for a call summary.",
  "type": "object",
  "properties": {
    "summary": {
      "title": "Summary",
      "description": "High-level summary of the call transcript. Should not exceed 3 sentences.",
      "type": "string"
    },
    "products": {
      "title": "Products",
      "description": "List of products discussed in the call",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "rep_name": {
      "title": "Rep Name",
      "description": "Name of the sales rep",
      "type": "string"
    },
    "prospect_name": {
      "title": "Prospect Name",
      "description": "Name of the prospect",
      "type": "string"
    },
    "action_items": {
      "title": "Action Items",
      "description": "List of action items",
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": ["summary", "products", "rep_name", "prospect_name", "action_items"]
}

让我们换一种方式，直接展示所需的 JSON 格式，而不是指定其模式

In [ ]:

Copied!





import json

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON in the following format:\n"
                "{json_example}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

dict_example = {
    "summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
    "products": ["product 1", "product 2"],
    "rep_name": "Name of the sales rep",
    "prospect_name": "Name of the prospect",
    "action_items": ["action item 1", "action item 2"],
}

json_example = json.dumps(dict_example)
import json

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON in the following format:\n"
                "{json_example}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

dict_example = {
    "summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
    "products": ["product 1", "product 2"],
    "rep_name": "Name of the sales rep",
    "prospect_name": "Name of the prospect",
    "action_items": ["action item 1", "action item 2"],
}

json_example = json.dumps(dict_example)

In [ ]:

Copied!

messages = prompt.format_messages(
    json_example=json_example, transcript=transcript
)
messages = prompt.format_messages(
    json_example=json_example, transcript=transcript
)

In [ ]:

Copied!

output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content
output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

现在我们能够按预期获取提取后的结构化数据。

In [ ]:

Copied!

print(output)
print(output)

{
  "summary": "Sarah from XYZ Company called John to discuss the new product, the XYZ Widget, which is designed to streamline workflow and improve productivity. They discussed case studies and scheduling a demo for John and his team. The next steps include Sarah sending over information and following up to schedule the demo.",
  "products": ["XYZ Widget"],
  "rep_name": "Sarah",
  "prospect_name": "John",
  "action_items": ["Review case studies", "Schedule demo"]
}

核心要点速览¶

函数调用（Function calling）在结构化数据提取场景中仍更易用（特别是当您已通过如 pydantic 模型等方式预先定义好数据模式时）
JSON 模式虽能强制规范输出格式，但无法针对指定模式进行数据验证。直接传入数据模式可能无法生成预期 JSON，且需要额外注意格式编排和提示词设计。