多模态结构化输出：GPT-4o 与其他 GPT-4 变体的对比¶

在本笔记本中，我们使用 MultiModalLLMCompletionProgram 类实现基于图像的结构化数据提取。我们将对支持视觉功能的 GPT-4 模型进行横向对比。

In [ ]:

Copied!





%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q
%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q

In [ ]:

Copied!

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

图像数据集：PaperCards¶

在此数据提取任务中，我们将使用多模态大语言模型从所谓的"论文卡片"中提取信息。这些卡片是包含研究论文摘要的可视化图表。可通过执行以下命令从我们的Dropbox账户下载数据集。

下载镜像¶

In [ ]:

Copied!





!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip
!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip

将 PaperCard 作为 ImageDocument 加载¶

In [ ]:

Copied!





## import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# context images
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()
## import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# context images
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()

In [ ]:

Copied!





# let's see one
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()
# let's see one
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()

No description has been provided for this image

构建我们的多模态LLM补全程序（多模态结构化输出）¶

期望的结构化输出¶

在这里我们将定义数据类（即 Pydantic 的 BaseModel），用于存储从给定图像或 PaperCard 中提取的数据。

In [ ]:

Copied!





from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional


# Desired output structure
class PaperCard(BaseModel):
    """Data class for storing text attributes of a PaperCard."""

    title: str = Field(description="Title of paper.")
    year: str = Field(description="Year of publication of paper.")
    authors: str = Field(description="Authors of paper.")
    arxiv_id: str = Field(description="Arxiv paper id.")
    main_contribution: str = Field(
        description="Main contribution of the paper."
    )
    insights: str = Field(
        description="Main insight or motivation for the paper."
    )
    main_results: List[str] = Field(
        description="The main results of the paper."
    )
    tech_bits: Optional[str] = Field(
        description="Describe what's being displayed in the technical bits section of the image."
    )
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional


# Desired output structure
class PaperCard(BaseModel):
    """Data class for storing text attributes of a PaperCard."""

    title: str = Field(description="Title of paper.")
    year: str = Field(description="Year of publication of paper.")
    authors: str = Field(description="Authors of paper.")
    arxiv_id: str = Field(description="Arxiv paper id.")
    main_contribution: str = Field(
        description="Main contribution of the paper."
    )
    insights: str = Field(
        description="Main insight or motivation for the paper."
    )
    main_results: List[str] = Field(
        description="The main results of the paper."
    )
    tech_bits: Optional[str] = Field(
        description="Describe what's being displayed in the technical bits section of the image."
    )

接下来，我们定义 MultiModalLLMCompletionProgram。实际上我们将定义三个独立的程序，分别对应三个支持视觉的 GPT-4 模型：GPT-4o、GPT-4v 和 GPT-4Turbo。

In [ ]:

Copied!





paper_card_extraction_prompt = """
Use the attached PaperCard image to extract data from it and store into the
provided data class.
"""

gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)

gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)

gpt_4turbo = OpenAIMultiModal(
    model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)

multimodal_llms = {
    "gpt_4o": gpt_4o,
    "gpt_4v": gpt_4v,
    "gpt_4turbo": gpt_4turbo,
}

programs = {
    mdl_name: MultiModalLLMCompletionProgram.from_defaults(
        output_cls=PaperCard,
        prompt_template_str=paper_card_extraction_prompt,
        multi_modal_llm=mdl,
    )
    for mdl_name, mdl in multimodal_llms.items()
}
paper_card_extraction_prompt = """
Use the attached PaperCard image to extract data from it and store into the
provided data class.
"""

gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)

gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)

gpt_4turbo = OpenAIMultiModal(
    model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)

multimodal_llms = {
    "gpt_4o": gpt_4o,
    "gpt_4v": gpt_4v,
    "gpt_4turbo": gpt_4turbo,
}

programs = {
    mdl_name: MultiModalLLMCompletionProgram.from_defaults(
        output_cls=PaperCard,
        prompt_template_str=paper_card_extraction_prompt,
        multi_modal_llm=mdl,
    )
    for mdl_name, mdl in multimodal_llms.items()
}

让我们来试运行一下¶

In [ ]:

Copied!

# Please ensure you're using llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])
# Please ensure you're using llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])

In [ ]:

Copied!

papercard
papercard

Out[ ]:

PaperCard(title='CRITIC: LLMs Can Self-Correct With Tool-Interactive Critiquing', year='2023', authors='Gao, Zhibin et al.', arxiv_id='arXiv:2305.11738', main_contribution='A framework for verifying and then correcting hallucinations by large language models (LLMs) with external tools (e.g., text-to-text APIs).', insights='LLMs can hallucinate and produce false information. By using external tools, these hallucinations can be identified and corrected.', main_results=['CRITIC leads to marked improvements over baselines on QA, math, and toxicity reduction tasks.', 'Feedback from external tools is crucial for an LLM to self-correct.', 'CRITIC significantly outperforms baselines on QA, math, and toxicity reduction tasks.'], tech_bits='The technical bits section describes the CRITIC prompt, which includes an initial output, critique, and revision steps. It also highlights the tools used for critiquing, such as a calculator for math tasks and a toxicity classifier for toxicity reduction tasks.')

执行数据提取任务¶

既然我们已经测试了程序，现在就可以将这些程序应用到 PaperCards 的数据提取任务上了。

In [ ]:

Copied!

import time
import tqdm
import time
import tqdm

In [ ]:

Copied!





results = {}

for mdl_name, program in programs.items():
    print(f"Model: {mdl_name}")
    results[mdl_name] = {
        "papercards": [],
        "failures": [],
        "execution_times": [],
        "image_paths": [],
    }
    total_time = 0
    for img in tqdm.tqdm(image_documents):
        results[mdl_name]["image_paths"].append(img.image_path)
        start_time = time.time()
        try:
            structured_output = program(image_documents=[img])
            end_time = time.time() - start_time
            results[mdl_name]["papercards"].append(structured_output)
            results[mdl_name]["execution_times"].append(end_time)
            results[mdl_name]["failures"].append(None)
        except Exception as e:
            results[mdl_name]["papercards"].append(None)
            results[mdl_name]["execution_times"].append(None)
            results[mdl_name]["failures"].append(e)
    print()
results = {}

for mdl_name, program in programs.items():
    print(f"Model: {mdl_name}")
    results[mdl_name] = {
        "papercards": [],
        "failures": [],
        "execution_times": [],
        "image_paths": [],
    }
    total_time = 0
    for img in tqdm.tqdm(image_documents):
        results[mdl_name]["image_paths"].append(img.image_path)
        start_time = time.time()
        try:
            structured_output = program(image_documents=[img])
            end_time = time.time() - start_time
            results[mdl_name]["papercards"].append(structured_output)
            results[mdl_name]["execution_times"].append(end_time)
            results[mdl_name]["failures"].append(None)
        except Exception as e:
            results[mdl_name]["papercards"].append(None)
            results[mdl_name]["execution_times"].append(None)
            results[mdl_name]["failures"].append(e)
    print()

Model: gpt_4o

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [09:01<00:00, 15.46s/it]

Model: gpt_4v

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [17:29<00:00, 29.99s/it]

Model: gpt_4turbo

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [14:50<00:00, 25.44s/it]

量化分析¶

在此，我们将对各程序进行快速定量分析。具体而言，我们将比较总故障次数、成功数据提取作业的总执行时间以及平均执行时间。

In [ ]:

Copied!

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd

In [ ]:

Copied!





metrics = {
    "gpt_4o": {},
    "gpt_4v": {},
    "gpt_4turbo": {},
}

# error count
for mdl_name, mdl_results in results.items():
    metrics[mdl_name]["error_count"] = sum(
        el is not None for el in mdl_results["failures"]
    )
    metrics[mdl_name]["total_execution_time"] = sum(
        el for el in mdl_results["execution_times"] if el is not None
    )
    metrics[mdl_name]["average_execution_time"] = metrics[mdl_name][
        "total_execution_time"
    ] / (len(image_documents) - metrics[mdl_name]["error_count"])
    metrics[mdl_name]["median_execution_time"] = np.percentile(
        [el for el in mdl_results["execution_times"] if el is not None], q=0.5
    )
metrics = {
    "gpt_4o": {},
    "gpt_4v": {},
    "gpt_4turbo": {},
}

# error count
for mdl_name, mdl_results in results.items():
    metrics[mdl_name]["error_count"] = sum(
        el is not None for el in mdl_results["failures"]
    )
    metrics[mdl_name]["total_execution_time"] = sum(
        el for el in mdl_results["execution_times"] if el is not None
    )
    metrics[mdl_name]["average_execution_time"] = metrics[mdl_name][
        "total_execution_time"
    ] / (len(image_documents) - metrics[mdl_name]["error_count"])
    metrics[mdl_name]["median_execution_time"] = np.percentile(
        [el for el in mdl_results["execution_times"] if el is not None], q=0.5
    )

In [ ]:

Copied!

pd.DataFrame(metrics)
pd.DataFrame(metrics)

Out[ ]:

	gpt_4o	gpt_4v	gpt_4turbo
error_count	0.000000	14.000000	1.000000
total_execution_time	541.128802	586.500559	762.130032
average_execution_time	15.460823	27.928598	22.415589
median_execution_time	5.377015	11.879649	7.177287

GPT-4o 确实更快！¶

GPT-4o 在总执行时间（仅统计成功程序，失败的提取不计算在内）以及平均和中位执行时间上都明显更快
GPT-4o 不仅速度更快，还能成功提取所有 PaperCards 的数据。相比之下，GPT-4v 失败了 14 次，而 GPT-4turbo 失败了 1 次。

定性分析¶

在这最后的部分，我们将对提取结果进行定性分析。最终，我们将获得一个关于数据提取任务的人工评估"标记"数据集。接下来提供的工具将允许您对每张论文卡片（PaperCard）数据提取的三个程序（或模型）结果进行人工评估。作为标注员，您的工作是根据数据提取的完整度对程序结果进行0到5的评分（5分表示完美提取）。

In [ ]:

Copied!

from IPython.display import clear_output
from IPython.display import clear_output

In [ ]:

Copied!





def display_results_and_papercard(ix: int):
    # image
    image_path = results["gpt_4o"]["image_paths"][ix]

    # outputs
    gpt_4o_output = results["gpt_4o"]["papercards"][ix]
    gpt_4v_output = results["gpt_4v"]["papercards"][ix]
    gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]

    image = Image.open(image_path).convert("RGB")
    plt.figure(figsize=(10, 10))
    plt.axis("off")
    plt.imshow(image)
    plt.show()

    print("GPT-4o\n")
    if gpt_4o_output is not None:
        print(json.dumps(gpt_4o_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")

    print("GPT-4v\n")
    if gpt_4v_output is not None:
        print(json.dumps(gpt_4v_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")

    print("GPT-4turbo\n")
    if gpt_4turbo_output is not None:
        print(json.dumps(gpt_4turbo_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")
def display_results_and_papercard(ix: int):
    # image
    image_path = results["gpt_4o"]["image_paths"][ix]

    # outputs
    gpt_4o_output = results["gpt_4o"]["papercards"][ix]
    gpt_4v_output = results["gpt_4v"]["papercards"][ix]
    gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]

    image = Image.open(image_path).convert("RGB")
    plt.figure(figsize=(10, 10))
    plt.axis("off")
    plt.imshow(image)
    plt.show()

    print("GPT-4o\n")
    if gpt_4o_output is not None:
        print(json.dumps(gpt_4o_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")

    print("GPT-4v\n")
    if gpt_4v_output is not None:
        print(json.dumps(gpt_4v_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")

    print("GPT-4turbo\n")
    if gpt_4turbo_output is not None:
        print(json.dumps(gpt_4turbo_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")

In [ ]:

Copied!





GRADES = {
    "gpt_4o": [0] * len(image_documents),
    "gpt_4v": [0] * len(image_documents),
    "gpt_4turbo": [0] * len(image_documents),
}


def manual_evaluation_single(img_ix: int):
    """Update the GRADES dictionary for a single PaperCard
    data extraction task.
    """
    display_results_and_papercard(img_ix)

    gpt_4o_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4o."
    )
    gpt_4v_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4v."
    )
    gpt_4turbo_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo."
    )

    GRADES["gpt_4o"][img_ix] = gpt_4o_grade
    GRADES["gpt_4v"][img_ix] = gpt_4v_grade
    GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade


def manual_evaluations(img_ix: Optional[int] = None):
    """An interactive program for manually grading gpt-4 variants on the
    task of PaperCard data extraction.
    """
    if img_ix is None:
        # mark all results
        for ix in range(len(image_documents)):
            print(f"You are marking {ix + 1} out of {len(image_documents)}")
            print()
            manual_evaluation_single(ix)
            clear_output(wait=True)
    else:
        manual_evaluation_single(img_ix)
GRADES = {
    "gpt_4o": [0] * len(image_documents),
    "gpt_4v": [0] * len(image_documents),
    "gpt_4turbo": [0] * len(image_documents),
}


def manual_evaluation_single(img_ix: int):
    """Update the GRADES dictionary for a single PaperCard
    data extraction task.
    """
    display_results_and_papercard(img_ix)

    gpt_4o_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4o."
    )
    gpt_4v_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4v."
    )
    gpt_4turbo_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo."
    )

    GRADES["gpt_4o"][img_ix] = gpt_4o_grade
    GRADES["gpt_4v"][img_ix] = gpt_4v_grade
    GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade


def manual_evaluations(img_ix: Optional[int] = None):
    """An interactive program for manually grading gpt-4 variants on the
    task of PaperCard data extraction.
    """
    if img_ix is None:
        # mark all results
        for ix in range(len(image_documents)):
            print(f"You are marking {ix + 1} out of {len(image_documents)}")
            print()
            manual_evaluation_single(ix)
            clear_output(wait=True)
    else:
        manual_evaluation_single(img_ix)

In [ ]:

Copied!

manual_evaluations()
manual_evaluations()

You are marking 35 out of 35

GPT-4o

{
    "title": "Prometheus: Inducing Fine-Grained Evaluation Capability In Language Models",
    "year": "2023",
    "authors": "Kim, Seungone et al.",
    "arxiv_id": "arxiv:2310.08441",
    "main_contribution": "An open-source LLM (LLMav2) evaluation specializing in fine-grained evaluations using human-like rubrics.",
    "insights": "While large LLMs like GPT-4 have shown impressive performance, they still lack fine-grained evaluation capabilities. Prometheus aims to address this by providing a dataset and evaluation framework that can assess models on a more detailed level.",
    "main_results": [
        "Prometheus matches or outperforms GPT-4.",
        "Prometheus can function as a reward model.",
        "Reference answers are crucial for fine-grained evaluation."
    ],
    "tech_bits": "Score Rubric, Feedback Collection, Generated Instructions, Generated Responses, Generated Rubrics, Evaluations, Answers & Explanations"
}

============================================

GPT-4v

{
    "title": "PROMETHEUS: Fine-Grained Evaluation Capability In Language Models",
    "year": "2023",
    "authors": "Kim, George, et al.",
    "arxiv_id": "arXiv:2310.08941",
    "main_contribution": "PROMETHEUS presents a novel source-level LLM evaluation suite using a custom feedback collection interface.",
    "insights": "The insights section would contain a summary of the main insight or motivation for the paper as described in the image.",
    "main_results": [
        "The main results section would list the key findings or results of the paper as described in the image."
    ],
    "tech_bits": "The tech bits section would describe what's being displayed in the technical bits section of the image."
}

============================================

GPT-4turbo

{
    "title": "Prometheus: Evaluating Capability In Language Models",
    "year": "2023",
    "authors": "Kim, George, et al.",
    "arxiv_id": "arXiv:2310.05941",
    "main_contribution": "Prometheus uses a custom feedback collection system designed for fine-tuning language models.",
    "insights": "The main insight is that fine-tuning language models on specific tasks can improve their overall performance, especially when using a custom feedback collection system.",
    "main_results": [
        "Prometheus LM outperforms GPT-4 on targeted feedback tasks.",
        "Prometheus LM's custom feedback function was 2% more effective than Prometheus 3.",
        "Feedback quality was better as reported by human judges."
    ],
    "tech_bits": "The technical bits section includes a Rubric Score, Seed, Fine-Grained Annotations, and Models. It also shows a feedback collection process with a visual representation of the feedback loop involving seed, generated annotations, and models."
}

============================================

Provide a rating from 0 to 5, with 5 being the highest for GPT-4o. 3
Provide a rating from 0 to 5, with 5 being the highest for GPT-4v. 1.5
Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo. 1.5

In [ ]:

Copied!

grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()
grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()

Out[ ]:

gpt_4o        3.585714
gpt_4v        1.300000
gpt_4turbo    2.128571
dtype: float64

观察结果表¶

在下表中，我们列出了希望从 PaperCard 中提取的每个组件的总体观察结果。GPT-4v 和 GPT-4Turbo 表现相近，但 GPT-4Turbo 略胜一筹。总体而言，GPT-4o 在此数据提取任务中的表现显著优于其他模型。最后，所有模型在描述 PaperCard 的 Tech Bits 部分时似乎都存在困难，有时所有模型都会生成摘要而非精确提取；不过，GPT-4o 出现这种情况的频率低于其他模型。

提取的组件	GPT-4o	GPT-4v & GPT-4Turbo
标题、年份、作者	非常好，可能达到 100%	约 80%，在少数示例中出现幻觉
Arxiv ID	良好，准确率约 95%	准确率 70%
主要贡献	良好（约 80%），但无法提取列出的多个贡献	不太理想，准确率 60%，存在一些幻觉
见解	不太理想（约 65%），更多是总结而非提取	更多是总结而非提取
主要结果	非常擅长提取主要结果的总结性陈述	在此处出现大量幻觉
Tech Bits	无法生成此处图表的详细描述	无法生成此处图表的详细描述

概述¶

GPT-4o 比 GPT-4v 和 GPT-4turbo 速度更快且错误更少（零失误！）
GPT-4o 在数据提取结果上优于 GPT-4v 和 GPT-4turbo
GPT-4o 能出色地从 PaperCard 中提取事实信息：包括标题、作者、年份以及"主要成果"部分的重点陈述
GPT-4v 和 GPT-4turbo 经常虚构主要成果内容，有时甚至误报作者信息
通过优化提示词（特别是针对"研究洞见"部分的数据提取，以及"技术要点"描述），GPT-4o 的结果质量仍有提升空间