多模态结构化输出:GPT-4o 与其他 GPT-4 变体的对比¶
在本笔记本中,我们使用 MultiModalLLMCompletionProgram 类实现基于图像的结构化数据提取。我们将对支持视觉功能的 GPT-4 模型进行横向对比。
%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
图像数据集:PaperCards¶
在此数据提取任务中,我们将使用多模态大语言模型从所谓的"论文卡片"中提取信息。这些卡片是包含研究论文摘要的可视化图表。可通过执行以下命令从我们的Dropbox账户下载数据集。
下载镜像¶
!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip
将 PaperCard 作为 ImageDocument 加载¶
## import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document
# context images
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()
# let's see one
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()
构建我们的多模态LLM补全程序(多模态结构化输出)¶
期望的结构化输出¶
在这里我们将定义数据类(即 Pydantic 的 BaseModel),用于存储从给定图像或 PaperCard 中提取的数据。
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional
# Desired output structure
class PaperCard(BaseModel):
"""Data class for storing text attributes of a PaperCard."""
title: str = Field(description="Title of paper.")
year: str = Field(description="Year of publication of paper.")
authors: str = Field(description="Authors of paper.")
arxiv_id: str = Field(description="Arxiv paper id.")
main_contribution: str = Field(
description="Main contribution of the paper."
)
insights: str = Field(
description="Main insight or motivation for the paper."
)
main_results: List[str] = Field(
description="The main results of the paper."
)
tech_bits: Optional[str] = Field(
description="Describe what's being displayed in the technical bits section of the image."
)
接下来,我们定义 MultiModalLLMCompletionProgram。实际上我们将定义三个独立的程序,分别对应三个支持视觉的 GPT-4 模型:GPT-4o、GPT-4v 和 GPT-4Turbo。
paper_card_extraction_prompt = """
Use the attached PaperCard image to extract data from it and store into the
provided data class.
"""
gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)
gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)
gpt_4turbo = OpenAIMultiModal(
model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)
multimodal_llms = {
"gpt_4o": gpt_4o,
"gpt_4v": gpt_4v,
"gpt_4turbo": gpt_4turbo,
}
programs = {
mdl_name: MultiModalLLMCompletionProgram.from_defaults(
output_cls=PaperCard,
prompt_template_str=paper_card_extraction_prompt,
multi_modal_llm=mdl,
)
for mdl_name, mdl in multimodal_llms.items()
}
让我们来试运行一下¶
# Please ensure you're using llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])
papercard
PaperCard(title='CRITIC: LLMs Can Self-Correct With Tool-Interactive Critiquing', year='2023', authors='Gao, Zhibin et al.', arxiv_id='arXiv:2305.11738', main_contribution='A framework for verifying and then correcting hallucinations by large language models (LLMs) with external tools (e.g., text-to-text APIs).', insights='LLMs can hallucinate and produce false information. By using external tools, these hallucinations can be identified and corrected.', main_results=['CRITIC leads to marked improvements over baselines on QA, math, and toxicity reduction tasks.', 'Feedback from external tools is crucial for an LLM to self-correct.', 'CRITIC significantly outperforms baselines on QA, math, and toxicity reduction tasks.'], tech_bits='The technical bits section describes the CRITIC prompt, which includes an initial output, critique, and revision steps. It also highlights the tools used for critiquing, such as a calculator for math tasks and a toxicity classifier for toxicity reduction tasks.')
执行数据提取任务¶
既然我们已经测试了程序,现在就可以将这些程序应用到 PaperCards 的数据提取任务上了。
import time
import tqdm
results = {}
for mdl_name, program in programs.items():
print(f"Model: {mdl_name}")
results[mdl_name] = {
"papercards": [],
"failures": [],
"execution_times": [],
"image_paths": [],
}
total_time = 0
for img in tqdm.tqdm(image_documents):
results[mdl_name]["image_paths"].append(img.image_path)
start_time = time.time()
try:
structured_output = program(image_documents=[img])
end_time = time.time() - start_time
results[mdl_name]["papercards"].append(structured_output)
results[mdl_name]["execution_times"].append(end_time)
results[mdl_name]["failures"].append(None)
except Exception as e:
results[mdl_name]["papercards"].append(None)
results[mdl_name]["execution_times"].append(None)
results[mdl_name]["failures"].append(e)
print()
Model: gpt_4o
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [09:01<00:00, 15.46s/it]
Model: gpt_4v
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [17:29<00:00, 29.99s/it]
Model: gpt_4turbo
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [14:50<00:00, 25.44s/it]
量化分析¶
在此,我们将对各程序进行快速定量分析。具体而言,我们将比较总故障次数、成功数据提取作业的总执行时间以及平均执行时间。
import numpy as np
import pandas as pd
metrics = {
"gpt_4o": {},
"gpt_4v": {},
"gpt_4turbo": {},
}
# error count
for mdl_name, mdl_results in results.items():
metrics[mdl_name]["error_count"] = sum(
el is not None for el in mdl_results["failures"]
)
metrics[mdl_name]["total_execution_time"] = sum(
el for el in mdl_results["execution_times"] if el is not None
)
metrics[mdl_name]["average_execution_time"] = metrics[mdl_name][
"total_execution_time"
] / (len(image_documents) - metrics[mdl_name]["error_count"])
metrics[mdl_name]["median_execution_time"] = np.percentile(
[el for el in mdl_results["execution_times"] if el is not None], q=0.5
)
pd.DataFrame(metrics)
| gpt_4o | gpt_4v | gpt_4turbo | |
|---|---|---|---|
| error_count | 0.000000 | 14.000000 | 1.000000 |
| total_execution_time | 541.128802 | 586.500559 | 762.130032 |
| average_execution_time | 15.460823 | 27.928598 | 22.415589 |
| median_execution_time | 5.377015 | 11.879649 | 7.177287 |
GPT-4o 确实更快!¶
- GPT-4o 在总执行时间(仅统计成功程序,失败的提取不计算在内)以及平均和中位执行时间上都明显更快
- GPT-4o 不仅速度更快,还能成功提取所有 PaperCards 的数据。相比之下,GPT-4v 失败了 14 次,而 GPT-4turbo 失败了 1 次。
定性分析¶
在这最后的部分,我们将对提取结果进行定性分析。最终,我们将获得一个关于数据提取任务的人工评估"标记"数据集。接下来提供的工具将允许您对每张论文卡片(PaperCard)数据提取的三个程序(或模型)结果进行人工评估。作为标注员,您的工作是根据数据提取的完整度对程序结果进行0到5的评分(5分表示完美提取)。
from IPython.display import clear_output
def display_results_and_papercard(ix: int):
# image
image_path = results["gpt_4o"]["image_paths"][ix]
# outputs
gpt_4o_output = results["gpt_4o"]["papercards"][ix]
gpt_4v_output = results["gpt_4v"]["papercards"][ix]
gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]
image = Image.open(image_path).convert("RGB")
plt.figure(figsize=(10, 10))
plt.axis("off")
plt.imshow(image)
plt.show()
print("GPT-4o\n")
if gpt_4o_output is not None:
print(json.dumps(gpt_4o_output.dict(), indent=4))
else:
print("Failed to extract data")
print()
print("============================================\n")
print("GPT-4v\n")
if gpt_4v_output is not None:
print(json.dumps(gpt_4v_output.dict(), indent=4))
else:
print("Failed to extract data")
print()
print("============================================\n")
print("GPT-4turbo\n")
if gpt_4turbo_output is not None:
print(json.dumps(gpt_4turbo_output.dict(), indent=4))
else:
print("Failed to extract data")
print()
print("============================================\n")
GRADES = {
"gpt_4o": [0] * len(image_documents),
"gpt_4v": [0] * len(image_documents),
"gpt_4turbo": [0] * len(image_documents),
}
def manual_evaluation_single(img_ix: int):
"""Update the GRADES dictionary for a single PaperCard
data extraction task.
"""
display_results_and_papercard(img_ix)
gpt_4o_grade = input(
"Provide a rating from 0 to 5, with 5 being the highest for GPT-4o."
)
gpt_4v_grade = input(
"Provide a rating from 0 to 5, with 5 being the highest for GPT-4v."
)
gpt_4turbo_grade = input(
"Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo."
)
GRADES["gpt_4o"][img_ix] = gpt_4o_grade
GRADES["gpt_4v"][img_ix] = gpt_4v_grade
GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade
def manual_evaluations(img_ix: Optional[int] = None):
"""An interactive program for manually grading gpt-4 variants on the
task of PaperCard data extraction.
"""
if img_ix is None:
# mark all results
for ix in range(len(image_documents)):
print(f"You are marking {ix + 1} out of {len(image_documents)}")
print()
manual_evaluation_single(ix)
clear_output(wait=True)
else:
manual_evaluation_single(img_ix)
manual_evaluations()
You are marking 35 out of 35
GPT-4o
{
"title": "Prometheus: Inducing Fine-Grained Evaluation Capability In Language Models",
"year": "2023",
"authors": "Kim, Seungone et al.",
"arxiv_id": "arxiv:2310.08441",
"main_contribution": "An open-source LLM (LLMav2) evaluation specializing in fine-grained evaluations using human-like rubrics.",
"insights": "While large LLMs like GPT-4 have shown impressive performance, they still lack fine-grained evaluation capabilities. Prometheus aims to address this by providing a dataset and evaluation framework that can assess models on a more detailed level.",
"main_results": [
"Prometheus matches or outperforms GPT-4.",
"Prometheus can function as a reward model.",
"Reference answers are crucial for fine-grained evaluation."
],
"tech_bits": "Score Rubric, Feedback Collection, Generated Instructions, Generated Responses, Generated Rubrics, Evaluations, Answers & Explanations"
}
============================================
GPT-4v
{
"title": "PROMETHEUS: Fine-Grained Evaluation Capability In Language Models",
"year": "2023",
"authors": "Kim, George, et al.",
"arxiv_id": "arXiv:2310.08941",
"main_contribution": "PROMETHEUS presents a novel source-level LLM evaluation suite using a custom feedback collection interface.",
"insights": "The insights section would contain a summary of the main insight or motivation for the paper as described in the image.",
"main_results": [
"The main results section would list the key findings or results of the paper as described in the image."
],
"tech_bits": "The tech bits section would describe what's being displayed in the technical bits section of the image."
}
============================================
GPT-4turbo
{
"title": "Prometheus: Evaluating Capability In Language Models",
"year": "2023",
"authors": "Kim, George, et al.",
"arxiv_id": "arXiv:2310.05941",
"main_contribution": "Prometheus uses a custom feedback collection system designed for fine-tuning language models.",
"insights": "The main insight is that fine-tuning language models on specific tasks can improve their overall performance, especially when using a custom feedback collection system.",
"main_results": [
"Prometheus LM outperforms GPT-4 on targeted feedback tasks.",
"Prometheus LM's custom feedback function was 2% more effective than Prometheus 3.",
"Feedback quality was better as reported by human judges."
],
"tech_bits": "The technical bits section includes a Rubric Score, Seed, Fine-Grained Annotations, and Models. It also shows a feedback collection process with a visual representation of the feedback loop involving seed, generated annotations, and models."
}
============================================
Provide a rating from 0 to 5, with 5 being the highest for GPT-4o. 3 Provide a rating from 0 to 5, with 5 being the highest for GPT-4v. 1.5 Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo. 1.5
grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()
gpt_4o 3.585714 gpt_4v 1.300000 gpt_4turbo 2.128571 dtype: float64
观察结果表¶
在下表中,我们列出了希望从 PaperCard 中提取的每个组件的总体观察结果。GPT-4v 和 GPT-4Turbo 表现相近,但 GPT-4Turbo 略胜一筹。总体而言,GPT-4o 在此数据提取任务中的表现显著优于其他模型。最后,所有模型在描述 PaperCard 的 Tech Bits 部分时似乎都存在困难,有时所有模型都会生成摘要而非精确提取;不过,GPT-4o 出现这种情况的频率低于其他模型。
| 提取的组件 | GPT-4o | GPT-4v & GPT-4Turbo |
|---|---|---|
| 标题、年份、作者 | 非常好,可能达到 100% | 约 80%,在少数示例中出现幻觉 |
| Arxiv ID | 良好,准确率约 95% | 准确率 70% |
| 主要贡献 | 良好(约 80%),但无法提取列出的多个贡献 | 不太理想,准确率 60%,存在一些幻觉 |
| 见解 | 不太理想(约 65%),更多是总结而非提取 | 更多是总结而非提取 |
| 主要结果 | 非常擅长提取主要结果的总结性陈述 | 在此处出现大量幻觉 |
| Tech Bits | 无法生成此处图表的详细描述 | 无法生成此处图表的详细描述 |
概述¶
- GPT-4o 比 GPT-4v 和 GPT-4turbo 速度更快且错误更少(零失误!)
- GPT-4o 在数据提取结果上优于 GPT-4v 和 GPT-4turbo
- GPT-4o 能出色地从 PaperCard 中提取事实信息:包括标题、作者、年份以及"主要成果"部分的重点陈述
- GPT-4v 和 GPT-4turbo 经常虚构主要成果内容,有时甚至误报作者信息
- 通过优化提示词(特别是针对"研究洞见"部分的数据提取,以及"技术要点"描述),GPT-4o 的结果质量仍有提升空间