GPT4-V 实验:通用与特定问题及思维链(COT)提示技术的应用¶
GPT-4V 以其分析图像甚至从视觉内容生成网站代码的能力令我们惊叹。
本教程笔记探究 GPT-4V 在解析条形图、散点图和表格方面的熟练程度。我们旨在评估,相较于宽泛的提问,特定问题与思维链提示是否能产生更优质的响应。本次演示试图验证 GPT-4V 能否通过精准提问和系统推理技术突破这些已知限制。
在这些实验中我们观察到,提出具体问题比笼统提问能获得更佳答案。让我们深入探讨这些实验。
注意:本教程笔记本旨在向社区介绍 GPT-4V 的性能表现,但测试结果可能不具备普适性。我们强烈建议您在自己的数据集上使用类似问题进行测试后,再得出最终结论。
让我们按照以下步骤来分析这些图像以解答问题:
- 一般性问题:直接询问"分析这张图像"。
- 具体询问:详细询问某个类别或模型的性能表现。
- 思维链提示法:采用逐步推理的方法来进行分析。
这些指导原则旨在测试不同的提问技巧如何提高我们从图像中获取信息的精确度。
%pip install llama-index-multi-modal-llms-openai
!pip install llama-index
import os
OPENAI_API_KEY = "YOUR OPENAI API KEY"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
openai_mm_llm = OpenAIMultiModal(
model="gpt-4o",
api_key=OPENAI_API_KEY,
max_new_tokens=500,
temperature=0.0,
)
下载数据¶
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/gpt4_experiments/llama2_mistral.png' -O './llama2_mistral.png'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/gpt4_experiments/llama2_model_analysis.pdf' -O './llama2_model_analysis.png'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/gpt4_experiments/llama2_violations_charts.png' -O './llama2_violations_charts.png'
图1 - 不同类别下各大型语言模型的安全违规比例¶
from PIL import Image
import matplotlib.pyplot as plt
img = Image.open("llama2_violations_charts.png")
plt.imshow(img)
<matplotlib.image.AxesImage at 0x785d08fb6800>
# put your local directory here
image_documents = SimpleDirectoryReader(
input_files=["./llama2_violations_charts.png"]
).load_data()
常见问题¶
query = "Analyse the image"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
The image you've provided is a bar chart displaying data grouped into three categories of online violations: hateful and harmful, illicit and criminal activity, and unqualified advice. Each of these categories has a number in parentheses, indicating the sample size of reported incidents for each type of violation (395, 728, and 311, respectively). The x-axis of the chart specifies various social media platforms or services, such as "Video sharing", "Social networking", "Gaming", "Dating", "Forums & boards", "Commercial Websites", "Media sharing", "P2P/File sharing", "Wiki", and "Other". It appears to measure how often these categories of violations occur on each type of platform. The y-axis measures the percentage of violations reported, ranging from 0% to 60%. Each platform/service has three bars corresponding to the three violation categories, showing their respective percentages. The bars also have error bars, which typically represent the variability of the data, such as standard deviation, standard error, or confidence interval, indicating the precision of the estimates. The chart has a legend indicating the color corresponding to each of the three categories of violations. This visual representation helps to compare the prevalence of different types of violations across the different types of online services. Without specific numbers, it's not possible to provide exact percentages, but we can observe trends, such as: - "Forums & boards" and "Social networking" platforms have notably higher percentages across all three types of violations compared to other platforms. - "Commercial Websites" seem to have lower percentages of reported hateful and harmful activities and unqualified advice but higher percentages of illicit and criminal activities when compared to other platforms like "Wiki" or "P2P/File sharing". - "Gaming" appears to have a moderate percentage of hateful and harmful violations, lower levels of illicit and criminal activity, and relatively higher levels of unqualified advice. Overall, this chart is a visual tool that likely aims to inform about the prevalence of certain types of online violations across different digital platforms.
观察:¶
如你所见,虽然分类包含仇恨与有害内容、非法及犯罪活动、以及不专业建议,但它与x轴值存在幻觉关联——"视频分享"、"社交网络"、"游戏"、"约会"、"论坛与留言板"、"商业网站"、"媒体分享"、"P2P/文件共享"、"维基"和"其他"。
具体问题¶
query = "Compare Llama2 models vs Vicuna models across categories."
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
The provided image is a bar graph with three categories along the x-axis: Hateful and harmful; Illicit and criminal activity; Unqualified advice. It shows a comparison of two types of models – Llama2 and Vicuna – across these categories in terms of violation percentage, which is represented on the y-axis. For each category, there are multiple bars representing different subcategories or criteria. The error bars on each bar indicate the confidence interval or variation in the percentage of violations. From the graph, it looks like the Vicuna model generally has a lower violation percentage across all subcategories compared to the Llama2 model. This suggests that Vicuna may perform better in terms of producing fewer content violations in these areas. However, without knowing the exact context or details of the models and the evaluation methodology, my interpretation is based solely on the visual data presented. If you have specific questions about each category or subcategory, or the implications of these results, feel free to ask!
观察结果:¶
它错误地声称 Vicuna 模型在所有子类别中的违规百分比普遍低于 Llama2 模型。
query = "which model among llama2 and vicuna models does better in terms of violation percentages in Hateful and harmful category."
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
From the provided bar chart, we can analyze the violation percentage for both the Llama2 and Vicuna models in the "Hateful and harmful" category. To determine which model does better (i.e., has a lower violation percentage), you would look at the height of the bars corresponding to each model within that category. In the "Hateful and harmful" category (which is the first group of bars on the left), you can compare the blue bar (representing Llama2) to the light blue bar (representing Vicuna). The model with the shorter bar in this category will have a lower violation percentage, and hence, better performance with respect to minimizing hateful and harmful content. Please note that I cannot physically see the chart, so I'm unable to specify which model has the lower violation percentage. If you provide the actual percentages or describe the relative bar lengths for Llama2 and Vicuna in the "Hateful and harmful" category, I could then tell you which model performs better in that respect.
观察结果:¶
未能准确捕捉信息,错误地将浅蓝色条形标识为代表 Vicuna,而实际上浅蓝色条形代表的是 Llama2。
现在让我们通过提供更详细的信息来检查,并提出相同的问题。
query = """In the image provided to you depicts about the violation rate performance of various AI models across Hateful and harmful, Illicit and criminal activity, Unqualified advice categories.
Hateful and harmful category is in first column. Bars with light blue are with Llama2 model and dark blue are with Vicuna models.
With this information, Can you compare about Llama2 and Vicuna models in Hateful and harmful category."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
In the image you provided, which shows a bar graph for the violation rate performance of various AI models in the Hateful and harmful category, we can analyze the light blue bars that represent the Llama2 model and the dark blue bars that represent the Vicuna model. Based on the visual information given, I can compare the two models within this category by observing the heights of the light blue bars (Llama2) versus the heights of the dark blue bars (Vicuna) for each subsection within the category. A lower bar indicates a lower violation rate, which could be interpreted as better performance in minimizing violations for the given criteria. Since I can't give you the exact numbers or percentages, I would describe their relative performances. It appears that for some subsections, the Llama2 bars are shorter than the Vicuna bars, suggesting that the Llama2 model could have a lower violation rate in those areas. Conversely, in other subsections, the Vicuna bars might be shorter than the Llama2 bars, indicating a lower violation rate for the Vicuna model in those areas. The exact subsections where one model outperforms the other would depend on their relative bar heights, which should be compared individually for the given information.
观察:¶
它确实正确回答了这个问题。
思维链提示¶
query = """Based on the image provided. Follow the steps and answer the query - which model among llama2 and vicuna does better in terms of violation percentages in 'Hateful and harmful'.
Examine the Image: Look at the mentioned category in the query in the Image.
Identify Relevant Data: Note the violation percentages.
Evaluate: Compare if there is any comparison required as per the query.
Draw a Conclusion: Now draw the conclusion based on the whole data."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
Examine the Image: The image is a bar chart depicting violation percentages in three categories: "Hateful and harmful," "Illicit and criminal activity," and "Unqualified advice." Each category has two bars next to each other representing two models named "llama2" and "vicuna." Identify Relevant Data: We are specifically interested in the "Hateful and harmful" category, which is the first set of bars on the left. The two bars indicate the violation percentages for "llama2" and "vicuna." Evaluate: By visually inspecting the bar corresponding to "Hateful and harmful," we see that the blue bar (indicating "llama2") has a lower percentage than the red bar (indicating "vicuna"). The exact percentages are not clear, but we can discern the relative performance between the two. Draw a Conclusion: From the data visible, "llama2" has a lower violation percentage than "vicuna" in the "Hateful and harmful" category, thereby doing better in that respect according to the given image.
观察结果:¶
通过思维链提示(chain of thought prompting),模型虽然出现了条形图颜色的幻觉,但正确回答出Llama2在仇恨和有害内容方面的违规率低于Vicuna。不过其中有一个部分显示Llama2的违规率高于Vicuna。
图2 - Llama2 与 Mistral 模型在各类 NLP 任务中的性能对比¶
img = Image.open("llama2_mistral.png")
plt.imshow(img)
<matplotlib.image.AxesImage at 0x785d06ebb820>
image_documents = SimpleDirectoryReader(
input_files=["./llama2_mistral.png"]
).load_data()
常见问题¶
query = "Analyse the image"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
The image appears to be a set of four line graphs that compare the performance of different natural language processing (NLP) models across four different metrics: Natural Language Understanding (NLU), Reasoning, Knowledge, and Commonsense. The models compared are LLaMA 2, LLaMA 13B, Mistral, and GPT-3. Each graph plots the performance metric (on the y-axis) against the model size measured in billions of effective parameters (on the x-axis). In general, these plots demonstrate that as the model size increases, the performance on each metric improves. Here are specific observations for each graph: 1. Natural Language Understanding (Top left): All models show an increase in NLU performance as the model size increases. Mistral is depicted as the highest-performing model at each size benchmark. 2. Reasoning (Top right): Again, larger models perform better on reasoning tasks, with Mistral consistently outperforming the other models mentioned. 3. Knowledge (Bottom left): Similar to the NLU and reasoning tasks, the trend is for better performance with larger models, with Mistral leading across the different sizes. 4. Commonsense (Bottom right): This graph follows the same trend with the Mistral model performing better at each size level compared to the other models. Based on these graphs, Mistral appears to be the top-performing model in all four metrics, suggesting that it might be more efficient or effective than LLaMA 2, LLaMA 13B, and GPT-3, at least within the scope of the parameters and tasks represented here. On a technical note, it's important to mention that the y-axis shows different scales for the metrics (ranging from 45-75% for NLU, Knowledge, and Commonsense, and from 50-72% for Reasoning), which suggests that these tasks might have different levels of difficulty or different benchmarks for success.
观察:¶
虽然回答了查询,但在涉及MMLU任务的NLU(自然语言理解)任务时出现了幻觉现象,且错误假设Mistral模型适用于所有不同的参数规模。
具体问题¶
query = "How well does mistral model compared to llama2 model?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
The image you've provided contains four graphs, each plotting different performance metrics for evaluating language models. It compares two models: Mistral and LLaMA2 across four different aspects—MMLU (Multi-tasking multiple-choice), Reasoning, Knowledge, and Commonsense. In each graph, the performance metric is plotted on the y-axis while the effective model size in billion parameters is plotted on the x-axis. Here's a summary based on the trends observed in the graphs: 1. **MMLU (Top Left Graph)**: LLaMA2 shows a steeper performance curve as the model size increases, starting at around 61% for the smallest size and reaching upwards of 66% for the largest model. Mistral also shows performance gains with model size but plateaus earlier, reaching a level just shy of 62%. 2. **Reasoning (Top Right Graph)**: In the reasoning task, LLaMA2 again has a steeper improvement curve, starting from about 61% and surpassing 72% for the largest model. Mistral, while improving, seems to plateau near 70% for the largest model. 3. **Knowledge (Bottom Left Graph)**: This graph reflects a similar trend to the previous ones, with LLaMA2 beginning at a lower performance around 46% and eclipsing 52%. Mistral starts higher at around 48% and appears to plateau near 52%. 4. **Commonsense (Bottom Right Graph)**: Here, LLaMA2 starts its performance at approximately 62% and reaches just above 66%. Mistral seems to start at a slightly higher point than LLaMA2 but ends at a similar level to LLaMA2's largest model. Overall, the LLaMA2 model appears to show a greater degree of improvement in performance as the model size increases compared to Mistral across these metrics. Meanwhile, Mistral starts at a higher performance for some metrics but tends to plateau earlier, suggesting that LLaMA2 may scale better with size in terms of performance gains.
观察:¶
回答不正确且百分比数据不够精确,再次错误假设了Mistral模型适用于所有参数模型。
query = "Assuming mistral is available in 7B series. How well does mistral model compared to llama2 model?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
This image appears to be a set of four graphs, each comparing the performance of three different language models on various tasks: Mistral, LLaMA-2, and a hypothetical "Size 7B" model. The graphs are labeled with task-specific performance metrics: "MLM U", "Reasoning", "Knowledge", and "Commonsense". The x-axes on the graphs represent model size in terms of the number of parameters, with three points that likely correspond to the sizes of the models being compared. The y-axes represent performance as a percentage, which could mean accuracy, precision, recall, or another relevant performance metric depending on the specific task. The graphs appear to show that Mistral performs better than LLaMA-2 and the Size 7B model across all metrics. This indicates that within the context of these measurements and tasks, Mistral is a stronger model. The exact nature of the tasks or what "MLM U", "Reasoning", "Knowledge", and "Commonsense" specifically refer to are not detailed in the image, but they likely correspond to standard NLP tasks designed to test understanding of language, ability to reason, knowledge recall, and commonsense reasoning, respectively. The performance improvements are depicted as increasingly significant with larger model sizes.
观察:¶
现在提供了Mistral模型有7B系列版本的详细信息后,它能够正确回答问题。
思维链提示法。¶
query = """Based on the image provided. Follow the steps and answer the query - Assuming mistral is available in 7B series. How well does mistral model compared to llama2 model?.
Examine the Image: Look at the mentioned category in the query in the Image.
Identify Relevant Data: Note the respective percentages.
Evaluate: Compare if there is any comparison required as per the query.
Draw a Conclusion: Now draw the conclusion based on the whole data."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
Examine the Image: The image contains four graphs, each graph compares the performance of three different models—Llama 2, Mistral, and an unnamed third model—across different metrics: mAP@L (mean Average Precision at L), Reason@L (presumably a reasoning score at L), Knowledge@L, and Comprehension@L. Each graph shows performance as a function of model size (in terms of billion parameters). Identify Relevant Data: We need to focus on the Mistral and Llama 2 models across all four graphs to extract the relevant data. For mAP@L: - Llama 2 reaches above 65% when reaching 70 billion parameters. - Mistral starts at around 5 billion parameters with about 48% and reaches almost 65% by the time it gets to 34 billion parameters. For Reason@L: - Llama 2 reaches about 68% when reaching 70 billion parameters. - Mistral starts at around 5 billion parameters with about 52% and reaches almost 68% by the time it gets to 34 billion parameters. For Knowledge@L: - Llama 2 starts at about 45% with 1.3 billion parameters and reaches just slightly under 60% by 70 billion parameters. - Mistral is not plotted in this graph, so no direct comparison can be made. For Comprehension@L: - Llama 2 starts just above 40% with 1.3 billion parameters and reaches just under 60% by 70 billion parameters. - Mistral starts around 13 billion parameters with about 52% and continues above 62% by 34 billion parameters. Evaluate: From the graphs, we can see that Mistral and Llama 2 are very comparable in performance in the metrics where both are shown. However, Mistral is not displayed on the Knowledge@L graph, so no comparison can be made for that particular metric. Draw a Conclusion: In the graphs where both models can be compared (mAP@L, Reason@L, and Comprehension@L), Mistral appears to perform comparably to Llama 2, reaching similar levels of performance at smaller parameter sizes relative to the plotted sizes for Llama 2. This suggests that Mistral could be more parameter-efficient. However, since there is no data for Mistral in the Knowledge@L graph, we cannot draw a complete comparison
观察结果:¶
虽然最终结论部分正确,但在模型参数数量和百分比数据方面存在幻觉现象。
Image3 - 不同大语言模型在各类 NLP 任务中的性能表现¶
img = Image.open("llm_analysis.png")
plt.imshow(img)
<matplotlib.image.AxesImage at 0x785d05b2b3d0>
image_documents = SimpleDirectoryReader(
input_files=["./llama2_model_analysis.png"]
).load_data()
常见问题¶
query = "Analyse the image"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
The image appears to be a table containing numerical data, which seems to be a comparative analysis of various models across multiple parameters or tests. Each row represents a different model or configuration as indicated by names such as "MPT" and "Falcon" followed by a size specification like "70B," "7B," or some other parameter. The columns are labeled with test names or evaluation metrics, such as "Size," "Avg," "AQuA-RAT," "LogiQA," "L-SAT-AR," "L-SAT-IC," "L-SAT-RC," "SAT-en (w/o Psg.)," and "SAT-math." The data is likely related to performance scores of these models on these tests, where higher numbers probably indicate better performance. Without additional context, it is difficult to provide a comprehensive analysis of this data, but it seems clear that it is intended to provide a performance comparison between different models on various tasks, possibly in the field of machine learning or artificial intelligence evaluations, where such models are typically assessed on reasoning, comprehension, or problem-solving capabilities. The "70B" and "7B" demarcations could refer to the size of the model in terms of the number of parameters, commonly used in assessing language models. A deeper evaluation would require further insights into the specific nature of these tests and models, along with the intended use-case for which they were being compared.
观察结果:¶
虽然未对图像进行具体分析,但在一定程度上理解了图像中呈现的整体数据。
具体问题¶
query = "which model has higher performance in SAT-en?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
In the table you've provided, the models are compared based on their performance on several benchmarks, including SAT-en (SAT Analogies, or "SAT" in the table). To determine which model has higher performance specifically on the SAT-en benchmark, you'll need to look at the corresponding column. In the SAT-en column (second from the right), the two models with the highest scores are: - LLaMA1 65B: with a score of 57.9 - LLaMA2 70B: with a score of 63.4 Between these two, the LLaMA2 model with 70 billion parameters shows the higher performance on the SAT-en benchmark with a score of 63.4.
观察结果:¶
回答内容本身是正确的,但其中数字部分存在幻觉现象(被虚构)。
query = "which model has higher performance in SAT-en in 7B series models?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
The table you've provided shows performance benchmark scores for different model sizes across various AI models such as MPT, Falcon, and LLaMA on different tasks, such as Aqua-RAT, LogiQA, LastAR, SAT-en, and SAT-math. For the SAT-en task specifically, you asked which model in the 7B series has the highest performance. To find out, we need to look at the column labeled "SAT-en (w/o Ps_8)." In the 7B series of models, here are the scores: - MPT 7B: 37.1 - Falcon 7B: 37.3 - LLaMA 7B: 63.9 - Model2 7B: 37.4 The LLaMA 7B model outperforms the other 7B models on the SAT-en (w/o Ps_8) task with a score of 63.9.
观察结果:¶
它确实识别出了模型名称并给出了正确答案,但对Llama系列模型的识别和数值判断存在错误。
思维链提示法¶
query = """Based on the image provided. Follow the steps and answer the query - which model has higher performance in SAT-en in 7B series models?
Examine the Image: Look at the mentioned category in the query in the Image.
Identify Relevant Data: Note the respective percentages.
Evaluate: Compare if there is any comparison required as per the query.
Draw a Conclusion: Now draw the conclusion based on the whole data."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
To answer which model has higher performance in SAT-en in the 7B series models, I will follow the provided steps: Examine the Image: The categories in the image include model names and sizes, and performance scores for various tasks, including the SAT-en category. Identify Relevant Data: The relevant data for the SAT-en category are the percentages listed under it for each 7B series model. Evaluate: I will compare the SAT-en percentages of each 7B series model. Draw a Conclusion: The SAT-en scores for the 7B series models are as follows: - MPT 7B: 63.1% - Falcon 7B: 73.4% - LLama 1 7B: No data present for this category. - LLama 2 7B: 76.6% Based on the data, the LLama 2 7B model has the highest SAT-en performance among the 7B series models with a score of 76.6%.
观察:¶
通过思维链提示(chain of the thought prompting),我们能够得出正确的结论,但需注意该方法最初选取了错误的值。
总结¶
在本教程笔记中,我们展示了从一般性查询到系统性提问及思维链提示技术的各类实验,并观察了幻觉率与准确性指标。
需要注意的是,GPT-4V 的输出可能存在一定的不一致性,且幻觉水平略有升高。因此,重复相同实验可能会得到不同答案,特别是在处理泛化性问题时。