使用Replicate的LlaVa、Fuyu 8B和MiniGPT4模型实现多模态LLM图像推理¶

本笔记本将展示如何运用MultiModal LLM类进行图像理解/推理任务。目前我们支持以下模型：

第二部分将演示如何在Replicate平台上使用流式完成（stream complete）和异步完成（async complete）功能。

注意：当前Replicate的多模态LLM模型仅支持单次处理一个图像文档。

In [ ]:

Copied!

%pip install llama-index-multi-modal-llms-replicate
%pip install llama-index-multi-modal-llms-replicate

In [ ]:

Copied!

% pip install replicate
% pip install replicate

加载并初始化 Replicate¶

In [ ]:

Copied!

import os

REPLICATE_API_TOKEN = ""  # Your Relicate API token here
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
import os

REPLICATE_API_TOKEN = ""  # Your Relicate API token here
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

下载镜像与本地加载镜像¶

In [ ]:

Copied!





from PIL import Image
import requests
from io import BytesIO

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument


if not os.path.exists("test_images"):
    os.makedirs("test_images")

# for now fuyu-8b model on replicate can mostly handle JPG image urls well instead of local files
image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
# save images
for idx, image_url in enumerate(image_urls):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img.save(f"test_images/{idx}.png")

# option 1: load images from urls directly
# image_documents = load_image_urls(image_urls)
# option 2: load images from local
image_documents = [
    ImageDocument(image_path=f"test_images/{idx}.png")
    for idx in range(len(image_urls))
]
from PIL import Image
import requests
from io import BytesIO

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument


if not os.path.exists("test_images"):
    os.makedirs("test_images")

# for now fuyu-8b model on replicate can mostly handle JPG image urls well instead of local files
image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
# save images
for idx, image_url in enumerate(image_urls):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img.save(f"test_images/{idx}.png")

# option 1: load images from urls directly
# image_documents = load_image_urls(image_urls)
# option 2: load images from local
image_documents = [
    ImageDocument(image_path=f"test_images/{idx}.png")
    for idx in range(len(image_urls))
]

可视化图像¶

In [ ]:

Copied!

import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris

image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)
import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris

image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)

No description has been provided for this image

提供多样化提示词以测试不同多模态大语言模型¶

In [ ]:

Copied!





from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import (
    REPLICATE_MULTI_MODAL_LLM_MODELS,
)

prompts = [
    "what is shown in this image?",
    "how many people are shown in the image?",
    "is there anything unusual in the image?",
]
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import (
    REPLICATE_MULTI_MODAL_LLM_MODELS,
)

prompts = [
    "what is shown in this image?",
    "how many people are shown in the image?",
    "is there anything unusual in the image?",
]

针对不同图像使用不同提示词生成多模型推理结果¶

In [ ]:

Copied!





res = []
for prompt_idx, prompt in enumerate(prompts):
    for image_idx, image_doc in enumerate(image_documents):
        for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
            try:
                ## Initialize the MultiModal LLM model
                multi_modal_llm = ReplicateMultiModal(
                    model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
                    max_new_tokens=100,
                    temperature=0.1,
                    num_input_files=1,
                    top_p=0.9,
                    num_beams=1,
                    repetition_penalty=1,
                )

                mm_resp = multi_modal_llm.complete(
                    prompt=prompt,
                    image_documents=[image_doc],
                )
            except Exception as e:
                print(
                    f"Error with LLM model inference with prompt {prompt}, image {image_idx}, and MM model {llm_model}"
                )
                print("Inference Failed due to: ", e)
                continue
            res.append(
                {
                    "model": llm_model,
                    "prompt": prompt,
                    "response": mm_resp,
                    "image": str(image_doc.image_path),
                }
            )
res = []
for prompt_idx, prompt in enumerate(prompts):
    for image_idx, image_doc in enumerate(image_documents):
        for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
            try:
                ## Initialize the MultiModal LLM model
                multi_modal_llm = ReplicateMultiModal(
                    model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
                    max_new_tokens=100,
                    temperature=0.1,
                    num_input_files=1,
                    top_p=0.9,
                    num_beams=1,
                    repetition_penalty=1,
                )

                mm_resp = multi_modal_llm.complete(
                    prompt=prompt,
                    image_documents=[image_doc],
                )
            except Exception as e:
                print(
                    f"Error with LLM model inference with prompt {prompt}, image {image_idx}, and MM model {llm_model}"
                )
                print("Inference Failed due to: ", e)
                continue
            res.append(
                {
                    "model": llm_model,
                    "prompt": prompt,
                    "response": mm_resp,
                    "image": str(image_doc.image_path),
                }
            )

展示多模态大语言模型的采样响应¶

In [ ]:

Copied!





from IPython.display import display
import pandas as pd

pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])
from IPython.display import display
import pandas as pd

pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])

	model	prompt	response	image
0	llava-13b	what is shown in this image?	The image shows a man holding a gold trophy, possibly a soccer trophy, while wearing a suit and tie.	test_images/0.png
1	fuyu-8b	what is shown in this image?	The image shows a man wearing a suit and holding a golden ball trophy.	test_images/0.png
2	minigpt-4	what is shown in this image?	The image shows a man in a black suit and tie holding a golden trophy.	test_images/0.png
3	llava-13b	what is shown in this image?	The image shows a large, illuminated building, which is the Colosseum in Rome, Italy. The building is lit up at night, and the lights create a beautiful and dramatic effect.	test_images/1.png
4	fuyu-8b	what is shown in this image?	The image showcases a city street at night, with colorful lights illuminating the scene. The street is lined with buildings, including a prominent Roman-style amphitheater.	test_images/1.png

人工标注多模态大语言模型推理结果的正确性与相关性¶

需注意人工标注在评估相关性和正确性时可能存在一定偏见/主观性。

我们采用[1, 5]分制标注正确性与相关性分数：

5分：完美回答问题
4分：基本回答问题
3分：部分回答问题
2分：给出错误答案
1分：未作答或出现幻觉

模型	提示/问题	模型推理结果	正确性与相关性 [1,5]	图片
llava-13b	这张图片展示了什么内容？	图片展示了一名手持奖杯的男子，奖杯似乎是金色足球造型。他身着西装领带，手持奖杯时面带微笑。	4	test_images/0.png
fuyu-8b	这张图片展示了什么内容？	图片展示了一名穿西装手持金色球状奖杯的男子。	4	test_images/0.png
minigpt-4	这张图片展示了什么内容？	图片展示了一名穿黑色西装打领带的男子手持金色奖杯。	4	test_images/0.png
llava-13b	这张图片展示了什么内容？	图片展示了一座夜间被灯光照亮的大型建筑，可能是历史纪念碑或著名地标。建筑被彩色灯光点亮，形成视觉冲击力强烈的场景。	4	test_images/1.png
fuyu-8b	这张图片展示了什么内容？	图片呈现了夜晚的城市街道，彩色灯光照亮场景。街道两侧建筑林立，包括一座显眼的罗马风格圆形剧场。	4	test_images/1.png
minigpt-4	这张图片展示了什么内容？	图片展示的是意大利罗马的斗兽场，这是该市最著名的地标之一，也是热门旅游景点。	5	test_images/1.png
llava-13b	这张图片展示了什么内容？	图片展示了电影《神偷奶爸》中的两个动画小黄人角色并肩站立。	5	test_images/2.png
fuyu-8b	这张图片展示了什么内容？	图片呈现了两个小黄人形象，一个穿背带裤，另一个戴护目镜。它们并肩站立，可能正在交谈或摆姿势拍照。	5	test_images/2.png
minigpt-4	这张图片展示了什么内容？	图片展示了两只小黄人，一只穿黄色背带裤，另一只穿蓝色背带裤，都戴着太阳镜并肩站立。	4	test_images/2.png
llava-13b	图片中有多少人？	图片中仅有一人，是手持奖杯的男子。	5	test_images/0.png
fuyu-8b	图片中有多少人？	图片中有一名穿西装手持金色球体或奖杯的男子。	5	test_images/0.png
minigpt-4	图片中有多少人？	抱歉，我无法看到您提供的图片。请重新上传图片。	NA	test_images/0.png
llava-13b	图片中有多少人？	图片中有两人。	1	test_images/1.png
fuyu-8b	图片中有多少人？	图片呈现了夜晚的城市街道，彩色灯光照亮场景。街道两侧建筑林立，包括一座显眼的罗马风格圆形剧场。\n\n街上有数名行人走动，有人拎着手提包。人行道旁可见几张长椅供行人休息。整体氛围热闹，人们在进行日常活动。	4	test_images/1.png
minigpt-4	图片中有多少人？	抱歉，我无法看到您提供的图片。请提供更多信息或更换图片。	NA	test_images/1.png
llava-13b	图片中有多少人？	图片中有两人。	4	test_images/2.png
fuyu-8b	图片中有多少人？	图片中有两人。	4	test_images/2.png
minigpt-4	图片中有多少人？	图片中有两只小黄人。	5	test_images/2.png
llava-13b	图片中有任何异常之处吗？	是的，图片中有异常元素。男子手持金色奖杯并不罕见，但他同时拿着麦克风就显得不寻常。通常拿麦克风的人会是演讲者或主持人，而奖杯通常与奖项或成就相关。麦克风和奖杯的组合暗示该男子可能是正在领奖或为他人颁奖的主持人。	4	test_images/0.png
fuyu-8b	图片中有任何异常之处吗？	是的，图片中有异常之处：一名穿西装的男子手持金色球体或奖杯。这很反常，因为体育奖杯通常与竞赛或成就相关，而非由穿西装的人佩戴。男子手持奖杯且身着西装的场景增添了画面的奇特感。	3	test_images/0.png
minigpt-4	图片中有任何异常之处吗？	图片展示了一名穿黑色西装打领带的男子手持金色奖杯。奖杯上有足球图案设计。男子面带笑容显得很开心。	3	test_images/0.png
llava-13b	图片中有任何异常之处吗？	是的，图片中有异常之处。著名的古罗马斗兽场被彩色灯光点亮，这非常规景象。这种照明为原本历史悠久的宏伟建筑增添了独特而活力的视觉效果。	4	test_images/1.png
fuyu-8b	图片中有任何异常之处吗？	是的，图片呈现了异常场景：多人在这座古老的罗马圆形剧场周围走动。剧场被彩色灯光照亮，营造出 vibrant 而 lively 的氛围。	3	test_images/1.png
minigpt-4	图片中有任何异常之处吗？	图片展示的是意大利罗马的斗兽场，这是众所周知的标志性建筑和旅游景点。	2	test_images/1.png
llava-13b	图片中有任何异常之处吗？	是的，图片中有异常之处。两个卡通小黄人并肩站立，但其中一只缺了一颗牙。这个细节很不寻常，因为动画角色通常不会有缺牙这样的缺陷。缺牙的细节为画面增添了独特趣味性，使其区别于典型的动画场景。	3	test_images/2.png
fuyu-8b	图片中有任何异常之处吗？	是的，图片中有异常元素：两只穿背带裤戴护目镜的小黄人并肩站立。这种组合并不典型，因为小黄人通常与其热门动画电影形象相关联。它们穿背带裤戴护目镜的装扮进一步强化了场景的奇特感。	2	test_images/2.png
minigpt-4	图片中有任何异常之处吗？	图片展示的是一个穿背带裤和黄衬衫的卡通角色。角色面带微笑头戴蓝帽子。图片中没有任何异常之处。	5	test_images/2.png

多模态模型初步评估结果摘要¶

首先，本笔记旨在展示如何利用 Replicate 平台部署不同多模态大语言模型以完成图像推理任务。此类比较存在若干局限性：

我们仅针对部分简单且有限的任务/提示词，对 LLaVa-13B、Fuyu-8B 和 MiniGPT-4 进行了对比评估
需注意示例中不同模型的超参数设置相同。超参数调优对多模态大语言模型的质量可能产生显著影响
人工评估可能存在偏见/主观性/噪声干扰

初步发现如下：

MiniGPT-4 有时能给出更精确的答案，例如对于图像中的人物识别，它能回答图中有两个小黄人，而 LlaVa 或 Fuyu-8B 则回答图中有两个人。另一例是当提问意大利斗兽场图像图中是什么建筑时，MiniGPT-4 能直接回答斗兽场
MiniGPT-4 对两个提示词未能返回结果，其回答为抱歉，我无法看到您提供的图像，但对同一图像的其他问题却能正常应答。尚不确定这是 Replicate 推理服务的问题还是 MiniGPT-4 模型本身的问题
Fuyu-8B 和 LlaVa-13B 通常会对问题生成更冗长的回答，附带更多上下文支持
Llava-13B 和 Fuyu-8B 在回答图像中是否有异常内容？这类问题时，偶尔会出现稍高的幻觉现象

复制流完成、异步完成、异步流完成模式¶

初始化 Fuyu-8B 模型¶

In [ ]:

Copied!





multi_modal_llm = ReplicateMultiModal(
    model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
    max_new_tokens=100,
    temperature=0.1,
    num_input_files=1,
    top_p=0.9,
    num_beams=1,
    repetition_penalty=1,
)
multi_modal_llm = ReplicateMultiModal(
    model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
    max_new_tokens=100,
    temperature=0.1,
    num_input_files=1,
    top_p=0.9,
    num_beams=1,
    repetition_penalty=1,
)

使用异步流完成¶

In [ ]:

Copied!





resp = await multi_modal_llm.astream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.astream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)

In [ ]:

Copied!

async for delta in resp:
    print(delta.delta, end="")
async for delta in resp:
    print(delta.delta, end="")

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.

使用异步完成¶

In [ ]:

Copied!





resp = await multi_modal_llm.acomplete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.acomplete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)

In [ ]:

Copied!

print(resp)
print(resp)

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.

使用流式完成¶

In [ ]:

Copied!





resp = multi_modal_llm.stream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = multi_modal_llm.stream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)

In [ ]:

Copied!

for delta in resp:
    print(delta.delta, end="")
for delta in resp:
    print(delta.delta, end="")

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.