评估多模态 RAG 系统¶

在本指南中，我们将演示如何评估多模态 RAG（检索增强生成）系统。与纯文本场景类似，我们将分别评估检索器（Retriever）和生成器（Generator）。正如我们在相关博客中提到的，我们的评估方法采用了针对纯文本场景的常规评估技术的适配版本。这些适配版本已集成至llama-index库的evaluation模块中，本指南将逐步展示如何将这些工具应用于您的具体评估场景。

注意：本文展示的用例及其评估仅作演示用途，旨在说明如何将我们的评估工具适配到特定需求。此处呈现的结果或分析并非严谨结论——但我们相信这些工具能帮助您为实际应用建立更高标准的评估体系。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-multi-modal-llms-replicate
%pip install llama-index-llms-openai
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-multi-modal-llms-replicate

In [ ]:

Copied!





# %pip install llama_index ftfy regex tqdm -q
# %pip install git+https://github.com/openai/CLIP.git -q
# %pip install torch torchvision -q
# %pip install matplotlib scikit-image -q
# %pip install -U qdrant_client -q
# %pip install llama_index ftfy regex tqdm -q
# %pip install git+https://github.com/openai/CLIP.git -q
# %pip install torch torchvision -q
# %pip install matplotlib scikit-image -q
# %pip install -U qdrant_client -q

In [ ]:

Copied!

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

用例：美国手语字母拼写¶

本演示将贯穿使用的具体用例涉及利用图像和文本描述来展示美国手语（ASL）字母表的手势表达。

查询示例¶

本次演示中，我们将仅使用一种查询形式。（这并非典型的实际应用场景，但需要再次强调的是，此处主要目的在于展示如何运用llama-index评估工具来执行评估工作。）

In [ ]:

Copied!

QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."
QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."

数据集¶

图像部分

图像取自 ASL-Alphabet Kaggle 数据集。需注意的是，这些图像经过修改，在手势图像上简单添加了对应字母的标签。这些调整后的图像将作为用户查询的上下文内容，您可以通过我们的谷歌网盘下载（参见下方单元格，取消注释即可直接从本笔记本下载数据集）。

文本上下文

对于文本上下文，我们使用来自 https://www.deafblind.com/asl.html 的每个手势描述文本。这些描述已存储于名为 asl_text_descriptions.json 的 json 文件中，该文件包含在我们谷歌网盘的压缩下载包内。

In [ ]:

Copied!





#######################################################################
## This notebook guide makes several calls to gpt-4v, which is       ##
## heavily rate limited. For convenience, you should download data   ##
## files to avoid making such calls and still follow along with the  ##
## notebook. Unzip the zip file and store in a folder asl_data in    ##
## the same directory as this notebook.                              ##
#######################################################################

download_notebook_data = False
if download_notebook_data:
    !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q
#######################################################################
## This notebook guide makes several calls to gpt-4v, which is       ##
## heavily rate limited. For convenience, you should download data   ##
## files to avoid making such calls and still follow along with the  ##
## notebook. Unzip the zip file and store in a folder asl_data in    ##
## the same directory as this notebook.                              ##
#######################################################################

download_notebook_data = False
if download_notebook_data:
    !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q

首先，我们将上下文图像和文本分别加载到 ImageDocument 和 Documents 中。

In [ ]:

Copied!





import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# context images
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()

# context text
with open("asl_data/asl_text_descriptions.json") as json_file:
    asl_text_descriptions = json.load(json_file)
text_format_str = "To sign {letter} in ASL: {desc}."
text_documents = [
    Document(text=text_format_str.format(letter=k, desc=v))
    for k, v in asl_text_descriptions.items()
]
import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# context images
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()

# context text
with open("asl_data/asl_text_descriptions.json") as json_file:
    asl_text_descriptions = json.load(json_file)
text_format_str = "To sign {letter} in ASL: {desc}."
text_documents = [
    Document(text=text_format_str.format(letter=k, desc=v))
    for k, v in asl_text_descriptions.items()
]

手头有了文档后，我们就可以创建 MultiModalVectorStoreIndex。为此，我们需要先将 Documents 解析为节点，然后简单地将这些节点传递给 MultiModalVectorStoreIndex 构造函数。

In [ ]:

Copied!





from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)

asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)

asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)

另一种值得考虑的 RAG 系统（使用 GPT-4V 图像描述进行检索）¶

在之前的 MultiModalVectorStoreIndex 中，图像的默认嵌入模型是 OpenAI 的 CLIP。为了与另一种 RAG 系统进行对比（这通常是进行 RAG 评估的原因之一），我们将搭建另一个 RAG 系统，该系统对图像使用不同于默认设置的嵌入方式。

具体而言，我们将提示 GPT-4V 为每张图像生成文本描述，然后对这些描述应用常规的文本嵌入，并将这些嵌入与图像关联起来。也就是说，这些文本描述的嵌入将最终用于该 RAG 系统中执行检索任务。

In [ ]:

Copied!





#######################################################################
## Set load_previously_generated_text_descriptions to True if you    ##
## would rather use previously generated gpt-4v text descriptions    ##
## that are included in the .zip download                            ##
#######################################################################

load_previously_generated_text_descriptions = True
#######################################################################
## Set load_previously_generated_text_descriptions to True if you    ##
## would rather use previously generated gpt-4v text descriptions    ##
## that are included in the .zip download                            ##
#######################################################################

load_previously_generated_text_descriptions = True

In [ ]:

Copied!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import tqdm

if not load_previously_generated_text_descriptions:
    # define our lmm
    openai_mm_llm = OpenAIMultiModal(model="gpt-4o", max_new_tokens=300)

    # make a new copy since we want to store text in its attribute
    image_with_text_documents = SimpleDirectoryReader(image_path).load_data()

    # get text desc and save to text attr
    for img_doc in tqdm.tqdm(image_with_text_documents):
        response = openai_mm_llm.complete(
            prompt="Describe the images as an alternative text",
            image_documents=[img_doc],
        )
        img_doc.text = response.text

    # save so don't have to incur expensive gpt-4v calls again
    desc_jsonl = [
        json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
    ]
    with open("image_descriptions.json", "w") as f:
        json.dump(desc_jsonl, f)
else:
    # load up previously saved image descriptions and documents
    with open("asl_data/image_descriptions.json") as f:
        image_descriptions = json.load(f)

    image_with_text_documents = [
        ImageDocument.from_dict(el) for el in image_descriptions
    ]

# parse into nodes
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import tqdm

if not load_previously_generated_text_descriptions:
    # define our lmm
    openai_mm_llm = OpenAIMultiModal(model="gpt-4o", max_new_tokens=300)

    # make a new copy since we want to store text in its attribute
    image_with_text_documents = SimpleDirectoryReader(image_path).load_data()

    # get text desc and save to text attr
    for img_doc in tqdm.tqdm(image_with_text_documents):
        response = openai_mm_llm.complete(
            prompt="Describe the images as an alternative text",
            image_documents=[img_doc],
        )
        img_doc.text = response.text

    # save so don't have to incur expensive gpt-4v calls again
    desc_jsonl = [
        json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
    ]
    with open("image_descriptions.json", "w") as f:
        json.dump(desc_jsonl, f)
else:
    # load up previously saved image descriptions and documents
    with open("asl_data/image_descriptions.json") as f:
        image_descriptions = json.load(f)

    image_with_text_documents = [
        ImageDocument.from_dict(el) for el in image_descriptions
    ]

# parse into nodes
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

细心的读者会发现，我们将文本描述存储在 ImageDocument 的 text 字段中。正如之前所做的那样，要创建 MultiModalVectorStoreIndex，我们需要先将 ImageDocuments 解析为 ImageNodes，然后将这些节点传递给构造函数。

需要注意的是，当使用带有已填充 text 字段的 ImageNodes 来构建 MultiModalVectorStoreIndex 时，我们可以选择利用这些文本来构建用于检索的嵌入向量。要实现这一点，只需将类属性 is_image_to_text 设置为 True。

In [ ]:

Copied!





image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

asl_text_desc_index = MultiModalVectorStoreIndex(
    nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

asl_text_desc_index = MultiModalVectorStoreIndex(
    nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)

构建多模态 RAG 系统¶

与纯文本场景类似，我们需要将生成器"附加"到索引（可作为检索器使用）上，最终组装成 RAG 系统。但在多模态场景中，我们的生成器是多模态大语言模型（通常简称为 LMM）。本笔记本中，为了更全面地比较不同 RAG 系统，我们将同时使用 GPT-4V 和 LLaVA。通过调用索引的 as_query_engine 方法，我们可以"附加"生成器并获取可查询的 RAG 接口。

In [ ]:

Copied!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.core import PromptTemplate

# define our QA prompt template
qa_tmpl_str = (
    "Images of hand gestures for ASL are provided.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "If the images provided cannot help in answering the query\n"
    "then respond that you are unable to answer the query. Otherwise,\n"
    "using only the context provided, and not prior knowledge,\n"
    "provide an answer to the query."
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

# define our lmms
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=300,
)

llava_mm_llm = ReplicateMultiModal(
    model="yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
    max_new_tokens=300,
)

# define our RAG query engines
rag_engines = {
    "mm_clip_gpt4v": asl_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_clip_llava": asl_index.as_query_engine(
        multi_modal_llm=llava_mm_llm,
        text_qa_template=qa_tmpl,
    ),
    "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_text_desc_llava": asl_text_desc_index.as_query_engine(
        multi_modal_llm=llava_mm_llm, text_qa_template=qa_tmpl
    ),
}

# llava only supports 1 image per call at current moment
rag_engines["mm_clip_llava"].retriever.image_similarity_top_k = 1
rag_engines["mm_text_desc_llava"].retriever.image_similarity_top_k = 1
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.core import PromptTemplate

# define our QA prompt template
qa_tmpl_str = (
    "Images of hand gestures for ASL are provided.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "If the images provided cannot help in answering the query\n"
    "then respond that you are unable to answer the query. Otherwise,\n"
    "using only the context provided, and not prior knowledge,\n"
    "provide an answer to the query."
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

# define our lmms
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=300,
)

llava_mm_llm = ReplicateMultiModal(
    model="yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
    max_new_tokens=300,
)

# define our RAG query engines
rag_engines = {
    "mm_clip_gpt4v": asl_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_clip_llava": asl_index.as_query_engine(
        multi_modal_llm=llava_mm_llm,
        text_qa_template=qa_tmpl,
    ),
    "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_text_desc_llava": asl_text_desc_index.as_query_engine(
        multi_modal_llm=llava_mm_llm, text_qa_template=qa_tmpl
    ),
}

# llava only supports 1 image per call at current moment
rag_engines["mm_clip_llava"].retriever.image_similarity_top_k = 1
rag_engines["mm_text_desc_llava"].retriever.image_similarity_top_k = 1

体验我们的多模态 RAG 系统¶

让我们实际体验其中一个系统。为了美观地展示响应结果，我们使用了笔记本工具函数 display_query_and_multimodal_response。

In [ ]:

Copied!

letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)
letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)

In [ ]:

Copied!

from llama_index.core.response.notebook_utils import (
    display_query_and_multimodal_response,
)

display_query_and_multimodal_response(query, response)
from llama_index.core.response.notebook_utils import (
    display_query_and_multimodal_response,
)

display_query_and_multimodal_response(query, response)

Query: How can I sign a R?.
=======
Retrieved Images:

No description has been provided for this image

=======
Response: To sign the letter "R" in American Sign Language (ASL), you would follow the instructions provided: the ring and little finger should be folded against the palm and held down by your thumb, while the index and middle finger are straight and crossed with the index finger in front to form the letter "R."
=======

检索器评估¶

在本笔记本的这一部分，我们将对检索器进行评估。需要明确的是，我们主要拥有两种多模态检索器：一种使用默认的 CLIP 图像嵌入；另一种使用关联的 GPT-4V 文本描述嵌入。在进行性能量化分析之前，我们首先对所有要求展示每个 ASL 字母手语的用户查询，为 text_desc_retriever 创建了 top-1 检索结果的可视化（如需切换为 clip_retriever 只需简单替换即可！）。

注意：由于我们不会将检索到的文档发送给 LLaVA，因此可以将 image_simiarity_top_k 设置为大于 1 的值。当进行生成评估时，我们需要再次使用前文定义的 rag_engine，该引擎对于使用 LLaVA 的 RAG 引擎将此参数设置为 1。

In [ ]:

Copied!





# use as retriever
clip_retriever = asl_index.as_retriever(image_similarity_top_k=2)

# use as retriever
text_desc_retriever = asl_text_desc_index.as_retriever(
    image_similarity_top_k=2
)
# use as retriever
clip_retriever = asl_index.as_retriever(image_similarity_top_k=2)

# use as retriever
text_desc_retriever = asl_text_desc_index.as_retriever(
    image_similarity_top_k=2
)

视觉¶

In [ ]:

Copied!





from llama_index.core.schema import TextNode, ImageNode

f, axarr = plt.subplots(3, 9)
f.set_figheight(6)
f.set_figwidth(15)
ix = 0
for jx, letter in enumerate(asl_text_descriptions.keys()):
    retrieval_results = text_desc_retriever.retrieve(
        QUERY_STR_TEMPLATE.format(symbol=letter)
    )
    image_node = None
    text_node = None
    for r in retrieval_results:
        if isinstance(r.node, TextNode):
            text_node = r
        if isinstance(r.node, ImageNode):
            image_node = r
            break

    img_path = image_node.node.image_path
    image = Image.open(img_path).convert("RGB")
    axarr[int(jx / 9), jx % 9].imshow(image)
    axarr[int(jx / 9), jx % 9].set_title(f"Query: {letter}")

plt.setp(axarr, xticks=[0, 100, 200], yticks=[0, 100, 200])
f.tight_layout()
plt.show()
from llama_index.core.schema import TextNode, ImageNode

f, axarr = plt.subplots(3, 9)
f.set_figheight(6)
f.set_figwidth(15)
ix = 0
for jx, letter in enumerate(asl_text_descriptions.keys()):
    retrieval_results = text_desc_retriever.retrieve(
        QUERY_STR_TEMPLATE.format(symbol=letter)
    )
    image_node = None
    text_node = None
    for r in retrieval_results:
        if isinstance(r.node, TextNode):
            text_node = r
        if isinstance(r.node, ImageNode):
            image_node = r
            break

    img_path = image_node.node.image_path
    image = Image.open(img_path).convert("RGB")
    axarr[int(jx / 9), jx % 9].imshow(image)
    axarr[int(jx / 9), jx % 9].set_title(f"Query: {letter}")

plt.setp(axarr, xticks=[0, 100, 200], yticks=[0, 100, 200])
f.tight_layout()
plt.show()

如您所见，该检索器在 top-1 检索任务中表现相当出色。现在，我们将对检索器性能进行定量分析。

定量分析：命中率与平均倒数排名¶

在我们博客（详见本笔记本开头的链接）中曾提到，评估多模态检索器的一个合理方法是分别计算图像检索和文本检索的常规评估指标。虽然这会使评估指标数量达到纯文本检索的两倍，但能为您提供更细粒度的调试RAG/检索器的重要能力。若需单一指标，则根据需求定制权重进行加权平均似乎是合理选择。

为实现这一目标，我们使用MultiModalRetrieverEvaluator工具，其与单模态版本类似，不同之处在于能分别处理image和text检索评估——这也正是我们当前所需的功能。

In [ ]:

Copied!





from llama_index.core.evaluation import MultiModalRetrieverEvaluator

clip_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=clip_retriever
)

text_desc_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=text_desc_retriever
)
from llama_index.core.evaluation import MultiModalRetrieverEvaluator

clip_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=clip_retriever
)

text_desc_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=text_desc_retriever
)

在进行评估计算时，一个重要注意事项是您通常需要基准真值数据（有时也称为标注数据）。对于检索任务而言，这类标注数据表现为query（查询）和expected_ids（预期ID）的配对组合，前者代表用户查询语句，后者则表示应当被检索到的节点（通过其ID标识）。

在本指南中，我们编写了一个特定的辅助函数来构建LabelledQADataset对象，这正是我们所需的工具。

In [ ]:

Copied!





import uuid
import re
from llama_index.core.evaluation import LabelledQADataset


def asl_create_labelled_retrieval_dataset(
    reg_ex, nodes, mode
) -> LabelledQADataset:
    """Returns a QALabelledDataset that provides the expected node IDs
    for every query.

    NOTE: this is specific to the ASL use-case.
    """
    queries = {}
    relevant_docs = {}
    for node in nodes:
        # find the letter associated with the image/text node
        if mode == "image":
            string_to_search = node.metadata["file_path"]
        elif mode == "text":
            string_to_search = node.text
        else:
            raise ValueError(
                "Unsupported mode. Please enter 'image' or 'text'."
            )
        match = re.search(reg_ex, string_to_search)
        if match:
            # build the query
            query = QUERY_STR_TEMPLATE.format(symbol=match.group(1))
            id_ = str(uuid.uuid4())
            # store the query and expected ids pair
            queries[id_] = query
            relevant_docs[id_] = [node.id_]

    return LabelledQADataset(
        queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode
    )
import uuid
import re
from llama_index.core.evaluation import LabelledQADataset


def asl_create_labelled_retrieval_dataset(
    reg_ex, nodes, mode
) -> LabelledQADataset:
    """Returns a QALabelledDataset that provides the expected node IDs
    for every query.

    NOTE: this is specific to the ASL use-case.
    """
    queries = {}
    relevant_docs = {}
    for node in nodes:
        # find the letter associated with the image/text node
        if mode == "image":
            string_to_search = node.metadata["file_path"]
        elif mode == "text":
            string_to_search = node.text
        else:
            raise ValueError(
                "Unsupported mode. Please enter 'image' or 'text'."
            )
        match = re.search(reg_ex, string_to_search)
        if match:
            # build the query
            query = QUERY_STR_TEMPLATE.format(symbol=match.group(1))
            id_ = str(uuid.uuid4())
            # store the query and expected ids pair
            queries[id_] = query
            relevant_docs[id_] = [node.id_]

    return LabelledQADataset(
        queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode
    )

In [ ]:

Copied!





# labelled dataset for image retrieval with asl_index.as_retriever()
qa_dataset_image = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_nodes, "image"
)

# labelled dataset for text retrieval with asl_index.as_retriever()
qa_dataset_text = asl_create_labelled_retrieval_dataset(
    r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text"
)

# labelled dataset for text-desc with asl_text_desc_index.as_retriever()
qa_dataset_text_desc = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_with_text_nodes, "image"
)
# labelled dataset for image retrieval with asl_index.as_retriever()
qa_dataset_image = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_nodes, "image"
)

# labelled dataset for text retrieval with asl_index.as_retriever()
qa_dataset_text = asl_create_labelled_retrieval_dataset(
    r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text"
)

# labelled dataset for text-desc with asl_text_desc_index.as_retriever()
qa_dataset_text_desc = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_with_text_nodes, "image"
)

现在我们已经掌握了真实数据，可以调用 MultiModalRetrieverEvaluator 的 evaluate_dataset 方法（或其异步版本 async）。

In [ ]:

Copied!





eval_results_image = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_image
)
eval_results_text = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text
)
eval_results_text_desc = await text_desc_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text_desc
)
eval_results_image = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_image
)
eval_results_text = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text
)
eval_results_text_desc = await text_desc_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text_desc
)

此外，我们将利用另一个笔记本实用函数 get_retrieval_results_df，它能将评估结果优雅地渲染成 pandas 的 DataFrame 格式。

In [ ]:

Copied!





from llama_index.core.evaluation import get_retrieval_results_df

get_retrieval_results_df(
    names=["asl_index-image", "asl_index-text", "asl_text_desc_index"],
    results_arr=[
        eval_results_image,
        eval_results_text,
        eval_results_text_desc,
    ],
)
from llama_index.core.evaluation import get_retrieval_results_df

get_retrieval_results_df(
    names=["asl_index-image", "asl_index-text", "asl_text_desc_index"],
    results_arr=[
        eval_results_image,
        eval_results_text,
        eval_results_text_desc,
    ],
)

Out[ ]:

	retrievers	hit_rate	mrr
0	asl_index-image	0.814815	0.814815
1	asl_index-text	1.000000	1.000000
2	asl_text_desc_index	0.925926	0.925926

观察结果¶

可以看出，asl_index检索器的文本检索效果完美。这在意料之中，因为用于创建text_nodes存储文本的QUERY_STR_TEMPLATE和text_format_str非常相似。
图像的CLIP嵌入表现相当不错，不过在本案例中，源自GPT-4V文本描述的嵌入表征带来了更好的检索性能。
有趣的是，当两个检索器确实检索到正确图像时，都会将其置于首位，这就是为什么两者的命中率和平均倒数排名指标数值相同。

生成结果评估¶

现在我们将进入生成响应的评估环节。为此，我们考虑之前构建的4个多模态RAG系统：

mm_clip_gpt4v = 采用CLIP图像编码器的多模态RAG，语言多模态模型为GPT-4V，同时使用image_nodes和text_nodes
mm_clip_llava = 采用CLIP图像编码器的多模态RAG，语言多模态模型为LLaVA，同时使用image_nodes和text_nodes
mm_text_desc_gpt4v = 采用文本描述+ada图像编码器的多模态RAG，语言多模态模型为GPT-4V，同时使用image_with_text_nodes和text_nodes
mm_text_desc_llava = 采用文本描述+ada图像编码器的多模态RAG，语言多模态模型为LLaVA，同时使用image_with_text_nodes和text_nodes

与检索器评估类似，我们现在也需要用于评估生成结果的基准真实数据。（需要注意的是并非所有评估方法都需要基准数据，但我们将使用"正确性"指标，这需要参考答案来对比生成的响应。）

参考（基准真实）数据¶

为此，我们采集了另一组美国手语(ASL)手势的文本描述。我们发现这些描述更为详尽，非常适合作为ASL查询的参考答案。数据来源：https://www.signingtime.com/dictionary/category/letters/，这些数据已被提取并存储在`human_responses.json`文件中，该文件同样包含在本笔记本开头链接的数据压缩包内。

In [ ]:

Copied!

# references (ground-truth) for our answers
with open("asl_data/human_responses.json") as json_file:
    human_answers = json.load(json_file)
# references (ground-truth) for our answers
with open("asl_data/human_responses.json") as json_file:
    human_answers = json.load(json_file)

为每个系统生成所有查询的响应¶

现在我们将遍历所有查询，并将这些查询传递给全部4个RAG系统（即通过QueryEngine.query()接口）。

In [ ]:

Copied!





#######################################################################
## Set load_previous_responses to True if you would rather use       ##
## previously generated responses for all rags. The json is part of  ##
## the .zip download                                                 ##
#######################################################################

load_previous_responses = True
#######################################################################
## Set load_previous_responses to True if you would rather use       ##
## previously generated responses for all rags. The json is part of  ##
## the .zip download                                                 ##
#######################################################################

load_previous_responses = True

In [ ]:

Copied!





import time
import tqdm

if not load_previous_responses:
    response_data = []
    for letter in tqdm.tqdm(asl_text_descriptions.keys()):
        data_entry = {}
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        data_entry["query"] = query

        responses = {}
        for name, engine in rag_engines.items():
            this_response = {}
            result = engine.query(query)
            this_response["response"] = result.response

            sources = {}
            source_image_nodes = []
            source_text_nodes = []

            # image sources
            source_image_nodes = [
                score_img_node.node.metadata["file_path"]
                for score_img_node in result.metadata["image_nodes"]
            ]

            # text sources
            source_text_nodes = [
                score_text_node.node.text
                for score_text_node in result.metadata["text_nodes"]
            ]

            sources["images"] = source_image_nodes
            sources["texts"] = source_text_nodes
            this_response["sources"] = sources

            responses[name] = this_response
        data_entry["responses"] = responses
        response_data.append(data_entry)

    # save expensive gpt-4v responses
    with open("expensive_response_data.json", "w") as json_file:
        json.dump(response_data, json_file)
else:
    # load up previously saved image descriptions
    with open("asl_data/expensive_response_data.json") as json_file:
        response_data = json.load(json_file)
import time
import tqdm

if not load_previous_responses:
    response_data = []
    for letter in tqdm.tqdm(asl_text_descriptions.keys()):
        data_entry = {}
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        data_entry["query"] = query

        responses = {}
        for name, engine in rag_engines.items():
            this_response = {}
            result = engine.query(query)
            this_response["response"] = result.response

            sources = {}
            source_image_nodes = []
            source_text_nodes = []

            # image sources
            source_image_nodes = [
                score_img_node.node.metadata["file_path"]
                for score_img_node in result.metadata["image_nodes"]
            ]

            # text sources
            source_text_nodes = [
                score_text_node.node.text
                for score_text_node in result.metadata["text_nodes"]
            ]

            sources["images"] = source_image_nodes
            sources["texts"] = source_text_nodes
            this_response["sources"] = sources

            responses[name] = this_response
        data_entry["responses"] = responses
        response_data.append(data_entry)

    # save expensive gpt-4v responses
    with open("expensive_response_data.json", "w") as json_file:
        json.dump(response_data, json_file)
else:
    # load up previously saved image descriptions
    with open("asl_data/expensive_response_data.json") as json_file:
        response_data = json.load(json_file)

正确性、忠实性、相关性¶

根据已生成的响应（存储在为该ASL用例定制的自定义数据对象中，即：response_data），我们现在可以计算以下评估指标：

正确性（LLM作为评判者）：
忠实性（LMM作为评判者）：
相关性（LMM作为评判者）：

为计算这三项指标，我们提示另一个生成式模型根据各自的标准提供评分。对于正确性，由于不考虑上下文，评判者为LLM。相反，计算忠实性和相关性时，需要传入初始提供给RAG用于生成响应的上下文（包括图像和文本）。由于需要同时传入图像和文本的要求，忠实性和相关性的评判者必须是LMM（或多模态LLM）。

我们在evaluation模块中实现了这些抽象逻辑，后续将演示如何通过循环遍历所有生成响应来使用这些功能。

In [ ]:

Copied!





from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import (
    MultiModalRelevancyEvaluator,
    MultiModalFaithfulnessEvaluator,
)

import os

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = MultiModalRelevancyEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4o",
        max_new_tokens=300,
    )
)

judges["faithfulness"] = MultiModalFaithfulnessEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4o",
        max_new_tokens=300,
    )
)
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import (
    MultiModalRelevancyEvaluator,
    MultiModalFaithfulnessEvaluator,
)

import os

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = MultiModalRelevancyEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4o",
        max_new_tokens=300,
    )
)

judges["faithfulness"] = MultiModalFaithfulnessEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4o",
        max_new_tokens=300,
    )
)

In [ ]:

Copied!





#######################################################################
## This section of the notebook can make a total of ~200 GPT-4V      ##
## which is heavily rate limited (100 per day). To follow along,     ##
## with previous generated evaluations set load_previous_evaluations ##
## to True. To test out the evaluation execution, set number_evals   ##
## to any number between (1-27). The json is part of the .zip        ##
## download                                                          ##
#######################################################################

load_previous_evaluations = True
number_evals = 27
#######################################################################
## This section of the notebook can make a total of ~200 GPT-4V      ##
## which is heavily rate limited (100 per day). To follow along,     ##
## with previous generated evaluations set load_previous_evaluations ##
## to True. To test out the evaluation execution, set number_evals   ##
## to any number between (1-27). The json is part of the .zip        ##
## download                                                          ##
#######################################################################

load_previous_evaluations = True
number_evals = 27

In [ ]:

Copied!





if not load_previous_evaluations:
    evals = {
        "names": [],
        "correctness": [],
        "relevancy": [],
        "faithfulness": [],
    }

    # loop through all responses and evaluate them
    for data_entry in tqdm.tqdm(response_data[:number_evals]):
        reg_ex = r"(?:How can I sign a ([A-Z]+)?)"
        match = re.search(reg_ex, data_entry["query"])

        batch_names = []
        batch_correctness = []
        batch_relevancy = []
        batch_faithfulness = []
        if match:
            letter = match.group(1)
            reference_answer = human_answers[letter]
            for rag_name, rag_response_data in data_entry["responses"].items():
                correctness_result = await judges["correctness"].aevaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    reference=reference_answer,
                )

                relevancy_result = judges["relevancy"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                faithfulness_result = judges["faithfulness"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                batch_names.append(rag_name)
                batch_correctness.append(correctness_result)
                batch_relevancy.append(relevancy_result)
                batch_faithfulness.append(faithfulness_result)

            evals["names"] += batch_names
            evals["correctness"] += batch_correctness
            evals["relevancy"] += batch_relevancy
            evals["faithfulness"] += batch_faithfulness

    # save evaluations
    evaluations_objects = {
        "names": evals["names"],
        "correctness": [e.dict() for e in evals["correctness"]],
        "faithfulness": [e.dict() for e in evals["faithfulness"]],
        "relevancy": [e.dict() for e in evals["relevancy"]],
    }
    with open("asl_data/evaluations.json", "w") as json_file:
        json.dump(evaluations_objects, json_file)
else:
from llama_index.core.evaluation import EvaluationResult

    # load up previously saved image descriptions
    with open("asl_data/evaluations.json") as json_file:
        evaluations_objects = json.load(json_file)

    evals = {}
    evals["names"] = evaluations_objects["names"]
    evals["correctness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["correctness"]
    ]
    evals["faithfulness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["faithfulness"]
    ]
    evals["relevancy"] = [
        EvaluationResult.parse_obj(e) for e in evaluations_objects["relevancy"]
    ]
if not load_previous_evaluations:
    evals = {
        "names": [],
        "correctness": [],
        "relevancy": [],
        "faithfulness": [],
    }

    # loop through all responses and evaluate them
    for data_entry in tqdm.tqdm(response_data[:number_evals]):
        reg_ex = r"(?:How can I sign a ([A-Z]+)?)"
        match = re.search(reg_ex, data_entry["query"])

        batch_names = []
        batch_correctness = []
        batch_relevancy = []
        batch_faithfulness = []
        if match:
            letter = match.group(1)
            reference_answer = human_answers[letter]
            for rag_name, rag_response_data in data_entry["responses"].items():
                correctness_result = await judges["correctness"].aevaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    reference=reference_answer,
                )

                relevancy_result = judges["relevancy"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                faithfulness_result = judges["faithfulness"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                batch_names.append(rag_name)
                batch_correctness.append(correctness_result)
                batch_relevancy.append(relevancy_result)
                batch_faithfulness.append(faithfulness_result)

            evals["names"] += batch_names
            evals["correctness"] += batch_correctness
            evals["relevancy"] += batch_relevancy
            evals["faithfulness"] += batch_faithfulness

    # save evaluations
    evaluations_objects = {
        "names": evals["names"],
        "correctness": [e.dict() for e in evals["correctness"]],
        "faithfulness": [e.dict() for e in evals["faithfulness"]],
        "relevancy": [e.dict() for e in evals["relevancy"]],
    }
    with open("asl_data/evaluations.json", "w") as json_file:
        json.dump(evaluations_objects, json_file)
else:
from llama_index.core.evaluation import EvaluationResult

    # load up previously saved image descriptions
    with open("asl_data/evaluations.json") as json_file:
        evaluations_objects = json.load(json_file)

    evals = {}
    evals["names"] = evaluations_objects["names"]
    evals["correctness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["correctness"]
    ]
    evals["faithfulness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["faithfulness"]
    ]
    evals["relevancy"] = [
        EvaluationResult.parse_obj(e) for e in evaluations_objects["relevancy"]
    ]

要查看这些结果，我们再次使用了笔记本工具函数 get_eval_results_df。

In [ ]:

Copied!





from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    evals["names"], evals["correctness"], metric="correctness"
)
_, mean_relevancy_df = get_eval_results_df(
    evals["names"], evals["relevancy"], metric="relevancy"
)
_, mean_faithfulness_df = get_eval_results_df(
    evals["names"], evals["faithfulness"], metric="faithfulness"
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    evals["names"], evals["correctness"], metric="correctness"
)
_, mean_relevancy_df = get_eval_results_df(
    evals["names"], evals["relevancy"], metric="relevancy"
)
_, mean_faithfulness_df = get_eval_results_df(
    evals["names"], evals["faithfulness"], metric="faithfulness"
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

In [ ]:

Copied!

print(deep_eval_df[:4])
print(deep_eval_df[:4])

Out[ ]:

	rag	query	scores	feedbacks
0	mm_clip_gpt4v	How can I sign a A?.	4.500000	The generated answer is relevant and mostly correct. It accurately describes how to sign the letter 'A' in ASL, which matches the user query. However, it includes unnecessary information about images that were not mentioned in the user query, which slightly detracts from its overall correctness.
1	mm_clip_llava	How can I sign a A?.	4.500000	The generated answer is relevant and mostly correct. It provides the necessary steps to sign the letter 'A' in ASL, but it lacks the additional information about the hand position and the difference between 'A' and 'S' that the reference answer provides.
2	mm_text_desc_gpt4v	How can I sign a A?.	4.500000	The generated answer is relevant and mostly correct. It provides a clear description of how to sign the letter 'A' in American Sign Language, which matches the reference answer. However, it starts with an unnecessary statement about the lack of images, which is not relevant to the user's query.
3	mm_text_desc_llava	How can I sign a A?.	4.500000	The generated answer is relevant and almost fully correct. It accurately describes how to sign the letter 'A' in American Sign Language. However, it lacks the detail about the position of the hand (at shoulder height with palm facing out) that is present in the reference answer.

In [ ]:

Copied!

mean_scores_df
mean_scores_df

Out[ ]:

rag	mm_clip_gpt4v	mm_clip_llava	mm_text_desc_gpt4v	mm_text_desc_llava
metrics
mean_correctness_score	3.685185	4.092593	3.722222	3.870370
mean_relevancy_score	0.777778	0.851852	0.703704	0.740741
mean_faithfulness_score	0.777778	0.888889	0.851852	0.851852

观察结果¶

使用 LLaVA 的 RAG 模型在正确性、相关性和忠实度得分上似乎优于使用 GPT-4V 的模型
经过对部分回答的检查，我们发现 GPT-4V 针对 SPACE 的回复如下（尽管图像检索正确）：「抱歉，我无法根据提供的图像回答该查询，因为系统目前不允许我对图像进行视觉分析。不过根据上下文，在 ASL（美国手语）中表示『SPACE』时，您需要将手掌朝上，手指向上弯曲且拇指竖起。」
此类生成回复可能是评委对 GPT-4V 生成内容评分低于 LLaVA 的原因。更深入的分析需要进一步研究生成响应，可能需要调整生成提示词甚至评估提示词。

总结¶

在本笔记本中，我们演示了如何评估多模态 RAG 系统中的检索器（Retriever）和生成器（Generator）。具体而言，我们将现有的 llama-index 评估工具应用于 ASL 用例，旨在说明如何将这些工具适配到您的评估需求中。需要注意的是，多模态大语言模型目前仍应被视为测试版技术，若计划在生产系统中使用它们来评估多模态响应，应当采取特殊的审慎标准。