使用 Cohere 多模态嵌入实现多模态检索¶

Cohere 发布了多模态嵌入模型，本笔记本将演示使用 Cohere 多模态嵌入实现多模态检索。

为什么多模态嵌入如此重要？

多模态嵌入的重要性在于它能让AI系统以统一的方式理解和搜索图像与文本内容。不同于传统的文本和图像分离搜索系统，多模态嵌入将两种类型的内容转换到相同的嵌入空间，使用户能够针对特定查询跨不同媒体类型找到相关信息。

演示流程包含以下步骤：

从相关维基百科文章下载文本、图像及原始PDF文件
使用Cohere多模态嵌入为文本和图像构建多模态索引
通过多模态检索器针对查询同时检索相关文本和图像
使用多模态查询引擎生成查询响应

注意： 由于Cohere目前尚未支持多模态大语言模型，我们将使用Anthropic的多模态LLM来生成响应。

安装¶

我们将采用 Cohere 多模态嵌入模型实现检索功能，使用 Qdrant 向量数据库存储数据，并基于 Anthropic 多模态大语言模型生成响应。

In [ ]:

Copied!

%pip install llama-index-embeddings-cohere
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-multi-modal-llms-anthropic
%pip install llama-index-embeddings-cohere
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-multi-modal-llms-anthropic

设置 API 密钥¶

Cohere - 多模态检索

Anthropic - 多模态大语言模型

In [ ]:

Copied!

import os

os.environ["COHERE_API_KEY"] = "<YOUR COHERE API KEY>"

os.environ["ANTHROPIC_API_KEY"] = "<YOUR ANTHROPIC API KEY>"
import os

os.environ["COHERE_API_KEY"] = ""

os.environ["ANTHROPIC_API_KEY"] = ""

工具函数¶

get_wikipedia_images: 从指定标题的维基百科页面获取图片URL列表。
plot_images: 绘制指定图片路径列表中的所有图像。
delete_large_images: 删除指定目录中大于5MB的图片文件。

注意: Cohere API仅接受小于5MB的图片文件。

In [ ]:

Copied!





import requests
import matplotlib.pyplot as plt
from PIL import Image
from pathlib import Path
import urllib.request
import os


def get_wikipedia_images(title):
    """
    Get the image URLs from the Wikipedia page with the specified title.
    """
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "imageinfo",
            "iiprop": "url|dimensions|mime",
            "generator": "images",
            "gimlimit": "50",
        },
    ).json()
    image_urls = []
    for page in response["query"]["pages"].values():
        if page["imageinfo"][0]["url"].endswith(".jpg") or page["imageinfo"][
            0
        ]["url"].endswith(".png"):
            image_urls.append(page["imageinfo"][0]["url"])
    return image_urls


def plot_images(image_paths):
    """
    Plot the images in the specified list of image paths.
    """
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(2, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 9:
                break


def delete_large_images(folder_path):
    """
    Delete images larger than 5 MB in the specified directory.
    """
    # List to hold the names of deleted image files
    deleted_images = []

    # Iterate through each file in the directory
    for file_name in os.listdir(folder_path):
        if file_name.lower().endswith(
            (".png", ".jpg", ".jpeg", ".gif", ".bmp")
        ):
            # Construct the full file path
            file_path = os.path.join(folder_path, file_name)
            # Get the size of the file in bytes
            file_size = os.path.getsize(file_path)
            # Check if the file size is greater than 5 MB (5242880 bytes) and remove it
            if file_size > 5242880:
                os.remove(file_path)
                deleted_images.append(file_name)
                print(
                    f"Image: {file_name} was larger than 5 MB and has been deleted."
                )
import requests
import matplotlib.pyplot as plt
from PIL import Image
from pathlib import Path
import urllib.request
import os


def get_wikipedia_images(title):
    """
    Get the image URLs from the Wikipedia page with the specified title.
    """
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "imageinfo",
            "iiprop": "url|dimensions|mime",
            "generator": "images",
            "gimlimit": "50",
        },
    ).json()
    image_urls = []
    for page in response["query"]["pages"].values():
        if page["imageinfo"][0]["url"].endswith(".jpg") or page["imageinfo"][
            0
        ]["url"].endswith(".png"):
            image_urls.append(page["imageinfo"][0]["url"])
    return image_urls


def plot_images(image_paths):
    """
    Plot the images in the specified list of image paths.
    """
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(2, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 9:
                break


def delete_large_images(folder_path):
    """
    Delete images larger than 5 MB in the specified directory.
    """
    # List to hold the names of deleted image files
    deleted_images = []

    # Iterate through each file in the directory
    for file_name in os.listdir(folder_path):
        if file_name.lower().endswith(
            (".png", ".jpg", ".jpeg", ".gif", ".bmp")
        ):
            # Construct the full file path
            file_path = os.path.join(folder_path, file_name)
            # Get the size of the file in bytes
            file_size = os.path.getsize(file_path)
            # Check if the file size is greater than 5 MB (5242880 bytes) and remove it
            if file_size > 5242880:
                os.remove(file_path)
                deleted_images.append(file_name)
                print(
                    f"Image: {file_name} was larger than 5 MB and has been deleted."
                )

从维基百科下载文本和图片¶

我们将从以下维基百科页面下载相关文本和图片：

奥迪 e-tron
福特野马
保时捷 Taycan

In [ ]:

Copied!





image_uuid = 0
# image_metadata_dict stores images metadata including image uuid, filename and path
image_metadata_dict = {}
MAX_IMAGES_PER_WIKI = 10

wiki_titles = {
    "Audi e-tron",
    "Ford Mustang",
    "Porsche Taycan",
}


data_path = Path("mixed_wiki")
if not data_path.exists():
    Path.mkdir(data_path)

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

    images_per_wiki = 0
    try:
        list_img_urls = get_wikipedia_images(title)

        for url in list_img_urls:
            if (
                url.endswith(".jpg")
                or url.endswith(".png")
                or url.endswith(".svg")
            ):
                image_uuid += 1
                urllib.request.urlretrieve(
                    url, data_path / f"{image_uuid}.jpg"
                )
                images_per_wiki += 1
                if images_per_wiki > MAX_IMAGES_PER_WIKI:
                    break
    except:
        print(str(Exception("No images found for Wikipedia page: ")) + title)
        continue
image_uuid = 0
# image_metadata_dict stores images metadata including image uuid, filename and path
image_metadata_dict = {}
MAX_IMAGES_PER_WIKI = 10

wiki_titles = {
    "Audi e-tron",
    "Ford Mustang",
    "Porsche Taycan",
}


data_path = Path("mixed_wiki")
if not data_path.exists():
    Path.mkdir(data_path)

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

    images_per_wiki = 0
    try:
        list_img_urls = get_wikipedia_images(title)

        for url in list_img_urls:
            if (
                url.endswith(".jpg")
                or url.endswith(".png")
                or url.endswith(".svg")
            ):
                image_uuid += 1
                urllib.request.urlretrieve(
                    url, data_path / f"{image_uuid}.jpg"
                )
                images_per_wiki += 1
                if images_per_wiki > MAX_IMAGES_PER_WIKI:
                    break
    except:
        print(str(Exception("No images found for Wikipedia page: ")) + title)
        continue

删除较大的图像文件¶

Cohere 多模态嵌入模型接受的图像文件需小于 5MB，因此此处我们将删除较大的图像文件。

In [ ]:

Copied!

delete_large_images(data_path)
delete_large_images(data_path)

Image: 8.jpg was larger than 5 MB and has been deleted.
Image: 13.jpg was larger than 5 MB and has been deleted.
Image: 11.jpg was larger than 5 MB and has been deleted.
Image: 21.jpg was larger than 5 MB and has been deleted.
Image: 23.jpg was larger than 5 MB and has been deleted.
Image: 32.jpg was larger than 5 MB and has been deleted.
Image: 19.jpg was larger than 5 MB and has been deleted.
Image: 4.jpg was larger than 5 MB and has been deleted.
Image: 5.jpg was larger than 5 MB and has been deleted.
Image: 7.jpg was larger than 5 MB and has been deleted.
Image: 6.jpg was larger than 5 MB and has been deleted.
Image: 1.jpg was larger than 5 MB and has been deleted.

设置嵌入模型与大语言模型¶

采用 Cohere 多模态嵌入模型进行检索，配合 Anthropic 多模态大语言模型生成响应。

In [ ]:

Copied!





from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
from llama_index.core import Settings

Settings.embed_model = CohereEmbedding(
    api_key=os.environ["COHERE_API_KEY"],
    model_name="embed-english-v3.0",  # current v3 models support multimodal embeddings
)

anthropic_multimodal_llm = AnthropicMultiModal(max_tokens=300)
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
from llama_index.core import Settings

Settings.embed_model = CohereEmbedding(
    api_key=os.environ["COHERE_API_KEY"],
    model_name="embed-english-v3.0",  # current v3 models support multimodal embeddings
)

anthropic_multimodal_llm = AnthropicMultiModal(max_tokens=300)

加载数据¶

我们将加载已下载的文本和图像数据。

In [ ]:

Copied!

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./mixed_wiki/").load_data()
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./mixed_wiki/").load_data()

配置 Qdrant 向量数据库¶

我们将使用 Qdrant 向量数据库来存储图像和文本的嵌入向量及其关联元数据。

In [ ]:

Copied!





from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext

import qdrant_client

# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")

text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext

import qdrant_client

# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")

text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

创建多模态向量存储索引¶

In [ ]:

Copied!





index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    image_embed_model=Settings.embed_model,
)
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    image_embed_model=Settings.embed_model,
)

WARNING:root:Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
WARNING:root:Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.

测试检索功能¶

在此我们创建一个检索器并进行测试。

In [ ]:

Copied!

retriever_engine = index.as_retriever(
    similarity_top_k=4, image_similarity_top_k=4
)
retriever_engine = index.as_retriever(
    similarity_top_k=4, image_similarity_top_k=4
)

In [ ]:

Copied!

query = "Which models of Porsche are discussed here?"
retrieval_results = retriever_engine.retrieve(query)
query = "Which models of Porsche are discussed here?"
retrieval_results = retriever_engine.retrieve(query)

检查检索结果¶

In [ ]:

Copied!





from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode

retrieved_image = []
for res_node in retrieval_results:
    if isinstance(res_node.node, ImageNode):
        retrieved_image.append(res_node.node.metadata["file_path"])
    else:
        display_source_node(res_node, source_length=200)

plot_images(retrieved_image)
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode

retrieved_image = []
for res_node in retrieval_results:
    if isinstance(res_node.node, ImageNode):
        retrieved_image.append(res_node.node.metadata["file_path"])
    else:
        display_source_node(res_node, source_length=200)

plot_images(retrieved_image)

Node ID: ac3e92f1-e192-4aa5-bbc6-45674654d96f
Similarity: 0.49435770203542906
Text: === Aerodynamics === The Taycan Turbo has a drag coefficient of Cd=0.22, which the manufacturer claims is the lowest of any current Porsche model. The Turbo S model has a slightly higher drag coeff...

Node ID: 045cde7c-963f-46cd-b820-9cabe07f1ab5
Similarity: 0.4804621315897337
Text: The Porsche Taycan is a battery electric luxury sports sedan and shooting brake car produced by German automobile manufacturer Porsche. The concept version of the Taycan named the Porsche Mission E...

Node ID: e14475d1-7bd4-48f3-a085-f712d5bc7e5a
Similarity: 0.46787589674504015
Text: === Porsche Mission E Cross Turismo === The Porsche Mission E Cross Turismo previewed the Taycan Cross Turismo, and was presented at the 2018 Geneva Motor Show. The design language of the Mission E...

Node ID: a25b3aea-2fdd-4ae2-b5bc-55eef453fe82
Similarity: 0.4370399571869162
Text: == Specifications ==

=== Chassis === The Taycan's body is mainly steel and aluminium joined by different bonding techniques. The body's B pillars, side roof frame and seat cross member are made f...

No description has been provided for this image

测试多模态查询引擎¶

我们将通过上述的MultiModalVectorStoreIndex来创建一个QueryEngine。

In [ ]:

Copied!





from llama_index.core import PromptTemplate

qa_tmpl_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

query_engine = index.as_query_engine(
    llm=anthropic_multimodal_llm, text_qa_template=qa_tmpl
)
from llama_index.core import PromptTemplate

qa_tmpl_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

query_engine = index.as_query_engine(
    llm=anthropic_multimodal_llm, text_qa_template=qa_tmpl
)

In [ ]:

Copied!

query = "Which models of Porsche are discussed here?"
response = query_engine.query(query)
query = "Which models of Porsche are discussed here?"
response = query_engine.query(query)

In [ ]:

Copied!

print(str(response))
print(str(response))

Based on the context provided, the Porsche models discussed are:

- Porsche Taycan - a battery electric luxury sports sedan. It is offered in several variants at different performance levels, including the Taycan Turbo and Turbo S high-performance AWD models, the mid-range Taycan 4S, and a base RWD model.

- Porsche Taycan Cross Turismo - a lifted shooting brake/wagon version of the Taycan with crossover-like features and styling. 

- Porsche Taycan Sport Turismo - shares the shooting brake profile with the Cross Turismo but without the crossover styling elements. A RWD version is available as the base Taycan Sport Turismo.

- Porsche Mission E - the concept car unveiled in 2015 that previewed the design and technology of the production Taycan models.

检查源代码¶

In [ ]:

Copied!





from llama_index.core.response.notebook_utils import display_source_node

for text_node in response.metadata["text_nodes"]:
    display_source_node(text_node, source_length=200)
plot_images(
    [n.metadata["file_path"] for n in response.metadata["image_nodes"]]
)
from llama_index.core.response.notebook_utils import display_source_node

for text_node in response.metadata["text_nodes"]:
    display_source_node(text_node, source_length=200)
plot_images(
    [n.metadata["file_path"] for n in response.metadata["image_nodes"]]
)

Node ID: ac3e92f1-e192-4aa5-bbc6-45674654d96f
Similarity: 0.49435770203542906
Text: === Aerodynamics === The Taycan Turbo has a drag coefficient of Cd=0.22, which the manufacturer claims is the lowest of any current Porsche model. The Turbo S model has a slightly higher drag coeff...

Node ID: 045cde7c-963f-46cd-b820-9cabe07f1ab5
Similarity: 0.4804621315897337
Text: The Porsche Taycan is a battery electric luxury sports sedan and shooting brake car produced by German automobile manufacturer Porsche. The concept version of the Taycan named the Porsche Mission E...