使用谷歌Gemini模型实现多模态LLM图像理解及LlamaIndex检索增强生成¶

本笔记本演示如何利用谷歌Gemini Vision模型进行图像理解。

首先展示Gemini当前支持的几项功能：

complete（同步/异步）：处理单个提示词和图片列表
chat（同步/异步）：处理多轮对话消息
stream complete（同步/异步）：实现complete功能的流式输出
stream chat（同步/异步）：实现chat功能的流式输出

第二部分将尝试结合Gemini与Pydantic从谷歌地图图片中解析结构化信息：

定义包含属性字段的Pydantic类
让gemini-pro-vision模型理解每张图片并输出结构化结果

第三部分提出使用Gemini和LlamaIndex为小型谷歌地图餐厅数据集构建简易检索增强生成流程：

基于第二步的结构化输出构建向量索引
使用gemini-pro模型综合结果，根据用户查询推荐餐厅

注意：google-generativeai仅在特定国家和地区可用。

In [ ]:

Copied!





%pip install llama-index-multi-modal-llms-gemini
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-gemini
%pip install llama-index-llms-gemini
%pip install llama-index-multi-modal-llms-gemini
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-gemini
%pip install llama-index-llms-gemini

In [ ]:

Copied!

!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client
!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client

使用 Gemini 理解来自 URL 的图像¶

In [ ]:

Copied!

%env GOOGLE_API_KEY=...
%env GOOGLE_API_KEY=...

In [ ]:

Copied!

import os

GOOGLE_API_KEY = ""  # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
import os

GOOGLE_API_KEY = ""  # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

初始化 `GeminiMultiModal` 并从 URL 加载图像¶

In [ ]:

Copied!





from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage, ImageBlock


image_urls = [
    "https://storage.googleapis.com/generativeai-downloads/data/scene.jpg",
    # Add yours here!
]
gemini_pro = Gemini(model_name="models/gemini-1.5-flash")
msg = ChatMessage("Identify the city where this photo was taken.")
for img_url in image_urls:
    msg.blocks.append(ImageBlock(url=img_url))
from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage, ImageBlock


image_urls = [
    "https://storage.googleapis.com/generativeai-downloads/data/scene.jpg",
    # Add yours here!
]
gemini_pro = Gemini(model_name="models/gemini-1.5-flash")
msg = ChatMessage("Identify the city where this photo was taken.")
for img_url in image_urls:
    msg.blocks.append(ImageBlock(url=img_url))

In [ ]:

Copied!





from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)

https://storage.googleapis.com/generativeai-downloads/data/scene.jpg

Out[ ]:

<matplotlib.image.AxesImage at 0x128032e40>

No description has been provided for this image

在提示词中使用图片进行聊天¶

In [ ]:

Copied!

response = gemini_pro.chat(messages=[msg])
response = gemini_pro.chat(messages=[msg])

In [ ]:

Copied!

print(response.message.content)
print(response.message.content)

That's New York City.  More specifically, the photo shows a street in the **SoHo** neighborhood.  The distinctive cast-iron architecture and the pedestrian bridge are characteristic of that area.

支持图片的流式聊天¶

In [ ]:

Copied!

stream_response = gemini_pro.stream_chat(messages=[msg])
stream_response = gemini_pro.stream_chat(messages=[msg])

In [ ]:

Copied!

import time

for r in stream_response:
    print(r.delta, end="")
    # Add an artificial wait to make streaming visible in the notebook
    time.sleep(0.5)
import time

for r in stream_response:
    print(r.delta, end="")
    # Add an artificial wait to make streaming visible in the notebook
    time.sleep(0.5)

That's New York City.  More specifically, the photo was taken in the **West Village** neighborhood of Manhattan.  The distinctive architecture and the pedestrian bridge are strong clues.

异步支持¶

In [ ]:

Copied!

response_achat = await gemini_pro.achat(messages=[msg])
response_achat = await gemini_pro.achat(messages=[msg])

In [ ]:

Copied!

print(response_achat.message.content)
print(response_achat.message.content)

That's New York City.  More specifically, the photo was taken in the **West Village** neighborhood of Manhattan.  The distinctive architecture and the pedestrian bridge are strong clues.

让我们看看如何实现异步流式传输：

In [ ]:

Copied!





import asyncio

streaming_handler = await gemini_pro.astream_chat(messages=[msg])
async for chunk in streaming_handler:
    print(chunk.delta, end="")
    # Add an artificial wait to make streaming visible in the notebook
    await asyncio.sleep(0.5)
import asyncio

streaming_handler = await gemini_pro.astream_chat(messages=[msg])
async for chunk in streaming_handler:
    print(chunk.delta, end="")
    # Add an artificial wait to make streaming visible in the notebook
    await asyncio.sleep(0.5)

That's New York City.  More specifically, the photo was taken in the **West Village** neighborhood of Manhattan.  The distinctive architecture and the pedestrian bridge are strong clues.

完整包含两张图片¶

In [ ]:

Copied!





image_urls = [
    "https://picsum.photos/id/1/200/300",
    "https://picsum.photos/id/26/200/300",
]


msg = ChatMessage("Is there any relationship between these images?")
for img_url in image_urls:
    msg.blocks.append(ImageBlock(url=img_url))

response_multi = gemini_pro.chat(messages=[msg])
image_urls = [
    "https://picsum.photos/id/1/200/300",
    "https://picsum.photos/id/26/200/300",
]


msg = ChatMessage("Is there any relationship between these images?")
for img_url in image_urls:
    msg.blocks.append(ImageBlock(url=img_url))

response_multi = gemini_pro.chat(messages=[msg])

In [ ]:

Copied!

print(response_multi.message.content)
print(response_multi.message.content)

Yes, there is a relationship between the two images.  Both images depict aspects of a **professional or business-casual lifestyle**.

* **Image 1:** Shows someone working on a laptop, suggesting remote work, freelancing, or a business-related task.

* **Image 2:** Shows a flat lay of accessories commonly associated with a professional or stylish individual: sunglasses, a bow tie, a pen, a watch, glasses, and a phone.  These items suggest a certain level of personal style and preparedness often associated with business or professional settings.

The connection is indirect but thematic.  They both visually represent elements of a similar lifestyle or persona.

第二部分：使用 `Gemini` + `Pydantic` 实现图像结构化输出解析¶

利用 Gemini 进行图像推理
通过 Pydantic 程序从 Gemini 的图像推理结果生成结构化输出

下载示例图像供 Gemini 理解¶

In [ ]:

Copied!

from pathlib import Path

input_image_path = Path("google_restaurants")
if not input_image_path.exists():
    Path.mkdir(input_image_path)
from pathlib import Path

input_image_path = Path("google_restaurants")
if not input_image_path.exists():
    Path.mkdir(input_image_path)

In [ ]:

Copied!





!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png
!curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png
!curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png
!curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png
!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png
!curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png
!curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png
!curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png

定义结构化解析器的 Pydantic 类¶

In [ ]:

Copied!





from pydantic import BaseModel
from PIL import Image
import matplotlib.pyplot as plt


class GoogleRestaurant(BaseModel):
    """Data model for a Google Restaurant."""

    restaurant: str
    food: str
    location: str
    category: str
    hours: str
    price: str
    rating: float
    review: str
    description: str
    nearby_tourist_places: str


google_image_url = "./google_restaurants/miami.png"
image = Image.open(google_image_url).convert("RGB")

plt.figure(figsize=(16, 5))
plt.imshow(image)
from pydantic import BaseModel
from PIL import Image
import matplotlib.pyplot as plt


class GoogleRestaurant(BaseModel):
    """Data model for a Google Restaurant."""

    restaurant: str
    food: str
    location: str
    category: str
    hours: str
    price: str
    rating: float
    review: str
    description: str
    nearby_tourist_places: str


google_image_url = "./google_restaurants/miami.png"
image = Image.open(google_image_url).convert("RGB")

plt.figure(figsize=(16, 5))
plt.imshow(image)

Out[ ]:

<matplotlib.image.AxesImage at 0x10953cce0>

调用 Pydantic 程序并生成结构化输出¶

In [ ]:

Copied!





from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

prompt_template_str = """\
    can you summarize what is in the image\
    and return the answer with json format \
"""


def pydantic_gemini(
    model_name, output_class, image_documents, prompt_template_str
):
    gemini_llm = GeminiMultiModal(model_name=model_name)

    llm_program = MultiModalLLMCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(output_class),
        image_documents=image_documents,
        prompt_template_str=prompt_template_str,
        multi_modal_llm=gemini_llm,
        verbose=True,
    )

    response = llm_program()
    return response
from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

prompt_template_str = """\
    can you summarize what is in the image\
    and return the answer with json format \
"""


def pydantic_gemini(
    model_name, output_class, image_documents, prompt_template_str
):
    gemini_llm = GeminiMultiModal(model_name=model_name)

    llm_program = MultiModalLLMCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(output_class),
        image_documents=image_documents,
        prompt_template_str=prompt_template_str,
        multi_modal_llm=gemini_llm,
        verbose=True,
    )

    response = llm_program()
    return response

通过 Gemini Vision 模型生成 Pydantic 结构化输出¶

In [ ]:

Copied!





from llama_index.core import SimpleDirectoryReader

google_image_documents = SimpleDirectoryReader(
    "./google_restaurants"
).load_data()

results = []
for img_doc in google_image_documents:
    pydantic_response = pydantic_gemini(
        "models/gemini-1.5-flash",
        GoogleRestaurant,
        [img_doc],
        prompt_template_str,
    )
    # only output the results for miami for example along with image
    if "miami" in img_doc.image_path:
        for r in pydantic_response:
            print(r)
    results.append(pydantic_response)
from llama_index.core import SimpleDirectoryReader

google_image_documents = SimpleDirectoryReader(
    "./google_restaurants"
).load_data()

results = []
for img_doc in google_image_documents:
    pydantic_response = pydantic_gemini(
        "models/gemini-1.5-flash",
        GoogleRestaurant,
        [img_doc],
        prompt_template_str,
    )
    # only output the results for miami for example along with image
    if "miami" in img_doc.image_path:
        for r in pydantic_response:
            print(r)
    results.append(pydantic_response)

> Raw output: ```json
{
  "restaurant": "La Mar by Gaston Acurio",
  "food": "Peruvian & fusion",
  "location": "500 Brickell Key Dr, Miami, FL 33131",
  "category": "South American restaurant",
  "hours": "Opens 6PM, Closes 11 PM",
  "price": "$$$",
  "rating": 4.4,
  "review": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.",
  "description": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.",
  "nearby_tourist_places": "Brickell Key area with scenic views"
}
```

('restaurant', 'La Mar by Gaston Acurio')
('food', 'Peruvian & fusion')
('location', '500 Brickell Key Dr, Miami, FL 33131')
('category', 'South American restaurant')
('hours', 'Opens 6PM, Closes 11 PM')
('price', '$$$')
('rating', 4.4)
('review', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.')
('description', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.')
('nearby_tourist_places', 'Brickell Key area with scenic views')
> Raw output: ```json
{
  "restaurant": "Mythos Restaurant",
  "food": "American fare in a mythic underwater themed spot",
  "location": "6000 Universal Blvd, Orlando, FL 32819, United States",
  "category": "Restaurant",
  "hours": "Open: Closes in 7 hrs, Islands of Adventure",
  "price": "$$",
  "rating": 4.3,
  "review": "Overlooking Universal Studios/Island sea, this mythic underwater themed spot serves American fare.",
  "description": "Dine-in, Delivery",
  "nearby_tourist_places": "Universal Islands, Jurassic Park River Adventure"
}
```

> Raw output: ```json
{
  "restaurant": "Sam's Grill & Seafood Restaurant",
  "food": "Seafood",
  "location": "374 Bush St, San Francisco, CA 94104, United States",
  "category": "Seafood Restaurant",
  "hours": "Open ⋅ Closes 8:30 PM",
  "price": "$$$",
  "rating": 4.4,
  "review": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.",
  "description": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.",
  "nearby_tourist_places": "Chinatown, San Francisco"
}
```
> Raw output: ```json
{
  "restaurant": "Lobster Port",
  "food": "Seafood restaurant offering lobster, dim sum & Asian fusion dishes",
  "location": "8432 Leslie St, Thornhill, ON L3T 7M6",
  "category": "Seafood",
  "hours": "Open 10pm",
  "price": "$$",
  "rating": 4.0,
  "review": "Elegant, lively venue with a banquet-hall setup",
  "description": "Elegant, lively venue with a banquet-hall setup offering lobster, dim sum & Asian fusion dishes.",
  "nearby_tourist_places": "Nearby tourist places are not explicitly listed in the image but the map shows various points of interest in the surrounding area."
}
```

观察结果:

Gemini 完美生成了我们为 Pydantic 类所需的所有元信息
它还能从 Google Maps 识别出附近的公园

第三部分：构建用于餐厅推荐的多模态 RAG 系统¶

我们的技术栈由 Gemini + LlamaIndex + Pydantic 结构化输出能力组成

构建文本节点以创建向量存储。存储每家餐厅的元数据和描述信息。¶

In [ ]:

Copied!





from llama_index.core.schema import TextNode

nodes = []
for res in results:
    text_node = TextNode()
    metadata = {}
    for r in res:
        # set description as text of TextNode
        if r[0] == "description":
            text_node.text = r[1]
        else:
            metadata[r[0]] = r[1]
    text_node.metadata = metadata
    nodes.append(text_node)
from llama_index.core.schema import TextNode

nodes = []
for res in results:
    text_node = TextNode()
    metadata = {}
    for r in res:
        # set description as text of TextNode
        if r[0] == "description":
            text_node.text = r[1]
        else:
            metadata[r[0]] = r[1]
    text_node.metadata = metadata
    nodes.append(text_node)

使用 Gemini Embedding 构建密集检索向量库：将餐厅索引为节点存入向量库¶

In [ ]:

Copied!





from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client


# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_gemini_3")

vector_store = QdrantVectorStore(client=client, collection_name="collection")

# Using the embedding model to Gemini
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
)
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client


# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_gemini_3")

vector_store = QdrantVectorStore(client=client, collection_name="collection")

# Using the embedding model to Gemini
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
)

使用 Gemini 合成结果并向用户推荐餐厅¶

In [ ]:

Copied!





query_engine = index.as_query_engine(
    similarity_top_k=1,
)

response = query_engine.query(
    "recommend a Orlando restaurant for me and its nearby tourist places"
)
print(response)
query_engine = index.as_query_engine(
    similarity_top_k=1,
)

response = query_engine.query(
    "recommend a Orlando restaurant for me and its nearby tourist places"
)
print(response)

For a delightful dining experience, I recommend Mythos Restaurant, known for its American cuisine and unique underwater theme. Overlooking Universal Studios' Inland Sea, this restaurant offers a captivating ambiance. After your meal, explore the nearby tourist attractions such as Universal's Islands of Adventure, Skull Island: Reign of Kong, The Wizarding World of Harry Potter, Jurassic Park River Adventure, and Hollywood Rip Ride Rockit, all located near Mythos Restaurant.

使用谷歌Gemini模型实现多模态LLM图像理解及LlamaIndex检索增强生成¶

使用 Gemini 理解来自 URL 的图像¶

初始化 GeminiMultiModal 并从 URL 加载图像¶

在提示词中使用图片进行聊天¶

支持图片的流式聊天¶

异步支持¶

完整包含两张图片¶

第二部分：使用 Gemini + Pydantic 实现图像结构化输出解析¶

下载示例图像供 Gemini 理解¶

定义结构化解析器的 Pydantic 类¶

调用 Pydantic 程序并生成结构化输出¶

通过 Gemini Vision 模型生成 Pydantic 结构化输出¶

第三部分：构建用于餐厅推荐的多模态 RAG 系统¶

构建文本节点以创建向量存储。存储每家餐厅的元数据和描述信息。¶

使用 Gemini Embedding 构建密集检索向量库：将餐厅索引为节点存入向量库¶

使用 Gemini 合成结果并向用户推荐餐厅¶

初始化 `GeminiMultiModal` 并从 URL 加载图像¶

第二部分：使用 `Gemini` + `Pydantic` 实现图像结构化输出解析¶