使用谷歌Gemini模型实现多模态LLM图像理解及LlamaIndex检索增强生成¶
本笔记本演示如何利用谷歌Gemini Vision模型进行图像理解。
首先展示Gemini当前支持的几项功能:
complete(同步/异步):处理单个提示词和图片列表chat(同步/异步):处理多轮对话消息stream complete(同步/异步):实现complete功能的流式输出stream chat(同步/异步):实现chat功能的流式输出
第二部分将尝试结合Gemini与Pydantic从谷歌地图图片中解析结构化信息:
- 定义包含属性字段的Pydantic类
- 让
gemini-pro-vision模型理解每张图片并输出结构化结果
第三部分提出使用Gemini和LlamaIndex为小型谷歌地图餐厅数据集构建简易检索增强生成流程:
- 基于第二步的结构化输出构建向量索引
- 使用
gemini-pro模型综合结果,根据用户查询推荐餐厅
注意:google-generativeai仅在特定国家和地区可用。
In [ ]:
Copied!
%pip install llama-index-multi-modal-llms-gemini
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-gemini
%pip install llama-index-llms-gemini
%pip install llama-index-multi-modal-llms-gemini
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-gemini
%pip install llama-index-llms-gemini
In [ ]:
Copied!
!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client
!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client
使用 Gemini 理解来自 URL 的图像¶
In [ ]:
Copied!
%env GOOGLE_API_KEY=...
%env GOOGLE_API_KEY=...
In [ ]:
Copied!
import os
GOOGLE_API_KEY = "" # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
import os
GOOGLE_API_KEY = "" # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
初始化 GeminiMultiModal 并从 URL 加载图像¶
In [ ]:
Copied!
from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage, ImageBlock
image_urls = [
"https://storage.googleapis.com/generativeai-downloads/data/scene.jpg",
# Add yours here!
]
gemini_pro = Gemini(model_name="models/gemini-1.5-flash")
msg = ChatMessage("Identify the city where this photo was taken.")
for img_url in image_urls:
msg.blocks.append(ImageBlock(url=img_url))
from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage, ImageBlock
image_urls = [
"https://storage.googleapis.com/generativeai-downloads/data/scene.jpg",
# Add yours here!
]
gemini_pro = Gemini(model_name="models/gemini-1.5-flash")
msg = ChatMessage("Identify the city where this photo was taken.")
for img_url in image_urls:
msg.blocks.append(ImageBlock(url=img_url))
In [ ]:
Copied!
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
https://storage.googleapis.com/generativeai-downloads/data/scene.jpg
Out[ ]:
<matplotlib.image.AxesImage at 0x128032e40>
在提示词中使用图片进行聊天¶
In [ ]:
Copied!
response = gemini_pro.chat(messages=[msg])
response = gemini_pro.chat(messages=[msg])
In [ ]:
Copied!
print(response.message.content)
print(response.message.content)
That's New York City. More specifically, the photo shows a street in the **SoHo** neighborhood. The distinctive cast-iron architecture and the pedestrian bridge are characteristic of that area.
支持图片的流式聊天¶
In [ ]:
Copied!
stream_response = gemini_pro.stream_chat(messages=[msg])
stream_response = gemini_pro.stream_chat(messages=[msg])
In [ ]:
Copied!
import time
for r in stream_response:
print(r.delta, end="")
# Add an artificial wait to make streaming visible in the notebook
time.sleep(0.5)
import time
for r in stream_response:
print(r.delta, end="")
# Add an artificial wait to make streaming visible in the notebook
time.sleep(0.5)
That's New York City. More specifically, the photo was taken in the **West Village** neighborhood of Manhattan. The distinctive architecture and the pedestrian bridge are strong clues.
异步支持¶
In [ ]:
Copied!
response_achat = await gemini_pro.achat(messages=[msg])
response_achat = await gemini_pro.achat(messages=[msg])
In [ ]:
Copied!
print(response_achat.message.content)
print(response_achat.message.content)
That's New York City. More specifically, the photo was taken in the **West Village** neighborhood of Manhattan. The distinctive architecture and the pedestrian bridge are strong clues.
让我们看看如何实现异步流式传输:
In [ ]:
Copied!
import asyncio
streaming_handler = await gemini_pro.astream_chat(messages=[msg])
async for chunk in streaming_handler:
print(chunk.delta, end="")
# Add an artificial wait to make streaming visible in the notebook
await asyncio.sleep(0.5)
import asyncio
streaming_handler = await gemini_pro.astream_chat(messages=[msg])
async for chunk in streaming_handler:
print(chunk.delta, end="")
# Add an artificial wait to make streaming visible in the notebook
await asyncio.sleep(0.5)
That's New York City. More specifically, the photo was taken in the **West Village** neighborhood of Manhattan. The distinctive architecture and the pedestrian bridge are strong clues.
完整包含两张图片¶
In [ ]:
Copied!
image_urls = [
"https://picsum.photos/id/1/200/300",
"https://picsum.photos/id/26/200/300",
]
msg = ChatMessage("Is there any relationship between these images?")
for img_url in image_urls:
msg.blocks.append(ImageBlock(url=img_url))
response_multi = gemini_pro.chat(messages=[msg])
image_urls = [
"https://picsum.photos/id/1/200/300",
"https://picsum.photos/id/26/200/300",
]
msg = ChatMessage("Is there any relationship between these images?")
for img_url in image_urls:
msg.blocks.append(ImageBlock(url=img_url))
response_multi = gemini_pro.chat(messages=[msg])
In [ ]:
Copied!
print(response_multi.message.content)
print(response_multi.message.content)
Yes, there is a relationship between the two images. Both images depict aspects of a **professional or business-casual lifestyle**. * **Image 1:** Shows someone working on a laptop, suggesting remote work, freelancing, or a business-related task. * **Image 2:** Shows a flat lay of accessories commonly associated with a professional or stylish individual: sunglasses, a bow tie, a pen, a watch, glasses, and a phone. These items suggest a certain level of personal style and preparedness often associated with business or professional settings. The connection is indirect but thematic. They both visually represent elements of a similar lifestyle or persona.
第二部分:使用 Gemini + Pydantic 实现图像结构化输出解析¶
- 利用 Gemini 进行图像推理
- 通过 Pydantic 程序从 Gemini 的图像推理结果生成结构化输出
下载示例图像供 Gemini 理解¶
In [ ]:
Copied!
from pathlib import Path
input_image_path = Path("google_restaurants")
if not input_image_path.exists():
Path.mkdir(input_image_path)
from pathlib import Path
input_image_path = Path("google_restaurants")
if not input_image_path.exists():
Path.mkdir(input_image_path)
In [ ]:
Copied!
!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png
!curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png
!curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png
!curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png
!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png
!curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png
!curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png
!curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png
定义结构化解析器的 Pydantic 类¶
In [ ]:
Copied!
from pydantic import BaseModel
from PIL import Image
import matplotlib.pyplot as plt
class GoogleRestaurant(BaseModel):
"""Data model for a Google Restaurant."""
restaurant: str
food: str
location: str
category: str
hours: str
price: str
rating: float
review: str
description: str
nearby_tourist_places: str
google_image_url = "./google_restaurants/miami.png"
image = Image.open(google_image_url).convert("RGB")
plt.figure(figsize=(16, 5))
plt.imshow(image)
from pydantic import BaseModel
from PIL import Image
import matplotlib.pyplot as plt
class GoogleRestaurant(BaseModel):
"""Data model for a Google Restaurant."""
restaurant: str
food: str
location: str
category: str
hours: str
price: str
rating: float
review: str
description: str
nearby_tourist_places: str
google_image_url = "./google_restaurants/miami.png"
image = Image.open(google_image_url).convert("RGB")
plt.figure(figsize=(16, 5))
plt.imshow(image)
Out[ ]:
<matplotlib.image.AxesImage at 0x10953cce0>
调用 Pydantic 程序并生成结构化输出¶
In [ ]:
Copied!
from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
prompt_template_str = """\
can you summarize what is in the image\
and return the answer with json format \
"""
def pydantic_gemini(
model_name, output_class, image_documents, prompt_template_str
):
gemini_llm = GeminiMultiModal(model_name=model_name)
llm_program = MultiModalLLMCompletionProgram.from_defaults(
output_parser=PydanticOutputParser(output_class),
image_documents=image_documents,
prompt_template_str=prompt_template_str,
multi_modal_llm=gemini_llm,
verbose=True,
)
response = llm_program()
return response
from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
prompt_template_str = """\
can you summarize what is in the image\
and return the answer with json format \
"""
def pydantic_gemini(
model_name, output_class, image_documents, prompt_template_str
):
gemini_llm = GeminiMultiModal(model_name=model_name)
llm_program = MultiModalLLMCompletionProgram.from_defaults(
output_parser=PydanticOutputParser(output_class),
image_documents=image_documents,
prompt_template_str=prompt_template_str,
multi_modal_llm=gemini_llm,
verbose=True,
)
response = llm_program()
return response
通过 Gemini Vision 模型生成 Pydantic 结构化输出¶
In [ ]:
Copied!
from llama_index.core import SimpleDirectoryReader
google_image_documents = SimpleDirectoryReader(
"./google_restaurants"
).load_data()
results = []
for img_doc in google_image_documents:
pydantic_response = pydantic_gemini(
"models/gemini-1.5-flash",
GoogleRestaurant,
[img_doc],
prompt_template_str,
)
# only output the results for miami for example along with image
if "miami" in img_doc.image_path:
for r in pydantic_response:
print(r)
results.append(pydantic_response)
from llama_index.core import SimpleDirectoryReader
google_image_documents = SimpleDirectoryReader(
"./google_restaurants"
).load_data()
results = []
for img_doc in google_image_documents:
pydantic_response = pydantic_gemini(
"models/gemini-1.5-flash",
GoogleRestaurant,
[img_doc],
prompt_template_str,
)
# only output the results for miami for example along with image
if "miami" in img_doc.image_path:
for r in pydantic_response:
print(r)
results.append(pydantic_response)
> Raw output: ```json { "restaurant": "La Mar by Gaston Acurio", "food": "Peruvian & fusion", "location": "500 Brickell Key Dr, Miami, FL 33131", "category": "South American restaurant", "hours": "Opens 6PM, Closes 11 PM", "price": "$$$", "rating": 4.4, "review": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.", "description": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.", "nearby_tourist_places": "Brickell Key area with scenic views" } ``` ('restaurant', 'La Mar by Gaston Acurio') ('food', 'Peruvian & fusion') ('location', '500 Brickell Key Dr, Miami, FL 33131') ('category', 'South American restaurant') ('hours', 'Opens 6PM, Closes 11 PM') ('price', '$$$') ('rating', 4.4) ('review', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.') ('description', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.') ('nearby_tourist_places', 'Brickell Key area with scenic views') > Raw output: ```json { "restaurant": "Mythos Restaurant", "food": "American fare in a mythic underwater themed spot", "location": "6000 Universal Blvd, Orlando, FL 32819, United States", "category": "Restaurant", "hours": "Open: Closes in 7 hrs, Islands of Adventure", "price": "$$", "rating": 4.3, "review": "Overlooking Universal Studios/Island sea, this mythic underwater themed spot serves American fare.", "description": "Dine-in, Delivery", "nearby_tourist_places": "Universal Islands, Jurassic Park River Adventure" } ``` > Raw output: ```json { "restaurant": "Sam's Grill & Seafood Restaurant", "food": "Seafood", "location": "374 Bush St, San Francisco, CA 94104, United States", "category": "Seafood Restaurant", "hours": "Open ⋅ Closes 8:30 PM", "price": "$$$", "rating": 4.4, "review": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.", "description": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.", "nearby_tourist_places": "Chinatown, San Francisco" } ``` > Raw output: ```json { "restaurant": "Lobster Port", "food": "Seafood restaurant offering lobster, dim sum & Asian fusion dishes", "location": "8432 Leslie St, Thornhill, ON L3T 7M6", "category": "Seafood", "hours": "Open 10pm", "price": "$$", "rating": 4.0, "review": "Elegant, lively venue with a banquet-hall setup", "description": "Elegant, lively venue with a banquet-hall setup offering lobster, dim sum & Asian fusion dishes.", "nearby_tourist_places": "Nearby tourist places are not explicitly listed in the image but the map shows various points of interest in the surrounding area." } ```
观察结果:
- Gemini 完美生成了我们为 Pydantic 类所需的所有元信息
- 它还能从
Google Maps识别出附近的公园
第三部分:构建用于餐厅推荐的多模态 RAG 系统¶
我们的技术栈由 Gemini + LlamaIndex + Pydantic 结构化输出能力组成
构建文本节点以创建向量存储。存储每家餐厅的元数据和描述信息。¶
In [ ]:
Copied!
from llama_index.core.schema import TextNode
nodes = []
for res in results:
text_node = TextNode()
metadata = {}
for r in res:
# set description as text of TextNode
if r[0] == "description":
text_node.text = r[1]
else:
metadata[r[0]] = r[1]
text_node.metadata = metadata
nodes.append(text_node)
from llama_index.core.schema import TextNode
nodes = []
for res in results:
text_node = TextNode()
metadata = {}
for r in res:
# set description as text of TextNode
if r[0] == "description":
text_node.text = r[1]
else:
metadata[r[0]] = r[1]
text_node.metadata = metadata
nodes.append(text_node)
使用 Gemini Embedding 构建密集检索向量库:将餐厅索引为节点存入向量库¶
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_gemini_3")
vector_store = QdrantVectorStore(client=client, collection_name="collection")
# Using the embedding model to Gemini
Settings.embed_model = GeminiEmbedding(
model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
)
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_gemini_3")
vector_store = QdrantVectorStore(client=client, collection_name="collection")
# Using the embedding model to Gemini
Settings.embed_model = GeminiEmbedding(
model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
)
使用 Gemini 合成结果并向用户推荐餐厅¶
In [ ]:
Copied!
query_engine = index.as_query_engine(
similarity_top_k=1,
)
response = query_engine.query(
"recommend a Orlando restaurant for me and its nearby tourist places"
)
print(response)
query_engine = index.as_query_engine(
similarity_top_k=1,
)
response = query_engine.query(
"recommend a Orlando restaurant for me and its nearby tourist places"
)
print(response)
For a delightful dining experience, I recommend Mythos Restaurant, known for its American cuisine and unique underwater theme. Overlooking Universal Studios' Inland Sea, this restaurant offers a captivating ambiance. After your meal, explore the nearby tourist attractions such as Universal's Islands of Adventure, Skull Island: Reign of Kong, The Wizarding World of Harry Potter, Jurassic Park River Adventure, and Hollywood Rip Ride Rockit, all located near Mythos Restaurant.