Azure CosmosDB MongoDB 向量存储¶

本笔记本将展示如何在 LlamaIndex 中使用 Azure Cosmosdb Mongodb vCore 执行向量搜索。我们将使用 Azure Open AI 创建嵌入向量。

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-embeddings-openai
%pip install llama-index-vector-stores-azurecosmosmongo
%pip install llama-index-llms-azure-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-vector-stores-azurecosmosmongo
%pip install llama-index-llms-azure-openai

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!





import os
import json
import openai
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import os
import json
import openai
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

配置 Azure OpenAI¶

第一步是设置模型。这些模型将用于为加载到数据库中的文档创建嵌入向量，以及执行大语言模型（LLM）补全任务。

In [ ]:

Copied!





import os

# Set up the AzureOpenAI instance
llm = AzureOpenAI(
    model_name=os.getenv("OPENAI_MODEL_COMPLETION"),
    deployment_name=os.getenv("OPENAI_MODEL_COMPLETION"),
    api_base=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    api_type=os.getenv("OPENAI_API_TYPE"),
    api_version=os.getenv("OPENAI_API_VERSION"),
    temperature=0,
)

# Set up the OpenAIEmbedding instance
embed_model = OpenAIEmbedding(
    model=os.getenv("OPENAI_MODEL_EMBEDDING"),
    deployment_name=os.getenv("OPENAI_DEPLOYMENT_EMBEDDING"),
    api_base=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    api_type=os.getenv("OPENAI_API_TYPE"),
    api_version=os.getenv("OPENAI_API_VERSION"),
)
import os

# Set up the AzureOpenAI instance
llm = AzureOpenAI(
    model_name=os.getenv("OPENAI_MODEL_COMPLETION"),
    deployment_name=os.getenv("OPENAI_MODEL_COMPLETION"),
    api_base=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    api_type=os.getenv("OPENAI_API_TYPE"),
    api_version=os.getenv("OPENAI_API_VERSION"),
    temperature=0,
)

# Set up the OpenAIEmbedding instance
embed_model = OpenAIEmbedding(
    model=os.getenv("OPENAI_MODEL_EMBEDDING"),
    deployment_name=os.getenv("OPENAI_DEPLOYMENT_EMBEDDING"),
    api_base=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    api_type=os.getenv("OPENAI_API_TYPE"),
    api_version=os.getenv("OPENAI_API_VERSION"),
)

In [ ]:

Copied!

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

下载数据

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

加载文档¶

使用 SimpleDirectoryReader 加载存储在 data/paul_graham/ 路径下的文档

In [ ]:

Copied!

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

print("Document ID:", documents[0].doc_id)
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

print("Document ID:", documents[0].doc_id)

Document ID: c432ff1c-61ea-4c91-bd89-62be29078e79

创建索引¶

此处我们将建立与 Azure Cosmosdb mongodb vCore 集群的连接，并创建向量搜索索引。

In [ ]:

Copied!





import pymongo
from llama_index.vector_stores.azurecosmosmongo import (
    AzureCosmosDBMongoDBVectorSearch,
)
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.core import SimpleDirectoryReader

connection_string = os.environ.get("AZURE_COSMOSDB_MONGODB_URI")
mongodb_client = pymongo.MongoClient(connection_string)
store = AzureCosmosDBMongoDBVectorSearch(
    mongodb_client=mongodb_client,
    db_name="demo_vectordb",
    collection_name="paul_graham_essay",
)
storage_context = StorageContext.from_defaults(vector_store=store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
import pymongo
from llama_index.vector_stores.azurecosmosmongo import (
    AzureCosmosDBMongoDBVectorSearch,
)
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.core import SimpleDirectoryReader

connection_string = os.environ.get("AZURE_COSMOSDB_MONGODB_URI")
mongodb_client = pymongo.MongoClient(connection_string)
store = AzureCosmosDBMongoDBVectorSearch(
    mongodb_client=mongodb_client,
    db_name="demo_vectordb",
    collection_name="paul_graham_essay",
)
storage_context = StorageContext.from_defaults(vector_store=store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

查询索引¶

现在我们可以使用索引来提问了。

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What did the author love working on?")
query_engine = index.as_query_engine()
response = query_engine.query("What did the author love working on?")

In [ ]:

Copied!

import textwrap

print(textwrap.fill(str(response), 100))
import textwrap

print(textwrap.fill(str(response), 100))

The author loved working on multiple projects that were not their thesis while in grad school,
including Lisp hacking and writing On Lisp. They eventually wrote a dissertation on applications of
continuations in just 5 weeks to graduate. Afterward, they applied to art schools and were accepted
into the BFA program at RISD.

In [ ]:

Copied!

response = query_engine.query("What did he/she do in summer of 2016?")
response = query_engine.query("What did he/she do in summer of 2016?")

In [ ]:

Copied!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

The person moved to England with their family in the summer of 2016.