CouchbaseVectorStoreDemo
Couchbase 向量存储¶
Couchbase 是一款屡获殊荣的分布式 NoSQL 云数据库,为您的云端、移动端、人工智能和边缘计算应用提供无与伦比的多功能性、性能、可扩展性和经济价值。Couchbase 拥抱人工智能技术,为开发者提供编码辅助功能,并为应用程序配备向量搜索能力。
向量搜索是 Couchbase 中全文搜索服务(搜索服务)的组成部分。
本教程将介绍如何在 Couchbase 中使用向量搜索功能。您可以在 Couchbase Capella 云服务或自托管的 Couchbase Server 上使用该功能。
安装¶
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-couchbase
!pip install llama-index
创建 Couchbase 连接¶
我们首先建立与 Couchbase 集群的连接,然后将集群对象传递给向量存储库。
此处我们使用用户名和密码进行连接。您也可以通过集群支持的其他任何方式进行连接。
有关连接 Couchbase 集群的更多信息,请查阅 Python SDK 文档。
COUCHBASE_CONNECTION_STRING = (
"couchbase://localhost" # or "couchbases://localhost" if using TLS
)
DB_USERNAME = "Administrator"
DB_PASSWORD = "P@ssword1!"
from datetime import timedelta
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)
# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))
创建搜索索引¶
目前需要通过 Couchbase Capella 或 Server 的用户界面,或使用 REST 接口来创建搜索索引。
让我们在 testing
存储桶上定义一个名为 vector-index
的搜索索引。
本示例将使用用户界面中 Search Service 的 "导入索引" 功能。
我们将在 testing 存储桶的 _default
作用域下的 _default
集合中定义索引,其中向量字段设置为维度数为 1536 的 embedding
字段,文本字段设置为 text。同时我们通过动态映射对文档中 metadata 下的所有字段进行索引和存储,以适配不同的文档结构。相似度度量指标设置为 dot_product
。
如何将索引导入全文搜索服务?¶
-
- 点击 Search -> Add Index -> Import
- 在导入界面复制以下索引定义
- 点击 Create Index 创建索引
-
- 将索引定义复制到新文件
index.json
- 按照文档说明在 Capella 中导入该文件
- 点击 Create Index 创建索引
- 将索引定义复制到新文件
索引定义¶
{
"name": "vector-index",
"type": "fulltext-index",
"params": {
"doc_config": {
"docid_prefix_delim": "",
"docid_regexp": "",
"mode": "type_field",
"type_field": "type"
},
"mapping": {
"default_analyzer": "standard",
"default_datetime_parser": "dateTimeOptional",
"default_field": "_all",
"default_mapping": {
"dynamic": true,
"enabled": true,
"properties": {
"metadata": {
"dynamic": true,
"enabled": true
},
"embedding": {
"enabled": true,
"dynamic": false,
"fields": [
{
"dims": 1536,
"index": true,
"name": "embedding",
"similarity": "dot_product",
"type": "vector",
"vector_index_optimized_for": "recall"
}
]
},
"text": {
"enabled": true,
"dynamic": false,
"fields": [
{
"index": true,
"name": "text",
"store": true,
"type": "text"
}
]
}
}
},
"default_type": "_default",
"docvalues_dynamic": false,
"index_dynamic": true,
"store_dynamic": true,
"type_field": "_type"
},
"store": {
"indexType": "scorch",
"segmentVersion": 16
}
},
"sourceType": "gocbcore",
"sourceName": "testing",
"sourceParams": {},
"planParams": {
"maxPartitionsPerPIndex": 103,
"indexPartitions": 10,
"numReplicas": 0
}
}
现在我们将设置 Couchbase 集群中用于向量搜索的桶(bucket)、作用域(scope)和集合(collection)名称。
本示例中我们使用默认的作用域和集合。
BUCKET_NAME = "testing"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "vector-index"
# Import required packages
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.couchbase import CouchbaseSearchVectorStore
在本教程中,我们将使用 OpenAI 嵌入
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
OpenAI API Key: ········
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-04-09 23:31:46-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8003::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.008s 2024-04-09 23:31:46 (8.97 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
加载文档¶
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
vector_store = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
index_name=SEARCH_INDEX_NAME,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
基础示例¶
我们将向查询引擎提出一个关于刚建立索引的论文的问题。
query_engine = index.as_query_engine()
response = query_engine.query("What were his investments in Y Combinator?")
print(response)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" His investments in Y Combinator were $6k per founder, totaling $12k in the typical two-founder case, in return for 6% equity.
元数据过滤器¶
我们将创建一些包含元数据的示例文档,以便演示如何基于元数据筛选文档。
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text="The Shawshank Redemption",
metadata={
"author": "Stephen King",
"theme": "Friendship",
},
),
TextNode(
text="The Godfather",
metadata={
"director": "Francis Ford Coppola",
"theme": "Mafia",
},
),
TextNode(
text="Inception",
metadata={
"director": "Christopher Nolan",
},
),
]
vector_store.add(nodes)
['5abb42cf-7312-46eb-859e-60df4f92842a', 'b90525f4-38bf-453c-a51a-5f0718bccc98', '22f732d0-da17-4bad-b3cd-b54e2102367a']
# Metadata filter
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
filters = MetadataFilters(
filters=[ExactMatchFilter(key="theme", value="Mafia")]
)
retriever = index.as_retriever(filters=filters)
retriever.retrieve("What is inception about?")
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[NodeWithScore(node=TextNode(id_='b90525f4-38bf-453c-a51a-5f0718bccc98', embedding=None, metadata={'director': 'Francis Ford Coppola', 'theme': 'Mafia'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='The Godfather', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.3068528194400547)]
def custom_query(query, query_str):
print("custom query", query)
return query
query_engine = index.as_query_engine(
vector_store_kwargs={
"cb_search_options": {
"query": {"match": "growing up", "field": "text"}
},
"custom_query": custom_query,
}
)
response = query_engine.query("what were his investments in Y Combinator?")
print(response)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" His investments in Y Combinator were based on a combination of the deal he did with Julian ($10k for 10%) and what Robert said MIT grad students got for the summer ($6k). He invested $6k per founder, which in the typical two-founder case was $12k, in return for 6%.