CouchbaseVectorStoreDemo

Couchbase 向量存储¶

Couchbase 是一款屡获殊荣的分布式 NoSQL 云数据库，为您的云端、移动端、人工智能和边缘计算应用提供无与伦比的多功能性、性能、可扩展性和经济价值。Couchbase 拥抱人工智能技术，为开发者提供编码辅助功能，并为应用程序配备向量搜索能力。

向量搜索是 Couchbase 中全文搜索服务（搜索服务）的组成部分。

本教程将介绍如何在 Couchbase 中使用向量搜索功能。您可以在 Couchbase Capella 云服务或自托管的 Couchbase Server 上使用该功能。

安装¶

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-vector-stores-couchbase
%pip install llama-index-vector-stores-couchbase

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

创建 Couchbase 连接¶

我们首先建立与 Couchbase 集群的连接，然后将集群对象传递给向量存储库。

此处我们使用用户名和密码进行连接。您也可以通过集群支持的其他任何方式进行连接。

有关连接 Couchbase 集群的更多信息，请查阅 Python SDK 文档。

In [ ]:

Copied!





COUCHBASE_CONNECTION_STRING = (
    "couchbase://localhost"  # or "couchbases://localhost" if using TLS
)
DB_USERNAME = "Administrator"
DB_PASSWORD = "P@ssword1!"
COUCHBASE_CONNECTION_STRING = (
    "couchbase://localhost"  # or "couchbases://localhost" if using TLS
)
DB_USERNAME = "Administrator"
DB_PASSWORD = "P@ssword1!"

In [ ]:

Copied!





from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))
from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))

创建搜索索引¶

目前需要通过 Couchbase Capella 或 Server 的用户界面，或使用 REST 接口来创建搜索索引。

让我们在 testing 存储桶上定义一个名为 vector-index 的搜索索引。

本示例将使用用户界面中 Search Service 的 "导入索引" 功能。

我们将在 testing 存储桶的 _default 作用域下的 _default 集合中定义索引，其中向量字段设置为维度数为 1536 的 embedding 字段，文本字段设置为 text。同时我们通过动态映射对文档中 metadata 下的所有字段进行索引和存储，以适配不同的文档结构。相似度度量指标设置为 dot_product。

如何将索引导入全文搜索服务？¶

Couchbase Server
- 点击 Search -> Add Index -> Import
- 在导入界面复制以下索引定义
- 点击 Create Index 创建索引
Couchbase Capella
- 将索引定义复制到新文件 index.json
- 按照文档说明在 Capella 中导入该文件
- 点击 Create Index 创建索引

索引定义¶

{
 "name": "vector-index",
 "type": "fulltext-index",
 "params": {
  "doc_config": {
   "docid_prefix_delim": "",
   "docid_regexp": "",
   "mode": "type_field",
   "type_field": "type"
  },
  "mapping": {
   "default_analyzer": "standard",
   "default_datetime_parser": "dateTimeOptional",
   "default_field": "_all",
   "default_mapping": {
    "dynamic": true,
    "enabled": true,
    "properties": {
     "metadata": {
      "dynamic": true,
      "enabled": true
     },
     "embedding": {
      "enabled": true,
      "dynamic": false,
      "fields": [
       {
        "dims": 1536,
        "index": true,
        "name": "embedding",
        "similarity": "dot_product",
        "type": "vector",
        "vector_index_optimized_for": "recall"
       }
      ]
     },
     "text": {
      "enabled": true,
      "dynamic": false,
      "fields": [
       {
        "index": true,
        "name": "text",
        "store": true,
        "type": "text"
       }
      ]
     }
    }
   },
   "default_type": "_default",
   "docvalues_dynamic": false,
   "index_dynamic": true,
   "store_dynamic": true,
   "type_field": "_type"
  },
  "store": {
   "indexType": "scorch",
   "segmentVersion": 16
  }
 },
 "sourceType": "gocbcore",
 "sourceName": "testing",
 "sourceParams": {},
 "planParams": {
  "maxPartitionsPerPIndex": 103,
  "indexPartitions": 10,
  "numReplicas": 0
 }
}

现在我们将设置 Couchbase 集群中用于向量搜索的桶（bucket）、作用域（scope）和集合（collection）名称。

本示例中我们使用默认的作用域和集合。

In [ ]:

Copied!





BUCKET_NAME = "testing"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "vector-index"
BUCKET_NAME = "testing"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "vector-index"

In [ ]:

Copied!





# Import required packages
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.couchbase import CouchbaseSearchVectorStore
# Import required packages
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.couchbase import CouchbaseSearchVectorStore

在本教程中，我们将使用 OpenAI 嵌入

In [ ]:

Copied!

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key: ········

In [ ]:

Copied!

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

下载数据¶

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-04-09 23:31:46--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.008s  

2024-04-09 23:31:46 (8.97 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

加载文档¶

In [ ]:

Copied!

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

In [ ]:

Copied!





vector_store = CouchbaseSearchVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    index_name=SEARCH_INDEX_NAME,
)
vector_store = CouchbaseSearchVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    index_name=SEARCH_INDEX_NAME,
)

In [ ]:

Copied!





storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

基础示例¶

我们将向查询引擎提出一个关于刚建立索引的论文的问题。

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What were his investments in Y Combinator?")
print(response)
query_engine = index.as_query_engine()
response = query_engine.query("What were his investments in Y Combinator?")
print(response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
His investments in Y Combinator were $6k per founder, totaling $12k in the typical two-founder case, in return for 6% equity.

元数据过滤器¶

我们将创建一些包含元数据的示例文档，以便演示如何基于元数据筛选文档。

In [ ]:

Copied!





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
        },
    ),
]
vector_store.add(nodes)
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
        },
    ),
]
vector_store.add(nodes)

Out[ ]:

['5abb42cf-7312-46eb-859e-60df4f92842a',
 'b90525f4-38bf-453c-a51a-5f0718bccc98',
 '22f732d0-da17-4bad-b3cd-b54e2102367a']

In [ ]:

Copied!

# Metadata filter
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)

retriever = index.as_retriever(filters=filters)

retriever.retrieve("What is inception about?")
# Metadata filter
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)

retriever = index.as_retriever(filters=filters)

retriever.retrieve("What is inception about?")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

Out[ ]:

[NodeWithScore(node=TextNode(id_='b90525f4-38bf-453c-a51a-5f0718bccc98', embedding=None, metadata={'director': 'Francis Ford Coppola', 'theme': 'Mafia'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='The Godfather', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.3068528194400547)]

自定义过滤器与查询重写¶

目前 Couchbase 通过 LlamaIndex 仅支持 ExactMatchFilters。实际上 Couchbase 支持更广泛的过滤器类型，包括范围过滤器、地理空间过滤器等。要使用这些过滤器，您可以通过 cb_search_options 参数以字典列表的形式传入。

关于搜索选项的不同搜索/查询可能性，可查阅此处文档。

In [ ]:

Copied!





def custom_query(query, query_str):
    print("custom query", query)
    return query


query_engine = index.as_query_engine(
    vector_store_kwargs={
        "cb_search_options": {
            "query": {"match": "growing up", "field": "text"}
        },
        "custom_query": custom_query,
    }
)
response = query_engine.query("what were his investments in Y Combinator?")
print(response)
def custom_query(query, query_str):
    print("custom query", query)
    return query


query_engine = index.as_query_engine(
    vector_store_kwargs={
        "cb_search_options": {
            "query": {"match": "growing up", "field": "text"}
        },
        "custom_query": custom_query,
    }
)
response = query_engine.query("what were his investments in Y Combinator?")
print(response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
His investments in Y Combinator were based on a combination of the deal he did with Julian ($10k for 10%) and what Robert said MIT grad students got for the summer ($6k). He invested $6k per founder, which in the typical two-founder case was $12k, in return for 6%.