TiDB 向量存储¶

TiDB Cloud 是一个全面的数据库即服务(DBaaS)解决方案，提供专用集群和无服务器两种选项。TiDB Serverless 现已在 MySQL 生态中集成了内置向量搜索功能。通过这一增强，您可以直接使用 TiDB Serverless 开发 AI 应用，无需额外部署新数据库或技术栈。立即创建免费的 TiDB Serverless 集群并开始使用向量搜索功能：https://pingcap.com/ai

本笔记本详细介绍了如何在 LlamaIndex 中使用 TiDB 向量搜索功能。

环境配置¶

In [ ]:

Copied!

%pip install llama-index-vector-stores-tidbvector
%pip install llama-index
%pip install llama-index-vector-stores-tidbvector
%pip install llama-index

In [ ]:

Copied!

import textwrap

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore
import textwrap

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore

配置您的 OpenAI 密钥

In [ ]:

Copied!

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Input your OpenAI API key:")
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Input your OpenAI API key:")

配置所需的 TiDB 连接设置。要连接到 TiDB Cloud 集群，请按以下步骤操作：

进入 TiDB Cloud 集群控制台，导航至 Connect 页面。
选择使用 SQLAlchemy 搭配 PyMySQL 的连接方式，复制提供的连接 URL（不含密码）。
将连接 URL 粘贴至代码中，替换 tidb_connection_string_template 变量。
输入您的密码。

In [ ]:

Copied!





# replace with your tidb connect string from tidb cloud console
tidb_connection_string_template = "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
# type your tidb password
tidb_password = getpass.getpass("Input your TiDB password:")
tidb_connection_url = tidb_connection_string_template.replace(
    "<PASSWORD>", tidb_password
)
# replace with your tidb connect string from tidb cloud console
tidb_connection_string_template = "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
# type your tidb password
tidb_password = getpass.getpass("Input your TiDB password:")
tidb_connection_url = tidb_connection_string_template.replace(
    "", tidb_password
)

准备用于展示的数据

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [ ]:

Copied!





documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
for index, document in enumerate(documents):
    document.metadata = {"book": "paul_graham"}
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
for index, document in enumerate(documents):
    document.metadata = {"book": "paul_graham"}

Document ID: 86e12675-2e9a-4097-847c-8b981dd41806

创建 TiDB 向量存储¶

以下代码片段会在 TiDB 中创建一个名为 VECTOR_TABLE_NAME 的表，该表针对向量搜索进行了优化。当这段代码成功执行后，您将能在 TiDB 数据库环境中直接查看和访问 VECTOR_TABLE_NAME 表

In [ ]:

Copied!





VECTOR_TABLE_NAME = "paul_graham_test"
tidbvec = TiDBVectorStore(
    connection_string=tidb_connection_url,
    table_name=VECTOR_TABLE_NAME,
    distance_strategy="cosine",
    vector_dimension=1536,
    drop_existing_table=False,
)
VECTOR_TABLE_NAME = "paul_graham_test"
tidbvec = TiDBVectorStore(
    connection_string=tidb_connection_url,
    table_name=VECTOR_TABLE_NAME,
    distance_strategy="cosine",
    vector_dimension=1536,
    drop_existing_table=False,
)

基于 TiDB 向量存储创建查询引擎

In [ ]:

Copied!





storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)
storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)

注意：如果在操作过程中因 MySQL 协议的数据包大小限制而遇到错误（例如尝试插入大量向量时，如 2000 行数据），您可以通过将插入操作拆分为更小的批次来缓解此问题。例如，您可以将 insert_batch_size 参数设置为较小的值（如 1000），以避免超出数据包大小限制，从而确保数据能顺利插入 TiDB 向量存储：

storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, insert_batch_size=1000, show_progress=True
)

语义相似性搜索¶

本节重点介绍向量搜索基础知识和使用元数据过滤器优化结果。请注意，tidb vector 目前仅支持 Deafult VectorStoreQueryMode 模式。

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))

The author wrote a book.

通过元数据筛选¶

使用元数据过滤器执行搜索，以检索符合所应用筛选条件的特定数量的最近邻结果。

In [ ]:

Copied!





from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="book", value="paul_graham", operator="!="),
        ]
    ),
    similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="book", value="paul_graham", operator="!="),
        ]
    ),
    similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

Empty Response

重新查询

In [ ]:

Copied!





from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="book", value="paul_graham", operator="=="),
        ]
    ),
    similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="book", value="paul_graham", operator="=="),
        ]
    ),
    similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

The author learned valuable lessons from his experiences.

删除文档¶

In [ ]:

Copied!

tidbvec.delete(documents[0].doc_id)
tidbvec.delete(documents[0].doc_id)

检查文档是否已被删除

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

Empty Response