ClickHouse 向量存储¶

本笔记本将快速演示如何使用 ClickHouseVectorStore。

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index
!pip install clickhouse_connect
!pip install llama-index
!pip install clickhouse_connect

创建 ClickHouse 客户端¶

In [ ]:

Copied!

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [ ]:

Copied!





from os import environ
import clickhouse_connect

environ["OPENAI_API_KEY"] = "sk-*"

# initialize client
client = clickhouse_connect.get_client(
    host="localhost",
    port=8123,
    username="default",
    password="",
)
from os import environ
import clickhouse_connect

environ["OPENAI_API_KEY"] = "sk-*"

# initialize client
client = clickhouse_connect.get_client(
    host="localhost",
    port=8123,
    username="default",
    password="",
)

加载文档并使用 ClickHouseVectorStore 构建存储 VectorStoreIndex¶

这里我们将使用一系列保罗·格雷厄姆（Paul Graham）的随笔文章作为文本来源，将其转化为嵌入向量后存储到 ClickHouseVectorStore 中，并通过查询为我们的LLM问答循环提供上下文。

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore

In [ ]:

Copied!





# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
print("Number of Documents: ", len(documents))
# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
print("Number of Documents: ", len(documents))

Document ID: d03ac7db-8dae-4199-bc38-445dec51a534
Number of Documents:  1

下载数据

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-02-13 10:08:31--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.003s  

2024-02-13 10:08:31 (23.9 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

您可以使用 SimpleDirectoryReader 单独处理文件：

In [ ]:

Copied!





loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
    print(file)
    # Here is where you would do any preprocessing
loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
    print(file)
    # Here is where you would do any preprocessing

data/paul_graham/paul_graham_essay.txt

In [ ]:

Copied!





# initialize with metadata filter and store indexes
from llama_index.core import StorageContext

for document in documents:
    document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# initialize with metadata filter and store indexes
from llama_index.core import StorageContext

for document in documents:
    document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

查询索引¶

当前 ClickHouse 向量存储已支持过滤搜索与混合搜索

您可通过以下链接了解更多关于查询引擎和检索器的信息。

In [ ]:

Copied!





import textwrap

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="user_id", value="123"),
        ]
    ),
    similarity_top_k=2,
    vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
import textwrap

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="user_id", value="123"),
        ]
    ),
    similarity_top_k=2,
    vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

The author learned several things during their time at Interleaf, including the importance of having
technology companies run by product people rather than sales people, the drawbacks of having too
many people edit code, the value of corridor conversations over planned meetings, the challenges of
dealing with big bureaucratic customers, and the importance of being the "entry level" option in a
market.

清除所有索引¶

In [ ]:

Copied!

for document in documents:
    index.delete_ref_doc(document.doc_id)
for document in documents:
    index.delete_ref_doc(document.doc_id)