ClickHouse 向量存储¶
本笔记本将快速演示如何使用 ClickHouseVectorStore。
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index
!pip install clickhouse_connect
!pip install llama-index
!pip install clickhouse_connect
创建 ClickHouse 客户端¶
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
In [ ]:
Copied!
from os import environ
import clickhouse_connect
environ["OPENAI_API_KEY"] = "sk-*"
# initialize client
client = clickhouse_connect.get_client(
host="localhost",
port=8123,
username="default",
password="",
)
from os import environ
import clickhouse_connect
environ["OPENAI_API_KEY"] = "sk-*"
# initialize client
client = clickhouse_connect.get_client(
host="localhost",
port=8123,
username="default",
password="",
)
加载文档并使用 ClickHouseVectorStore 构建存储 VectorStoreIndex¶
这里我们将使用一系列保罗·格雷厄姆(Paul Graham)的随笔文章作为文本来源,将其转化为嵌入向量后存储到 ClickHouseVectorStore 中,并通过查询为我们的LLM问答循环提供上下文。
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore
In [ ]:
Copied!
# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
print("Number of Documents: ", len(documents))
# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
print("Number of Documents: ", len(documents))
Document ID: d03ac7db-8dae-4199-bc38-445dec51a534 Number of Documents: 1
下载数据
In [ ]:
Copied!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-02-13 10:08:31-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.003s 2024-02-13 10:08:31 (23.9 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
您可以使用 SimpleDirectoryReader 单独处理文件:
In [ ]:
Copied!
loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
print(file)
# Here is where you would do any preprocessing
loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
print(file)
# Here is where you would do any preprocessing
data/paul_graham/paul_graham_essay.txt
In [ ]:
Copied!
# initialize with metadata filter and store indexes
from llama_index.core import StorageContext
for document in documents:
document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
# initialize with metadata filter and store indexes
from llama_index.core import StorageContext
for document in documents:
document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
In [ ]:
Copied!
import textwrap
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
ExactMatchFilter(key="user_id", value="123"),
]
),
similarity_top_k=2,
vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
import textwrap
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
ExactMatchFilter(key="user_id", value="123"),
]
),
similarity_top_k=2,
vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The author learned several things during their time at Interleaf, including the importance of having technology companies run by product people rather than sales people, the drawbacks of having too many people edit code, the value of corridor conversations over planned meetings, the challenges of dealing with big bureaucratic customers, and the importance of being the "entry level" option in a market.
清除所有索引¶
In [ ]:
Copied!
for document in documents:
index.delete_ref_doc(document.doc_id)
for document in documents:
index.delete_ref_doc(document.doc_id)