Epsilla 向量数据库¶

本笔记本将展示如何在 LlamaIndex 中使用 Epsilla 执行向量搜索。

作为前提条件，您需要运行一个 Epsilla 向量数据库（例如通过我们的 Docker 镜像），并安装 pyepsilla 包。完整文档请查看 docs

In [ ]:

Copied!

%pip install llama-index-vector-stores-epsilla
%pip install llama-index-vector-stores-epsilla

In [ ]:

Copied!

!pip/pip3 install pyepsilla
!pip/pip3 install pyepsilla

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!





import logging
import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.epsilla import EpsillaVectorStore
import textwrap
import logging
import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.epsilla import EpsillaVectorStore
import textwrap

配置 OpenAI¶

首先让我们添加 OpenAI API 密钥。该密钥将用于为载入索引的文档创建嵌入向量。

In [ ]:

Copied!

import openai
import getpass

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
import openai
import getpass

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY

下载数据¶

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

加载文档¶

使用 SimpleDirectoryReader 加载存储在 /data/paul_graham 文件夹中的文档。

In [ ]:

Copied!





# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")

Total documents: 1
First document, id: ac7f23f0-ce15-4d94-a0a2-5020fa87df61
First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35

创建索引¶

这里我们使用先前加载的文档创建一个由 Epsilla 支持的索引。EpsillaVectorStore 需要接收以下参数：

client (Any): 用于连接的 Epsilla 客户端实例。
collection_name (str, 可选): 指定使用的集合名称。默认为 "llama_collection"。
db_path (str, 可选): 数据库持久化存储路径。默认为 "/tmp/langchain-epsilla"。
db_name (str, 可选): 为加载的数据库命名。默认为 "langchain_store"。
dimension (int, 可选): 嵌入向量的维度。若不指定，将在首次插入时创建集合。默认为 None。
overwrite (bool, 可选): 是否覆盖同名集合。默认为 False。

Epsilla 向量数据库默认运行在主机 "localhost" 和端口 "8888"。

In [ ]:

Copied!





# Create an index over the documnts
from pyepsilla import vectordb

client = vectordb.Client()
vector_store = EpsillaVectorStore(client=client, db_path="/tmp/llamastore")

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# Create an index over the documnts
from pyepsilla import vectordb

client = vectordb.Client()
vector_store = EpsillaVectorStore(client=client, db_path="/tmp/llamastore")

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

[INFO] Connected to localhost:8888 successfully.

查询数据¶

现在我们已经将文档存储到索引中，可以针对该索引进行提问查询。

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))

The author of the given context information is Paul Graham.

In [ ]:

Copied!

response = query_engine.query("How did the author learn about AI?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("How did the author learn about AI?")
print(textwrap.fill(str(response), 100))

The author learned about AI through various sources. One source was a novel called "The Moon is a
Harsh Mistress" by Heinlein, which featured an intelligent computer called Mike. Another source was
a PBS documentary that showed Terry Winograd using SHRDLU, a program that could understand natural
language. These experiences sparked the author's interest in AI and motivated them to start learning
about it, including teaching themselves Lisp, which was regarded as the language of AI at the time.

接下来，我们尝试覆盖之前的数据。

In [ ]:

Copied!





vector_store = EpsillaVectorStore(client=client, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
single_doc = Document(text="Epsilla is the vector database we are using.")
index = VectorStoreIndex.from_documents(
    [single_doc],
    storage_context=storage_context,
)

query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
vector_store = EpsillaVectorStore(client=client, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
single_doc = Document(text="Epsilla is the vector database we are using.")
index = VectorStoreIndex.from_documents(
    [single_doc],
    storage_context=storage_context,
)

query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))

There is no information provided about the author in the given context.

In [ ]:

Copied!

response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))

Epsilla is the vector database being used.

接下来，让我们向现有集合添加更多数据。

In [ ]:

Copied!





vector_store = EpsillaVectorStore(client=client, overwrite=False)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
for doc in documents:
    index.insert(document=doc)

query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
vector_store = EpsillaVectorStore(client=client, overwrite=False)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
for doc in documents:
    index.insert(document=doc)

query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))

The author of the given context information is Paul Graham.

In [ ]:

Copied!

response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))

Epsilla is the vector database being used.