Deep Lake 向量数据库快速入门¶

可以通过 pip 安装 Deep Lake。

In [ ]:

Copied!

%pip install llama-index-vector-stores-deeplake
%pip install llama-index-vector-stores-deeplake

In [ ]:

Copied!

!pip install llama-index
!pip install deeplake
!pip install llama-index
!pip install deeplake

接下来，导入所需模块并设置必要的环境变量：

In [ ]:

Copied!





import os
import textwrap

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.deeplake import DeepLakeVectorStore

os.environ["OPENAI_API_KEY"] = "sk-********************************"
os.environ["ACTIVELOOP_TOKEN"] = "********************************"
import os
import textwrap

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.deeplake import DeepLakeVectorStore

os.environ["OPENAI_API_KEY"] = "sk-********************************"
os.environ["ACTIVELOOP_TOKEN"] = "********************************"

我们将把保罗·格雷厄姆（Paul Graham）的一篇文章嵌入并存储到本地部署的Deep Lake向量数据库中。首先，我们需要将数据下载到名为data/paul_graham的目录中。

In [ ]:

Copied!





import urllib.request

urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt",
    "data/paul_graham/paul_graham_essay.txt",
)
import urllib.request

urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt",
    "data/paul_graham/paul_graham_essay.txt",
)

我们现在可以从源数据文件创建文档。

In [ ]:

Copied!





# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(
    "Document ID:",
    documents[0].doc_id,
    "Document Hash:",
    documents[0].hash,
)
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(
    "Document ID:",
    documents[0].doc_id,
    "Document Hash:",
    documents[0].hash,
)

Document ID: a98b6686-e666-41a9-a0bc-b79f0d666bde Document Hash: beaa54b3e9cea641e91e6975d2207af4f4200f4b2d629725d688f272372ce5bb

最后，让我们创建 Deep Lake 向量存储库并填充数据。我们使用默认的张量配置，该配置会创建包含 text (str)、metadata(json)、id (str, 自动填充) 和 embedding (float32) 等张量。点击此处了解有关张量自定义的更多信息。

In [ ]:

Copied!





from llama_index.core import StorageContext

dataset_path = "./dataset/paul_graham"

# Create an index over the documents
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
from llama_index.core import StorageContext

dataset_path = "./dataset/paul_graham"

# Create an index over the documents
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Uploading data to deeplake dataset.

100%|██████████| 22/22 [00:00<00:00, 684.80it/s]

Dataset(path='./dataset/paul_graham', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (22, 1)      str     None   
 metadata     json      (22, 1)      str     None   
 embedding  embedding  (22, 1536)  float32   None   
    id        text      (22, 1)      str     None

执行向量搜索¶

Deep Lake 提供高度灵活的向量搜索和混合搜索选项（这些教程中有详细讨论）。在本快速入门指南中，我们将展示使用默认选项的简单示例。

In [ ]:

Copied!





query_engine = index.as_query_engine()
response = query_engine.query(
    "What did the author learn?",
)
query_engine = index.as_query_engine()
response = query_engine.query(
    "What did the author learn?",
)

In [ ]:

Copied!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

  The author learned that working on things that are not prestigious can be a good thing, as it can
lead to discovering something real and avoiding the wrong track. The author also learned that
ignorance can be beneficial, as it can lead to discovering something new and unexpected. The author
also learned the importance of working hard, even at the parts of the job they don't like, in order
to set an example for others. The author also learned the value of unsolicited advice, as it can be
beneficial in unexpected ways, such as when Robert Morris suggested that the author should make sure
Y Combinator wasn't the last cool thing they did.

In [ ]:

Copied!

response = query_engine.query("What was a hard moment for the author?")
response = query_engine.query("What was a hard moment for the author?")

In [ ]:

Copied!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

The author experienced a hard moment when one of his programs on the IBM 1401 computer did not
terminate. This was a social as well as a technical error, as the data center manager's expression
made clear.

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What was a hard moment for the author?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("What was a hard moment for the author?")
print(textwrap.fill(str(response), 100))

The author experienced a hard moment when one of his programs on the IBM 1401 computer did not
terminate. This was a social as well as a technical error, as the data center manager's expression
made clear.

从数据库中删除条目¶

要查找待删除文档的 ID，您可以直接查询底层的 DeepLake 数据集

In [ ]:

Copied!

import deeplake

ds = deeplake.load(dataset_path)

idx = ds.id[0].numpy().tolist()
idx
import deeplake

ds = deeplake.load(dataset_path)

idx = ds.id[0].numpy().tolist()
idx

./dataset/paul_graham loaded successfully.

Out[ ]:

['42f8220e-673d-4c65-884d-5a48a1a15b03']

In [ ]:

Copied!

index.delete(idx[0])
index.delete(idx[0])