PostgresML 托管索引¶
在本笔记本中,我们将展示如何将 PostgresML 与 LlamaIndex 结合使用。
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index-indices-managed-postgresml
!pip install llama-index-indices-managed-postgresml
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
from llama_index.indices.managed.postgresml import PostgresMLIndex
from llama_index.core import SimpleDirectoryReader
# Need this as asyncio can get pretty wild with notebooks and this prevents event loop errors
import nest_asyncio
nest_asyncio.apply()
from llama_index.indices.managed.postgresml import PostgresMLIndex
from llama_index.core import SimpleDirectoryReader
# Need this as asyncio can get pretty wild with notebooks and this prevents event loop errors
import nest_asyncio
nest_asyncio.apply()
加载文档¶
加载 paul_graham_essay.txt
文档。
In [ ]:
Copied!
!mkdir data
!curl -o data/paul_graham_essay.txt https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
!mkdir data
!curl -o data/paul_graham_essay.txt https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
In [ ]:
Copied!
documents = SimpleDirectoryReader("data").load_data()
print(f"documents loaded into {len(documents)} document objects")
print(f"Document ID of first doc is {documents[0].doc_id}")
documents = SimpleDirectoryReader("data").load_data()
print(f"documents loaded into {len(documents)} document objects")
print(f"Document ID of first doc is {documents[0].doc_id}")
将文档更新插入到您的 PostgresML 数据库中¶
首先设置连接到 PostgresML 数据库的 URL。如果您还没有 URL,可以在此免费创建:https://postgresml.org/signup
In [ ]:
Copied!
# Let's set some secrets we need
from google.colab import userdata
PGML_DATABASE_URL = userdata.get("PGML_DATABASE_URL")
# If you don't have those secrets set, uncomment the lines below and run them instead
# Make sure to replace {REPLACE_ME} with your keys
# PGML_DATABASE_URL = "{REPLACE_ME}"
# Let's set some secrets we need
from google.colab import userdata
PGML_DATABASE_URL = userdata.get("PGML_DATABASE_URL")
# If you don't have those secrets set, uncomment the lines below and run them instead
# Make sure to replace {REPLACE_ME} with your keys
# PGML_DATABASE_URL = "{REPLACE_ME}"
In [ ]:
Copied!
index = PostgresMLIndex.from_documents(
documents,
collection_name="llama-index-example-demo",
pgml_database_url=PGML_DATABASE_URL,
)
index = PostgresMLIndex.from_documents(
documents,
collection_name="llama-index-example-demo",
pgml_database_url=PGML_DATABASE_URL,
)
查询 Postgresml 索引¶
现在我们可以使用 PostgresMLIndex 检索器来提问。
In [ ]:
Copied!
query = "What did the author write about?"
query = "What did the author write about?"
我们可以使用检索器来列出搜索文档:
In [ ]:
Copied!
retriever = index.as_retriever()
response = retriever.retrieve(query)
texts = [t.node.text for t in response]
print("The Nodes:")
print(response)
print("\nThe Texts")
print(texts)
retriever = index.as_retriever()
response = retriever.retrieve(query)
texts = [t.node.text for t in response]
print("The Nodes:")
print(response)
print("\nThe Texts")
print(texts)
PostgresML 允许在检索查询中轻松实现重新排序:
In [ ]:
Copied!
retriever = index.as_retriever(
limit=2, # Limit to returning the 2 most related Nodes
rerank={
"model": "mixedbread-ai/mxbai-rerank-base-v1", # Use the mxbai-rerank-base model for reranking
"num_documents_to_rerank": 100, # Rerank up to 100 results returned from the vector search
},
)
response = retriever.retrieve(query)
texts = [t.node.text for t in response]
print("The Nodes:")
print(response)
print("\nThe Texts")
print(texts)
retriever = index.as_retriever(
limit=2, # Limit to returning the 2 most related Nodes
rerank={
"model": "mixedbread-ai/mxbai-rerank-base-v1", # Use the mxbai-rerank-base model for reranking
"num_documents_to_rerank": 100, # Rerank up to 100 results returned from the vector search
},
)
response = retriever.retrieve(query)
texts = [t.node.text for t in response]
print("The Nodes:")
print(response)
print("\nThe Texts")
print(texts)
通过 as_query_engine()
方法,我们可以在单次查询中提问并获取响应:
In [ ]:
Copied!
query_engine = index.as_query_engine()
response = query_engine.query(query)
print("The Response:")
print(response)
print("\nThe Source Nodes:")
print(response.get_formatted_sources())
query_engine = index.as_query_engine()
response = query_engine.query(query)
print("The Response:")
print(response)
print("\nThe Source Nodes:")
print(response.get_formatted_sources())
请注意,上述的 "response" 对象不仅包含摘要文本,还包含用于提供此响应的源文档(引用)。注意源节点都来自同一文档,这是因为我们只上传了一个文档,PostgresML 在为我们进行嵌入前自动对其进行了分割。所有参数均可控制,更多信息请参阅文档。
我们可以在创建 query_engine 时通过传入 streaming=True
来启用流式传输。
注意:由于谷歌协作实验室的网络连接问题,流式传输在该平台上会异常缓慢。
In [ ]:
Copied!
query_engine = index.as_query_engine(streaming=True)
results = query_engine.query(query)
for text in results.response_gen:
print(text, end="", flush=True)
query_engine = index.as_query_engine(streaming=True)
results = query_engine.query(query)
for text in results.response_gen:
print(text, end="", flush=True)