Postgres 向量存储¶
本笔记本将展示如何在 LlamaIndex 中使用 Postgresql 和 pgvector 执行向量搜索
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-vector-stores-postgres
%pip install llama-index-vector-stores-postgres
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
运行以下单元格将在 Colab 中安装带有 PGVector 的 Postgres。
In [ ]:
Copied!
!sudo apt update
!echo | sudo apt install -y postgresql-common
!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
!echo | sudo apt install postgresql-15-pgvector
!sudo service postgresql start
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
!sudo -u postgres psql -c "CREATE DATABASE vector_db;"
!sudo apt update
!echo | sudo apt install -y postgresql-common
!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
!echo | sudo apt install postgresql-15-pgvector
!sudo service postgresql start
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
!sudo -u postgres psql -c "CREATE DATABASE vector_db;"
In [ ]:
Copied!
# import logging
# import sys
# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import textwrap
import openai
# import logging
# import sys
# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import textwrap
import openai
配置 OpenAI¶
第一步是设置 OpenAI 密钥。该密钥将用于为载入索引的文档创建嵌入向量。
In [ ]:
Copied!
import os
os.environ["OPENAI_API_KEY"] = "<your key>"
openai.api_key = os.environ["OPENAI_API_KEY"]
import os
os.environ["OPENAI_API_KEY"] = ""
openai.api_key = os.environ["OPENAI_API_KEY"]
下载数据
In [ ]:
Copied!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-07-29 15:29:26-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.04s 2024-07-29 15:29:26 (1.75 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
加载文档¶
使用 SimpleDirectoryReader 加载存储在 data/paul_graham/ 路径下的文档
In [ ]:
Copied!
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
Document ID: e9a28a97-73af-42dd-8b40-9c585e222e69
创建数据库¶
在本地已运行的 PostgreSQL 实例中,创建我们将要使用的数据库。
In [ ]:
Copied!
import psycopg2
connection_string = "postgresql://postgres:password@localhost:5432"
db_name = "vector_db"
conn = psycopg2.connect(connection_string)
conn.autocommit = True
with conn.cursor() as c:
c.execute(f"DROP DATABASE IF EXISTS {db_name}")
c.execute(f"CREATE DATABASE {db_name}")
import psycopg2
connection_string = "postgresql://postgres:password@localhost:5432"
db_name = "vector_db"
conn = psycopg2.connect(connection_string)
conn.autocommit = True
with conn.cursor() as c:
c.execute(f"DROP DATABASE IF EXISTS {db_name}")
c.execute(f"CREATE DATABASE {db_name}")
创建索引¶
此处我们基于先前加载的文档创建一个由Postgres支持的索引。PGVectorStore需要接收几个参数。以下示例构建了一个PGVectorStore,其中包含m=16、ef_construction=64、ef_search=40的HNSW索引,并采用vector_cosine_ops方法。
In [ ]:
Copied!
from sqlalchemy import make_url
url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="paul_graham_essay",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()
from sqlalchemy import make_url
url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="paul_graham_essay",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()
查询索引¶
现在我们可以使用索引来提问了。
In [ ]:
Copied!
response = query_engine.query("What did the author do?")
response = query_engine.query("What did the author do?")
In [ ]:
Copied!
print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))
The author worked on writing essays, programming, developing software, giving talks, and starting a startup.
In [ ]:
Copied!
response = query_engine.query("What happened in the mid 1980s?")
response = query_engine.query("What happened in the mid 1980s?")
In [ ]:
Copied!
print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))
In the mid-1980s, the context mentions the presence of a famous fund manager who was not much older than the author and was super rich. This sparked a thought in the author's mind about becoming rich as well to have the freedom to work on whatever he wanted.
查询现有索引¶
In [ ]:
Copied!
vector_store = PGVectorStore.from_params(
database="vector_db",
host="localhost",
password="password",
port=5432,
user="postgres",
table_name="paul_graham_essay",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine()
vector_store = PGVectorStore.from_params(
database="vector_db",
host="localhost",
password="password",
port=5432,
user="postgres",
table_name="paul_graham_essay",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine()
In [ ]:
Copied!
response = query_engine.query("What did the author do?")
response = query_engine.query("What did the author do?")
In [ ]:
Copied!
print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))
The author worked on writing essays, programming, developing software, giving talks, and starting a startup.
混合搜索¶
要启用混合搜索功能,您需要:
- 在构建
PGVectorStore时传入hybrid_search=True(可选配置text_search_config参数指定目标语言) - 在构建查询引擎时传入
vector_store_query_mode="hybrid"(该配置会隐式传递给检索器)。您还可以选择性地设置sparse_top_k参数来配置从稀疏文本搜索中获取的结果数量(默认使用与similarity_top_k相同的值)。
In [ ]:
Copied!
from sqlalchemy import make_url
url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="paul_graham_essay_hybrid_search",
embed_dim=1536, # openai embedding dimension
hybrid_search=True,
text_search_config="english",
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
storage_context = StorageContext.from_defaults(
vector_store=hybrid_vector_store
)
hybrid_index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
from sqlalchemy import make_url
url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="paul_graham_essay_hybrid_search",
embed_dim=1536, # openai embedding dimension
hybrid_search=True,
text_search_config="english",
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
storage_context = StorageContext.from_defaults(
vector_store=hybrid_vector_store
)
hybrid_index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
In [ ]:
Copied!
hybrid_query_engine = hybrid_index.as_query_engine(
vector_store_query_mode="hybrid", sparse_top_k=2
)
hybrid_response = hybrid_query_engine.query(
"Who does Paul Graham think of with the word schtick"
)
hybrid_query_engine = hybrid_index.as_query_engine(
vector_store_query_mode="hybrid", sparse_top_k=2
)
hybrid_response = hybrid_query_engine.query(
"Who does Paul Graham think of with the word schtick"
)
In [ ]:
Copied!
print(hybrid_response)
print(hybrid_response)
Roy Lichtenstein
使用 QueryFusionRetriever 优化混合搜索¶
由于文本搜索和向量搜索的分数计算方式不同,仅通过文本搜索找到的节点得分会显著偏低。
通过使用 QueryFusionRetriever 可以显著提升混合搜索性能,该工具能更有效地利用互信息对节点进行排序。
In [ ]:
Copied!
from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
vector_retriever = hybrid_index.as_retriever(
vector_store_query_mode="default",
similarity_top_k=5,
)
text_retriever = hybrid_index.as_retriever(
vector_store_query_mode="sparse",
similarity_top_k=5, # interchangeable with sparse_top_k in this context
)
retriever = QueryFusionRetriever(
[vector_retriever, text_retriever],
similarity_top_k=5,
num_queries=1, # set this to 1 to disable query generation
mode="relative_score",
use_async=False,
)
response_synthesizer = CompactAndRefine()
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
)
from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
vector_retriever = hybrid_index.as_retriever(
vector_store_query_mode="default",
similarity_top_k=5,
)
text_retriever = hybrid_index.as_retriever(
vector_store_query_mode="sparse",
similarity_top_k=5, # interchangeable with sparse_top_k in this context
)
retriever = QueryFusionRetriever(
[vector_retriever, text_retriever],
similarity_top_k=5,
num_queries=1, # set this to 1 to disable query generation
mode="relative_score",
use_async=False,
)
response_synthesizer = CompactAndRefine()
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
)
In [ ]:
Copied!
response = query_engine.query(
"Who does Paul Graham think of with the word schtick, and why?"
)
print(response)
response = query_engine.query(
"Who does Paul Graham think of with the word schtick, and why?"
)
print(response)
Paul Graham associates the word "schtick" with artists who have a distinctive signature style in their work. This signature style serves as a visual identifier unique to the artist, making their work easily recognizable and attributed to them.
元数据过滤器¶
PGVectorStore 支持在节点中存储元数据,并在检索阶段基于这些元数据进行过滤。
下载 Git 提交记录数据集¶
In [ ]:
Copied!
!mkdir -p 'data/git_commits/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'
!mkdir -p 'data/git_commits/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'
In [ ]:
Copied!
import csv
with open("data/git_commits/commit_history.csv", "r") as f:
commits = list(csv.DictReader(f))
print(commits[0])
print(len(commits))
import csv
with open("data/git_commits/commit_history.csv", "r") as f:
commits = list(csv.DictReader(f))
print(commits[0])
print(len(commits))
{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar<lakshmi@timescale.com>', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache. Fixes #6037 '}
4167
添加带有自定义元数据的节点¶
In [ ]:
Copied!
# Create TextNode for each of the first 100 commits
from llama_index.core.schema import TextNode
from datetime import datetime
import re
nodes = []
dates = set()
authors = set()
for commit in commits[:100]:
author_email = commit["author"].split("<")[1][:-1]
commit_date = datetime.strptime(
commit["date"], "%a %b %d %H:%M:%S %Y %z"
).strftime("%Y-%m-%d")
commit_text = commit["change summary"]
if commit["change details"]:
commit_text += "\n\n" + commit["change details"]
fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE)
nodes.append(
TextNode(
text=commit_text,
metadata={
"commit_date": commit_date,
"author": author_email,
"fixes": fixes,
},
)
)
dates.add(commit_date)
authors.add(author_email)
print(nodes[0])
print(min(dates), "to", max(dates))
print(authors)
# Create TextNode for each of the first 100 commits
from llama_index.core.schema import TextNode
from datetime import datetime
import re
nodes = []
dates = set()
authors = set()
for commit in commits[:100]:
author_email = commit["author"].split("<")[1][:-1]
commit_date = datetime.strptime(
commit["date"], "%a %b %d %H:%M:%S %Y %z"
).strftime("%Y-%m-%d")
commit_text = commit["change summary"]
if commit["change details"]:
commit_text += "\n\n" + commit["change details"]
fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE)
nodes.append(
TextNode(
text=commit_text,
metadata={
"commit_date": commit_date,
"author": author_email,
"fixes": fixes,
},
)
)
dates.add(commit_date)
authors.add(author_email)
print(nodes[0])
print(min(dates), "to", max(dates))
print(authors)
Node ID: 50fdca05-b1ce-41d8-b771-b13aa3ad7df0
Text: Fix segfault in set_integer_now_func When an invalid function
oid is passed to set_integer_now_func, it finds out that the function
oid is invalid but before throwing the error, it calls ReleaseSysCache
on an invalid tuple causing a segfault. Fixed that by removing the
invalid call to ReleaseSysCache. Fixes #6037
2023-03-22 to 2023-09-05
{'dmitry@timescale.com', 'jguthrie@timescale.com', 'lakshmi@timescale.com', 'rafia.sabih@gmail.com', 'sven@timescale.com', 'konstantina@timescale.com', 'engel@sero-systems.de', 'nikhil@timescale.com', '36882414+akuzm@users.noreply.github.com', 'mats@timescale.com', 'satish.8483@gmail.com', 'fabriziomello@gmail.com', 'me@noctarius.com', 'erik@timescale.com', 'jan@timescale.com'}
In [ ]:
Copied!
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="metadata_filter_demo3",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index.insert_nodes(nodes)
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="metadata_filter_demo3",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index.insert_nodes(nodes)
In [ ]:
Copied!
print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))
print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))
Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.
应用元数据筛选器¶
现在检索节点时,我们可以按提交作者或日期进行筛选。
In [ ]:
Copied!
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
filters = MetadataFilters(
filters=[
MetadataFilter(key="author", value="mats@timescale.com"),
MetadataFilter(key="author", value="sven@timescale.com"),
],
condition="or",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
filters = MetadataFilters(
filters=[
MetadataFilter(key="author", value="mats@timescale.com"),
MetadataFilter(key="author", value="sven@timescale.com"),
],
condition="or",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-27', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-07-13', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-30', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-07-25', 'author': 'mats@timescale.com', 'fixes': ['5892']}
{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}
In [ ]:
Copied!
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-23', 'author': 'erik@timescale.com', 'fixes': []}
{'commit_date': '2023-08-17', 'author': 'konstantina@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-24', 'author': 'lakshmi@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-20', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}
应用嵌套过滤器¶
在上述示例中,我们使用 AND 或 OR 组合了多个过滤器。我们还可以组合多组过滤器。
例如在 SQL 中:
WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = 'mats@timescale.com' OR author = 'sven@timescale.com')
In [ ]:
Copied!
filters = MetadataFilters(
filters=[
MetadataFilters(
filters=[
MetadataFilter(
key="commit_date", value="2023-08-01", operator=">="
),
MetadataFilter(
key="commit_date", value="2023-08-15", operator="<="
),
],
condition="and",
),
MetadataFilters(
filters=[
MetadataFilter(key="author", value="mats@timescale.com"),
MetadataFilter(key="author", value="sven@timescale.com"),
],
condition="or",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
filters = MetadataFilters(
filters=[
MetadataFilters(
filters=[
MetadataFilter(
key="commit_date", value="2023-08-01", operator=">="
),
MetadataFilter(
key="commit_date", value="2023-08-15", operator="<="
),
],
condition="and",
),
MetadataFilters(
filters=[
MetadataFilter(key="author", value="mats@timescale.com"),
MetadataFilter(key="author", value="sven@timescale.com"),
],
condition="or",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}
上述操作可通过使用 IN 运算符简化。PGVectorStore 支持 in、nin 和 contains 操作符用于元素与列表的比较。
In [ ]:
Copied!
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["mats@timescale.com", "sven@timescale.com"],
operator="in",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["mats@timescale.com", "sven@timescale.com"],
operator="in",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}
In [ ]:
Copied!
# Same thing, with NOT IN
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["mats@timescale.com", "sven@timescale.com"],
operator="nin",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
# Same thing, with NOT IN
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["mats@timescale.com", "sven@timescale.com"],
operator="nin",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-09', 'author': 'me@noctarius.com', 'fixes': ['5805']}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-11', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': []}
{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}
{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': ['5908']}
{'commit_date': '2023-08-01', 'author': 'nikhil@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'konstantina@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
In [ ]:
Copied!
# CONTAINS
filters = MetadataFilters(
filters=[
MetadataFilter(key="fixes", value="5680", operator="contains"),
]
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")
for node in retrieved_nodes:
print(node.node.metadata)
# CONTAINS
filters = MetadataFilters(
filters=[
MetadataFilter(key="fixes", value="5680", operator="contains"),
]
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}
PgVector 查询选项¶
In [ ]:
Copied!
retriever = index.as_retriever(
vector_store_query_mode="hybrid",
similarity_top_k=5,
vector_store_kwargs={"ivfflat_probes": 10},
)
retriever = index.as_retriever(
vector_store_query_mode="hybrid",
similarity_top_k=5,
vector_store_kwargs={"ivfflat_probes": 10},
)
In [ ]:
Copied!
retriever = index.as_retriever(
vector_store_query_mode="hybrid",
similarity_top_k=5,
vector_store_kwargs={"hnsw_ef_search": 300},
)
retriever = index.as_retriever(
vector_store_query_mode="hybrid",
similarity_top_k=5,
vector_store_kwargs={"hnsw_ef_search": 300},
)