从零构建检索系统¶
本教程将展示如何基于向量数据库构建标准检索器,实现通过 top-k 相似度获取节点。
我们选用 Pinecone 作为向量数据库,并利用高层级数据摄取抽象接口加载节点(如需了解从零构建方法,请参阅我们之前的教程!)。
您将学习到以下核心技能:
- 如何生成查询嵌入向量
- 如何使用不同搜索模式(稠密搜索、稀疏搜索、混合搜索)查询向量数据库
- 如何将查询结果解析为节点集合
- 如何将其封装成自定义检索器
配置¶
我们构建了一个空的 Pinecone 索引,并定义了必要的 LlamaIndex 封装/抽象层,以便开始向 Pinecone 加载数据。
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-readers-file pymupdf
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-embeddings-openai
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
构建 Pinecone 索引¶
In [ ]:
Copied!
from pinecone import Pinecone, Index, ServerlessSpec
import os
api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)
from pinecone import Pinecone, Index, ServerlessSpec
import os
api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)
In [ ]:
Copied!
# dimensions are for text-embedding-ada-002
dataset_name = "quickstart"
if dataset_name not in pc.list_indexes().names():
pc.create_index(
dataset_name,
dimension=1536,
metric="euclidean",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
# dimensions are for text-embedding-ada-002
dataset_name = "quickstart"
if dataset_name not in pc.list_indexes().names():
pc.create_index(
dataset_name,
dimension=1536,
metric="euclidean",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
In [ ]:
Copied!
pinecone_index = pc.Index(dataset_name)
pinecone_index = pc.Index(dataset_name)
In [ ]:
Copied!
# [Optional] drop contents in index
pinecone_index.delete(deleteAll=True)
# [Optional] drop contents in index
pinecone_index.delete(deleteAll=True)
创建 PineconeVectorStore¶
这是一个用于 LlamaIndex 的简易封装抽象层。通过封装到 StorageContext 中,我们可以轻松加载节点数据。
In [ ]:
Copied!
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.vector_stores.pinecone import PineconeVectorStore
In [ ]:
Copied!
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
加载文档¶
In [ ]:
Copied!
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
In [ ]:
Copied!
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
In [ ]:
Copied!
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")
加载至向量存储库¶
将文档加载至 PineconeVectorStore 中。
注意:此处我们使用高层级的抽象封装方法 VectorStoreIndex.from_documents 进行数据摄取。本教程后续部分将不再使用 VectorStoreIndex。
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import StorageContext
In [ ]:
Copied!
splitter = SentenceSplitter(chunk_size=1024)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, transformations=[splitter], storage_context=storage_context
)
splitter = SentenceSplitter(chunk_size=1024)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, transformations=[splitter], storage_context=storage_context
)
In [ ]:
Copied!
query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = "Can you tell me about the key concepts for safety finetuning"
1. 生成查询嵌入¶
In [ ]:
Copied!
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding()
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding()
In [ ]:
Copied!
query_embedding = embed_model.get_query_embedding(query_str)
query_embedding = embed_model.get_query_embedding(query_str)
In [ ]:
Copied!
# construct vector store query
from llama_index.core.vector_stores import VectorStoreQuery
query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)
# construct vector store query
from llama_index.core.vector_stores import VectorStoreQuery
query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)
In [ ]:
Copied!
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)
query_result
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)
query_result
3. 将解析结果转换为节点集合¶
VectorStoreQueryResult 返回节点集合及其相似度分数。我们据此构建一个 NodeWithScore 对象。
In [ ]:
Copied!
from llama_index.core.schema import NodeWithScore
from typing import Optional
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[index]
nodes_with_scores.append(NodeWithScore(node=node, score=score))
from llama_index.core.schema import NodeWithScore
from typing import Optional
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[index]
nodes_with_scores.append(NodeWithScore(node=node, score=score))
In [ ]:
Copied!
from llama_index.core.response.notebook_utils import display_source_node
for node in nodes_with_scores:
display_source_node(node, source_length=1000)
from llama_index.core.response.notebook_utils import display_source_node
for node in nodes_with_scores:
display_source_node(node, source_length=1000)
4. 将其封装为检索器¶
让我们将其封装为一个可集成到 LlamaIndex 工作流的 Retriever 子类!
In [ ]:
Copied!
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from typing import Any, List
class PineconeRetriever(BaseRetriever):
"""Retriever over a pinecone vector store."""
def __init__(
self,
vector_store: PineconeVectorStore,
embed_model: Any,
query_mode: str = "default",
similarity_top_k: int = 2,
) -> None:
"""Init params."""
self._vector_store = vector_store
self._embed_model = embed_model
self._query_mode = query_mode
self._similarity_top_k = similarity_top_k
super().__init__()
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
"""Retrieve."""
if query_bundle.embedding is None:
query_embedding = self._embed_model.get_query_embedding(
query_bundle.query_str
)
else:
query_embedding = query_bundle.embedding
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding,
similarity_top_k=self._similarity_top_k,
mode=self._query_mode,
)
query_result = self._vector_store.query(vector_store_query)
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[index]
nodes_with_scores.append(NodeWithScore(node=node, score=score))
return nodes_with_scores
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from typing import Any, List
class PineconeRetriever(BaseRetriever):
"""Retriever over a pinecone vector store."""
def __init__(
self,
vector_store: PineconeVectorStore,
embed_model: Any,
query_mode: str = "default",
similarity_top_k: int = 2,
) -> None:
"""Init params."""
self._vector_store = vector_store
self._embed_model = embed_model
self._query_mode = query_mode
self._similarity_top_k = similarity_top_k
super().__init__()
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
"""Retrieve."""
if query_bundle.embedding is None:
query_embedding = self._embed_model.get_query_embedding(
query_bundle.query_str
)
else:
query_embedding = query_bundle.embedding
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding,
similarity_top_k=self._similarity_top_k,
mode=self._query_mode,
)
query_result = self._vector_store.query(vector_store_query)
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[index]
nodes_with_scores.append(NodeWithScore(node=node, score=score))
return nodes_with_scores
In [ ]:
Copied!
retriever = PineconeRetriever(
vector_store, embed_model, query_mode="default", similarity_top_k=2
)
retriever = PineconeRetriever(
vector_store, embed_model, query_mode="default", similarity_top_k=2
)
In [ ]:
Copied!
retrieved_nodes = retriever.retrieve(query_str)
for node in retrieved_nodes:
display_source_node(node, source_length=1000)
retrieved_nodes = retriever.retrieve(query_str)
for node in retrieved_nodes:
display_source_node(node, source_length=1000)
将其接入我们的 RetrieverQueryEngine 以合成响应¶
注意:我们将在后续教程中详细介绍如何从头开始构建响应合成功能!
In [ ]:
Copied!
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever)
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever)
In [ ]:
Copied!
response = query_engine.query(query_str)
response = query_engine.query(query_str)
In [ ]:
Copied!
print(str(response))
print(str(response))
The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to train the model to align with safety guidelines. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and gathering challenging adversarial prompts for fine-tuning. Safety context distillation refines the RLHF pipeline by generating safer model responses using a safety preprompt and fine-tuning the model on these responses without the preprompt. These concepts are used to mitigate safety risks and improve the safety of the model's responses.