支持文档处理的 Oracle AI 向量搜索¶
Oracle AI 向量搜索专为人工智能(AI)工作负载设计,使您能够基于语义而非关键词进行数据查询。
Oracle AI 向量搜索的最大优势在于,它可以在单一系统中将非结构化数据的语义搜索与业务数据的关系型搜索相结合。
这不仅功能强大,而且显著提升了效率——您无需额外部署专用向量数据库,从而消除了多系统间数据碎片化的痛点。
此外,您的向量数据可充分受益于 Oracle 数据库的所有核心优势功能,包括:
- 分区支持
- 真正应用集群的可扩展性
- Exadata 智能扫描
- 跨地域分布式数据库的分片处理
- 事务处理
- 并行 SQL
- 灾难恢复
- 安全机制
- Oracle 机器学习
- Oracle 图数据库
- Oracle 空间与图数据
- Oracle 区块链
- JSON
本指南演示如何将 Oracle AI 向量搜索与 llama_index 结合使用,构建端到端的 RAG 流程。内容涵盖以下示例:
- 使用 OracleReader 从多源加载文档
- 通过 OracleSummary 在数据库内外执行文档摘要
- 利用 OracleEmbeddings 在数据库内外生成嵌入向量
- 通过 OracleTextSplitter 的高级功能按需进行文档分块
- 在 OraLlamaVS 向量存储中实现文档的存储、索引与查询
如果你是 Oracle 数据库的初学者,建议先尝试免费的 Oracle 23 AI,该资源能帮助你快速掌握数据库环境搭建。操作数据库时,通常应避免默认使用系统用户,而是创建专属用户以提升安全性和可定制性。关于用户创建的具体步骤,可参考我们的端到端指南,其中也演示了如何在 Oracle 中设置用户。此外,理解用户权限对有效管理数据库安全至关重要,更多相关内容可查阅官方Oracle 指南中关于用户账户与安全管理的章节。
前提条件¶
请安装 Oracle llama-index
集成包:
%pip install llama-index
%pip install llama_index-embeddings-oracleai
%pip install llama_index-readers-oracleai
%pip install llama_index-utils-oracleai
%pip install llama-index-vector-stores-oracledb
创建演示用户¶
首先,创建一个具备所有必要权限的演示用户。
import sys
import oracledb
# Update with your username, password, hostname, and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname/service_name>"
try:
conn = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")
cursor = conn.cursor()
try:
cursor.execute(
"""
begin
-- Drop user
begin
execute immediate 'drop user testuser cascade';
exception
when others then
dbms_output.put_line('Error dropping user: ' || SQLERRM);
end;
-- Create user and grant privileges
execute immediate 'create user testuser identified by testuser';
execute immediate 'grant connect, unlimited tablespace, create credential, create procedure, create any index to testuser';
execute immediate 'create or replace directory DEMO_PY_DIR as ''/scratch/hroy/view_storage/hroy_devstorage/demo/orachain''';
execute immediate 'grant read, write on directory DEMO_PY_DIR to public';
execute immediate 'grant create mining model to testuser';
-- Network access
begin
DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE(
host => '*',
ace => xs$ace_type(privilege_list => xs$name_list('connect'),
principal_name => 'testuser',
principal_type => xs_acl.ptype_db)
);
end;
end;
"""
)
print("User setup done!")
except Exception as e:
print(f"User setup failed with error: {e}")
finally:
cursor.close()
conn.close()
except Exception as e:
print(f"Connection failed with error: {e}")
sys.exit(1)
Connection successful! User setup done!
使用 Oracle AI 处理文档¶
考虑以下场景:用户拥有存储在 Oracle 数据库或文件系统中的文档,并希望利用这些数据通过 llama_index 驱动的 Oracle AI 向量搜索功能。
要为分析准备这些文档,需要执行全面的预处理流程。首先必须检索文档、(根据需要)生成摘要并按需分块。后续步骤包括为这些文本块生成嵌入向量,并将其集成到 Oracle AI 向量存储中。之后用户便可对这些数据进行语义搜索。
Oracle AI 向量搜索 llama_index 库包含一套文档处理工具,可辅助完成文档加载、分块、摘要生成和嵌入向量创建等工作。
在后续章节中,我们将详细说明如何使用 Oracle AI llama_index API 来有效实现上述每个处理流程。
import sys
import oracledb
# please update with your username, password, hostname and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname/service_name>"
try:
conn = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")
except Exception as e:
print("Connection failed!")
sys.exit(1)
Connection successful!
填充演示表¶
创建一个演示表并插入若干示例文档。
try:
cursor = conn.cursor()
drop_table_sql = """drop table demo_tab"""
cursor.execute(drop_table_sql)
create_table_sql = """create table demo_tab (id number, data clob)"""
cursor.execute(create_table_sql)
insert_row_sql = """insert into demo_tab values (:1, :2)"""
rows_to_insert = [
(
1,
"If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.",
),
(
2,
"A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.",
),
(
3,
"The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.",
),
]
cursor.executemany(insert_row_sql, rows_to_insert)
conn.commit()
print("Table created and populated.")
cursor.close()
except Exception as e:
print("Table creation failed.")
cursor.close()
conn.close()
sys.exit(1)
Table created and populated.
在包含演示用户和预填充样本表的基础上,剩余配置涉及嵌入和摘要功能的设置。系统为用户提供多种供应商选项,包括本地数据库解决方案以及第三方服务(如 Ocigenai、Hugging Face 和 OpenAI)。若用户选择第三方供应商,则需创建包含必要认证信息的凭证。反之,若选择数据库作为嵌入功能的供应商,则需向 Oracle 数据库上传 ONNX 模型。当选择数据库方案时,摘要功能无需额外配置。
加载 ONNX 模型¶
Oracle 支持多种嵌入生成服务提供商,允许用户在专有数据库解决方案与第三方服务(如 OCIGENAI 和 HuggingFace)之间进行选择。该选择将决定生成和管理嵌入向量的具体方法。
重要提示:若用户选择数据库方案,则必须向 Oracle 数据库上传 ONNX 模型;反之,若选择第三方服务提供商生成嵌入向量,则无需向 Oracle 数据库上传 ONNX 模型。
直接在 Oracle 中使用 ONNX 模型的核心优势在于:通过避免向外部传输数据,显著提升了安全性和性能表现。此外,该方法还能规避通常由网络请求或 REST API 调用带来的延迟问题。
以下是向 Oracle 数据库上传 ONNX 模型的示例代码:
from llama_index.embeddings.oracleai import OracleEmbeddings
# please update with your related information
# make sure that you have onnx file in the system
onnx_dir = "DEMO_PY_DIR"
onnx_file = "tinybert.onnx"
model_name = "demo_model"
try:
OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)
print("ONNX model loaded.")
except Exception as e:
print("ONNX model loading failed!")
sys.exit(1)
ONNX model loaded.
创建凭证¶
当选择第三方供应商生成嵌入向量时,用户需要建立凭证以安全访问供应商的终端节点。
重要提示: 若选择使用'database'供应商生成嵌入向量,则无需任何凭证。但如果用户决定使用第三方供应商,则必须为所选供应商创建专属凭证。
以下为示例说明:
try:
cursor = conn.cursor()
cursor.execute(
"""
declare
jo json_object_t;
begin
-- HuggingFace
dbms_vector_chain.drop_credential(credential_name => 'HF_CRED');
jo := json_object_t();
jo.put('access_token', '<access_token>');
dbms_vector_chain.create_credential(
credential_name => 'HF_CRED',
params => json(jo.to_string));
-- OCIGENAI
dbms_vector_chain.drop_credential(credential_name => 'OCI_CRED');
jo := json_object_t();
jo.put('user_ocid','<user_ocid>');
jo.put('tenancy_ocid','<tenancy_ocid>');
jo.put('compartment_ocid','<compartment_ocid>');
jo.put('private_key','<private_key>');
jo.put('fingerprint','<fingerprint>');
dbms_vector_chain.create_credential(
credential_name => 'OCI_CRED',
params => json(jo.to_string));
end;
"""
)
cursor.close()
print("Credentials created.")
except Exception as ex:
cursor.close()
raise
加载文档¶
用户可通过适当配置加载器参数,灵活选择从 Oracle 数据库、文件系统或两者同时加载文档。有关这些参数的完整说明,请参阅 Oracle AI 向量搜索指南。
使用 OracleReader 的显著优势在于其能处理超过 150 种不同文件格式,无需为不同文档类型配置多个加载器。完整支持的格式列表请参见 Oracle Text 支持文档格式。
以下示例代码片段展示了如何使用 OracleReader
from llama_index.core.schema import Document
from llama_index.readers.oracleai import OracleReader
# loading from Oracle Database table
# make sure you have the table with this specification
loader_params = {}
loader_params = {
"owner": "testuser",
"tablename": "demo_tab",
"colname": "data",
}
""" load the docs """
loader = OracleReader(conn=conn, params=loader_params)
docs = loader.load()
""" verify """
print(f"Number of docs loaded: {len(docs)}")
# print(f"Document-0: {docs[0].text}") # content
Number of docs loaded: 3
生成摘要¶
用户加载文档后,可能需要为每个文档生成摘要。Oracle AI 向量搜索 llama_index 库提供了一套专为文档摘要设计的 API 接口,支持多种摘要生成服务提供商(如 Database、OCIGENAI、HuggingFace 等),用户可根据需求选择最适合的服务提供商。要使用这些功能,用户必须按照规范配置摘要参数。有关这些参数的详细信息,请参阅 Oracle AI 向量搜索指南手册。
注意: 如果用户希望使用第三方摘要生成服务提供商(而非 Oracle 内部默认的 'database' 提供商),可能需要设置代理。若您没有代理服务器,请在实例化 OracleSummary 时移除代理参数。
# proxy to be used when we instantiate summary and embedder object
proxy = ""
以下示例代码将展示如何生成摘要:
from llama_index.core.schema import Document
from llama_index.utils.oracleai import OracleSummary
# using 'database' provider
summary_params = {
"provider": "database",
"glevel": "S",
"numParagraphs": 1,
"language": "english",
}
# get the summary instance
# Remove proxy if not required
summ = OracleSummary(conn=conn, params=summary_params, proxy=proxy)
list_summary = []
for doc in docs:
summary = summ.get_summary(doc.text)
list_summary.append(summary)
""" verify """
print(f"Number of Summaries: {len(list_summary)}")
# print(f"Summary-0: {list_summary[0]}") #content
Number of Summaries: 3
文档分块处理¶
文档的规模可能差异很大,从很小到非常大不等。用户通常倾向于将文档分割成较小的区块,以便更高效地生成嵌入向量。在此分块过程中,系统提供了丰富的自定义选项。如需了解这些参数的完整说明,请参阅 Oracle AI 向量搜索指南。
以下示例代码展示了如何实现此功能:
from llama_index.core.schema import Document
from llama_index.readers.oracleai import OracleTextSplitter
# split by default parameters
splitter_params = {"normalize": "all"}
""" get the splitter instance """
splitter = OracleTextSplitter(conn=conn, params=splitter_params)
list_chunks = []
for doc in docs:
chunks = splitter.split_text(doc.text)
list_chunks.extend(chunks)
""" verify """
print(f"Number of Chunks: {len(list_chunks)}")
# print(f"Chunk-0: {list_chunks[0]}") # content
Number of Chunks: 3
生成嵌入向量¶
文档已按要求完成分块处理后,用户可能需要为这些文本块生成嵌入向量。Oracle AI 向量搜索提供多种生成嵌入向量的方法,既可使用本地托管的 ONNX 模型,也可调用第三方 API。有关配置这些选项的完整说明,请参阅 Oracle AI 向量搜索指南。
注意: 用户可能需要配置代理才能使用第三方嵌入生成服务提供商(不包括使用 ONNX 模型的 'database' 提供商)。
# proxy to be used when we instantiate summary and embedder object
proxy = ""
以下示例代码将展示如何生成嵌入向量:
from llama_index.core.schema import Document
from llama_index.embeddings.oracleai import OracleEmbeddings
# using ONNX model loaded to Oracle Database
embedder_params = {"provider": "database", "model": "demo_model"}
# get the embedding instance
# Remove proxy if not required
embedder = OracleEmbeddings(conn=conn, params=embedder_params, proxy=proxy)
embeddings = []
for doc in docs:
chunks = splitter.split_text(doc.text)
for chunk in chunks:
embed = embedder._get_text_embedding(chunk)
embeddings.append(embed)
""" verify """
print(f"Number of embeddings: {len(embeddings)}")
# print(f"Embedding-0: {embeddings[0]}") # content
Number of embeddings: 3
创建 Oracle AI 向量存储¶
现在您已了解如何单独使用 Oracle AI llama_index 库 API 来处理文档,接下来我们将展示如何与 Oracle AI 向量存储集成以实现语义搜索。
首先,导入所有依赖项。
import sys
import oracledb
from llama_index.core.schema import Document, TextNode
from llama_index.readers.oracleai import OracleReader, OracleTextSplitter
from llama_index.embeddings.oracleai import OracleEmbeddings
from llama_index.utils.oracleai import OracleSummary
from llama_index.vector_stores.oracledb import OraLlamaVS, DistanceStrategy
from llama_index.vector_stores.oracledb import base as orallamavs
接下来,我们将所有文档处理阶段整合在一起。以下是示例代码:
"""
In this sample example, we will use 'database' provider for both summary and embeddings.
So, we don't need to do the followings:
- set proxy for 3rd party providers
- create credential for 3rd party providers
If you choose to use 3rd party provider,
please follow the necessary steps for proxy and credential.
"""
# oracle connection
# please update with your username, password, hostname, and service_name
username = "testuser"
password = "testuser"
dsn = "<hostname/service_name>"
try:
conn = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")
except Exception as e:
print("Connection failed!")
sys.exit(1)
# load onnx model
# please update with your related information
onnx_dir = "DEMO_PY_DIR"
onnx_file = "tinybert.onnx"
model_name = "demo_model"
try:
OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)
print("ONNX model loaded.")
except Exception as e:
print("ONNX model loading failed!")
sys.exit(1)
# params
# please update necessary fields with related information
loader_params = {
"owner": "testuser",
"tablename": "demo_tab",
"colname": "data",
}
summary_params = {
"provider": "database",
"glevel": "S",
"numParagraphs": 1,
"language": "english",
}
splitter_params = {"normalize": "all"}
embedder_params = {"provider": "database", "model": "demo_model"}
# instantiate loader, summary, splitter, and embedder
loader = OracleReader(conn=conn, params=loader_params)
summary = OracleSummary(conn=conn, params=summary_params)
splitter = OracleTextSplitter(conn=conn, params=splitter_params)
embedder = OracleEmbeddings(conn=conn, params=embedder_params)
# process the documents
loader = OracleReader(conn=conn, params=loader_params)
docs = loader.load()
chunks_with_mdata = []
for id, doc in enumerate(docs, start=1):
summ = summary.get_summary(doc.text)
chunks = splitter.split_text(doc.text)
for ic, chunk in enumerate(chunks, start=1):
chunk_metadata = doc.metadata.copy()
chunk_metadata["id"] = (
chunk_metadata["_oid"] + "$" + str(id) + "$" + str(ic)
)
chunk_metadata["document_id"] = str(id)
chunk_metadata["document_summary"] = str(summ[0])
textnode = TextNode(
text=chunk,
id_=chunk_metadata["id"],
embedding=embedder._get_text_embedding(chunk),
metadata=chunk_metadata,
)
chunks_with_mdata.append(textnode)
""" verify """
print(f"Number of total chunks with metadata: {len(chunks_with_mdata)}")
Connection successful! ONNX model loaded. Number of total chunks with metadata: 3
此时,我们已经完成了文档处理并生成了带有元数据的文本块。接下来,我们将利用这些文本块创建 Oracle AI 向量存储。
以下是实现该功能的示例代码:
# create Oracle AI Vector Store
vectorstore = OraLlamaVS.from_documents(
client=conn,
docs=chunks_with_mdata,
table_name="oravs",
distance_strategy=DistanceStrategy.DOT_PRODUCT,
)
""" verify """
print(f"Vector Store Table: {vectorstore.table_name}")
Vector Store Table: oravs
以下示例展示了使用 DOT_PRODUCT 距离策略创建向量存储的过程。用户可根据需求在 Oracle AI 向量存储中灵活选用多种距离策略,具体说明请参阅我们的综合指南。
在向量存储中存储嵌入向量后,建议建立索引以提升查询执行时的语义搜索性能。
注意 若遇到"内存不足"错误,建议在数据库配置中增加 vector_memory_size 参数值
以下是创建索引的示例代码片段:
orallamavs.create_index(
conn, vectorstore, params={"idx_name": "hnsw_oravs", "idx_type": "HNSW"}
)
print("Index created.")
此示例演示了如何在 'oravs' 表的嵌入向量上创建默认的 HNSW 索引。用户可根据具体需求调整各项参数。有关这些参数的详细信息,请参阅 Oracle AI 向量搜索指南。
此外,为满足多样化需求,可创建多种类型的向量索引。更多细节请参见我们的综合指南。
query = "What is Oracle AI Vector Store?"
filter = {"document_id": ["1"]}
# Similarity search without a filter
print(vectorstore.similarity_search(query, 1))
# Similarity search with a filter
print(vectorstore.similarity_search(query, 1, filter=filter))
# Similarity search with relevance score
print(vectorstore.similarity_search_with_score(query, 1))
# Similarity search with relevance score with filter
print(vectorstore.similarity_search_with_score(query, 1, filter=filter))
# Max marginal relevance search
print(
vectorstore.max_marginal_relevance_search(
query, 1, fetch_k=20, lambda_mult=0.5
)
)
# Max marginal relevance search with filter
print(
vectorstore.max_marginal_relevance_search(
query, 1, fetch_k=20, lambda_mult=0.5, filter=filter
)
)
[Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\n\n'})] [] [(Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\n\n'}), 0.055675752460956573)] [] [Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\n\n'})] [Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\n\n'})]