预处理¶

Preprocess 是一项 API 服务，可将各类文档分割成适用于语言模型任务的最佳文本块。

给定输入文档后，Preprocess 会将其分割成保留原始文档布局和语义的文本块。我们通过考虑章节、段落、列表、图像、数据表格、文本表格和幻灯片来分割内容，并遵循长文本的内容语义进行切分。

Preprocess 支持：

PDF 文件
Microsoft Office 文档（Word、PowerPoint、Excel）
OpenOffice 文档（ods、odt、odp）
HTML 内容（网页、文章、电子邮件）
纯文本

PreprocessLoader 与 Preprocess API 库 交互，提供文档转换和分块功能，或将已分块文件加载到 LangChain 中。

要求¶

如果尚未安装 Python Preprocess library，请执行以下安装操作：

In [ ]:

Copied!

# Install Preprocess Python SDK package
# $ pip install pypreprocess
# Install Preprocess Python SDK package
# $ pip install pypreprocess

使用方法¶

要使用预处理加载器，您需要传入 Preprocess API Key。
初始化 PreprocessReader 时，应传递您的 API Key。若尚未获取密钥，请发送邮件至 support@preprocess.co 申请。未提供 API Key 时，加载器将抛出错误。

传入有效文件路径即可对文件进行分块处理，读取器会自动开始转换和分块操作。
Preprocess 会通过内置的 Splitter 实现文件分块。因此，您不应在 IngestionPipeline 中处理文档时再使用 Splitter 解析节点或应用分块器。

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex
from llama_index.readers.preprocess import PreprocessReader
from llama_index.core import VectorStoreIndex
from llama_index.readers.preprocess import PreprocessReader

In [ ]:

Copied!

loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)
loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)

若需直接操作节点：

In [ ]:

Copied!

nodes = loader.get_nodes()

# import the nodes in a Vector Store with your configuration
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()
nodes = loader.get_nodes()

# import the nodes in a Vector Store with your configuration
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()

默认情况下，load_data() 会为每个文本块返回一个文档，请注意不要对这些文档进行任何分割处理

In [ ]:

Copied!





documents = loader.load_data()

# don't apply any Splitter parser to documents
# if you have an ingestion pipeline you should not apply a Splitter in the transformations
# import the documents in a Vector Store, if you set the service_context parameter remember to avoid including a splitter
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
documents = loader.load_data()

# don't apply any Splitter parser to documents
# if you have an ingestion pipeline you should not apply a Splitter in the transformations
# import the documents in a Vector Store, if you set the service_context parameter remember to avoid including a splitter
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

In [ ]:

Copied!

data = loader.load()
data = loader.load()

若希望仅返回提取的文本并通过自定义流程处理，请设置 return_whole_document = True

In [ ]:

Copied!

document = loader.load_data(return_whole_document=True)
document = loader.load_data(return_whole_document=True)

若需加载已分块的文件，可通过将 process_id 传递给读取器来实现。

In [ ]:

Copied!

# pass a process_id obtained from a previous instance and get the chunks as one string inside a Document
loader = PreprocessReader(api_key="your-api-key", process_id="your-process-id")
# pass a process_id obtained from a previous instance and get the chunks as one string inside a Document
loader = PreprocessReader(api_key="your-api-key", process_id="your-process-id")

其他信息¶

PreprocessReader 基于 Preprocess 库中的 pypreprocess 组件开发。如需了解更多信息或其他集成需求，请查阅官方文档。