Pydantic 提取器¶
在此我们测试 PydanticProgramExtractor 的功能——能够使用大型语言模型(无论是标准文本补全模型还是函数调用模型)提取完整的 Pydantic 对象。
相较于使用"单一"元数据提取器,这种方法的优势在于我们可以通过单次 LLM 调用提取多个实体。
安装¶
In [ ]:
Copied!
%pip install llama-index-readers-web
%pip install llama-index-program-openai
%pip install llama-index-readers-web
%pip install llama-index-program-openai
In [ ]:
Copied!
import nest_asyncio
nest_asyncio.apply()
import os
import openai
import nest_asyncio
nest_asyncio.apply()
import os
import openai
In [ ]:
Copied!
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
openai.api_key = os.getenv("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
openai.api_key = os.getenv("OPENAI_API_KEY")
配置 Pydantic 数据模型¶
在此我们定义需要提取的基础结构化模式,该模式包含:
- entities(实体):文本片段中的唯一实体
- summary(摘要):文本片段的简明摘要
- contains_number(包含数字):该片段是否含有数字
显然这是一个示例性模式。我们鼓励您发挥创意,设计您希望提取的元数据类型!
In [ ]:
Copied!
from pydantic import BaseModel, Field
from typing import List
from pydantic import BaseModel, Field
from typing import List
In [ ]:
Copied!
class NodeMetadata(BaseModel):
"""Node metadata."""
entities: List[str] = Field(
..., description="Unique entities in this text chunk."
)
summary: str = Field(
..., description="A concise summary of this text chunk."
)
contains_number: bool = Field(
...,
description=(
"Whether the text chunk contains any numbers (ints, floats, etc.)"
),
)
class NodeMetadata(BaseModel):
"""Node metadata."""
entities: List[str] = Field(
..., description="Unique entities in this text chunk."
)
summary: str = Field(
..., description="A concise summary of this text chunk."
)
contains_number: bool = Field(
...,
description=(
"Whether the text chunk contains any numbers (ints, floats, etc.)"
),
)
配置元数据提取器¶
此处我们设置元数据提取器。请注意,我们提供了提示词模板以便直观了解运行机制。
In [ ]:
Copied!
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.extractors import PydanticProgramExtractor
EXTRACT_TEMPLATE_STR = """\
Here is the content of the section:
----------------
{context_str}
----------------
Given the contextual information, extract out a {class_name} object.\
"""
openai_program = OpenAIPydanticProgram.from_defaults(
output_cls=NodeMetadata,
prompt_template_str="{input}",
# extract_template_str=EXTRACT_TEMPLATE_STR
)
program_extractor = PydanticProgramExtractor(
program=openai_program, input_key="input", show_progress=True
)
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.extractors import PydanticProgramExtractor
EXTRACT_TEMPLATE_STR = """\
Here is the content of the section:
----------------
{context_str}
----------------
Given the contextual information, extract out a {class_name} object.\
"""
openai_program = OpenAIPydanticProgram.from_defaults(
output_cls=NodeMetadata,
prompt_template_str="{input}",
# extract_template_str=EXTRACT_TEMPLATE_STR
)
program_extractor = PydanticProgramExtractor(
program=openai_program, input_key="input", show_progress=True
)
加载数据¶
我们使用 LlamaHub 的 SimpleWebPageReader 加载 Eugene 的文章(https://eugeneyan.com/writing/llm-patterns/)。
In [ ]:
Copied!
# load in blog
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.node_parser import SentenceSplitter
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
# load in blog
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.node_parser import SentenceSplitter
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
In [ ]:
Copied!
from llama_index.core.ingestion import IngestionPipeline
node_parser = SentenceSplitter(chunk_size=1024)
pipeline = IngestionPipeline(transformations=[node_parser, program_extractor])
orig_nodes = pipeline.run(documents=docs)
from llama_index.core.ingestion import IngestionPipeline
node_parser = SentenceSplitter(chunk_size=1024)
pipeline = IngestionPipeline(transformations=[node_parser, program_extractor])
orig_nodes = pipeline.run(documents=docs)
In [ ]:
Copied!
orig_nodes
orig_nodes
In [ ]:
Copied!
sample_entry = program_extractor.extract(orig_nodes[0:1])[0]
sample_entry = program_extractor.extract(orig_nodes[0:1])[0]
Extracting Pydantic object: 0%| | 0/1 [00:00<?, ?it/s]
In [ ]:
Copied!
display(sample_entry)
display(sample_entry)
{'entities': ['eugeneyan', 'HackerNews', 'Karpathy'],
'summary': 'This section discusses practical patterns for integrating large language models (LLMs) into systems & products. It introduces seven key patterns and provides information on evaluations and benchmarks in the field of language modeling.',
'contains_number': True}
In [ ]:
Copied!
new_nodes = program_extractor.process_nodes(orig_nodes)
new_nodes = program_extractor.process_nodes(orig_nodes)
Extracting Pydantic object: 0%| | 0/29 [00:00<?, ?it/s]
In [ ]:
Copied!
display(new_nodes[5:7])
display(new_nodes[5:7])