如何构建聊天机器人¶

LlamaIndex 充当您的数据与语言学习模型（LLMs）之间的桥梁，提供一套工具包，使您能够围绕数据建立查询接口，以执行各种任务，例如问答和摘要生成。

本教程将引导您使用数据代理构建一个上下文增强型聊天机器人。该代理由LLMs驱动，能够智能地执行数据相关任务。最终成果是一个配备LlamaIndex强大数据接口工具的聊天机器人代理，用于解答关于您数据的查询。

注意：本教程基于最初关于创建SEC 10-K文件查询接口的工作——点击此处查看。

背景¶

在本指南中，我们将构建一个"10-K文件聊天机器人"，该机器人使用来自Dropbox的UBER 10-K原始HTML文件。用户可与聊天机器人互动，提出与10-K文件相关的问题。

准备工作¶

In [ ]:

Copied!





%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
%pip install llama-index-question-gen-openai
%pip install unstructured
%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
%pip install llama-index-question-gen-openai
%pip install unstructured

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]:

Copied!





from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# global defaults
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model_name="text-embedding-3-large")
Settings.chunk_size = 512
Settings.chunk_overlap = 64
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# global defaults
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model_name="text-embedding-3-large")
Settings.chunk_size = 512
Settings.chunk_overlap = 64

数据导入¶

首先，我们需要下载2019-2022年期间的原始10-K文件。

In [ ]:

Copied!





# NOTE: the code examples assume you're operating within a Jupyter notebook.
# download files
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data
# NOTE: the code examples assume you're operating within a Jupyter notebook.
# download files
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data

为了将 HTML 文件解析为格式化文本，我们使用 Unstructured 库。得益于 LlamaHub 的支持，我们可以直接与 Unstructured 集成，从而将任意文本转换为 LlamaIndex 可处理的 Document 格式。

首先安装必要的依赖包：

然后我们可以使用 UnstructuredReader 将 HTML 文件解析为 Document 对象列表。

In [ ]:

Copied!

from llama_index.readers.file import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]
from llama_index.readers.file import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]

In [ ]:

Copied!





loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)
loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

为每年度设置向量索引¶

我们首先为每个年度建立一个向量索引。每个向量索引使我们能够针对特定年份的10-K文件提出问题。

我们构建每个索引并将其保存至磁盘。

In [ ]:

Copied!





# initialize simple vector indices
# NOTE: don't run this cell if the indices are already loaded!
from llama_index.core import VectorStoreIndex, StorageContext


index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")
# initialize simple vector indices
# NOTE: don't run this cell if the indices are already loaded!
from llama_index.core import VectorStoreIndex, StorageContext


index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")

要从磁盘加载索引，请执行以下操作

In [ ]:

Copied!





# Load indices from disk
from llama_index.core import StorageContext, load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index
# Load indices from disk
from llama_index.core import StorageContext, load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index

设置子问题查询引擎以综合分析10-K文件中的答案¶

由于我们拥有4年的文档访问权限，我们可能不仅希望针对特定年份的10-K文件提问，还希望提出需要分析所有10-K文件的问题。

为此，我们可以使用子问题查询引擎。该引擎将查询分解为子问题，每个子问题由独立的向量索引回答，最后综合结果以回答整体查询。

LlamaIndex提供了一些围绕索引（和查询引擎）的封装器，使其能够被查询引擎和代理使用。首先我们为每个向量索引定义一个QueryEngineTool工具。每个工具都包含名称和描述；这些是LLM代理用来决定选择哪个工具的依据。

In [ ]:

Copied!





from llama_index.core.tools import QueryEngineTool

individual_query_engine_tools = [
    QueryEngineTool.from_defaults(
        query_engine=index_set[year].as_query_engine(),
        name=f"vector_index_{year}",
        description=(
            "useful for when you want to answer queries about the"
            f" {year} SEC 10-K for Uber"
        ),
    )
    for year in years
]
from llama_index.core.tools import QueryEngineTool

individual_query_engine_tools = [
    QueryEngineTool.from_defaults(
        query_engine=index_set[year].as_query_engine(),
        name=f"vector_index_{year}",
        description=(
            "useful for when you want to answer queries about the"
            f" {year} SEC 10-K for Uber"
        ),
    )
    for year in years
]

现在我们可以创建子问题查询引擎（Sub Question Query Engine），这将使我们能够综合10-K文件中的答案。我们将传入之前定义的 individual_query_engine_tools。

In [ ]:

Copied!

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
)
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
)

设置聊天机器人代理¶

我们使用LlamaIndex数据代理来构建外层的聊天机器人代理，该代理具备访问一组工具的能力。具体而言，我们将采用利用OpenAI API函数调用功能的OpenAIAgent。我们的目标是使用之前为每个索引（对应特定年份）定义的独立工具，以及为上述子问题查询引擎定义的工具。

首先为子问题查询引擎定义一个QueryEngineTool：

In [ ]:

Copied!





query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="sub_question_query_engine",
    description=(
        "useful for when you want to answer queries that require analyzing"
        " multiple SEC 10-K documents for Uber"
    ),
)
query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="sub_question_query_engine",
    description=(
        "useful for when you want to answer queries that require analyzing"
        " multiple SEC 10-K documents for Uber"
    ),
)

然后，我们将上述定义的工具整合为供智能体使用的统一工具列表：

In [ ]:

Copied!

tools = individual_query_engine_tools + [query_engine_tool]
tools = individual_query_engine_tools + [query_engine_tool]

最后，我们调用 FunctionAgent 来创建代理，并传入上面定义的工具列表。

In [ ]:

Copied!

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))

测试智能体¶

现在我们可以通过多种查询来测试该智能体。

如果用一个简单的"hello"查询进行测试，智能体不会调用任何工具。

In [ ]:

Copied!

from llama_index.core.workflow import Context

# Setup the context for this specific interaction
ctx = Context(agent)

response = await agent.run("hi, i am bob", ctx=ctx)
print(str(response))
from llama_index.core.workflow import Context

# Setup the context for this specific interaction
ctx = Context(agent)

response = await agent.run("hi, i am bob", ctx=ctx)
print(str(response))

Hello Bob! How can I assist you today?

如果我们用某年度10-K报表相关的查询进行测试，该代理将使用相应的向量索引工具。

In [ ]:

Copied!





response = await agent.run(
    "What were some of the biggest risk factors in 2020 for Uber?", ctx=ctx
)
print(str(response))
response = await agent.run(
    "What were some of the biggest risk factors in 2020 for Uber?", ctx=ctx
)
print(str(response))

In 2020, some of the biggest risk factors for Uber included:

1. **Legal and Regulatory Risks**: Extensive government regulation and oversight could adversely impact operations and future prospects.
2. **Data Privacy and Security Risks**: Risks related to data collection, use, and processing could lead to investigations, litigation, and negative publicity.
3. **Economic Impact of COVID-19**: The pandemic adversely affected business operations, demand for services, and financial condition due to governmental restrictions and changes in consumer behavior.
4. **Market Volatility**: Volatility in the market price of common stock could affect investors' ability to resell shares at favorable prices.
5. **Safety Incidents**: Criminal or dangerous activities on the platform could harm the ability to attract and retain drivers and consumers.
6. **Investment Risks**: Substantial investments in new technologies and offerings carry inherent risks, with no guarantee of realizing expected benefits.
7. **Dependence on Metropolitan Areas**: A significant portion of gross bookings comes from large metropolitan areas, which may be negatively impacted by various external factors.
8. **Talent Retention**: Attracting and retaining high-quality personnel is crucial, and issues with attrition or succession planning could adversely affect the business.
9. **Cybersecurity Threats**: Cyberattacks and data breaches could harm reputation and operational results.
10. **Capital Requirements**: The need for additional capital to support growth may not be met on reasonable terms, impacting business expansion.
11. **Acquisition Challenges**: Difficulty in identifying and integrating suitable businesses could harm operating results and future prospects.
12. **Operational Limitations**: Potential restrictions in certain jurisdictions may require modifications to the business model, affecting service delivery.

最后，如果我们用一个查询来测试比较/对比不同年份的风险因素，代理将使用子问题查询引擎工具。

In [ ]:

Copied!





cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across"
    " years. Give answer in bullet points."
)

response = await agent.run(cross_query_str, ctx=ctx)
print(str(response))
cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across"
    " years. Give answer in bullet points."
)

response = await agent.run(cross_query_str, ctx=ctx)
print(str(response))

Here's a comparison of the risk factors for Uber across the years 2020, 2021, and 2022:

- **COVID-19 Impact**:
- **2020**: The pandemic significantly affected business operations, demand, and financial condition.
- **2021**: Continued impact of the pandemic was a concern, affecting various parts of the business.
- **2022**: The pandemic's impact was less emphasized, with more focus on operational and competitive risks.

- **Driver Classification**:
- **2020**: Not specifically highlighted.
- **2021**: Potential reclassification of Drivers as employees could alter the business model.
- **2022**: Continued risk of reclassification impacting operational costs.

- **Competition**:
- **2020**: Not specifically highlighted.
- **2021**: Intense competition with low barriers to entry and well-capitalized competitors.
- **2022**: Competitive landscape challenges due to established alternatives and low barriers to entry.

- **Financial Concerns**:
- **2020**: Market volatility and capital requirements were major concerns.
- **2021**: Historical losses and increased operating expenses raised profitability concerns.
- **2022**: Significant losses and rising expenses continued to raise profitability concerns.

- **User and Personnel Retention**:
- **2020**: Talent retention was crucial, with risks from attrition.
- **2021**: Attracting and retaining a critical mass of users and personnel was essential.
- **2022**: Continued emphasis on retaining Drivers, consumers, and high-quality personnel.

- **Brand and Reputation**:
- **2020**: Safety incidents and cybersecurity threats could harm reputation.
- **2021**: Maintaining and enhancing brand reputation was critical, with past negative publicity being a concern.
- **2022**: Brand and reputation were under scrutiny, with negative media coverage potentially harming prospects.

- **Operational Challenges**:
- **2020**: Operational limitations and acquisition challenges were highlighted.
- **2021**: Challenges in managing growth and optimizing organizational structure.
- **2022**: Historical workplace culture and the need for organizational optimization were critical.

- **Safety and Liability**:
- **2020**: Safety incidents and liability claims were significant risks.
- **2021**: Safety incidents and liability claims, especially with vulnerable road users, were concerns.
- **2022**: Safety incidents and public reporting could impact reputation and financial results.

Overall, while some risk factors remained consistent across the years, such as competition, financial concerns, and safety, the emphasis shifted slightly with the evolving business environment and external factors like the pandemic.

设置聊天机器人交互循环¶

现在我们已经完成了聊天机器人的基础配置，只需再执行几个简单步骤，就能为这个基于SEC增强的聊天机器人搭建一个基础的交互对话循环！

In [ ]:

Copied!





agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))
ctx = Context(agent)

while True:
    text_input = input("User: ")
    if text_input == "exit":
        break
    response = await agent.run(text_input, ctx=ctx)
    print(f"Agent: {response}")

# User: What were some of the legal proceedings against Uber in 2022?
agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))
ctx = Context(agent)

while True:
    text_input = input("User: ")
    if text_input == "exit":
        break
    response = await agent.run(text_input, ctx=ctx)
    print(f"Agent: {response}")

# User: What were some of the legal proceedings against Uber in 2022?