HotpotQADistractor 演示¶
本笔记本演示了如何使用 HotpotQA 数据集评估查询引擎。在此任务中,大型语言模型(LLM)必须在给定预设上下文的情况下回答问题。答案通常需要简洁明了,准确度通过计算重叠率(采用 F1 分数衡量)和完全匹配率来评估。
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-llms-openai
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
from llama_index.core.evaluation.benchmarks import HotpotQAEvaluator
from llama_index.core import VectorStoreIndex
from llama_index.core import Document
from llama_index.llms.openai import OpenAI
from llama_index.core.embeddings import resolve_embed_model
llm = OpenAI(model="gpt-3.5-turbo")
embed_model = resolve_embed_model(
"local:sentence-transformers/all-MiniLM-L6-v2"
)
index = VectorStoreIndex.from_documents(
[Document.example()], embed_model=embed_model, show_progress=True
)
from llama_index.core.evaluation.benchmarks import HotpotQAEvaluator
from llama_index.core import VectorStoreIndex
from llama_index.core import Document
from llama_index.llms.openai import OpenAI
from llama_index.core.embeddings import resolve_embed_model
llm = OpenAI(model="gpt-3.5-turbo")
embed_model = resolve_embed_model(
"local:sentence-transformers/all-MiniLM-L6-v2"
)
index = VectorStoreIndex.from_documents(
[Document.example()], embed_model=embed_model, show_progress=True
)
Parsing documents into nodes: 100%|██████████| 1/1 [00:00<00:00, 129.13it/s] Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 36.62it/s]
首先我们尝试使用一个非常简单的引擎。在这个特定的基准测试中,检索器(以及索引)实际上被忽略了,因为每个查询检索到的文档已由数据集提供。这在 HotpotQA 中被称为"干扰项"(distractor)设置。
In [ ]:
Copied!
engine = index.as_query_engine(llm=llm)
HotpotQAEvaluator().run(engine, queries=5, show_result=True)
engine = index.as_query_engine(llm=llm)
HotpotQAEvaluator().run(engine, queries=5, show_result=True)
Dataset: hotpot_dev_distractor downloaded at: /Users/loganmarkewich/Library/Caches/llama_index/datasets/HotpotQA
Evaluating on dataset: hotpot_dev_distractor
-------------------------------------
Loading 5 queries out of 7405 (fraction: 0.00068)
Question: Were Scott Derrickson and Ed Wood of the same nationality?
Response: No.
Correct answer: yes
EM: 0 F1: 0
-------------------------------------
Question: What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?
Response: Unknown
Correct answer: Chief of Protocol
EM: 0 F1: 0
-------------------------------------
Question: What science fantasy young adult series, told in first person, has a set of companion books narrating the stories of enslaved worlds and alien species?
Response: Animorphs
Correct answer: Animorphs
EM: 1 F1: 1.0
-------------------------------------
Question: Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?
Response: Yes.
Correct answer: no
EM: 0 F1: 0
-------------------------------------
Question: The director of the romantic comedy "Big Stone Gap" is based in what New York city?
Response: Greenwich Village
Correct answer: Greenwich Village, New York City
EM: 0 F1: 0.5714285714285715
-------------------------------------
Scores: {'exact_match': 0.2, 'f1': 0.31428571428571433}
现在我们尝试使用句子转换器(sentence transformer)重排序器,它会从检索器提供的10个节点中筛选出3个
In [ ]:
Copied!
from llama_index.core.postprocessor import SentenceTransformerRerank
rerank = SentenceTransformerRerank(top_n=3)
engine = index.as_query_engine(
llm=llm,
node_postprocessors=[rerank],
)
HotpotQAEvaluator().run(engine, queries=5, show_result=True)
from llama_index.core.postprocessor import SentenceTransformerRerank
rerank = SentenceTransformerRerank(top_n=3)
engine = index.as_query_engine(
llm=llm,
node_postprocessors=[rerank],
)
HotpotQAEvaluator().run(engine, queries=5, show_result=True)
Dataset: hotpot_dev_distractor downloaded at: /Users/loganmarkewich/Library/Caches/llama_index/datasets/HotpotQA
Evaluating on dataset: hotpot_dev_distractor
-------------------------------------
Loading 5 queries out of 7405 (fraction: 0.00068)
Question: Were Scott Derrickson and Ed Wood of the same nationality?
Response: No.
Correct answer: yes
EM: 0 F1: 0
-------------------------------------
Question: What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?
Response: No government position.
Correct answer: Chief of Protocol
EM: 0 F1: 0
-------------------------------------
Question: What science fantasy young adult series, told in first person, has a set of companion books narrating the stories of enslaved worlds and alien species?
Response: Animorphs
Correct answer: Animorphs
EM: 1 F1: 1.0
-------------------------------------
Question: Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?
Response: No.
Correct answer: no
EM: 1 F1: 1.0
-------------------------------------
Question: The director of the romantic comedy "Big Stone Gap" is based in what New York city?
Response: New York City.
Correct answer: Greenwich Village, New York City
EM: 0 F1: 0.7499999999999999
-------------------------------------
Scores: {'exact_match': 0.4, 'f1': 0.55}
F1分数和精确匹配分数似乎略有提升。
需注意的是,该基准测试针对生成简短的事实性答案(不含解释)进行了优化,尽管已知思维链(CoT)提示有时能提升输出质量。
这些评分也并非衡量正确性的完美标准,但可以快速识别查询引擎的改动如何影响输出结果。