Prometheus-2 实践指南¶
本笔记本将演示如何使用《Prometheus 2:专用于评估其他语言模型的开源语言模型》。
论文摘要:¶
目前通常使用 GPT-4 等专有语言模型来评估各类语言模型的响应质量。然而,出于透明度、可控性和成本效益等考量,亟需开发专门用于评估的开源语言模型。现有开源评估模型存在两大关键缺陷:1) 其评分与人类评分存在显著偏差;2) 无法同时支持直接评估和成对排序这两种最主流的评估形式,且缺乏基于自定义评估标准的能力,仅能评估"有用性"和"无害性"等通用属性。为此,我们推出 Prometheus 2——相比前代更强大的评估模型,其判断结果更贴近人类与 GPT-4 的评估结论。该模型能同时处理直接评估和成对排序任务,并支持用户自定义评估标准。在四项直接评估基准和四项成对排序基准测试中,Prometheus 2 在所有开源评估模型中展现出与人类及专有模型评估者最高的相关性评分和一致性水平。
注:构建 Prometheus-2 的基础模型为 Mistral-7B 和 Mixtral8x7B。¶
本文将演示如何使用 Prometheus-2 作为评估器来测试 LlamaIndex 提供的以下评估功能:
- 配对评估器(Pairwise Evaluator)- 评估 LLM 是否会偏好两个不同查询引擎生成的响应中的某一个
- 忠实度评估器(Faithfulness Evaluator)- 判断答案是否忠实于检索到的上下文,确保没有出现幻觉生成
- 正确性评估器(Correctness Evaluator)- 判断生成的答案是否与查询提供的参考答案匹配(需要标注数据)
- 相关性评估器(Relevancy Evaluator)- 评估检索到的上下文及响应与查询的相关性
如果您不熟悉上述评估器,请参阅我们的评估指南获取更多信息
演示使用的提示词灵感/来源自 promethues-eval 代码库
安装¶
!pip install llama-index
!pip install llama-index-llms-huggingface-api
配置 API 密钥¶
import os
os.environ["OPENAI_API_KEY"] = "sk-" # OPENAI API KEY
# attach to the same event-loop
import nest_asyncio
nest_asyncio.apply()
from typing import Tuple, Optional
from IPython.display import Markdown, display
下载数据¶
本次演示中,我们将使用 PaulGrahamEssay 数据集,并定义一个示例查询及其对应的参考答案。
from llama_index.core.llama_dataset import download_llama_dataset
paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset(
"PaulGrahamEssayDataset", "./data/paul_graham"
)
获取演示用的查询语句和参考(真实答案)回答。
query = paul_graham_rag_dataset[0].query
reference = paul_graham_rag_dataset[0].reference_answer
配置大语言模型与嵌入模型¶
您需要将模型部署在HuggingFace平台或本地加载。此处我们使用HF推理终端(HF Inference Endpoints)完成部署。
我们将采用OpenAI的嵌入模型和LLM构建索引,并选用Prometheus LLM进行评估。
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
HF_TOKEN = "YOUR HF TOKEN"
HF_ENDPOINT_URL = "YOUR HF ENDPOINT URL"
prometheus_llm = HuggingFaceInferenceAPI(
model_name=HF_ENDPOINT_URL,
token=HF_TOKEN,
temperature=0.0,
do_sample=True,
top_p=0.95,
top_k=40,
repetition_penalty=1.1,
num_output=1024,
)
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI()
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
成对评估¶
构建两个 QueryEngine 进行成对评估¶
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
dataset_path = "./data/paul_graham"
rag_dataset = LabelledRagDataset.from_json(f"{dataset_path}/rag_dataset.json")
documents = SimpleDirectoryReader(
input_dir=f"{dataset_path}/source_files"
).load_data()
index = VectorStoreIndex.from_documents(documents=documents)
query_engine1 = index.as_query_engine(similarity_top_k=1)
query_engine2 = index.as_query_engine(similarity_top_k=2)
response1 = str(query_engine1.query(query))
response2 = str(query_engine2.query(query))
response1
'The author mentions using the IBM 1401 computer for programming in his early experiences. The language he used was an early version of Fortran. One of the challenges he faced was the limited input options for programs, as the only form of input was data stored on punched cards, which he did not have access to. This limitation made it difficult for him to create programs that required specific input data.'
response2
'The author mentions using the IBM 1401 computer for programming in his early experiences. The language he used was an early version of Fortran. One of the challenges he faced was the limited input options for programs, as the only form of input was data stored on punched cards, which he did not have access to. This limitation made it difficult for him to create programs that required specific input data, leading to a lack of meaningful programming experiences on the IBM 1401.'
ABS_SYSTEM_PROMPT = "You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance."
REL_SYSTEM_PROMPT = "You are a fair judge assistant assigned to deliver insightful feedback that compares individual performances, highlighting how each stands relative to others within the same cohort."
prometheus_pairwise_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (A or B)"
4. Please do not generate any other opening, closing, and explanations.
###Instruction:
Your task is to compare response A and Response B and give Feedback and score [RESULT] based on Rubric for the following query.
{query}
###Response A:
{answer_1}
###Response B:
{answer_2}
###Score Rubric:
A: If Response A is better than Response B.
B: If Response B is better than Response A.
###Feedback: """
def parser_function(
outputs: str,
) -> Tuple[Optional[bool], Optional[float], Optional[str]]:
parts = outputs.split("[RESULT]")
if len(parts) == 2:
feedback, result = parts[0].strip(), parts[1].strip()
if result == "A":
return True, 0.0, feedback
elif result == "B":
return True, 1.0, feedback
return None, None, None
from llama_index.core.evaluation import PairwiseComparisonEvaluator
prometheus_pairwise_evaluator = PairwiseComparisonEvaluator(
llm=prometheus_llm,
parser_function=parser_function,
enforce_consensus=False,
eval_template=REL_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_pairwise_eval_prompt_template,
)
pairwise_result = await prometheus_pairwise_evaluator.aevaluate(
query,
response=response1,
second_response=response2,
)
pairwise_result
EvaluationResult(query='In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.', contexts=None, response="\nBoth responses accurately describe the first computer the author used for programming, the language he used, and the challenges he faced. However, Response B provides a more comprehensive understanding of the challenges faced by the author. It not only mentions the limited input options but also connects this limitation to the author's lack of meaningful programming experiences on the IBM 1401. This additional context in Response B enhances the reader's understanding of the author's experiences and the impact of the challenges he faced. Therefore, based on the score rubric, Response B is better than Response A as it offers a more detailed and insightful analysis of the author's early programming experiences. \n[RESULT] B", passing=True, feedback="\nBoth responses accurately describe the first computer the author used for programming, the language he used, and the challenges he faced. However, Response B provides a more comprehensive understanding of the challenges faced by the author. It not only mentions the limited input options but also connects this limitation to the author's lack of meaningful programming experiences on the IBM 1401. This additional context in Response B enhances the reader's understanding of the author's experiences and the impact of the challenges he faced. Therefore, based on the score rubric, Response B is better than Response A as it offers a more detailed and insightful analysis of the author's early programming experiences. \n[RESULT] B", score=1.0, pairwise_source='original', invalid_result=False, invalid_reason=None)
pairwise_result.score
1.0
display(Markdown(f"<b>{pairwise_result.feedback}</b>"))
观察:¶
根据反馈,按照我们的解析函数评分(得分为1.0),第二个回答比第一个回答更受青睐。
正确性评估¶
prometheus_correctness_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is either 1 or 2 or 3 or 4 or 5. You should refer to the score rubric.
3. The output format should only look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.
5. Only evaluate on common things between generated answer and reference answer. Don't evaluate on things which are present in reference answer but not in generated answer.
###Instruction:
Your task is to evaluate the generated answer and reference answer for the following query:
{query}
###Generate answer to evaluate:
{generated_answer}
###Reference Answer (Score 5):
{reference_answer}
###Score Rubrics:
Score 1: If the generated answer is not relevant to the user query and reference answer.
Score 2: If the generated answer is according to reference answer but not relevant to user query.
Score 3: If the generated answer is relevant to the user query and reference answer but contains mistakes.
Score 4: If the generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.
Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.
###Feedback:"""
from typing import Tuple
import re
def parser_function(output_str: str) -> Tuple[float, str]:
# Print result to backtrack
display(Markdown(f"<b>{output_str}</b>"))
# Pattern to match the feedback and response
# This pattern looks for any text ending with '[RESULT]' followed by a number
pattern = r"(.+?) \[RESULT\] (\d)"
# Using regex to find all matches
matches = re.findall(pattern, output_str)
# Check if any match is found
if matches:
# Assuming there's only one match in the text, extract feedback and response
feedback, score = matches[0]
score = float(score.strip()) if score is not None else score
return score, feedback.strip()
else:
return None, None
from llama_index.core.evaluation import (
CorrectnessEvaluator,
FaithfulnessEvaluator,
RelevancyEvaluator,
)
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
# CorrectnessEvaluator with Prometheus model
prometheus_correctness_evaluator = CorrectnessEvaluator(
llm=prometheus_llm,
parser_function=parser_function,
eval_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_correctness_eval_prompt_template,
)
correctness_result = prometheus_correctness_evaluator.evaluate(
query=query,
response=response1,
reference=reference,
)
display(Markdown(f"<b>{correctness_result.score}</b>"))
4.0
display(Markdown(f"<b>{correctness_result.passing}</b>"))
True
display(Markdown(f"<b>{correctness_result.feedback}</b>"))
The generated answer is relevant to the user query and the reference answer, as it correctly identifies the IBM 1401 as the first computer used for programming, the early version of Fortran as the programming language, and the challenge of limited input options. However, the response lacks the depth and detail found in the reference answer. For instance, it does not mention the specific age of the author when he started using the IBM 1401, nor does it provide examples of the types of programs he could not create due to the lack of input data. These omissions make the response less comprehensive than the reference answer. Therefore, while the generated answer is accurate and relevant, it is not as thorough as the reference answer. So the score is 4.
观察:¶
根据反馈,生成的答案与用户查询相关,且与参考答案的各项指标完全吻合。然而,由于表述不够简洁,评分为4.0。尽管如此,根据阈值标准,该答案仍判定为通过(True)。
忠实度评估器¶
prometheus_faithfulness_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), an information, a context, and a score rubric representing evaluation criteria are given.
1. You are provided with evaluation task with the help of information, context information to give result based on score rubrics.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)”
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: Your task is to evaluate if the given piece of information is supported by context.
###Information:
{query_str}
###Context:
{context_str}
###Score Rubrics:
Score YES: If the given piece of information is supported by context.
Score NO: If the given piece of information is not supported by context
###Feedback:"""
prometheus_faithfulness_refine_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a information, a context information, an existing answer, and a score rubric representing a evaluation criteria are given.
1. You are provided with evaluation task with the help of information, context information and an existing answer.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)"
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: If the information is present in the context and also provided with an existing answer.
###Existing answer:
{existing_answer}
###Information:
{query_str}
###Context:
{context_msg}
###Score Rubrics:
Score YES: If the existing answer is already YES or If the Information is present in the context.
Score NO: If the existing answer is NO and If the Information is not present in the context.
###Feedback: """
# FaithfulnessEvaluator with Prometheus model
prometheus_faithfulness_evaluator = FaithfulnessEvaluator(
llm=prometheus_llm,
eval_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_faithfulness_eval_prompt_template,
refine_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_faithfulness_refine_prompt_template,
)
response_vector = query_engine1.query(query)
faithfulness_result = prometheus_faithfulness_evaluator.evaluate_response(
response=response_vector
)
faithfulness_result.score
1.0
faithfulness_result.passing
True
观察结果:¶
分数及通过情况表明未观察到幻觉现象。
相关性评估器¶
prometheus_relevancy_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a query with response, context, and a score rubric representing evaluation criteria are given.
1. You are provided with evaluation task with the help of a query with response and context.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is A or B. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)”
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.
###Query and Response:
{query_str}
###Context:
{context_str}
###Score Rubrics:
Score YES: If the response for the query is in line with the context information provided.
Score NO: If the response for the query is not in line with the context information provided.
###Feedback: """
prometheus_relevancy_refine_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a query with response, context, an existing answer, and a score rubric representing a evaluation criteria are given.
1. You are provided with evaluation task with the help of a query with response and context and an existing answer.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)"
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.
###Query and Response:
{query_str}
###Context:
{context_str}
###Score Rubrics:
Score YES: If the existing answer is already YES or If the response for the query is in line with the context information provided.
Score NO: If the existing answer is NO and If the response for the query is in line with the context information provided.
###Feedback: """
# RelevancyEvaluator with Prometheus model
prometheus_relevancy_evaluator = RelevancyEvaluator(
llm=prometheus_llm,
eval_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_relevancy_eval_prompt_template,
refine_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_relevancy_refine_prompt_template,
)
relevancy_result = prometheus_relevancy_evaluator.evaluate_response(
query=query, response=response_vector
)
relevancy_result.score
1.0
relevancy_result.passing
True
display(Markdown(f"<b>{relevancy_result.feedback}</b>"))
观察结果:¶
反馈显示,针对该查询的响应与提供的上下文信息高度吻合,因此评分为1.0且通过状态为True。