How to evaluate a summarization task

在本笔记本中，我们通过一个简单示例深入探讨抽象摘要任务的评估技术。除了展示使用LLM作为评估器的新颖方法外，我们还探索了ROUGE和BERTScore等传统评估方法。

评估摘要质量是一个耗时的过程，因为它涉及连贯性、简洁性、可读性和内容等不同质量指标。传统的自动评估指标如ROUGE和BERTScore等具体可靠，但它们可能与摘要的实际质量相关性不高。这些指标与人工判断的相关性相对较低，特别是对于开放式生成任务(Liu et al., 2023)。人们越来越需要依赖人工评估、用户反馈或基于模型的指标，同时警惕潜在的偏见。虽然人工判断提供了宝贵的见解，但通常难以扩展且成本高昂。

除了这些传统指标外，我们展示了一种名为G-Eval的方法，该方法利用大型语言模型(LLMs)作为一种新颖的、无需参考的指标来评估抽象摘要。在本案例中，我们使用gpt-4对候选输出进行评分。gpt-4已有效学习了一个内部语言质量模型，使其能够区分流畅连贯的文本与低质量文本。利用这种内部评分机制，可以实现对LLM生成的新候选输出的自动评估。

设置

# Installing necessary packages for the evaluation
# rouge: For evaluating with ROUGE metric
# bert_score: For evaluating with BERTScore
# openai: To interact with OpenAI's API
!pip install rouge --quiet
!pip install bert_score --quiet
!pip install openai --quiet

from openai import OpenAI
import os
import re
import pandas as pd

# Python Implementation of the ROUGE Metric
from rouge import Rouge

# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.
from bert_score import BERTScorer

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

<IPython.core.display.Javascript object>

示例任务

在本笔记本中，我们将使用以下示例摘要。请注意，我们提供了两个生成的摘要进行比较，以及一个参考的人工撰写摘要，这是像ROUGE和BERTScore这样的评估指标所需要的。

摘要 (excerpt):

OpenAI的使命是确保人工通用智能(AGI)造福全人类。OpenAI将直接构建安全有益的AGI，但如果其工作能帮助他人实现这一目标，也会认为使命达成。为此，OpenAI遵循几个关键原则。首先，广泛分配利益 - 对AGI部署的任何影响力都将用于造福所有人，并避免有害使用或权力过度集中。其次，长期安全 - OpenAI致力于开展研究确保AGI安全，并推动AI界采纳此类研究。第三，技术领先 - OpenAI旨在处于AI能力的前沿。第四，合作导向 - OpenAI积极与其他研究机构和政策机构合作，努力创建共同应对AGI全球挑战的国际社区。

摘要：

Reference Summary /`ref_summary` (human generated)	Eval Summary 1 / `eval_summary_1` (system generated)	Eval Summary 2 / `eval_summary_2` (system generated)
OpenAI致力于确保人工通用智能(AGI)造福全人类，避免有害用途或权力过度集中。它专注于研究AGI安全性，并在AI社区推动相关研究。OpenAI力求在AI能力方面保持领先地位，并与全球研究和政策机构合作应对AGI带来的挑战。	OpenAI旨在让AGI惠及全人类，避免有害用途和权力集中。它率先开展安全有益的AGI研究，并在全球推广应用。OpenAI保持AI技术领先优势的同时，与全球机构合作应对AGI挑战。它致力于引领全球协作，开发造福全人类的AGI。	OpenAI旨在确保AGI供所有人使用，完全避免有害用途或权力过度集中。致力于研究AGI的安全方面，在AI从业者中推广这些研究。OpenAI希望在AI领域保持顶尖地位，并与全球研究机构、政策组织合作解决AGI相关问题。

花点时间想想你个人更喜欢哪个摘要，以及哪个摘要真正抓住了OpenAI的使命。

excerpt = "OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges."
ref_summary = "OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges."
eval_summary_1 = "OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good."
eval_summary_2 = "OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff."

<IPython.core.display.Javascript object>

使用ROUGE进行评估

ROUGE，全称为面向召回率的摘要评估替代指标，主要衡量生成输出与参考文本之间的词汇重叠程度。它是评估自动摘要任务的常用指标。在其变体中，ROUGE-L通过分析系统生成摘要与参考摘要之间的最长连续匹配，评估系统保留原始摘要精髓的能力。

# function to calculate the Rouge score
def get_rouge_scores(text1, text2):
    rouge = Rouge()
    return rouge.get_scores(text1, text2)


rouge_scores_out = []

# Calculate the ROUGE scores for both summaries using reference
eval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)
eval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)

for metric in ["rouge-1", "rouge-2", "rouge-l"]:
    for label in ["F-Score"]:
        eval_1_score = eval_1_rouge[0][metric][label[0].lower()]
        eval_2_score = eval_2_rouge[0][metric][label[0].lower()]

        row = {
            "Metric": f"{metric} ({label})",
            "Summary 1": eval_1_score,
            "Summary 2": eval_2_score,
        }
        rouge_scores_out.append(row)


def highlight_max(s):
    is_max = s == s.max()
    return [
        "background-color: lightgreen" if v else "background-color: white"
        for v in is_max
    ]


rouge_scores_out = (
    pd.DataFrame(rouge_scores_out)
    .set_index("Metric")
    .style.apply(highlight_max, axis=1)
)

rouge_scores_out

	摘要1	摘要2
指标
rouge-1 (F值)	0.488889	0.511628
rouge-2 (F值)	0.230769	0.163265
rouge-l (F值)	0.488889	0.511628

<IPython.core.display.Javascript object>

该表格展示了针对参考文本评估两种不同摘要的ROUGE分数。在rouge-1指标中，摘要2的表现优于摘要1，表明其在单个词汇重叠方面表现更好；而在rouge-l指标中，摘要2得分更高，意味着其在最长公共子序列匹配度更接近，因此在捕捉原文主要内容和顺序方面可能具有更好的整体摘要效果。由于摘要2直接引用了原文中的大量词汇和短语片段，其与参考摘要的重叠度可能更高，从而导致ROUGE分数更高。

虽然ROUGE和类似指标（如BLEU和METEOR）提供了量化衡量标准，但它们往往无法捕捉到优质摘要的真正精髓。这些指标与人类评分的相关性也较差。鉴于LLMs在生成流畅连贯摘要方面的进步，像ROUGE这样的传统指标可能会无意中惩罚这些模型。特别是当摘要表达方式不同但仍能准确概括核心信息时，这种情况尤为明显。

使用BERTScore进行评估

ROUGE依赖于预测文本和参考文本中单词的精确匹配，无法解释潜在的语义。这就是BERTScore发挥作用的地方，它利用BERT模型的上下文嵌入，旨在评估机器生成文本背景下预测句子和参考句子之间的相似性。通过比较两个句子的嵌入，BERTScore能够捕捉到传统基于n-gram的指标可能遗漏的语义相似性。

# Instantiate the BERTScorer object for English language
scorer = BERTScorer(lang="en")

# Calculate BERTScore for the summary 1 against the excerpt
# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively
P1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])

# Calculate BERTScore for summary 2 against the excerpt
# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively
P2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])

print("Summary 1 F1 Score:", F1_1.tolist()[0])
print("Summary 2 F1 Score:", F2_2.tolist()[0])

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Summary 1 F1 Score: 0.9227314591407776
Summary 2 F1 Score: 0.9189572930335999

<IPython.core.display.Javascript object>

摘要之间接近的F1分数表明它们在捕捉关键信息方面可能表现相似。然而，这种微小差异应谨慎解读。由于BERTScore可能无法完全理解人类评估者能够领会的微妙之处和高级概念，仅依赖这一指标可能导致对摘要实际质量和细微差别的误解。将BERTScore与人工判断及其他指标相结合的综合性评估方法可能会提供更可靠的评估结果。

使用GPT-4进行评估

这里我们实现了一个无参考的文本评估器示例，使用gpt-4，灵感来自G-Eval框架，该框架利用大语言模型评估生成文本的质量。与依赖参考摘要进行对比的ROUGE或BERTScore等指标不同，基于gpt-4的评估器仅根据输入提示和文本来评估生成内容的质量，无需任何真实参考。这使得它适用于人类参考稀少或不可用的新数据集和任务。

以下是该方法的概述：

We define four distinct criteria:
1. 相关性: 评估摘要是否仅包含重要信息并排除冗余内容。
2. 连贯性: 评估摘要的逻辑流畅度和组织结构。
3. 一致性: 检查摘要是否与源文档中的事实相符。
4. 流畅度: 评估摘要的语法和可读性。
我们为每个标准精心设计提示词，将原始文档和摘要作为输入，利用思维链生成技术，引导模型为每个标准输出1-5分的数值评分。
我们使用gpt-4根据定义的提示生成评分，并在不同摘要之间进行比较。

在本演示中，我们使用了一个直接评分函数，其中gpt-4为每个指标生成离散分数(1-5)。对分数进行归一化处理并计算加权总和，可以得到更稳健、连续的评分，从而更好地反映摘要的质量和多样性。

# Evaluation prompt template based on G-Eval
EVALUATION_PROMPT_TEMPLATE = """
You will be given one summary written for an article. Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions very carefully. 
Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

{criteria}

Evaluation Steps:

{steps}

Example:

Source Text:

{document}

Summary:

{summary}

Evaluation Form (scores ONLY):

- {metric_name}
"""

# Metric 1: Relevance

RELEVANCY_SCORE_CRITERIA = """
Relevance(1-5) - selection of important content from the source. \
The summary should include only important information from the source document. \
Annotators were instructed to penalize summaries which contained redundancies and excess information.
"""

RELEVANCY_SCORE_STEPS = """
1. Read the summary and the source document carefully.
2. Compare the summary to the source document and identify the main points of the article.
3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.
4. Assign a relevance score from 1 to 5.
"""

# Metric 2: Coherence

COHERENCE_SCORE_CRITERIA = """
Coherence(1-5) - the collective quality of all sentences. \
We align this dimension with the DUC quality question of structure and coherence \
whereby "the summary should be well-structured and well-organized. \
The summary should not just be a heap of related information, but should build from sentence to a\
coherent body of information about a topic."
"""

COHERENCE_SCORE_STEPS = """
1. Read the article carefully and identify the main topic and key points.
2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,
and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.
"""

# Metric 3: Consistency

CONSISTENCY_SCORE_CRITERIA = """
Consistency(1-5) - the factual alignment between the summary and the summarized source. \
A factually consistent summary contains only statements that are entailed by the source document. \
Annotators were also asked to penalize summaries that contained hallucinated facts.
"""

CONSISTENCY_SCORE_STEPS = """
1. Read the article carefully and identify the main facts and details it presents.
2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.
3. Assign a score for consistency based on the Evaluation Criteria.
"""

# Metric 4: Fluency

FLUENCY_SCORE_CRITERIA = """
Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.
1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
3: Good. The summary has few or no errors and is easy to read and follow.
"""

FLUENCY_SCORE_STEPS = """
Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.
"""


def get_geval_score(
    criteria: str, steps: str, document: str, summary: str, metric_name: str
):
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        criteria=criteria,
        steps=steps,
        metric_name=metric_name,
        document=document,
        summary=summary,
    )
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=5,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response.choices[0].message.content


evaluation_metrics = {
    "Relevance": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),
    "Coherence": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),
    "Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),
    "Fluency": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),
}

summaries = {"Summary 1": eval_summary_1, "Summary 2": eval_summary_2}

data = {"Evaluation Type": [], "Summary Type": [], "Score": []}

for eval_type, (criteria, steps) in evaluation_metrics.items():
    for summ_type, summary in summaries.items():
        data["Evaluation Type"].append(eval_type)
        data["Summary Type"].append(summ_type)
        result = get_geval_score(criteria, steps, excerpt, summary, eval_type)
        score_num = int(result.strip())
        data["Score"].append(score_num)

pivot_df = pd.DataFrame(data, index=None).pivot(
    index="Evaluation Type", columns="Summary Type", values="Score"
)
styled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)
display(styled_pivot_df)

摘要类型	摘要1	摘要2
评估类型
连贯性	5	3
一致性	5	5
流畅度	3	2
相关性	5	4

<IPython.core.display.Javascript object>

总体而言，摘要1在四个评估类别中的三项（连贯性、相关性和流畅性）表现优于摘要2。两份摘要被认为彼此一致。根据给定的评估标准，结果表明摘要1通常是更优选择。

局限性

需要注意的是，基于LLM的评估指标可能存在偏向性，更倾向于LLM生成的文本而非人类撰写的文本。此外，基于LLM的指标对系统消息/提示词较为敏感。我们建议尝试其他技术手段来帮助提升性能或获得更稳定的评分，在高质量高成本的人工评估与自动化评估之间找到恰当平衡。还需注意的是，当前这种评分方法受限于gpt-4的上下文窗口长度。

结论

评估抽象摘要生成仍是一个有待改进的开放领域。传统指标如ROUGE、BLEU和BERTScore虽然提供了有效的自动评估方法，但在捕捉语义相似度和摘要质量的细微差异方面存在局限。此外，这些指标需要参考输出，而获取/标注这些参考数据成本高昂。基于大语言模型的指标作为一种无需参考的评估方法，在衡量连贯性、流畅度和相关性方面展现出潜力。然而，它们也可能存在偏向大语言模型生成文本的潜在偏差。最终，结合自动指标和人工评估才是可靠评估抽象摘要系统的理想方案。虽然人工评估对于全面理解摘要质量不可或缺，但仍需辅以自动化评估以实现高效的大规模测试。该领域将持续发展更健壮的评估技术，在质量、可扩展性和公平性之间取得平衡。推进评估方法对推动生产应用进展至关重要。

参考资料

G-EVAL: 使用GPT-4进行自然语言生成评估并实现更好的人类对齐 - 刘毅, Iter D, 徐阳, 王松, 徐睿, 朱晨. 发表于2023年5月.
BERTScore: 使用BERT评估文本生成 - 张T, Kishore V, Wu F, Weinberger KQ, Artzi Y. 在线发布于2020年2月.
ROUGE: 自动摘要评估工具包 - Lin CY. 发表于2004年7月。
SummEval: 重新评估摘要评估 - Fabbri等人。2021年4月发表。

2023年8月16日

如何评估摘要任务

设置

示例任务

使用ROUGE进行评估

使用BERTScore进行评估

使用GPT-4进行评估

局限性

结论

参考资料