答案相关性与上下文相关性评估¶

在本笔记本中，我们将演示如何使用AnswerRelevancyEvaluator和ContextRelevancyEvaluator类来分别衡量生成答案与用户查询的相关性，以及检索上下文与查询的相关性。这两个评估器都会返回一个介于0到1之间的score分数，并生成解释该分数的feedback反馈。请注意，分数越高表示相关性越高。具体而言，我们会引导评判LLM采用分步方法来提供相关性评分，要求它回答以下两个关于生成答案与查询相关性的问题（对于上下文相关性，这些问题会稍作调整）：

提供的响应是否符合用户查询的主题？
提供的回应是否试图解决用户查询所涉及的主题焦点或视角?

每个问题价值1分，因此完美评估将获得2/2的分数。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-llms-openai

In [ ]:

Copied!

import nest_asyncio
from tqdm.asyncio import tqdm_asyncio

nest_asyncio.apply()
import nest_asyncio
from tqdm.asyncio import tqdm_asyncio

nest_asyncio.apply()

In [ ]:

Copied!





def displayify_df(df):
    """For pretty displaying DataFrame in a notebook."""
    display_df = df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        }
    )
    display(display_df)
def displayify_df(df):
    """在笔记本中优雅地展示DataFrame。"""
    display_df = df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        }
    )
    display(display_df)

下载数据集 (`LabelledRagDataset`)¶

在本演示中，我们将使用通过llama-hub提供的llama数据集。

In [ ]:

Copied!





from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
    "EvaluatingLlmSurveyPaperDataset", "./data"
)
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex

# 下载并安装基准数据集的依赖项
rag_dataset, documents = download_llama_dataset(
    "EvaluatingLlmSurveyPaperDataset", "./data"
)

In [ ]:

Copied!

rag_dataset.to_pandas()[:5]
rag_dataset.to_pandas()[:5]

输出[ ]:

	查询	参考上下文	参考答案	参考答案提供者	查询提供者
0	与llamaindex相关的潜在风险有哪些...	[评估大型语言模型：全面\n概述...	根据上下文信息，潜在的...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
1	调查如何对评估进行分类...	[评估大型语言模型：一个综合...	该调查将LLMs的评估分类为...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
2	讨论了哪些不同类型的推理...	[目录\n1 引言 4\n2 分类与角色...	文档中讨论的不同类型推理包括...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
3	语言模型中的毒性是如何评估的...	[目录\n1 引言 4\n2 分类与角色...	语言模型中的毒性评估是根据...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
4	在专业LLM评估的背景下,...	[5.1.3 对齐鲁棒性 . . . . . . . . . ...	在专业LLM评估的背景下,...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)

接下来，我们在用于创建rag_dataset的相同源文档上构建一个RAG。

In [ ]:

Copied!

index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

定义好我们的RAG（即query_engine）后，我们可以用它来对rag_dataset进行预测（即生成查询的响应）。

In [ ]:

Copied!

prediction_dataset = await rag_dataset.amake_predictions_with(
    predictor=query_engine, batch_size=100, show_progress=True
)
prediction_dataset = await rag_dataset.amake_predictions_with(
    predictor=query_engine, batch_size=100, show_progress=True
)

Batch processing of predictions: 100%|████████████████████| 100/100 [00:08<00:00, 12.12it/s]
Batch processing of predictions: 100%|████████████████████| 100/100 [00:08<00:00, 12.37it/s]
Batch processing of predictions: 100%|██████████████████████| 76/76 [00:06<00:00, 10.93it/s]

分别评估答案与上下文的相关性¶

我们首先需要定义评估器（即AnswerRelevancyEvaluator和ContextRelevancyEvaluator）：

In [ ]:

Copied!





# instantiate the gpt-4 judges
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    AnswerRelevancyEvaluator,
    ContextRelevancyEvaluator,
)

judges = {}

judges["answer_relevancy"] = AnswerRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-3.5-turbo"),
)

judges["context_relevancy"] = ContextRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)
# 实例化GPT-4评估器
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    AnswerRelevancyEvaluator,
    ContextRelevancyEvaluator,
)

judges = {}

judges["answer_relevancy"] = AnswerRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-3.5-turbo"),
)

judges["context_relevancy"] = ContextRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

现在，我们可以使用评估器通过遍历所有的<示例, 预测>对来进行评估。

In [ ]:

Copied!





eval_tasks = []
for example, prediction in zip(
    rag_dataset.examples, prediction_dataset.predictions
):
    eval_tasks.append(
        judges["answer_relevancy"].aevaluate(
            query=example.query,
            response=prediction.response,
            sleep_time_in_seconds=1.0,
        )
    )
    eval_tasks.append(
        judges["context_relevancy"].aevaluate(
            query=example.query,
            contexts=prediction.contexts,
            sleep_time_in_seconds=1.0,
        )
    )
eval_tasks = []
for example, prediction in zip(
    rag_dataset.examples, prediction_dataset.predictions
):
    eval_tasks.append(
        judges["answer_relevancy"].aevaluate(
            query=example.query,
            response=prediction.response,
            sleep_time_in_seconds=1.0,
        )
    )
    eval_tasks.append(
        judges["context_relevancy"].aevaluate(
            query=example.query,
            contexts=prediction.contexts,
            sleep_time_in_seconds=1.0,
        )
    )

In [ ]:

Copied!

eval_results1 = await tqdm_asyncio.gather(*eval_tasks[:250])
eval_results1 = await tqdm_asyncio.gather(*eval_tasks[:250])

100%|█████████████████████████████████████████████████████| 250/250 [00:28<00:00,  8.85it/s]

In [ ]:

Copied!

eval_results2 = await tqdm_asyncio.gather(*eval_tasks[250:])
eval_results2 = await tqdm_asyncio.gather(*eval_tasks[250:])

100%|█████████████████████████████████████████████████████| 302/302 [00:31<00:00,  9.62it/s]

In [ ]:

Copied!

eval_results = eval_results1 + eval_results2
eval_results = eval_results1 + eval_results2

In [ ]:

Copied!





evals = {
    "answer_relevancy": eval_results[::2],
    "context_relevancy": eval_results[1::2],
}
评估结果 = {
    "答案相关性": eval_results[::2],
    "上下文相关性": eval_results[1::2],
}

查看评估结果¶

这里我们使用一个实用函数将EvaluationResult对象列表转换为更适合笔记本的形式。该实用程序将提供两个数据框：一个包含所有评估结果的详细数据框，另一个则通过取每种评估方法所有分数的平均值来进行汇总。

In [ ]:

Copied!





from llama_index.core.evaluation.notebook_utils import get_eval_results_df
import pandas as pd

deep_dfs = {}
mean_dfs = {}
for metric in evals.keys():
    deep_df, mean_df = get_eval_results_df(
        names=["baseline"] * len(evals[metric]),
        results_arr=evals[metric],
        metric=metric,
    )
    deep_dfs[metric] = deep_df
    mean_dfs[metric] = mean_df
从llama_index.core.evaluation.notebook_utils导入get_eval_results_df
导入pandas库并命名为pd

deep_dfs = {}
mean_dfs = {}
对于评估指标字典中的每个指标:
    deep_df, mean_df = get_eval_results_df(
        names=["baseline"] * len(evals[metric]),
        results_arr=evals[metric],
        metric=metric,
    )
    deep_dfs[metric] = deep_df
    mean_dfs[metric] = mean_df

In [ ]:

Copied!





mean_scores_df = pd.concat(
    [mdf.reset_index() for _, mdf in mean_dfs.items()],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
mean_scores_df
mean_scores_df = pd.concat(
    [mdf.reset_index() for _, mdf in mean_dfs.items()],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
mean_scores_df

输出[ ]:

检索增强生成	基线
指标
平均答案相关性得分	0.914855
平均上下文相关性得分	0.572273

上述工具还提供了mean_df中所有评估的平均分数。

我们可以通过在deep_df上调用value_counts()来查看分数的原始分布情况。

In [ ]:

Copied!

deep_dfs["answer_relevancy"]["scores"].value_counts()
deep_dfs["answer_relevancy"]["scores"].value_counts()

输出[ ]:

scores
1.0    250
0.0     21
0.5      5
Name: count, dtype: int64

In [ ]:

Copied!

deep_dfs["context_relevancy"]["scores"].value_counts()
deep_dfs["context_relevancy"]["scores"].value_counts()

输出[ ]:

scores
1.000    89
0.000    70
0.750    49
0.250    23
0.625    14
0.500    11
0.375    10
0.875     9
Name: count, dtype: int64

看起来在大多数情况下，默认的RAG在生成与查询相关的答案方面表现相当不错。通过查看任何deep_df的记录，可以更深入地了解情况。

In [ ]:

Copied!

displayify_df(deep_dfs["context_relevancy"].head(2))
displayify_df(deep_dfs["context_relevancy"].head(2))

	检索增强生成	查询	回答	上下文	分数	反馈
0	baseline	What are the potential risks associated with large language models (LLMs) according to the context information?	None	['Evaluating Large Language Models: A\nComprehensive Survey\nZishan Guo∗, Renren Jin∗, Chuang Liu∗, Yufei Huang, Dan Shi, Supryadi\nLinhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong†\nTianjin University\n{guozishan, rrjin, liuc_09, yuki_731, shidan, supryadi}@tju.edu.cn\n{linhaoyu, yan_liu, jiaxuanlee, xbj1355, dyxiong}@tju.edu.cn\nAbstract\nLarge language models (LLMs) have demonstrated remarkable capabilities\nacross a broad spectrum of tasks. They have attracted significant attention\nand been deployed in numerous downstream applications. Nevertheless, akin\nto a double-edged sword, LLMs also present potential risks. They could\nsuffer from private data leaks or yield inappropriate, harmful, or misleading\ncontent. Additionally, the rapid progress of LLMs raises concerns about the\npotential emergence of superintelligent systems without adequate safeguards.\nTo effectively capitalize on LLM capacities as well as ensure their safe and\nbeneficial development, it is critical to conduct a rigorous and comprehensive\nevaluation of LLMs.\nThis survey endeavors to offer a panoramic perspective on the evaluation\nof LLMs. We categorize the evaluation of LLMs into three major groups:\nknowledgeandcapabilityevaluation, alignmentevaluationandsafetyevaluation.\nIn addition to the comprehensive review on the evaluation methodologies and\nbenchmarks on these three aspects, we collate a compendium of evaluations\npertaining to LLMs’ performance in specialized domains, and discuss the\nconstruction of comprehensive evaluation platforms that cover LLM evaluations\non capabilities, alignment, safety, and applicability.\nWe hope that this comprehensive overview will stimulate further research\ninterests in the evaluation of LLMs, with the ultimate goal of making evaluation\nserve as a cornerstone in guiding the responsible development of LLMs. We\nenvision that this will channel their evolution into a direction that maximizes\nsocietal benefit while minimizing potential risks. A curated list of related\npapers has been publicly available at a GitHub repository.1\n∗Equal contribution\n†Corresponding author.\n1https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers\n1arXiv:2310.19736v3 [cs.CL] 25 Nov 2023', 'criteria. Multilingual Holistic Bias (Costa-jussà et al., 2023) extends the HolisticBias dataset\nto 50 languages, achieving the largest scale of English template-based text expansion.\nWhether using automatic or manual evaluations, both approaches inevitably carry human\nsubjectivity and cannot establish a comprehensive and fair evaluation standard. Unqover\n(Li et al., 2020) is the first to transform the task of evaluating biases generated by models\ninto a multiple-choice question, covering gender, nationality, race, and religion categories.\nThey provide models with ambiguous and disambiguous contexts and ask them to choose\nbetween options with and without stereotypes, evaluating both PLMs and models fine-tuned\non multiple-choice question answering datasets. BBQ (Parrish et al., 2022) adopts this\napproach but extends the types of biases to nine categories. All sentence templates are\nmanually created, and in addition to the two contrasting group answers, the model is also\nprovided with correct answers like “I don’t know” and “I’m not sure”, and a statistical bias\nscore metric is proposed to evaluate multiple question answering models. CBBQ (Huang\n& Xiong, 2023) extends BBQ to Chinese. Based on Chinese socio-cultural factors, CBBQ\nadds four categories: disease, educational qualification, household registration, and region.\nThey manually rewrite ambiguous text templates and use GPT-4 to generate disambiguous\ntemplates, greatly increasing the dataset’s diversity and extensibility. Additionally, they\nimprove the experimental setup for LLMs and evaluate existing Chinese open-source LLMs,\nfinding that current Chinese LLMs not only have higher bias scores but also exhibit behavioral\ninconsistencies, revealing a significant gap compared to GPT-3.5-Turbo.\nIn addition to these aforementioned evaluation methods, we could also use advanced LLMs for\nscoring bias, such as GPT-4, or employ models that perform best in training bias detection\ntasks to detect the level of bias in answers. Such models can be used not only in the evaluation\nphase but also for identifying biases in data for pre-training LLMs, facilitating debiasing in\ntraining data.\nAs the development of multilingual LLMs and domain-specific LLMs progresses, studies on\nthe fairness of these models become increasingly important. Zhao et al. (2020) create datasets\nto study gender bias in multilingual embeddings and cross-lingual tasks, revealing gender\nbias from both internal and external perspectives. Moreover, FairLex (Chalkidis et al., 2022)\nproposes a multilingual legal dataset as fairness benchmark, covering four judicial jurisdictions\n(European Commission, United States, Swiss Federation, and People’s Republic of China), five\nlanguages (English, German, French, Italian, and Chinese), and various sensitive attributes\n(gender, age, region, etc.). As LLMs have been applied and deployed in the finance and legal\nsectors, these studies deserve high attention.\n4.3 Toxicity\nLLMs are usually trained on a huge amount of online data which may contain toxic behavior\nand unsafe content. These include hate speech, offensive/abusive language, pornographic\ncontent, etc. It is hence very desirable to evaluate how well trained LLMs deal with toxicity.\nConsidering the proficiency of LLMs in understanding and generating sentences, we categorize\nthe evaluation of toxicity into two tasks: toxicity identification and classification evaluation,\nand the evaluation of toxicity in generated sentences.\n29']	1.000000	1. The retrieved context does match the subject matter of the user's query. It discusses the potential risks associated with large language models (LLMs), including private data leaks, inappropriate or harmful content, and the emergence of superintelligent systems without adequate safeguards. It also discusses the potential for bias in LLMs, and the risk of toxicity in the content generated by LLMs. Therefore, it is relevant to the user's query about the potential risks associated with LLMs. (2/2) 2. The retrieved context can be used to provide a full answer to the user's query. It provides a comprehensive overview of the potential risks associated with LLMs, including data privacy, inappropriate content, superintelligence, bias, and toxicity. It also discusses the importance of evaluating these risks and the methodologies for doing so. Therefore, it provides a complete answer to the user's query. (2/2) [RESULT] 4/4
1	baseline	该调查如何对LLM评估进行分类？提到了哪三大类别？	None	['问题回答工具学习推理知识完整性伦理与道德偏见毒性真实性鲁棒性评估风险评估生物学与医学教育立法计算机科学金融整体评估基准知识与推理基准自然语言理解与生成基准知识与能力大型语言模型评估对齐评估安全性专业LLM评估组织...图1：我们提出的LLM评估主要类别与子类别分类法。\n我们的调查扩大了范围，综合了LLM能力和对齐评估的研究成果。通过整合视角和扩大范围来补充先前调查，我们的工作全面概述了当前LLM评估研究现状。我们的调查与这两项相关工作的区别进一步凸显了本研究对文献的新贡献。\n2 分类法与路线图\n本次调查的主要目标是细致分类LLM评估，为读者提供结构化的分类框架。通过这个框架，读者可以深入理解LLM在不同关键领域的表现及伴随的挑战。\n许多研究认为LLM能力的基石在于知识和推理，这是其在众多任务中卓越表现的基础。然而，要有效应用这些能力，需要仔细检查对齐问题，确保模型输出符合用户期望。此外，LLM易受恶意利用或无意误用的脆弱性凸显了安全考量的重要性。一旦解决对齐和安全问题，LLM就可以明智地部署在专业领域，促进任务自动化并辅助智能决策。因此，我们总体的6', '本调查系统阐述了LLM的核心能力，涵盖知识和推理等关键方面。此外，我们深入探讨了对齐评估和安全评估，包括伦理问题、偏见、毒性和真实性，以确保LLM安全、可信和符合道德的应用。同时，我们探索了LLM在生物学、教育、法律、计算机科学和金融等不同领域的潜在应用。最重要的是，我们提供了一系列流行的基准评估，帮助研究人员、开发者和从业者理解并评估LLM的表现。\n我们期待本次调查能推动LLM评估的发展，为引导这些模型可控进步提供明确指导。这将使LLM能更好地服务社会和世界，确保其在各领域的应用安全、可靠且有益。我们热切期待迎接LLM发展和评估的未来挑战。\n58']	0.375000	1. 检索到的上下文确实与用户查询主题相关。用户查询是关于调查如何分类大型语言模型(LLM)评估以及提到的三大类别。提供的上下文讨论了调查中对LLM评估的分类，提到了知识和推理、对齐评估、安全评估以及在多个领域的潜在应用等方面。 2. 然而，上下文并未完整回答用户查询。虽然讨论了LLM评估的分类，但没有明确提及三大类别。上下文提到了LLM评估的多个方面，但不清楚哪些被视为三大类别。 [RESULT] 1.5

当然，您可以根据需要应用任何筛选条件。例如，如果您想查看那些结果不够完美的示例。

In [ ]:

Copied!

cond = deep_dfs["context_relevancy"]["scores"] < 1
displayify_df(deep_dfs["context_relevancy"][cond].head(5))
cond = deep_dfs["context_relevancy"]["scores"] < 1
displayify_df(deep_dfs["context_relevancy"][cond].head(5))

	检索增强生成	查询	回答	上下文	分数	反馈
1	baseline	该调查如何对LLM评估进行分类？提到的三大类别是什么？	None	['问题回答工具学习推理知识完整性伦理与道德偏见毒性真实性鲁棒性评估风险评估生物学与医学教育立法计算机科学金融整体评估基准知识与推理基准自然语言理解与生成基准知识与能力大型语言模型评估对齐评估安全性专业LLM评估组织...图1：我们提出的LLM评估主要类别与子类别分类法。\n我们的调查扩大了范围，综合了LLM能力和对齐评估的研究成果。通过整合视角和扩大范围来补充先前调查，我们的工作全面概述了当前LLM评估研究的现状。我们的调查与这两项相关工作的区别进一步凸显了本研究对文献的新贡献。\n2 分类法与路线图\n本次调查的主要目标是细致分类LLM评估，为读者提供一个结构化的分类框架。通过这个框架，读者可以深入理解LLM在不同关键领域的表现及伴随的挑战。\n许多研究认为LLM能力的基石在于知识和推理，这是其在众多任务中卓越表现的基础。然而，这些能力的有效应用需要对对齐问题进行细致检查，以确保模型输出与用户期望一致。此外，LLM易受恶意利用或无意误用的脆弱性凸显了安全考量的重要性。一旦解决对齐和安全问题，LLM就可以明智地部署在专业领域，促进任务自动化并辅助智能决策。因此，我们总体\n6', '本调查系统阐述了LLM的核心能力，涵盖知识和推理等关键方面。此外，我们深入探讨了对齐评估和安全评估，包括伦理问题、偏见、毒性和真实性，以确保LLM安全、可信和符合伦理的应用。同时，我们探索了LLM在生物学、教育、法律、计算机科学和金融等不同领域的潜在应用。最重要的是，我们提供了一系列流行的基准评估，帮助研究人员、开发者和从业者理解和评估LLM的表现。\n我们期待本次调查能推动LLM评估的发展，为引导这些模型可控发展提供明确指导。这将使LLM能更好地服务社会和世界，确保其在各领域的应用安全、可靠且有益。我们热切期待迎接LLM发展和评估的未来挑战。\n58']	0.375000	1. 检索到的上下文确实与用户查询的主题相符。用户查询是关于一项调查如何对大型语言模型(LLM)评估进行分类以及提到的三大类别。提供的上下文讨论了调查中LLM评估的分类，提到了知识和推理、对齐评估、安全评估以及在多样化领域的潜在应用等方面。 2. 然而，上下文并未完整回答用户的查询。虽然讨论了LLM评估的分类，但并未明确提及三大类别。上下文提到了LLM评估的几个方面，但不清楚哪些被视为三大类别。 [RESULT] 1.5
9	baseline	这份关于LLM评估的调查与Chang等人(2023)和Liu等人(2023i)之前进行的综述有何不同？	None	['本调查系统阐述了LLM的核心能力，涵盖知识和推理等关键方面。此外，我们深入探讨了对齐评估和安全评估，包括伦理问题、偏见、毒性和真实性，以确保LLM的安全、可信和符合伦理的应用。同时，我们探索了LLM在生物学、教育、法律、计算机科学和金融等不同领域的潜在应用。最重要的是，我们提供了一系列流行的基准评估，以帮助研究人员、开发者和从业者理解和评估LLM的性能。我们期待这项调查能推动LLM评估的发展，为引导这些模型可控发展提供明确指导。这将使LLM能更好地服务社会和世界，确保其在各领域应用的安全、可靠和有益。我们热切期待迎接LLM发展和评估的未来挑战。 58', '(2021)\nBEGIN (Dziri et al., 2022b)\nConsisTest (Lotfi et al., 2022)\nSummarizationXSumFaith (Maynez et al., 2020)\nFactCC (Kryscinski et al., 2020)\nSummEval (Fabbri et al., 2021)\nFRANK (Pagnoni et al., 2021)\nSummaC (Laban et al., 2022)\nWang et al. (2020)\nGoyal & Durrett (2021)\nCao et al. (2022)\nCLIFF (Cao & Wang, 2021)\nAggreFact (Tang et al., 2023a)\nPolyTope (Huang et al., 2020)\nMethodsNLI-based MethodsWelleck et al. (2019)\nLotfi et al. (2022)\nFalke et al. (2019)\nLaban et al. (2022)\nMaynez et al. (2020)\nAharoni et al. (2022)\nUtama et al. (2022)\nRoit et al. (2023)\nQAQG-based MethodsFEQA (Durmus et al., 2020)\nQAGS (Wang et al., 2020)\nQuestEval (Scialom et al., 2021)\nQAFactEval (Fabbri et al., 2022)\nQ2 (Honovich et al., 2021)\nFaithDial (Dziri et al., 2022a)\nDeng et al. (2023b)\nLLMs-based MethodsFIB (Tam et al., 2023)\nFacTool (Chern et al., 2023)\nFActScore (Min et al., 2023)\nSelfCheckGPT (Manakul et al., 2023)\nSAPLMA (Azaria & Mitchell, 2023)\nLin et al. (2022b)\nKadavath et al. (2022)\nFigure 3: Overview of alignment evaluations.\n4 Alignment Evaluation\n尽管经过指令调优的LLM展现出令人印象深刻的能力，但这些对齐的LLM仍受限于标注者偏见、迎合人类、幻觉等问题。为了全面了解LLM的对齐评估，本节我们将讨论伦理、偏见、毒性和真实性等方面的评估，如图3所示。\n21']	0.000000	1. 检索到的上下文与用户查询的主题不匹配。用户查询要求比较当前LLM评估调查与Chang等人(2023)和Liu等人(2023i)之前的综述。然而上下文完全没有提及这些之前的综述，因此无法进行任何比较。故上下文与用户查询主题不匹配。(0/2) 2. 检索到的上下文不能单独用于完整回答用户查询。如上所述，上下文未提及用户查询主要关注的Chang等人和Liu等人的先前综述，因此无法提供完整答案。(0/2) [RESULT] 0.0
11	baseline	According to the document, what are the two main concerns that need to be addressed before deploying LLMs within specialized domains?	None	['This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\napplications of LLMs across diverse domains, including biology, education, law, computer\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\nperformance.\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\nto better serve the community and the world, ensuring their applications in various domains\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\nof LLMs’ development and evaluation.\n58', 'objective is to delve into evaluations encompassing these five fundamental domains and their\nrespective subdomains, as illustrated in Figure 1.\nSection 3, titled “Knowledge and Capability Evaluation”, centers on the comprehensive\nassessment of the fundamental knowledge and reasoning capabilities exhibited by LLMs. This\nsection is meticulously divided into four distinct subsections: Question-Answering, Knowledge\nCompletion, Reasoning, and Tool Learning. Question-answering and knowledge completion\ntasks stand as quintessential assessments for gauging the practical application of knowledge,\nwhile the various reasoning tasks serve as a litmus test for probing the meta-reasoning and\nintricate reasoning competencies of LLMs. Furthermore, the recently emphasized special\nability of tool learning is spotlighted, showcasing its significance in empowering models to\nadeptly handle and generate domain-specific content.\nSection 4, designated as “Alignment Evaluation”, hones in on the scrutiny of LLMs’ perfor-\nmance across critical dimensions, encompassing ethical considerations, moral implications,\nbias detection, toxicity assessment, and truthfulness evaluation. The pivotal aim here is to\nscrutinize and mitigate the potential risks that may emerge in the realms of ethics, bias,\nand toxicity, as LLMs can inadvertently generate discriminatory, biased, or offensive content.\nFurthermore, this section acknowledges the phenomenon of hallucinations within LLMs, which\ncan lead to the inadvertent dissemination of false information. As such, an indispensable\nfacet of this evaluation involves the rigorous assessment of truthfulness, underscoring its\nsignificance as an essential aspect to evaluate and rectify.\nSection 5, titled “Safety Evaluation”, embarks on a comprehensive exploration of two funda-\nmental dimensions: the robustness of LLMs and their evaluation in the context of Artificial\nGeneral Intelligence (AGI). LLMs are routinely deployed in real-world scenarios, where their\nrobustness becomes paramount. Robustness equips them to navigate disturbances stemming\nfrom users and the environment, while also shielding against malicious attacks and deception,\nthereby ensuring consistent high-level performance. Furthermore, as LLMs inexorably ad-\nvance toward human-level capabilities, the evaluation expands its purview to encompass more\nprofound security concerns. These include but are not limited to power-seeking behaviors\nand the development of situational awareness, factors that necessitate meticulous evaluation\nto safeguard against unforeseen challenges.\nSection 6, titled “Specialized LLMs Evaluation”, serves as an extension of LLMs evaluation\nparadigm into diverse specialized domains. Within this section, we turn our attention to the\nevaluation of LLMs specifically tailored for application in distinct domains. Our selection\nencompasses currently prominent specialized LLMs spanning fields such as biology, education,\nlaw, computer science, and finance. The objective here is to systematically assess their\naptitude and limitations when confronted with domain-specific challenges and intricacies.\nSection 7, denominated “Evaluation Organization”, serves as a comprehensive introduction\nto the prevalent benchmarks and methodologies employed in the evaluation of LLMs. In light\nof the rapid proliferation of LLMs, users are confronted with the challenge of identifying the\nmost apt models to meet their specific requirements while minimizing the scope of evaluations.\nIn this context, we present an overview of well-established and widely recognized benchmark\n7']	0.750000	The retrieved context does match the subject matter of the user's query. It discusses the concerns that need to be addressed before deploying LLMs within specialized domains. The two main concerns mentioned are the alignment evaluation, which includes ethical considerations, moral implications, bias detection, toxicity assessment, and truthfulness evaluation, and the safety evaluation, which includes the robustness of LLMs and their evaluation in the context of Artificial General Intelligence (AGI). However, the context does not provide a full answer to the user's query. While it does mention the two main concerns, it does not go into detail about why these concerns need to be addressed before deploying LLMs within specialized domains. The context provides a general overview of the concerns, but it does not specifically tie these concerns to the deployment of LLMs within specialized domains. [RESULT] 3.0
12	baseline	在"对齐评估"部分中，评估了哪些维度以减轻与LLMs相关的潜在风险？	None	['本调研系统阐述了LLMs的核心能力，涵盖知识和推理等关键方面。此外，我们深入探讨了对齐评估和安全评估，包括伦理问题、偏见、毒性和真实性，以确保LLMs安全、可信和符合伦理的应用。同时，我们探索了LLMs在生物学、教育、法律、计算机科学和金融等不同领域的潜在应用。最重要的是，我们提供了一系列流行的基准评估，以帮助研究人员、开发人员和从业者理解和评估LLMs的性能。我们预期这项调研将推动LLMs评估的发展，为引导这些模型的可控发展提供明确指导。这将使LLMs能更好地服务社会和世界，确保其在各领域的应用安全、可靠且有益。我们热切期待迎接LLMs发展和评估的未来挑战。 58', '问题回答工具学习推理知识完成伦理和道德偏见毒性真实性鲁棒性评估风险评估生物学和医学教育立法计算机科学金融全面评估基准知识和推理基准自然语言理解和生成基准知识和能力大语言模型评估对齐评估安全专用LLMs 评估组织 ...图1：我们提出的LLM评估主要类别和子类别分类法。我们的调研扩大了范围，综合了LLMs能力和对齐评估的研究结果。通过整合视角和扩大范围来补充这些先前调研，我们的工作提供了当前LLM评估研究现状的全面概述。我们的调研与这两项相关工作的区别进一步突出了本研究对文献的新贡献。 2 分类法和路线图本调研的主要目标是细致分类LLMs的评估，为读者提供一个结构良好的分类框架。通过这个框架，读者可以对LLMs在不同关键领域的表现和伴随挑战获得细致入微的理解。许多研究认为，LLMs能力的基石在于知识和推理，这是它们在众多任务中表现出色的基础。然而，这些能力的有效应用需要仔细检查对齐问题，以确保模型输出与用户期望保持一致。此外，LLMs易受恶意利用或无意滥用的脆弱性凸显了安全考量的必要性。一旦解决了对齐和安全问题，LLMs就可以明智地部署在专业领域，促进任务自动化和智能决策。因此，我们的总体 6']	0.750000	1. 检索到的上下文确实与用户查询的主题相匹配。用户的查询是关于"对齐评估"部分中评估了哪些维度以减轻与LLMs相关的潜在风险。上下文讨论了LLMs的评估，包括对齐评估和安全评估。它提到了知识和推理、伦理问题、偏见、毒性和真实性等方面。这些是可以评估以减轻与LLMs相关潜在风险的一些维度。因此，上下文与查询相关。(2/2) 2. 然而，检索到的上下文并未提供用户查询的完整答案。虽然它提到了一些可以在对齐评估中评估的维度(如知识和推理、伦理问题、偏见、毒性和真实性)，但并未明确说明这些就是用于减轻与LLMs相关潜在风险的评估维度。上下文没有提供维度的全面列表，也没有解释这些维度如何帮助减轻风险。因此，上下文不能单独用于提供用户查询的完整答案。(1/2) [RESULT] 3.0
14	baseline	What is the purpose of evaluating the knowledge and capability of LLMs?	None	['objective is to delve into evaluations encompassing these five fundamental domains and their\nrespective subdomains, as illustrated in Figure 1.\nSection 3, titled “Knowledge and Capability Evaluation”, centers on the comprehensive\nassessment of the fundamental knowledge and reasoning capabilities exhibited by LLMs. This\nsection is meticulously divided into four distinct subsections: Question-Answering, Knowledge\nCompletion, Reasoning, and Tool Learning. Question-answering and knowledge completion\ntasks stand as quintessential assessments for gauging the practical application of knowledge,\nwhile the various reasoning tasks serve as a litmus test for probing the meta-reasoning and\nintricate reasoning competencies of LLMs. Furthermore, the recently emphasized special\nability of tool learning is spotlighted, showcasing its significance in empowering models to\nadeptly handle and generate domain-specific content.\nSection 4, designated as “Alignment Evaluation”, hones in on the scrutiny of LLMs’ perfor-\nmance across critical dimensions, encompassing ethical considerations, moral implications,\nbias detection, toxicity assessment, and truthfulness evaluation. The pivotal aim here is to\nscrutinize and mitigate the potential risks that may emerge in the realms of ethics, bias,\nand toxicity, as LLMs can inadvertently generate discriminatory, biased, or offensive content.\nFurthermore, this section acknowledges the phenomenon of hallucinations within LLMs, which\ncan lead to the inadvertent dissemination of false information. As such, an indispensable\nfacet of this evaluation involves the rigorous assessment of truthfulness, underscoring its\nsignificance as an essential aspect to evaluate and rectify.\nSection 5, titled “Safety Evaluation”, embarks on a comprehensive exploration of two funda-\nmental dimensions: the robustness of LLMs and their evaluation in the context of Artificial\nGeneral Intelligence (AGI). LLMs are routinely deployed in real-world scenarios, where their\nrobustness becomes paramount. Robustness equips them to navigate disturbances stemming\nfrom users and the environment, while also shielding against malicious attacks and deception,\nthereby ensuring consistent high-level performance. Furthermore, as LLMs inexorably ad-\nvance toward human-level capabilities, the evaluation expands its purview to encompass more\nprofound security concerns. These include but are not limited to power-seeking behaviors\nand the development of situational awareness, factors that necessitate meticulous evaluation\nto safeguard against unforeseen challenges.\nSection 6, titled “Specialized LLMs Evaluation”, serves as an extension of LLMs evaluation\nparadigm into diverse specialized domains. Within this section, we turn our attention to the\nevaluation of LLMs specifically tailored for application in distinct domains. Our selection\nencompasses currently prominent specialized LLMs spanning fields such as biology, education,\nlaw, computer science, and finance. The objective here is to systematically assess their\naptitude and limitations when confronted with domain-specific challenges and intricacies.\nSection 7, denominated “Evaluation Organization”, serves as a comprehensive introduction\nto the prevalent benchmarks and methodologies employed in the evaluation of LLMs. In light\nof the rapid proliferation of LLMs, users are confronted with the challenge of identifying the\nmost apt models to meet their specific requirements while minimizing the scope of evaluations.\nIn this context, we present an overview of well-established and widely recognized benchmark\n7', 'evaluations. This serves the purpose of aiding users in making judicious and well-informed\ndecisions when selecting an appropriate LLM for their particular needs.\nPleasebeawarethatourtaxonomyframeworkdoesnotpurporttocomprehensivelyencompass\nthe entirety of the evaluation landscape. In essence, our aim is to address the following\nfundamental questions:\n•What are the capabilities of LLMs?\n•What factors must be taken into account when deploying LLMs?\n•In which domains can LLMs find practical applications?\n•How do LLMs perform in these diverse domains?\nWe will now embark on an in-depth exploration of each category within the LLM evaluation\ntaxonomy, sequentially addressing capabilities, concerns, applications, and performance.\n3 Knowledge and Capability Evaluation\nEvaluating the knowledge and capability of LLMs has become an important research area as\nthese models grow in scale and capability. As LLMs are deployed in more applications, it is\ncrucial to rigorously assess their strengths and limitations across a diverse range of tasks and\ndatasets. In this section, we aim to offer a comprehensive overview of the evaluation methods\nand benchmarks pertinent to LLMs, spanning various capabilities such as question answering,\nknowledge completion, reasoning, and tool use. Our objective is to provide an exhaustive\nsynthesis of the current advancements in the systematic evaluation and benchmarking of\nLLMs’ knowledge and capabilities, as illustrated in Figure 2.\n3.1 Question Answering\nQuestionansweringisaveryimportantmeansforLLMsevaluation, andthequestionanswering\nability of LLMs directly determines whether the final output can meet the expectation. At\nthe same time, however, since any form of LLMs evaluation can be regarded as question\nanswering or transfer to question answering form, there are rare datasets and works that\npurely evaluate question answering ability of LLMs. Most of the datasets are curated to\nevaluate other capabilities of LLMs.\nTherefore, we believe that the datasets simply used to evaluate the question answering ability\nof LLMs must be from a wide range of sources, preferably covering all fields rather than\naiming at some fields, and the questions do not need to be very professional but general.\nAccording to the above criteria for datasets focusing on question answering capability, we can\nfind that many datasets are qualified, e.g., SQuAD (Rajpurkar et al., 2016), NarrativeQA\n(Kociský et al., 2018), HotpotQA (Yang et al., 2018), CoQA (Reddy et al., 2019). Although\nthese datasets predate LLMs, they can still be used to evaluate the question answering ability\nof LLMs. Kwiatkowski et al. (2019) present the Natural Questions corpus. The questions\n8']	0.750000	The retrieved context is relevant to the user's query as it discusses the purpose of evaluating the knowledge and capability of LLMs (Large Language Models). It explains that the evaluation is important to assess their strengths and limitations across a diverse range of tasks and datasets. The context also mentions the different aspects of LLMs that are evaluated, such as question answering, knowledge completion, reasoning, and tool use. However, the context does not fully answer the user's query. While it does provide a general idea of why LLMs are evaluated, it does not delve into the specific purpose of these evaluations. For instance, it does not explain how these evaluations can help improve the performance of LLMs, or how they can be used to identify areas where LLMs may need further development or training. [RESULT] 3.0

答案相关性与上下文相关性评估¶

下载数据集 (LabelledRagDataset)¶

分别评估答案与上下文的相关性¶

查看评估结果¶

下载数据集 (`LabelledRagDataset`)¶