如何评估LLM驱动的应用程序的实用性？

November 20, 2023 · 10 min read

Julia Kiseleva

Senior Researcher at Microsoft Research

Negar Arabzadeh

PhD student at the University of Waterloo

Fig.1: A verification framework

图1展示了AgentEval的总体流程

简要说明：

作为LLM驱动的应用程序的开发人员，如何评估其为最终用户带来的实用性，同时帮助他们完成任务？
为了解决上述问题，我们引入了AgentEval——第一个版本的框架，用于评估任何为帮助用户完成特定任务而设计的LLM驱动应用程序的实用性。AgentEval旨在通过自动提出一组量身定制的标准来简化评估过程，以适应您应用程序的独特目的。这样可以进行全面的评估，根据建议的标准量化您的应用程序的实用性。
我们展示了如何使用数学问题数据集作为示例，在以下笔记本中演示AgentEval的工作原理。任何反馈对未来的开发都将非常有用。请通过我们的Discord联系我们。

介绍

AutoGen旨在简化开发基于LLM的多代理系统，用于各种应用，最终通过协助用户完成任务来使他们的生活更加轻松。接下来，我们都渴望了解我们开发的系统表现如何，对用户的实用性，以及，也许最关键的是，我们如何能够改进它们。直接评估多代理系统带来了挑战，因为当前的方法主要依赖于成功指标——本质上，是代理是否完成了任务。然而，理解用户与系统的交互远不止成功这一点。以数学问题为例，不仅仅是代理解决了问题。同样重要的是它根据各种标准传达解决方案的能力，包括完整性、简洁性以及提供解释的清晰度。此外，对于每项任务，成功并不总是明确定义的。

LLMs和多智能体系统的快速发展带来了许多新兴能力，我们渴望将这些能力转化为最终用户的实际应用。我们介绍了AgentEval框架的第一个版本——一个旨在帮助开发人员快速评估设计用于帮助最终用户完成所需任务的LLM驱动的应用程序实用性的工具。

Fig.2: An overview of the tasks taxonomy

图2提供了任务分类的概览

首先，让我们来看一下可以为多代理系统设计的建议任务分类的概述。一般来说，任务可以分为两种类型，其中：

成功没有明确定义 - 指的是用户以辅助方式使用系统的情况，他们寻求建议而不是期望系统完成任务。例如，用户可能会请求系统生成一封电子邮件。在许多情况下，生成的内容将作为用户稍后编辑的模板。然而，为这些任务精确定义成功是相对复杂的。
Success is clearly defined - refer to instances where we can clearly define whether a system solved the task or not. Consider agents that assist in accomplishing household tasks, where the definition of success is clear and measurable. This category can be further divided into two separate subcategories:
- 最优解存在 - 这些任务只有一个可能的解决方案。例如，如果你要求助手开灯，该任务的成功是明确定义的，并且只有一种方式来完成它。
- 存在多种解决方案 - 我们越来越多地观察到，代理行为的多种轨迹可能导致成功或失败。在这种情况下，区分各种成功和不成功的轨迹至关重要。例如，当你要求代理为你推荐食物食谱或讲笑话时。

在我们的AgentEval框架中，我们目前专注于成功明确定义的任务。接下来，我们将介绍建议的框架。

`AgentEval` 框架

我们之前关于Minecraft中的辅助代理的研究表明，获取人类判断的最优方法是并排展示两个代理并询问偏好。在这种成对比较的设置中，人类可以制定标准来解释为什么他们更喜欢某个代理的行为。例如，'第一个代理执行速度更快，' 或者 '第二个代理移动更自然。' 因此，这种比较性质使人类能够列出一系列标准，这些标准有助于推断任务的效用。基于这个想法，我们设计了AgentEval（如图1所示），在其中我们使用LLMs来帮助我们理解、验证和评估多代理系统的任务效用。即：

CriticAgent的目标是建议可以用于评估任务效用的标准列表（图1）。这是如何使用Autogen定义CriticAgent的示例：

critic = autogen.AssistantAgent(
    name="critic",
    llm_config={"config_list": config_list},
    system_message="""You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be distinguishable, quantifiable, and not redundant.
    Convert the evaluation criteria into a dictionary where the keys are the criteria.
    The value of each key is a dictionary as follows {"description": criteria description, "accepted_values": possible accepted inputs for this key}
    Make sure the keys are criteria for assessing the given task. "accepted_values" include the acceptable inputs for each key that are fine-grained and preferably multi-graded levels. "description" includes the criterion description.
    Return only the dictionary."""
)

接下来，评审者将获得任务执行的成功和失败示例；然后，它能够返回一个标准列表（图1）。作为参考，请使用以下笔记本。

QuantifierAgent 的目标是对每个建议的标准进行量化（图1），为我们提供该系统在给定任务中的效用概念。以下是如何定义它的示例：

quantifier = autogen.AssistantAgent(
    name="quantifier",
    llm_config={"config_list": config_list},
    system_message = """You are a helpful assistant. You quantify the output of different tasks based on the given criteria.
    The criterion is given in a dictionary format where each key is a distinct criteria.
    The value of each key is a dictionary as follows {"description": criteria description , "accepted_values": possible accepted inputs for this key}
    You are going to quantify each of the criteria for a given task based on the task description.
    Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.
    Return only the dictionary."""

)

`AgentEval` 基于数学问题数据集的评估结果

例如，在运行CriticAgent后，我们获得了以下标准来验证数学问题数据集的结果：

标准	描述	接受的值
问题解释	正确解释问题的能力	["完全偏离", "略有相关", "相关", "大部分准确", "完全准确"]
数学方法论	所选数学或算法方法论对问题的适用性	["不恰当", "勉强适用", "适用", "大部分有效", "完全有效"]
计算正确性	计算和提供的解决方案的准确性	["完全错误", "大部分错误", "中立", "大部分正确", "完全正确"]
解释清晰度	解释的清晰度和易懂性，包括语言使用和结构	["完全不清晰", "略微清晰", "中等清晰", "非常清晰", "完全清晰"]
代码效率	代码的效率和优雅程度	["完全无效", "稍微有效", "中等有效", "非常有效", "极其有效"]
代码正确性	所提供代码的正确性	["完全错误", "大部分错误", "部分正确", "大部分正确", "完全正确"]

然后，在运行QuantifierAgent之后，我们得到了如图3所示的结果，其中你可以看到三个模型：

代理聊天
ReAct
GPT-4 基本求解器

较浅的颜色代表失败案例的估计，较亮的颜色显示已发现的标准是如何量化的。

$Fig.3: Results based on overall math problems dataset _s stands for successful cases, _f - stands for failed cases$

图3基于整体数学问题数据集展示了结果。其中，_s表示成功案例，_f表示失败案例

我们注意到，在将agentEval应用于数学问题时，该代理并未接触到任何关于问题的真实信息。因此，该图展示了三种不同代理的估计性能，即Autogen（蓝色）、Gpt-4（红色）和ReAct（绿色）。我们观察到，通过比较这三种代理在成功案例（任何颜色的深色条）与失败案例（相同颜色的浅色条）中的表现，我们注意到AgentEval能够为成功案例分配比失败案例更高的量化值。这一观察验证了AgentEval在任务效用预测方面的能力。此外，AgentEval使我们能够超越对成功的二元定义，从而在成功与失败案例之间进行更深入的比较。

不仅重要的是识别哪些地方不工作，还要认识到哪些地方实际上进展顺利以及原因。

限制和未来工作

当前的 AgentEval 实现有一些限制，我们计划在未来克服这些问题：

每次运行的准则列表可能会有所不同（除非你存储了一个种子）。我们建议至少运行两次CriticAgent，并选择你认为对你的领域重要的准则。
QuantifierAgent的结果可能每次运行都会有所不同，因此我们建议进行多次运行以观察结果变化的程度。

为了缓解上述提到的限制，我们正在开发VerifierAgent，其目标是稳定结果并提供额外的解释。

总结

CriticAgent 和 QuantifierAgent 可以应用于任何类型应用程序的日志，为你提供关于解决方案在特定任务中为用户带来的效用的深入理解。

我们非常希望听到AgentEval在您的应用中的工作情况。任何反馈对我们未来的开发都将非常有用。请通过我们的Discord联系我们。

先前的研究

@InProceedings{pmlr-v176-kiseleva22a,
  title = "Interactive Grounded Language Understanding in a Collaborative Environment: IGLU 2021",
  author = "Kiseleva, Julia and Li, Ziming and Aliannejadi, Mohammad and Mohanty, Shrestha and ter Hoeve, Maartje and Burtsev, Mikhail and Skrynnik, Alexey and Zholus, Artem and Panov, Aleksandr and Srinet, Kavya and Szlam, Arthur and Sun, Yuxuan and Hofmann, Katja and C{\^o}t{\'e}, Marc-Alexandre and Awadallah, Ahmed and Abdrazakov, Linar and Churin, Igor and Manggala, Putra and Naszadi, Kata and van der Meer, Michiel and Kim, Taewoon",
  booktitle = "Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track",
  pages = "146--161",
  year = 2022,
  editor = "Kiela, Douwe and Ciccone, Marco and Caputo, Barbara",
  volume = 176,
  series = "Proceedings of Machine Learning Research",
  month = "06--14 Dec",
  publisher = "PMLR",
  pdf = 	 {https://proceedings.mlr.press/v176/kiseleva22a/kiseleva22a.pdf},
  url = 	 {https://proceedings.mlr.press/v176/kiseleva22a.html}.
}

@InProceedings{pmlr-v220-kiseleva22a,
  title = "Interactive Grounded Language Understanding in a Collaborative Environment: Retrospective on Iglu 2022 Competition",
  author = "Kiseleva, Julia and Skrynnik, Alexey and Zholus, Artem and Mohanty, Shrestha and Arabzadeh, Negar and C\^{o}t\'e, Marc-Alexandre and Aliannejadi, Mohammad and Teruel, Milagro and Li, Ziming and Burtsev, Mikhail and ter Hoeve, Maartje and Volovikova, Zoya and Panov, Aleksandr and Sun, Yuxuan and Srinet, Kavya and Szlam, Arthur and Awadallah, Ahmed and Rho, Seungeun and Kwon, Taehwan and Wontae Nam, Daniel and Bivort Haiek, Felipe and Zhang, Edwin and Abdrazakov, Linar and Qingyam, Guo and Zhang, Jason and Guo, Zhibin",
  booktitle = "Proceedings of the NeurIPS 2022 Competitions Track",
  pages = "204--216",
  year = 2022,
  editor = "Ciccone, Marco and Stolovitzky, Gustavo and Albrecht, Jacob",
  volume = 220,
  series = "Proceedings of Machine Learning Research",
  month = "28 Nov--09 Dec",
  publisher = "PMLR",
  pdf = "https://proceedings.mlr.press/v220/kiseleva22a/kiseleva22a.pdf",
  url = "https://proceedings.mlr.press/v220/kiseleva22a.html".
}

介绍​

AgentEval 框架​

AgentEval 基于数学问题数据集的评估结果​

限制和未来工作​

总结​

先前的研究​

介绍

`AgentEval` 框架

`AgentEval` 基于数学问题数据集的评估结果

限制和未来工作

总结

先前的研究