2 个帖子标记为 "task utility"

查看所有标签

AgentEval: 一个评估LLM驱动应用实用性的开发者工具

June 21, 2024 · 7 min read

James Woffinden-Luey

Senior Research Engineer at Microsoft Research

Julia Kiseleva

Senior Researcher at Microsoft Research

Fig.1: An AgentEval framework with verification step

图1展示了AgentEval的一般流程，包括验证步骤

TL;DR:

作为一名开发者，如何评估一个基于LLM的应用程序在帮助终端用户完成任务方面的实用性和有效性？
为了阐明上述问题，我们之前引入了AgentEval——一个评估任何LLM驱动的应用程序的多维度实用性的框架，旨在帮助用户完成特定任务。我们现在已将其嵌入到AutoGen库中，以便开发人员更容易采用。
在这里，我们介绍了一个更新版本的AgentEval，其中包括了一个验证过程，用于评估QuantifierAgent的鲁棒性。更多详细信息可以在这篇论文中找到。

介绍

之前介绍的 AgentEval 是一个全面的框架，旨在弥合评估由LLM驱动的应用程序效用的差距。它利用了LLM的最新进展，提供了一种可扩展且经济高效的替代方案，代替了传统的人工评估。该框架包括三个主要代理：CriticAgent、QuantifierAgent 和 VerifierAgent，每个代理在评估应用程序的任务效用中都扮演着关键角色。

CriticAgent: 定义标准

CriticAgent的主要功能是根据任务描述以及成功和失败执行的示例，为评估应用程序提出一套标准。例如，在一个数学辅导应用程序的背景下，CriticAgent可能会提出诸如效率、清晰度和正确性等标准。这些标准对于理解应用程序性能的各个方面至关重要。强烈建议应用程序开发者利用他们的领域专业知识来验证提出的标准。

QuantifierAgent：量化性能

一旦标准确定，QuantifierAgent将接管以量化应用程序在每个标准下的表现。此量化过程产生对应用程序效用的多维评估，提供其优缺点的详细视图。

验证器代理：确保鲁棒性和相关性

VerifierAgent确保用于评估效用的标准对最终用户有效，保持鲁棒性和高判别力。它通过两个主要操作来实现这一点：

标准稳定性：
- 确保标准是关键的、非冗余的，并且可以一致地衡量。
- 迭代生成和量化标准，消除冗余，并评估其稳定性。
- 仅保留最稳健的标准。
区分能力：
- 通过引入对抗性示例（噪声或受损数据）来测试系统的可靠性。
- 评估系统区分这些与标准案例的能力。
- 如果系统失败，这表明需要更好的标准来有效处理各种情况。

一个灵活且可扩展的框架

AgentEval 的一个关键优势是其灵活性。它可以应用于多种任务，无论这些任务的成功标准是否明确。对于具有明确定义的成功标准的任务，如家务，该框架可以评估是否存在多个成功的解决方案以及它们的比较情况。对于更开放的任务，如生成电子邮件模板，AgentEval 可以评估系统建议的实用性。

此外，AgentEval允许融入人类专业知识。领域专家可以通过建议相关标准或验证由代理识别的标准的有用性来参与评估过程。这种人在回路的方法确保评估基于实际的现实世界考虑。

实证验证

为了验证AgentEval，该框架在两个应用上进行了测试：数学问题解决和ALFWorld，一个家庭任务模拟。数学数据集包含12,500个具有挑战性的问题，每个问题都有逐步的解决方案，而ALFWorld数据集则涉及在模拟环境中的多轮交互。在这两种情况下，AgentEval成功识别了相关标准，量化了性能，并验证了评估的鲁棒性，展示了其有效性和多功能性。

如何使用 `AgentEval`

AgentEval 目前有两个主要阶段；标准生成和标准量化（标准验证仍在开发中）。这两个阶段都利用顺序的LLM驱动的代理来做出决定。

标准生成：

在评估标准生成过程中，AgentEval使用示例执行消息链来创建一组标准，用于量化应用程序在给定任务中的表现。

def generate_criteria(
    llm_config: Optional[Union[Dict, Literal[False]]] = None,
    task: Task = None,
    additional_instructions: str = "",
    max_round=2,
    use_subcritic: bool = False,
)

参数:

llm_config (dict or bool): llm推理配置。
任务 (Task): 要评估的任务。
additional_instructions (str, 可选): 为criteria代理提供的额外指令。
max_round (int, 可选): 运行对话的最大轮数。
use_subcritic (bool, 可选): 是否使用Subcritic代理生成子标准。Subcritic代理会将生成的标准分解为更小的标准进行评估。

示例代码:

config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
task = Task(
    **{
        "name": "Math problem solving",
        "description": "Given any question, the system needs to solve the problem as consisely and accurately as possible",
        "successful_response": response_successful,
        "failed_response": response_failed,
    }
)

criteria = generate_criteria(task=task, llm_config={"config_list": config_list})

注意：任务对象只需要一个样本执行链（成功/失败），但AgentEval在每种情况都有示例时会表现得更好。

示例输出：

[
    {
        "name": "Accuracy",
        "description": "The solution must be correct and adhere strictly to mathematical principles and techniques appropriate for the problem.",
        "accepted_values": ["Correct", "Minor errors", "Major errors", "Incorrect"]
    },
    {
        "name": "Conciseness",
        "description": "The explanation and method provided should be direct and to the point, avoiding unnecessary steps or complexity.",
        "accepted_values": ["Very concise", "Concise", "Somewhat verbose", "Verbose"]
    },
    {
        "name": "Relevance",
        "description": "The content of the response must be relevant to the question posed and should address the specific problem requirements.",
        "accepted_values": ["Highly relevant", "Relevant", "Somewhat relevant", "Not relevant"]
    }
]

标准量化：

在量化阶段，AgentEval 将使用生成的标准（或用户定义的标准）来评估给定的执行链，以确定应用程序的表现如何。

def quantify_criteria(
    llm_config: Optional[Union[Dict, Literal[False]]],
    criteria: List[Criterion],
    task: Task,
    test_case: str,
    ground_truth: str,
)

参数:

llm_config (dict or bool): llm推理配置。
criteria (Criterion): 用于评估给定任务效用的一系列标准。这可以通过generate_criteria函数生成或手动创建。
任务 (Task): 要评估的任务。它应该与在generate_criteria步骤中使用的任务匹配。
test_case (str): 用于评估的执行链。这通常是一个消息的json列表，但也可以是对话链的任何字符串表示。
ground_truth (str): 测试用例的真实值。

示例代码：

test_case="""[
    {
      "content": "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.",
      "role": "user"
    },
    {
      "content": "To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\n\n```python\ndef mod_inverse(a, m):\n..."
      "role": "assistant"
    }
  ]"""

quantifier_output = quantify_criteria(
    llm_config={"config_list": config_list},
    criteria=criteria,
    task=task,
    test_case=test_case,
    ground_truth="true",
)

输出将是一个包含真实值和每个标准到其得分映射的字典的json对象。

{
  "actual_success": true,
  "estimated_performance": {
      "Accuracy": "Correct",
      "Conciseness": "Concise",
      "Relevance": "Highly relevant"
    }
}

接下来是什么？

在AutoGen Studio中启用AgentEval以实现无代码解决方案。
在AgentEval框架中完全实现VerifierAgent。

结论

AgentEval代表了在评估LLM驱动应用程序方面的重大进步。通过结合CriticAgent、QuantifierAgent和VerifierAgent的优势，该框架提供了一个强大、可扩展且灵活的解决方案，用于评估任务效用。这种创新方法不仅帮助开发者了解其应用程序的当前表现，还提供了有价值的见解，可以推动未来的改进。随着智能代理领域的不断发展，像AgentEval这样的框架将在确保这些应用程序满足用户多样化和动态需求方面发挥关键作用。

进一步阅读

请参考我们的论文和代码库了解有关AgentEval的更多详细信息。

如果你发现这个博客有用，请考虑引用：

@article{arabzadeh2024assessing,
  title={Assessing and Verifying Task Utility in LLM-Powered Applications},
  author={Arabzadeh, Negar and Huo, Siging and Mehta, Nikhil and Wu, Qinqyun and Wang, Chi and Awadallah, Ahmed and Clarke, Charles LA and Kiseleva, Julia},
  journal={arXiv preprint arXiv:2405.02178},
  year={2024}
}

如何评估LLM驱动应用程序的效用？

November 20, 2023 · 10 min read

Julia Kiseleva

Senior Researcher at Microsoft Research

Negar Arabzadeh

PhD student at the University of Waterloo

Fig.1: A verification framework

图1展示了AgentEval的总体流程

简要说明：

作为LLM驱动的应用程序的开发人员，如何评估其为最终用户带来的实用性，同时帮助他们完成任务？
为了解决上述问题，我们引入了AgentEval——第一个版本的框架，用于评估任何为帮助用户完成特定任务而设计的LLM驱动应用程序的实用性。AgentEval旨在通过自动提出一组量身定制的标准来简化评估过程，以适应您应用程序的独特目的。这样可以进行全面的评估，根据建议的标准量化您的应用程序的实用性。
我们展示了如何使用数学问题数据集作为示例，在以下笔记本中演示AgentEval的工作原理。任何反馈对未来的开发都将非常有用。请通过我们的Discord联系我们。

介绍

AutoGen旨在简化开发基于LLM的多代理系统，用于各种应用，最终通过协助用户完成任务来使他们的生活更加轻松。接下来，我们都渴望了解我们开发的系统表现如何，对用户的实用性，以及，也许最关键的是，我们如何能够改进它们。直接评估多代理系统带来了挑战，因为当前的方法主要依赖于成功指标——本质上，是代理是否完成了任务。然而，理解用户与系统的交互远不止成功这一点。以数学问题为例，不仅仅是代理解决了问题。同样重要的是它根据各种标准传达解决方案的能力，包括完整性、简洁性以及提供解释的清晰度。此外，对于每项任务，成功并不总是明确定义的。

LLMs和多智能体系统的快速发展带来了许多新兴能力，我们渴望将这些能力转化为最终用户的实际应用。我们介绍了AgentEval框架的第一个版本——一个旨在帮助开发人员快速评估设计用于帮助最终用户完成所需任务的LLM驱动的应用程序实用性的工具。

Fig.2: An overview of the tasks taxonomy

图2提供了任务分类的概览

首先，让我们来看一下可以为多代理系统设计的建议任务分类的概述。一般来说，任务可以分为两种类型，其中：

成功没有明确定义 - 指的是用户以辅助方式使用系统的情况，他们寻求建议而不是期望系统完成任务。例如，用户可能会请求系统生成一封电子邮件。在许多情况下，生成的内容将作为用户稍后编辑的模板。然而，为这些任务精确定义成功是相对复杂的。
Success is clearly defined - refer to instances where we can clearly define whether a system solved the task or not. Consider agents that assist in accomplishing household tasks, where the definition of success is clear and measurable. This category can be further divided into two separate subcategories:
- 最优解存在 - 这些任务只有一个可能的解决方案。例如，如果你要求助手开灯，该任务的成功是明确定义的，并且只有一种方式来完成它。
- 存在多种解决方案 - 我们越来越多地观察到，代理行为的多种轨迹可能导致成功或失败。在这种情况下，区分各种成功和不成功的轨迹至关重要。例如，当你要求代理为你推荐食物食谱或讲笑话时。

在我们的AgentEval框架中，我们目前专注于成功明确定义的任务。接下来，我们将介绍建议的框架。

`AgentEval` 框架

我们之前关于Minecraft中的辅助代理的研究表明，获取人类判断的最优方法是并排展示两个代理并询问偏好。在这种成对比较的设置中，人类可以制定标准来解释为什么他们更喜欢某个代理的行为。例如，'第一个代理执行速度更快，' 或者 '第二个代理移动更自然。' 因此，这种比较性质使人类能够列出一系列标准，这些标准有助于推断任务的效用。基于这个想法，我们设计了AgentEval（如图1所示），在其中我们使用LLMs来帮助我们理解、验证和评估多代理系统的任务效用。即：

CriticAgent的目标是建议可以用于评估任务效用的标准列表（图1）。这是如何使用Autogen定义CriticAgent的示例：

critic = autogen.AssistantAgent(
    name="critic",
    llm_config={"config_list": config_list},
    system_message="""You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be distinguishable, quantifiable, and not redundant.
    Convert the evaluation criteria into a dictionary where the keys are the criteria.
    The value of each key is a dictionary as follows {"description": criteria description, "accepted_values": possible accepted inputs for this key}
    Make sure the keys are criteria for assessing the given task. "accepted_values" include the acceptable inputs for each key that are fine-grained and preferably multi-graded levels. "description" includes the criterion description.
    Return only the dictionary."""
)

接下来，评审者将获得任务执行的成功和失败示例；然后，它能够返回一个标准列表（图1）。作为参考，请使用以下笔记本。

QuantifierAgent 的目标是对每个建议的标准进行量化（图1），为我们提供该系统在给定任务中的效用概念。以下是如何定义它的示例：

quantifier = autogen.AssistantAgent(
    name="quantifier",
    llm_config={"config_list": config_list},
    system_message = """You are a helpful assistant. You quantify the output of different tasks based on the given criteria.
    The criterion is given in a dictionary format where each key is a distinct criteria.
    The value of each key is a dictionary as follows {"description": criteria description , "accepted_values": possible accepted inputs for this key}
    You are going to quantify each of the criteria for a given task based on the task description.
    Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.
    Return only the dictionary."""

)

`AgentEval` 基于数学问题数据集的评估结果

例如，在运行CriticAgent后，我们获得了以下标准来验证数学问题数据集的结果：

标准	描述	接受的值
问题解释	正确解释问题的能力	["完全偏离", "略有相关", "相关", "大部分准确", "完全准确"]
数学方法论	所选数学或算法方法论对问题的适用性	["不恰当", "勉强适用", "适用", "大部分有效", "完全有效"]
计算正确性	计算和提供的解决方案的准确性	["完全错误", "大部分错误", "中立", "大部分正确", "完全正确"]
解释清晰度	解释的清晰度和易懂性，包括语言使用和结构	["完全不清晰", "略微清晰", "中等清晰", "非常清晰", "完全清晰"]
代码效率	代码的效率和优雅程度	["完全无效", "稍微有效", "中等有效", "非常有效", "极其有效"]
代码正确性	所提供代码的正确性	["完全错误", "大部分错误", "部分正确", "大部分正确", "完全正确"]

然后，在运行QuantifierAgent之后，我们得到了如图3所示的结果，其中你可以看到三个模型：

代理聊天
ReAct
GPT-4 基本求解器

较浅的颜色代表失败案例的估计，较亮的颜色显示已发现的标准是如何量化的。

$Fig.3: Results based on overall math problems dataset _s stands for successful cases, _f - stands for failed cases$

图3基于整体数学问题数据集展示了结果。其中，_s表示成功案例，_f表示失败案例

我们注意到，在将agentEval应用于数学问题时，该代理并未接触到任何关于问题的真实信息。因此，该图展示了三种不同代理的估计性能，即Autogen（蓝色）、Gpt-4（红色）和ReAct（绿色）。我们观察到，通过比较这三种代理在成功案例（任何颜色的深色条）与失败案例（相同颜色的浅色条）中的表现，我们注意到AgentEval能够为成功案例分配比失败案例更高的量化值。这一观察验证了AgentEval在任务效用预测方面的能力。此外，AgentEval使我们能够超越对成功的二元定义，从而在成功与失败案例之间进行更深入的比较。

不仅重要的是识别哪些地方不工作，还要认识到哪些地方实际上进展顺利以及原因。

限制和未来工作

当前的 AgentEval 实现有一些限制，我们计划在未来克服这些问题：

每次运行的准则列表可能会有所不同（除非你存储了一个种子）。我们建议至少运行两次CriticAgent，并选择你认为对你的领域重要的准则。
QuantifierAgent的结果可能每次运行都会有所不同，因此我们建议进行多次运行以观察结果变化的程度。

为了缓解上述提到的限制，我们正在开发VerifierAgent，其目标是稳定结果并提供额外的解释。

总结

CriticAgent 和 QuantifierAgent 可以应用于任何类型应用程序的日志，为你提供关于解决方案在特定任务中为用户带来的效用的深入理解。

我们非常希望听到AgentEval在您的应用中的工作情况。任何反馈对我们未来的开发都将非常有用。请通过我们的Discord联系我们。

先前的研究

@InProceedings{pmlr-v176-kiseleva22a,
  title = "Interactive Grounded Language Understanding in a Collaborative Environment: IGLU 2021",
  author = "Kiseleva, Julia and Li, Ziming and Aliannejadi, Mohammad and Mohanty, Shrestha and ter Hoeve, Maartje and Burtsev, Mikhail and Skrynnik, Alexey and Zholus, Artem and Panov, Aleksandr and Srinet, Kavya and Szlam, Arthur and Sun, Yuxuan and Hofmann, Katja and C{\^o}t{\'e}, Marc-Alexandre and Awadallah, Ahmed and Abdrazakov, Linar and Churin, Igor and Manggala, Putra and Naszadi, Kata and van der Meer, Michiel and Kim, Taewoon",
  booktitle = "Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track",
  pages = "146--161",
  year = 2022,
  editor = "Kiela, Douwe and Ciccone, Marco and Caputo, Barbara",
  volume = 176,
  series = "Proceedings of Machine Learning Research",
  month = "06--14 Dec",
  publisher = "PMLR",
  pdf = 	 {https://proceedings.mlr.press/v176/kiseleva22a/kiseleva22a.pdf},
  url = 	 {https://proceedings.mlr.press/v176/kiseleva22a.html}.
}

@InProceedings{pmlr-v220-kiseleva22a,
  title = "Interactive Grounded Language Understanding in a Collaborative Environment: Retrospective on Iglu 2022 Competition",
  author = "Kiseleva, Julia and Skrynnik, Alexey and Zholus, Artem and Mohanty, Shrestha and Arabzadeh, Negar and C\^{o}t\'e, Marc-Alexandre and Aliannejadi, Mohammad and Teruel, Milagro and Li, Ziming and Burtsev, Mikhail and ter Hoeve, Maartje and Volovikova, Zoya and Panov, Aleksandr and Sun, Yuxuan and Srinet, Kavya and Szlam, Arthur and Awadallah, Ahmed and Rho, Seungeun and Kwon, Taehwan and Wontae Nam, Daniel and Bivort Haiek, Felipe and Zhang, Edwin and Abdrazakov, Linar and Qingyam, Guo and Zhang, Jason and Guo, Zhibin",
  booktitle = "Proceedings of the NeurIPS 2022 Competitions Track",
  pages = "204--216",
  year = 2022,
  editor = "Ciccone, Marco and Stolovitzky, Gustavo and Albrecht, Jacob",
  volume = 220,
  series = "Proceedings of Machine Learning Research",
  month = "28 Nov--09 Dec",
  publisher = "PMLR",
  pdf = "https://proceedings.mlr.press/v220/kiseleva22a/kiseleva22a.pdf",
  url = "https://proceedings.mlr.press/v220/kiseleva22a.html".
}

介绍​

一个灵活且可扩展的框架​

实证验证​

如何使用 AgentEval​

接下来是什么？​

结论​

进一步阅读​

介绍​

AgentEval 框架​

AgentEval 基于数学问题数据集的评估结果​

限制和未来工作​

总结​

先前的研究​

介绍