评估¶

AgentScope提供了一个内置评估框架，用于在不同任务和基准中评估智能体性能，其特点包括：

Ray-based parallel and distributed evaluation
支持中断后继续
🚧 评估结果可视化

注意

我们将持续把新的基准集成到AgentScope中：

✅ ACEBench
🚧 GAIA 基准测试

概览¶

AgentScope评估框架包含几个关键组件：

基准测试: Collections of tasks for systematic evaluation
- 任务: Individual evaluation units with inputs, ground truth, and metrics
  
  Metric: 用于评估解决方案质量的测量函数
评估器: Engine that runs evaluation, aggregates results, and analyzes performance
- Evaluator Storage: 用于记录和检索评估结果的持久化存储
解决方案: 用户自定义的方案

AgentScope Evaluation Framework — *智能体评估框架*¶

目前 AgentScope 中的实现包括：

评估者:
- RayEvaluator: 一种基于ray的评估器，支持并行和分布式评估。
- GeneralEvaluator: 一个按顺序运行任务的一般评估器，有利于调试。
基准测试:
- ACEBench：一个用于评估智能体能力的基准测试。

我们提供了一个玩具示例在我们的GitHub 仓库中，包含RayEvaluator以及 ACEBench 中智能体的多步任务。

核心组件¶

我们将构建一个简单的数学问题玩具基准测试，来演示如何使用AgentScope评估模块。

TOY_BENCHMARK = [
    {
        "id": "math_problem_1",
        "question": "What is 2 + 2?",
        "ground_truth": 4.0,
        "tags": {
            "difficulty": "easy",
            "category": "math",
        },
    },
    {
        "id": "math_problem_2",
        "question": "What is 12345 + 54321 + 6789 + 9876?",
        "ground_truth": 83331,
        "tags": {
            "difficulty": "medium",
            "category": "math",
        },
    },
]

从任务、解决方案和指标到基准构建¶

一个 SolutionOutput 包含智能体生成的所有信息，包括轨迹和最终输出。
一个 Metric 代表一个单独的评估可调用实例，它将生成的解决方案（例如，轨迹或最终输出）与真实答案进行比较。

在玩具示例中，我们定义了一个指标，该指标仅检查解决方案中的 output 字段是否与真实情况匹配。

from agentscope.evaluate import (
    SolutionOutput,
    MetricBase,
    MetricResult,
    MetricType,
)


class CheckEqual(MetricBase):
    def __init__(
        self,
        ground_truth: float,
    ):
        super().__init__(
            name="math check number equal",
            metric_type=MetricType.NUMERICAL,
            description="Toy metric checking if two numbers are equal",
            categories=[],
        )
        self.ground_truth = ground_truth

    def __call__(
        self,
        solution: SolutionOutput,
    ) -> MetricResult:
        if solution.output == self.ground_truth:
            return MetricResult(
                name=self.name,
                result=1.0,
                message="Correct",
            )
        else:
            return MetricResult(
                name=self.name,
                result=0.0,
                message="Incorrect",
            )

一个Task是基准测试中的一个单元，包含智能体执行与评估所需的所有信息（例如输入/查询及其基准真值）。
一个 Benchmark 用于组织多个任务进行系统评估。

from typing import Generator
from agentscope.evaluate import (
    Task,
    BenchmarkBase,
)


class ToyBenchmark(BenchmarkBase):
    def __init__(self):
        super().__init__(
            name="Toy bench",
            description="A toy benchmark for demonstrating the evaluation module.",
        )
        self.dataset = self._load_data()

    @staticmethod
    def _load_data() -> list[Task]:
        dataset = []
        for item in TOY_BENCHMARK:
            dataset.append(
                Task(
                    id=item["id"],
                    input=item["question"],
                    ground_truth=item["ground_truth"],
                    tags=item.get("tags", {}),
                    metrics=[
                        CheckEqual(item["ground_truth"]),
                    ],
                    metadata={},
                ),
            )
        return dataset

    def __iter__(self) -> Generator[Task, None, None]:
        """Iterate over the benchmark."""
        for task in self.dataset:
            yield task

    def __getitem__(self, index: int) -> Task:
        """Get a task by index."""
        return self.dataset[index]

    def __len__(self) -> int:
        """Get the length of the benchmark."""
        return len(self.dataset)

评估器¶

评估器负责管理评估流程。它们能够自动遍历基准测试中的各项任务，并将每个任务输入到解决方案生成函数中，开发者需要在此定义运行智能体及获取执行结果与轨迹的逻辑。以下是一个使用我们的演示基准测试运行 GeneralEvaluator 的示例。若遇到大型基准测试且开发者希望通过并行化提高评估效率，也可使用内置解决方案 RayEvaluator。

import os
import asyncio
from typing import Callable
from pydantic import BaseModel

from agentscope.message import Msg
from agentscope.model import DashScopeChatModel
from agentscope.formatter import DashScopeChatFormatter
from agentscope.agent import ReActAgent

from agentscope.evaluate import (
    GeneralEvaluator,
    FileEvaluatorStorage,
)


class ToyBenchAnswerFormat(BaseModel):
    answer_as_number: float


async def toy_solution_generation(
    task: Task,
    pre_hook: Callable,
) -> SolutionOutput:
    agent = ReActAgent(
        name="Friday",
        sys_prompt="You are a helpful assistant named Friday. "
        "Your target is to solve the given task with your tools. "
        "Try to solve the task as best as you can.",
        model=DashScopeChatModel(
            api_key=os.environ.get("DASHSCOPE_API_KEY"),
            model_name="qwen-max",
            stream=False,
        ),
        formatter=DashScopeChatFormatter(),
    )
    agent.register_instance_hook(
        "pre_print",
        "save_logging",
        pre_hook,
    )
    msg_input = Msg("user", task.input, role="user")
    res = await agent(
        msg_input,
        structured_model=ToyBenchAnswerFormat,
    )
    return SolutionOutput(
        success=True,
        output=res.metadata.get("answer_as_number", None),
        trajectory=[],
    )


async def main() -> None:
    evaluator = GeneralEvaluator(
        name="ACEbench evaluation",
        benchmark=ToyBenchmark(),
        # Repeat how many times
        n_repeat=1,
        storage=FileEvaluatorStorage(
            save_dir="./results",
        ),
        # How many workers to use
        n_workers=1,
    )

    # Run the evaluation
    await evaluator.run(toy_solution_generation)


asyncio.run(main())

Friday: The answer to 2 + 2 is 4.
Friday: The sum of 12345, 54321, 6789, and 9876 is 83331.
Repeat ID: 0
        Metric: math check number equal
                Type: MetricType.NUMERICAL
                Involved tasks: 2
                Completed tasks: 2
                Incomplete tasks: 0
                Aggregation: {
                    "mean": 1.0,
                    "max": 1.0,
                    "min": 1.0
                }

脚本的总运行时间：（0分钟7.444秒）

Gallery generated by Sphinx-Gallery