测试/评估

可以为LLM脚本定义测试/测试，以评估LLM在不同时间和模型类型下的输出质量。

测试由promptfoo执行，这是一个用于评估LLM输出质量的工具。

您还可以使用redteam功能来发现AI漏洞，例如偏见、毒性和事实性问题。

定义测试

测试在您测试中的script函数中声明。您可以定义一个或多个测试（数组）。

script({
  ...,
  tests: [{
    files: "src/rag/testcode.ts",
    rubrics: "is a report with a list of issues",
    facts: `The report says that the input string
      should be validated before use.`,
  }, { ... }],
})

测试模型

您可以指定要测试的模型列表（或模型别名）。

script({
  ...,
  testModels: ["ollama:phi3", "ollama:gpt-4o"],
})

评估引擎(PromptFoo)将对列表中的每个模型运行每个测试。此设置可以通过命令行--models选项覆盖。

外部测试文件

您还可以指定外部测试文件的文件名，支持JSON、YAML、CSV格式，以及.mjs、.mts JavaScript文件将被执行以生成测试。

script({
  ...,
  tests: ["tests.json", "more-tests.csv", "tests.mjs"],
})

JSON和YAML文件假设文件是PromptTest对象的列表，您可以使用位于https://microsoft.github.io/genaiscript/schemas/tests.json的JSON模式来验证这些文件。

CSV文件假设第一行为标题行，列大多为PromptTest对象的属性。 file列应为文件名，fileContent列是虚拟文件的内容。

content,rubrics,facts
"const x = 1;",is a report with a list of issues,The report says that the input string should be validated before use.

JavaScript文件应导出一个PromptTest对象列表，或者一个生成PromptTest对象列表的函数。

export default [
    {
        content: "const x = 1;",
        rubrics: "is a report with a list of issues",
        facts: "The report says that the input string should be validated before use.",
    },
]

`files`

files 接受一个文件路径列表（相对于工作区），并在运行测试时填充 env.files 变量。您可以通过传递字符串数组来提供多个文件。

script({
  tests: {
    files: "src/rag/testcode.ts",
    ...
  }
})

`rubrics`

rubrics 检查LLM输出是否符合给定要求，使用语言模型根据评分标准对输出进行评分（参见llm-rubric）。您可以通过传递字符串数组来指定多个评分标准。

script({
  tests: {
    rubrics: "is a report with a list of issues",
    ...,
  }
})

`facts`

facts 检查事实一致性（参见 factuality）。您可以通过传递字符串数组来指定多个事实。

给定一个完成答案A和参考答案B，评估A是否是B的子集、A是否是B的超集、A和B是否等价、A和B存在分歧，或者A和B虽有差异但从事实准确性的角度来看差异并不重要。

script({
  tests: {
    facts: `The report says that the input string should be validated before use.`,
    ...,
  }
})

`asserts`

其他断言请参考 promptfoo断言和指标。

icontains (not-icontains") 输出包含子字符串（不区分大小写）
equals (not-equals) 输出等于字符串
starts-with (not-starts-with) 输出以字符串开头

script({
    tests: {
        facts: `The report says that the input string should be validated before use.`,
        asserts: [
            {
                type: "icontains",
                value: "issue",
            },
        ],
    },
})

contains-all (not-contains-all) 输出包含所有子字符串
contains-any (not-contains-any) 输出包含任意子字符串
icontains-all (not-icontains-all) 输出包含所有子字符串（不区分大小写）

script({
    tests: {
        ...,
        asserts: [
            {
                type: "icontains-all",
                value: ["issue", "fix"],
            },
        ],
    },
})

transform

默认情况下，GenAIScript在将输出发送到PromptFoo之前会提取text字段。您可以通过设置format: "json"来禁用此模式；这样asserts就会在原始LLM输出上执行。您可以使用javascript表达式来选择输出的部分内容进行测试。

script({
    tests: {
        files: "src/will-trigger.cancel.txt",
        format: "json",
        asserts: {
            type: "equals",
            value: "cancelled",
            transform: "output.status",
        },
    },
})

运行测试

您可以从Visual Studio Code运行测试或使用命令行。在这两种情况下，genaiscript都会生成一个promptfoo配置文件并对其执行promptfoo。

Visual Studio Code

打开脚本进行测试
在编辑器中右键点击，然后在上下文菜单中选择运行 GenAIScript 测试
promptfoo web视图将自动打开并刷新测试结果。

命令行

运行test命令并将脚本文件作为参数传入。

npx genaiscript test <scriptid>

您可以通过传递--models选项来指定要测试的其他模型。

npx genaiscript test <scriptid> --models "ollama:phi3"