批量推理

批量推理端点提供访问部分模型供应商提供的批量推理API接口。相比实时推理，这些API能以显著降低的成本进行推理，但代价是更高的延迟（有时甚至长达一天）。

批量推理工作流程包含两个步骤：提交您的批量请求，然后轮询批量作业状态直至完成。

有关批量推理端点的更多详情，请参阅批量推理API参考，支持批量推理的模型提供商集成请查看集成。

示例

想象你有一个简单的TensorZero函数，它能用GPT-4o Mini生成俳句。

[functions.generate_haiku]
type = "chat"

[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"

您可以提交批量推理任务，通过单次请求生成多首俳句。 inputs中的每个条目等同于常规推理请求中的input字段。

curl -X POST http://localhost:3000/batch_inference \
  -H "Content-Type: application/json" \
  -d '{
    "function_name": "generate_haiku",
    "variant_name": "gpt_4o_mini",
    "inputs": [
      {
        "messages": [
          {
            "role": "user",
            "content": "Write a haiku about artificial intelligence."
          }
        ]
      },
      {
        "messages": [
          {
            "role": "user",
            "content": "Write a haiku about general aviation."
          }
        ]
      },
      {
        "messages": [
          {
            "role": "user",
            "content": "Write a haiku about anime."
          }
        ]
      }
    ]
  }'

响应中包含一个batch_id以及批次中每个推理对应的inference_ids和episode_ids。

{
  "batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
  "inference_ids": [
    "019470f0-d34a-77a3-9e59-bcc66db2b82f",
    "019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
    "019470f0-d34a-77a3-9e59-bcecfb7172a0"
  ],
  "episode_ids": [
    "019470f0-d34a-77a3-9e59-bc933973d087",
    "019470f0-d34a-77a3-9e59-bca6e9b748b2",
    "019470f0-d34a-77a3-9e59-bcb20177bf3a"
  ]
}

你可以使用这个batch_id来轮询任务状态，或者通过GET /batch_inference/{batch_id}端点获取结果。

curl -X GET http://localhost:3000/batch_inference/019470f0-db4c-7811-9e14-6fe6593a2652

当任务处于待处理状态时，响应将仅包含status字段。

{
  "status": "pending"
}

任务完成后，响应将包含status字段和inferences字段。每个推理对象与常规推理请求的响应相同。

{
  "status": "completed",
  "batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
  "inferences": [
    {
      "inference_id": "019470f0-d34a-77a3-9e59-bcc66db2b82f",
      "episode_id": "019470f0-d34a-77a3-9e59-bc933973d087",
      "variant_name": "gpt_4o_mini",
      "content": [
        {
          "type": "text",
          "text": "Whispers of circuits,  \nLearning paths through endless code,  \nDreams in binary."
        }
      ],
      "usage": {
        "input_tokens": 15,
        "output_tokens": 19
      }
    },
    {
      "inference_id": "019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
      "episode_id": "019470f0-d34a-77a3-9e59-bca6e9b748b2",
      "variant_name": "gpt_4o_mini",
      "content": [
        {
          "type": "text",
          "text": "Wings of freedom soar,  \nClouds embrace the lonely flight,  \nSky whispers adventure."
        }
      ],
      "usage": {
        "input_tokens": 15,
        "output_tokens": 20
      }
    },
    {
      "inference_id": "019470f0-d34a-77a3-9e59-bcecfb7172a0",
      "episode_id": "019470f0-d34a-77a3-9e59-bcb20177bf3a",
      "variant_name": "gpt_4o_mini",
      "content": [
        {
          "type": "text",
          "text": "Vivid worlds unfold,  \nHeroes rise with dreams in hand,  \nInk and dreams collide."
        }
      ],
      "usage": {
        "input_tokens": 14,
        "output_tokens": 20
      }
    }
  ]
}

技术说明

Observability
- 目前，TensorZero用户界面中不会显示待处理的批量推理任务。您可以在ClickHouse的BatchRequest和BatchModelInference表中找到相关信息。详情请参阅数据模型。
- 已完成批量推理作业的推断结果将与常规推断一同显示在用户界面中。
Experimentation
- 网关为整个批次采样相同的变体。
Python Client
- TensorZero Python客户端目前尚未原生支持批量推理。您需要通过HTTP请求提交批量请求，如上所示。