回退、负载均衡、重试

快速开始负载均衡
快速开始客户端回退

快速开始 - 负载均衡

步骤 1 - 在配置中设置部署

以下是示例配置。这里，带有 model=gpt-3.5-turbo 的请求将在多个 azure/gpt-3.5-turbo 实例之间路由

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/<your-deployment-name>
      api_base: <your-azure-endpoint>
      api_key: <your-azure-api-key>
      rpm: 6      # 此部署的速率限制：每分钟请求数 (rpm)
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: <your-azure-api-key>
      rpm: 6
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-large
      api_base: https://openai-france-1234.openai.azure.com/
      api_key: <your-azure-api-key>
      rpm: 1440

router_settings:
  routing_strategy: simple-shuffle # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], 默认="simple-shuffle"
  model_group_alias: {"gpt-4": "gpt-3.5-turbo"} # 所有带有 `gpt-4` 的请求将被路由到带有 `gpt-3.5-turbo` 的模型
  num_retries: 2
  timeout: 30                                  # 30 秒
  redis_host: <your redis host>                # 使用多个 litellm 代理部署时设置此项，负载均衡状态存储在 redis 中
  redis_password: <your redis password>
  redis_port: 1992

info

有关路由策略的详细信息可以在这里找到

步骤 2: 使用配置启动代理

$ litellm --config /path/to/config.yaml

测试 - 简单调用

这里带有 model=gpt-3.5-turbo 的请求将在多个 azure/gpt-3.5-turbo 实例之间路由

👉 关键更改: model="gpt-3.5-turbo"

检查响应头中的 model_id 以确保请求被负载均衡

OpenAI Python v1.0.0+
Curl 请求
Langchain

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": "这是一个测试请求，写一首短诗"
        }
    ]
)

print(response)

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "gpt-3.5-turbo",
    "messages": [
        {
        "role": "user",
        "content": "你是哪个llm"
        }
    ]
}'

from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage
import os 

os.environ["OPENAI_API_KEY"] = "anything"

chat = ChatOpenAI(
    openai_api_base="http://0.0.0.0:4000",
    model="gpt-3.5-turbo",
)

messages = [
    SystemMessage(
        content="你是一个有用的助手，我正在使用它来发出测试请求。"
    ),
    HumanMessage(
        content="来自 litellm 的测试。告诉我为什么它很棒，一句话"
    ),
]
response = chat(messages)

print(response)

测试 - 负载均衡

在此请求中，将发生以下情况：

将引发速率限制异常
LiteLLM 代理将在模型组上重试请求（默认是 3 次）。

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
  "model": "gpt-3.5-turbo",
  "messages": [
        {"role": "user", "content": "你好！"}
    ],
    "mock_testing_rate_limit_error": true
}'

查看代码

测试 - 客户端回退

在此请求中，将发生以下情况：

向 model="zephyr-beta" 的请求将失败
litellm 代理将遍历 fallbacks=["gpt-3.5-turbo"] 中指定的所有模型组
向 model="gpt-3.5-turbo" 的请求将成功，发出请求的客户端将收到来自 gpt-3.5-turbo 的响应

👉 关键更改: "fallbacks": ["gpt-3.5-turbo"]

OpenAI Python v1.0.0+
Curl 请求
Langchain

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="zephyr-beta",
    messages = [
        {
            "role": "user",
            "content": "这是一个测试请求，写一首简短的诗"
        }
    ],
    extra_body={
        "fallbacks": ["gpt-3.5-turbo"]
    }
)

print(response)

将 metadata 作为请求体的一部分传递

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "zephyr-beta",
    "messages": [
        {
        "role": "user",
        "content": "what llm are you"
        }
    ],
    "fallbacks": ["gpt-3.5-turbo"]
}'

from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage
import os 

os.environ["OPENAI_API_KEY"] = "anything"

chat = ChatOpenAI(
    openai_api_base="http://0.0.0.0:4000",
    model="zephyr-beta",
    extra_body={
        "fallbacks": ["gpt-3.5-turbo"]
    }
)

messages = [
    SystemMessage(
        content="You are a helpful assistant that im using to make a test request to."
    ),
    HumanMessage(
        content="test from litellm. tell me why it's amazing in 1 sentence"
    ),
]
response = chat(messages)

print(response)

高级设置

回退 + 重试 + 超时 + 冷却

要设置回退，只需执行：

litellm_settings:
  fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}]

涵盖所有错误（429、500 等）

通过配置设置

model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003
  - model_name: gpt-3.5-turbo
    litellm_params:
        model: gpt-3.5-turbo
        api_key: <my-openai-key>
  - model_name: gpt-3.5-turbo-16k
    litellm_params:
        model: gpt-3.5-turbo-16k
        api_key: <my-openai-key>

litellm_settings:
  num_retries: 3 # 每个模型名称（例如 zephyr-beta）重试 3 次
  request_timeout: 10 # 如果调用超过 10 秒则引发超时错误。设置 litellm.request_timeout 
  fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # 如果调用失败 num_retries 次则回退到 gpt-3.5-turbo
  allowed_fails: 3 # 如果模型在一分钟内失败次数超过 1 次则冷却模型
  cooldown_time: 30 # 如果失败次数/分钟超过 allowed_fails，冷却模型的时间

测试回退！

检查回退是否按预期工作。

常规回退

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
  "model": "my-bad-model",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "mock_testing_fallbacks": true # 👈 关键变化
}
'

内容策略回退

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
  "model": "my-bad-model",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "mock_testing_content_policy_fallbacks": true # 👈 关键变化
}
'

上下文窗口回退

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
  "model": "my-bad-model",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "mock_testing_context_window_fallbacks": true # 👈 关键变化
}
'

上下文窗口回退（预调用检查 + 回退）

在调用之前 检查调用是否在模型上下文窗口内，启用 enable_pre_call_checks: true。

查看代码

1. 设置配置

对于 Azure 部署，设置基础模型。从此列表中选择基础模型，所有 Azure 模型都以 azure/ 开头。

同一组
上下文窗口回退（不同组）

过滤具有较小上下文窗口的模型的旧实例（例如 gpt-3.5-turbo）

router_settings:
enable_pre_call_checks: true # 1. 启用预调用检查

model_list:
- model_name: gpt-3.5-turbo
  litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
  model_info:
base_model: azure/gpt-4-1106-preview # 2. 👈 (仅限Azure) 设置基础模型

- model_name: gpt-3.5-turbo
  litellm_params:
model: gpt-3.5-turbo-1106
api_key: os.environ/OPENAI_API_KEY

启动代理

litellm --config /path/to/config.yaml

# 运行于 http://0.0.0.0:4000

测试一下！

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

text = "42的意义是什么？" * 5000

# 请求发送到litellm代理上设置的模型，`litellm --model`
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {"role": "system", "content": text},
        {"role": "user", "content": "谁是亚历山大？"},
    ],
)

print(response)

如果当前模型太小，则回退到更大的模型。

router_settings:
enable_pre_call_checks: true # 1. 启用预调用检查

model_list:
- model_name: gpt-3.5-turbo-small
  litellm_params:
    model: azure/chatgpt-v-2
    api_base: os.environ/AZURE_API_BASE
    api_key: os.environ/AZURE_API_KEY
    api_version: "2023-07-01-preview"
    model_info:
      base_model: azure/gpt-4-1106-preview # 2. 👈（仅限Azure）设置基础模型

- model_name: gpt-3.5-turbo-large
  litellm_params:
    model: gpt-3.5-turbo-1106
    api_key: os.environ/OPENAI_API_KEY

- model_name: claude-opus
  litellm_params:
    model: claude-3-opus-20240229
    api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  context_window_fallbacks: [{"gpt-3.5-turbo-small": ["gpt-3.5-turbo-large", "claude-opus"]}]

启动代理

litellm --config /path/to/config.yaml

# 运行于 http://0.0.0.0:4000

测试一下！

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

text = "42的意义是什么？" * 5000

# 请求发送到litellm代理上设置的模型，`litellm --model`
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {"role": "system", "content": text},
        {"role": "user", "content": "谁是亚历山大？"},
    ],
)

print(response)

内容策略回退

如果在内容策略违规错误时回退到不同的提供商（例如从Azure OpenAI到Anthropic）。

model_list:
- model_name: gpt-3.5-turbo-small
  litellm_params:
    model: azure/chatgpt-v-2
    api_base: os.environ/AZURE_API_BASE
    api_key: os.environ/AZURE_API_KEY
    api_version: "2023-07-01-preview"

- model_name: claude-opus
  litellm_params:
    model: claude-3-opus-20240229
    api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  content_policy_fallbacks: [{"gpt-3.5-turbo-small": ["claude-opus"]}]

默认回退

你也可以设置default_fallbacks，以防某个特定的模型组配置错误或表现不佳。

model_list:
- model_name: gpt-3.5-turbo-small
  litellm_params:
    model: azure/chatgpt-v-2
    api_base: os.environ/AZURE_API_BASE
    api_key: os.environ/AZURE_API_KEY
    api_version: "2023-07-01-preview"

- model_name: claude-opus
  litellm_params:
    model: claude-3-opus-20240229
    api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  default_fallbacks: ["claude-opus"]

这将默认使用claude-opus，以防任何模型失败。

特定模型的回退（例如{"gpt-3.5-turbo-small": ["claude-opus"]}）会覆盖默认回退。

欧盟区域过滤（预调用检查）

在调用之前检查调用是否在模型上下文窗口内，使用enable_pre_call_checks: true。

设置部署的'region_name'。

注意：LiteLLM可以根据你的litellm参数自动推断Vertex AI、Bedrock和IBM WatsonxAI的region_name。对于Azure，设置litellm.enable_preview = True。

设置配置

router_settings:
enable_pre_call_checks: true # 1. 启用预调用检查

model_list:
- model_name: gpt-3.5-turbo
  litellm_params:
    model: azure/chatgpt-v-2
    api_base: os.environ/AZURE_API_BASE
    api_key: os.environ/AZURE_API_KEY
    api_version: "2023-07-01-preview"
    region_name: "eu" # 👈 设置欧盟区域

- model_name: gpt-3.5-turbo
  litellm_params:
    model: gpt-3.5-turbo-1106
    api_key: os.environ/OPENAI_API_KEY

- model_name: gemini-pro
  litellm_params:
    model: vertex_ai/gemini-pro-1.5
    vertex_project: adroit-crow-1234
    vertex_location: us-east1 # 👈 自动推断'region_name'

启动代理

litellm --config /path/to/config.yaml

# 运行于 http://0.0.0.0:4000

测试一下！

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

# 请求发送到litellm代理上设置的模型，`litellm --model`
response = client.chat.completions.with_raw_response.create(
    model="gpt-3.5-turbo",
    messages = [{"role": "user", "content": "谁是亚历山大？"}]
)

print(response)

print(f"response.headers.get('x-litellm-model-api-base')")

自定义超时、流超时——每模型

对于每个模型，你可以在 litellm_params 下设置 timeout 和 stream_timeout

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-eu
      api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
      api_key: <your-key>
      timeout: 0.1                      # 超时时间（秒）
      stream_timeout: 0.01              # 流请求的超时时间（秒）
      max_retries: 5
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: 
      timeout: 0.1                      # 超时时间（秒）
      stream_timeout: 0.01              # 流请求的超时时间（秒）
      max_retries: 5

启动代理

$ litellm --config /path/to/config.yaml

设置动态超时 - 每个请求

LiteLLM 代理支持为每个请求设置 timeout

示例用法

Curl 请求
OpenAI v1.0.0+

curl --location 'http://0.0.0.0:4000/chat/completions' \
     --header 'Content-Type: application/json' \
     --data-raw '{
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "user", "content": "what color is red"}
        ],
        "logit_bias": {12481: 100},
        "timeout": 1
     }'

import openai


client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "what color is red"}
    ],
    logit_bias={12481: 100},
    timeout=1
)

print(response)

为通配符模型设置回退

你可以在配置文件中为通配符模型（例如 azure/*）设置回退。

设置配置

model_list:
  - model_name: "gpt-4o"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: os.environ/OPENAI_API_KEY
  - model_name: "azure/*"
    litellm_params:
      model: "azure/*"
      api_key: os.environ/AZURE_API_KEY
      api_base: os.environ/AZURE_API_BASE

litellm_settings:
  fallbacks: [{"gpt-4o": ["azure/gpt-4o"]}]

启动代理

litellm --config /path/to/config.yaml

测试一下！

curl -L -X POST 'http://0.0.0.0:4000/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [    
          {
            "type": "text",
            "text": "what color is red"
          }
        ]
      }
    ],
    "max_tokens": 300,
    "mock_testing_fallbacks": true
}'

回退、负载均衡、重试

快速开始 - 负载均衡​

步骤 1 - 在配置中设置部署​

步骤 2: 使用配置启动代理​

测试 - 简单调用​

测试 - 负载均衡​

测试 - 客户端回退​

高级设置​

回退 + 重试 + 超时 + 冷却​

测试回退！​

常规回退​

内容策略回退​

上下文窗口回退​

上下文窗口回退（预调用检查 + 回退）​

启动代理​

测试一下！​

启动代理​

测试一下！​

内容策略回退​

默认回退​

欧盟区域过滤（预调用检查）​

设置配置​

启动代理​

测试一下！​

自定义超时、流超时——每模型​

启动代理​

设置动态超时 - 每个请求​

为通配符模型设置回退​

快速开始 - 负载均衡

步骤 1 - 在配置中设置部署

步骤 2: 使用配置启动代理

测试 - 简单调用

测试 - 负载均衡

测试 - 客户端回退

高级设置

回退 + 重试 + 超时 + 冷却

测试回退！

常规回退

内容策略回退

上下文窗口回退

上下文窗口回退（预调用检查 + 回退）

启动代理

测试一下！

启动代理

测试一下！

内容策略回退

默认回退

欧盟区域过滤（预调用检查）

设置配置

启动代理

测试一下！

自定义超时、流超时——每模型

启动代理

设置动态超时 - 每个请求

为通配符模型设置回退