回退、负载均衡、重试
快速开始 - 负载均衡
步骤 1 - 在配置中设置部署
以下是示例配置。这里,带有 model=gpt-3.5-turbo 的请求将在多个 azure/gpt-3.5-turbo 实例之间路由
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/<your-deployment-name>
api_base: <your-azure-endpoint>
api_key: <your-azure-api-key>
rpm: 6 # 此部署的速率限制:每分钟请求数 (rpm)
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-small-ca
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
api_key: <your-azure-api-key>
rpm: 6
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-large
api_base: https://openai-france-1234.openai.azure.com/
api_key: <your-azure-api-key>
rpm: 1440
router_settings:
routing_strategy: simple-shuffle # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], 默认="simple-shuffle"
model_group_alias: {"gpt-4": "gpt-3.5-turbo"} # 所有带有 `gpt-4` 的请求将被路由到带有 `gpt-3.5-turbo` 的模型
num_retries: 2
timeout: 30 # 30 秒
redis_host: <your redis host> # 使用多个 litellm 代理部署时设置此项,负载均衡状态存储在 redis 中
redis_password: <your redis password>
redis_port: 1992
步骤 2: 使用配置启动代理
$ litellm --config /path/to/config.yaml
测试 - 简单调用
这里带有 model=gpt-3.5-turbo 的请求将在多个 azure/gpt-3.5-turbo 实例之间路由
👉 关键更改: model="gpt-3.5-turbo"
检查响应头中的 model_id 以确保请求被负载均衡
- OpenAI Python v1.0.0+
- Curl 请求
- Langchain
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{
"role": "user",
"content": "这是一个测试请求,写一首短诗"
}
]
)
print(response)
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "你是哪个llm"
}
]
}'
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage
import os
os.environ["OPENAI_API_KEY"] = "anything"
chat = ChatOpenAI(
openai_api_base="http://0.0.0.0:4000",
model="gpt-3.5-turbo",
)
messages = [
SystemMessage(
content="你是一个有用的助手,我正在使用它来发出测试请求。"
),
HumanMessage(
content="来自 litellm 的测试。告诉我为什么它很棒,一句话"
),
]
response = chat(messages)
print(response)
测试 - 负载均衡
在此请求中,将发生以下情况:
- 将引发速率限制异常
- LiteLLM 代理将在模型组上重试请求(默认是 3 次)。
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "你好!"}
],
"mock_testing_rate_limit_error": true
}'
测试 - 客户端回退
在此请求中,将发生以下情况:
- 向
model="zephyr-beta"的请求将失败 - litellm 代理将遍历
fallbacks=["gpt-3.5-turbo"]中指定的所有模型组 - 向
model="gpt-3.5-turbo"的请求将成功,发出请求的客户端将收到来自 gpt-3.5-turbo 的响应
👉 关键更改: "fallbacks": ["gpt-3.5-turbo"]
- OpenAI Python v1.0.0+
- Curl 请求
- Langchain
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="zephyr-beta",
messages = [
{
"role": "user",
"content": "这是一个测试请求,写一首简短的诗"
}
],
extra_body={
"fallbacks": ["gpt-3.5-turbo"]
}
)
print(response)
将 metadata 作为请求体的一部分传递
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "zephyr-beta",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
"fallbacks": ["gpt-3.5-turbo"]
}'
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage
import os
os.environ["OPENAI_API_KEY"] = "anything"
chat = ChatOpenAI(
openai_api_base="http://0.0.0.0:4000",
model="zephyr-beta",
extra_body={
"fallbacks": ["gpt-3.5-turbo"]
}
)
messages = [
SystemMessage(
content="You are a helpful assistant that im using to make a test request to."
),
HumanMessage(
content="test from litellm. tell me why it's amazing in 1 sentence"
),
]
response = chat(messages)
print(response)
高级设置
回退 + 重试 + 超时 + 冷却
要设置回退,只需执行:
litellm_settings:
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}]
涵盖所有错误(429、500 等)
通过配置设置
model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: <my-openai-key>
- model_name: gpt-3.5-turbo-16k
litellm_params:
model: gpt-3.5-turbo-16k
api_key: <my-openai-key>
litellm_settings:
num_retries: 3 # 每个模型名称(例如 zephyr-beta)重试 3 次
request_timeout: 10 # 如果调用超过 10 秒则引发超时错误。设置 litellm.request_timeout
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # 如果调用失败 num_retries 次则回退到 gpt-3.5-turbo
allowed_fails: 3 # 如果模型在一分钟内失败次数超过 1 次则冷却模型
cooldown_time: 30 # 如果失败次数/分钟超过 allowed_fails,冷却模型的时间
测试回退!
检查回退是否按预期工作。
常规回退
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "my-bad-model",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"mock_testing_fallbacks": true # 👈 关键变化
}
'
内容策略回退
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "my-bad-model",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"mock_testing_content_policy_fallbacks": true # 👈 关键变化
}
'
上下文窗口回退
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "my-bad-model",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"mock_testing_context_window_fallbacks": true # 👈 关键变化
}
'
上下文窗口回退(预调用检查 + 回退)
在调用之前 检查调用是否在模型上下文窗口内,启用 enable_pre_call_checks: true。
1. 设置配置
对于 Azure 部署,设置基础模型。从 此列表 中选择基础模型,所有 Azure 模型都以 azure/ 开头。
- 同一组
- 上下文窗口回退(不同组)
过滤具有较小上下文窗口的模型的旧实例(例如 gpt-3.5-turbo)
router_settings:
enable_pre_call_checks: true # 1. 启用预调用检查
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
model_info:
base_model: azure/gpt-4-1106-preview # 2. 👈 (仅限Azure) 设置基础模型
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo-1106
api_key: os.environ/OPENAI_API_KEY
启动代理
litellm --config /path/to/config.yaml
# 运行于 http://0.0.0.0:4000
测试一下!
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
text = "42的意义是什么?" * 5000
# 请求发送到litellm代理上设置的模型,`litellm --model`
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{"role": "system", "content": text},
{"role": "user", "content": "谁是亚历山大?"},
],
)
print(response)
如果当前模型太小,则回退到更大的模型。
router_settings:
enable_pre_call_checks: true # 1. 启用预调用检查
model_list:
- model_name: gpt-3.5-turbo-small
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
model_info:
base_model: azure/gpt-4-1106-preview # 2. 👈(仅限Azure)设置基础模型
- model_name: gpt-3.5-turbo-large
litellm_params:
model: gpt-3.5-turbo-1106
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
context_window_fallbacks: [{"gpt-3.5-turbo-small": ["gpt-3.5-turbo-large", "claude-opus"]}]
启动代理
litellm --config /path/to/config.yaml
# 运行于 http://0.0.0.0:4000
测试一下!
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
text = "42的意义是什么?" * 5000
# 请求发送到litellm代理上设置的模型,`litellm --model`
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{"role": "system", "content": text},
{"role": "user", "content": "谁是亚历山大?"},
],
)
print(response)
内容策略回退
如果在内容策略违规错误时回退到不同的提供商(例如从Azure OpenAI到Anthropic)。
model_list:
- model_name: gpt-3.5-turbo-small
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
content_policy_fallbacks: [{"gpt-3.5-turbo-small": ["claude-opus"]}]
默认回退
你也可以设置default_fallbacks,以防某个特定的模型组配置错误或表现不佳。
model_list:
- model_name: gpt-3.5-turbo-small
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
default_fallbacks: ["claude-opus"]
这将默认使用claude-opus,以防任何模型失败。
特定模型的回退(例如{"gpt-3.5-turbo-small": ["claude-opus"]})会覆盖默认回退。
欧盟区域过滤(预调用检查)
在调用之前检查调用是否在模型上下文窗口内,使用enable_pre_call_checks: true。
设置部署的'region_name'。
注意:LiteLLM可以根据你的litellm参数自动推断Vertex AI、Bedrock和IBM WatsonxAI的region_name。对于Azure,设置litellm.enable_preview = True。
设置配置
router_settings:
enable_pre_call_checks: true # 1. 启用预调用检查
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
region_name: "eu" # 👈 设置欧盟区域
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo-1106
api_key: os.environ/OPENAI_API_KEY
- model_name: gemini-pro
litellm_params:
model: vertex_ai/gemini-pro-1.5
vertex_project: adroit-crow-1234
vertex_location: us-east1 # 👈 自动推断'region_name'
启动代理
litellm --config /path/to/config.yaml
# 运行于 http://0.0.0.0:4000
测试一下!
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
# 请求发送到litellm代理上设置的模型,`litellm --model`
response = client.chat.completions.with_raw_response.create(
model="gpt-3.5-turbo",
messages = [{"role": "user", "content": "谁是亚历山大?"}]
)
print(response)
print(f"response.headers.get('x-litellm-model-api-base')")
自定义超时、流超时——每模型
对于每个模型,你可以在 litellm_params 下设置 timeout 和 stream_timeout
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-small-eu
api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
api_key: <your-key>
timeout: 0.1 # 超时时间(秒)
stream_timeout: 0.01 # 流请求的超时时间(秒)
max_retries: 5
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-small-ca
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
api_key:
timeout: 0.1 # 超时时间(秒)
stream_timeout: 0.01 # 流请求的超时时间(秒)
max_retries: 5
启动代理
$ litellm --config /path/to/config.yaml
设置动态超时 - 每个请求
LiteLLM 代理支持为每个请求设置 timeout
示例用法
- Curl 请求
- OpenAI v1.0.0+
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data-raw '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "what color is red"}
],
"logit_bias": {12481: 100},
"timeout": 1
}'
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "what color is red"}
],
logit_bias={12481: 100},
timeout=1
)
print(response)
为通配符模型设置回退
你可以在配置文件中为通配符模型(例如 azure/*)设置回退。
- 设置配置
model_list:
- model_name: "gpt-4o"
litellm_params:
model: "openai/gpt-4o"
api_key: os.environ/OPENAI_API_KEY
- model_name: "azure/*"
litellm_params:
model: "azure/*"
api_key: os.environ/AZURE_API_KEY
api_base: os.environ/AZURE_API_BASE
litellm_settings:
fallbacks: [{"gpt-4o": ["azure/gpt-4o"]}]
- 启动代理
litellm --config /path/to/config.yaml
- 测试一下!
curl -L -X POST 'http://0.0.0.0:4000/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "what color is red"
}
]
}
],
"max_tokens": 300,
"mock_testing_fallbacks": true
}'