[测试版] 请求优先级

info

测试版功能。仅用于测试。

在高流量情况下优先处理LLM API请求。

将请求添加到优先级队列
轮询队列，检查请求是否可以进行。返回'True'：
- 如果存在健康的部署
- 或者如果请求位于队列顶部
优先级 - 数字越小，优先级越高：
- 例如 priority=0 > priority=2000

快速开始

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-3.5-turbo",
            "litellm_params": {
                "model": "gpt-3.5-turbo",
                "mock_response": "Hello world this is Macintosh!", # 模拟LLM API调用
                "rpm": 1,
            },
        },
    ],
    timeout=2, # 如果请求超过2秒则超时
    routing_strategy="usage-based-routing-v2",
    polling_interval=0.03 # 如果没有健康的部署，每3毫秒轮询一次队列
)

try:
    _response = await router.acompletion( # 👈 添加到队列 + 轮询 + 进行调用
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hey!"}],
        priority=0, # 👈 越低越好
    )
except Exception as e:
    print("没有进行请求")

LiteLLM代理

要在LiteLLM代理上优先处理请求，请在请求中添加priority。

curl -X POST 'http://localhost:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
    "model": "gpt-3.5-turbo-fake-model",
    "messages": [
        {
        "role": "user",
        "content": "宇宙的意义是什么？1234"
        }],
    "priority": 0 👈 在此处设置值
}'

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

# 发送到litellm代理上设置的模型，`litellm --model`
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": "这是一个测试请求，写一首短诗"
        }
    ],
    extra_body={ 
        "priority": 0 👈 在此处设置值
    }
)

print(response)

高级 - Redis缓存

使用Redis缓存在多个LiteLLM实例之间进行请求优先级排序。

SDK

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-3.5-turbo",
            "litellm_params": {
                "model": "gpt-3.5-turbo",
                "mock_response": "Hello world this is Macintosh!", # 模拟LLM API调用
                "rpm": 1,
            },
        },
    ],
    ### REDIS参数 ###
    redis_host=os.environ["REDIS_HOST"], 
    redis_password=os.environ["REDIS_PASSWORD"], 
    redis_port=os.environ["REDIS_PORT"], 
)

try:
    _response = await router.acompletion( # 👈 添加到队列 + 轮询 + 进行调用
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hey!"}],
        priority=0, # 👈 越低越好
    )
except Exception as e:
    print("没有进行请求")

代理

model_list:
    - model_name: gpt-3.5-turbo-fake-model
      litellm_params:
        model: gpt-3.5-turbo
        mock_response: "hello world!" 
        api_key: my-good-key

litellm_settings:
    request_timeout: 600 # 👈 将一直重试直到超时

router_settings:
    redis_host; os.environ/REDIS_HOST
    redis_password: os.environ/REDIS_PASSWORD
    redis_port: os.environ/REDIS_PORT

$ litellm --config /path/to/config.yaml 

# 在http://0.0.0.0:4000上运行

curl -X POST 'http://localhost:4000/queue/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
    "model": "gpt-3.5-turbo-fake-model",
    "messages": [
        {
        "role": "user",
        "content": "宇宙的意义是什么？1234"
        }],
    "priority": 0 👈 在此处设置值
}'

[测试版] 请求优先级

快速开始​

LiteLLM代理​

高级 - Redis缓存​

SDK​

代理​

快速开始

LiteLLM代理

高级 - Redis缓存

SDK

代理