VLLM

LiteLLM 支持 VLLM 上的所有模型。

快速开始

用法 - litellm.completion（调用 vLLM 端点）

vLLM 提供了一个兼容 OpenAI 的端点 - 以下是如何使用 LiteLLM 调用它的方法

为了使用 litellm 调用托管的 vllm 服务器，请在您的 completion 调用中添加以下内容：

model="hosted_vllm/<您的-vllm-模型名称>"
api_base = "您的-托管-vllm-服务器"

import litellm 

response = litellm.completion(
            model="hosted_vllm/facebook/opt-125m", # 传递 vllm 模型名称
            messages=messages,
            api_base="https://hosted-vllm-api.co",
            temperature=0.2,
            max_tokens=80)

print(response)

用法 - LiteLLM 代理服务器（调用 vLLM 端点）

以下是如何使用 LiteLLM 代理服务器调用 OpenAI 兼容端点的方法

修改 config.yaml

model_list:
  - model_name: my-model
    litellm_params:
      model: hosted_vllm/facebook/opt-125m  # 添加 hosted_vllm/ 前缀以路由为 OpenAI 提供者
      api_base: https://hosted-vllm-api.co      # 为 OpenAI 兼容提供者添加 api 基础

启动代理
```
$ litellm --config /path/to/config.yaml
```

向 LiteLLM 代理服务器发送请求

import openai
client = openai.OpenAI(
    api_key="sk-1234",             # 传递 litellm 代理密钥，如果你使用虚拟密钥
    base_url="http://0.0.0.0:4000" # litellm-proxy-base url
)

response = client.chat.completions.create(
    model="my-model",
    messages = [
        {
            "role": "user",
            "content": "你是什么 llm"
        }
    ],
)

print(response)

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Authorization: Bearer sk-1234' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "my-model",
    "messages": [
        {
        "role": "user",
        "content": "你是什么 llm"
        }
    ],
}'

额外 - 针对 `vllm pip 包`

使用 - `litellm.completion`

pip install litellm vllm

import litellm 

response = litellm.completion(
            model="vllm/facebook/opt-125m", # 添加一个 vllm 前缀，以便 litellm 知道 custom_llm_provider==vllm
            messages=messages,
            temperature=0.2,
            max_tokens=80)

print(response)

批量完成

from litellm import batch_completion

model_name = "facebook/opt-125m"
provider = "vllm"
messages = [[{"role": "user", "content": "嘿，最近怎么样"}] for _ in range(5)]

response_list = batch_completion(
            model=model_name, 
            custom_llm_provider=provider, # 可以轻松切换到 huggingface, replicate, together ai, sagemaker 等
            messages=messages,
            temperature=0.2,
            max_tokens=80,
        )
print(response_list)

提示模板

对于有特殊提示模板（例如 Llama2）的模型，我们格式化提示以适应它们的模板。

如果我们不支持你需要的模型怎么办？ 你也可以指定你自己的自定义提示格式化，以防我们还没有覆盖你的模型。

这是否意味着你必须为所有模型指定提示？ 不需要。默认情况下，我们会将你的消息内容连接起来形成一个提示（适用于 Bloom, T-5, Llama-2 基础模型等）。

默认提示模板

def default_pt(messages):
    return " ".join(message["content"] for message in messages)

LiteLLM 中提示模板工作的代码

我们已经有的提示模板

模型名称	适用模型	函数调用
meta-llama/Llama-2-7b-chat	所有 meta-llama llama2 聊天模型	`completion(model='vllm/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")`
tiiuae/falcon-7b-instruct	所有 falcon 指令模型	`completion(model='vllm/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")`
mosaicml/mpt-7b-chat	所有 mpt 聊天模型	`completion(model='vllm/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")`
codellama/CodeLlama-34b-Instruct-hf	所有 codellama 指导模型	`completion(model='vllm/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")`
WizardLM/WizardCoder-Python-34B-V1.0	所有 wizardcoder 模型	`completion(model='vllm/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")`
Phind/Phind-CodeLlama-34B-v2	所有 phind-codellama 模型	`completion(model='vllm/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")`

自定义提示模板

# 创建你自己的自定义提示模板
litellm.register_prompt_template(
model="togethercomputer/LLaMA-2-7B-32K",
roles={
            "system": {
                "pre_message": "[INST] <<SYS>>\n",
                "post_message": "\n<</SYS>>\n [/INST]\n"
            },
            "user": { 
                "pre_message": "[INST] ",
                "post_message": " [/INST]\n"
            }, 
            "assistant": {
                "pre_message": "\n",
                "post_message": "\n",
            }
        } # 告诉 LiteLLM 如何将 openai 消息映射到此模型
)

def test_vllm_custom_model():
    model = "vllm/togethercomputer/LLaMA-2-7B-32K"
    response = completion(model=model, messages=messages)
    print(response['choices'][0]['message']['content'])
    return response

test_vllm_custom_model()

实现代码

VLLM

快速开始

用法 - litellm.completion（调用 vLLM 端点）​

用法 - LiteLLM 代理服务器（调用 vLLM 端点）​

额外 - 针对 vllm pip 包​

使用 - litellm.completion​

批量完成​

提示模板​

我们已经有的提示模板​

自定义提示模板​