OpenAI 兼容服务器#

本文主要讨论了在单个节点上跨多个GPU部署单个LLM模型，提供与OpenAI接口兼容的服务，以及服务API的使用。为了方便起见，我们将此服务称为api_server。关于多模型的并行服务，请参考关于请求分发服务器的指南。

在接下来的部分中，我们将首先介绍启动服务的方法，根据您的应用场景选择合适的服务。

接下来，我们重点介绍服务的RESTful API定义，探索与接口交互的各种方式，并演示如何通过Swagger UI或LMDeploy CLI工具尝试该服务。

最后，我们展示了如何将服务集成到WebUI中，为您提供一个参考，以便轻松设置演示示例。

启动服务#

以托管在huggingface hub上的internlm2_5-7b-chat模型为例，您可以选择以下方法之一来启动服务。

选项1：使用lmdeploy CLI启动#

lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333

api_server 的参数可以通过命令 lmdeploy serve api_server -h 查看，例如，--tp 用于设置张量并行，--session-len 用于指定上下文窗口的最大长度，--cache-max-entry-count 用于调整 k/v 缓存的 GPU 内存比例等。

选项2：使用docker部署#

使用LMDeploy 官方docker镜像，您可以按如下方式运行OpenAI兼容服务器：

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 23333:23333 \
    --ipc=host \
    openmmlab/lmdeploy:latest \
    lmdeploy serve api_server internlm/internlm2_5-7b-chat

api_server 的参数与“选项 1”部分中提到的参数相同

选项3：部署到Kubernetes集群#

连接到正在运行的Kubernetes集群，并使用kubectl命令行工具部署internlm2_5-7b-chat模型服务（将 token>替换为您的huggingface hub令牌）：

sed 's/{{HUGGING_FACE_HUB_TOKEN}}/<your token>/' k8s/deployment.yaml | kubectl create -f - \
    && kubectl create -f k8s/service.yaml

在上面的示例中，模型数据被放置在节点的本地磁盘上（hostPath）。如果需要多个副本，请考虑将其替换为高可用性共享存储，并且可以使用PersistentVolume将存储挂载到容器中。

RESTful API#

LMDeploy的RESTful API兼容以下三个OpenAI接口：

/v1/chat/completions
/v1/models
/v1/completions

此外，LMDeploy 还定义了 /v1/chat/interactive 以支持交互式推理。交互式推理的特点是无需像 v1/chat/completions 那样传递用户对话历史，因为对话历史将缓存在服务器端。这种方法在多轮长上下文推理中表现出色。

您可以通过网站http://0.0.0.0:23333来概览并尝试提供的RESTful API，如下所示，在成功启动服务后。

swagger_ui

或者，您可以使用LMDeploy的内置CLI工具直接从控制台验证服务的正确性。

# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client ${api_server_url}

如果您需要将服务集成到您自己的项目或产品中，我们推荐以下方法：

与`OpenAI`集成#

这里是一个通过openai包与v1/chat/completions服务交互的示例。在运行之前，请通过pip install openai安装openai包。

from openai import OpenAI
client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": " provide three suggestions about time management"},
  ],
    temperature=0.8,
    top_p=0.8
)
print(response)

如果你想使用异步函数，可以尝试以下示例：

import asyncio
from openai import AsyncOpenAI

async def main():
    client = AsyncOpenAI(api_key='YOUR_API_KEY',
                         base_url='http://0.0.0.0:23333/v1')
    model_cards = await client.models.list()._get_page()
    response = await client.chat.completions.create(
        model=model_cards.data[0].id,
        messages=[
            {
                'role': 'system',
                'content': 'You are a helpful assistant.'
            },
            {
                'role': 'user',
                'content': ' provide three suggestions about time management'
            },
        ],
        temperature=0.8,
        top_p=0.8)
    print(response)

asyncio.run(main())

您可以使用类似的方法调用其他OpenAI接口。有关更多详细信息，请参阅OpenAI API指南

与lmdeploy `APIClient`集成#

以下是一些示例，展示了如何通过APIClient访问服务。

如果你想使用/v1/chat/completions端点，你可以尝试以下代码：

from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
messages = [{"role": "user", "content": "Say this is a test!"}]
for item in api_client.chat_completions_v1(model=model_name, messages=messages):
    print(item)

对于/v1/completions端点，你可以尝试：

from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
for item in api_client.completions_v1(model=model_name, prompt='hi'):
    print(item)

至于/v1/chat/interactive，我们默认禁用了该功能。请通过设置interactive_mode = True来开启它。如果不这样做，它将回退到与openai兼容的接口。

请记住，session_id 表示一个相同的序列，所有属于同一序列的请求必须共享相同的 session_id。例如，在一个包含10轮聊天请求的序列中，每个请求中的 session_id 应该是相同的。

from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient(f'http://{server_ip}:{server_port}')
messages = [
    "hi, what's your name?",
    "who developed you?",
    "Tell me more about your developers",
    "Summarize the information we've talked so far"
]
for message in messages:
    for item in api_client.chat_interactive_v1(prompt=message,
                                               session_id=1,
                                               interactive_mode=True,
                                               stream=False):
        print(item)

工具#

可以参考 api_server_tools。

与Java/Golang/Rust集成#

可以使用 openapi-generator-cli 将 http://{server_ip}:{server_port}/openapi.json 转换为 java/rust/golang 客户端。以下是一个示例：

$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust

$ ls rust/*
rust/Cargo.toml  rust/git_push.sh  rust/README.md

rust/docs:
ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md

rust/src:
apis  lib.rs  models

与cURL集成#

cURL 是一个用于观察 RESTful API 输出的工具。

列出已服务的模型 v1/models

curl http://{server_ip}:{server_port}/v1/models

聊天 v1/chat/completions

curl http://{server_ip}:{server_port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm-chat-7b",
    "messages": [{"role": "user", "content": "Hello! How are you?"}]
  }'

文本补全 v1/completions

curl http://{server_ip}:{server_port}/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "llama",
  "prompt": "two steps to build a house:"
}'

交互式聊天 v1/chat/interactive

curl http://{server_ip}:{server_port}/v1/chat/interactive \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello! How are you?",
    "session_id": 1,
    "interactive_mode": true
  }'

与WebUI集成#

# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server-name localhost --server-port 6006
lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}

启动多个API服务器#

以下是启动多个API服务器的两个步骤。只需创建一个包含以下代码的Python脚本。

通过lmdeploy serve proxy启动代理服务器。获取正确的代理服务器URL。
通过 torchrun --nproc_per_node 2 script.py InternLM/internlm2-chat-1_8b --proxy_url http://{proxy_node_name}:{proxy_node_port} 启动脚本。注意：请不要在此处使用 0.0.0.0:8000，而是输入真实的IP名称，例如 11.25.34.55:8000。

import os
import socket
from typing import List, Literal

import fire


def get_host_ip():
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.connect(('8.8.8.8', 80))
        ip = s.getsockname()[0]
    finally:
        s.close()
    return ip


def main(model_path: str,
         tp: int = 1,
         proxy_url: str = 'http://0.0.0.0:8000',
         port: int = 23333,
         backend: Literal['turbomind', 'pytorch'] = 'turbomind'):
    local_rank = int(os.environ.get('LOCAL_RANK', -1))
    world_size = int(os.environ.get('WORLD_SIZE', -1))
    local_ip = get_host_ip()
    if isinstance(port, List):
        assert len(port) == world_size
        port = port[local_rank]
    else:
        port += local_rank * 10
    if (world_size - local_rank) % tp == 0:
        rank_list = ','.join([str(local_rank + i) for i in range(tp)])
        command = f'CUDA_VISIBLE_DEVICES={rank_list} lmdeploy serve api_server {model_path} '\
                  f'--server-name {local_ip} --server-port {port} --tp {tp} '\
                  f'--proxy-url {proxy_url} --backend {backend}'
        print(f'running command: {command}')
        os.system(command)


if __name__ == '__main__':
    fire.Fire(main)

常见问题解答#

当用户得到"finish_reason":"length"时，意味着会话太长无法继续。可以通过向api_server传递--session_len来修改会话长度。
当服务器端出现OOM时，请在启动服务时减少backend_config中的cache_max_entry_count。
当使用相同的session_id向/v1/chat/interactive发出请求时，如果返回值为空且tokens为负，请考虑设置interactive_mode=false以重新启动会话。
/v1/chat/interactive API 默认禁用多轮对话。输入参数 prompt 由单个字符串或整个聊天历史组成。
关于停用词，我们只支持编码为单个索引的字符。此外，可能有多个索引解码为包含停用词的结果。在这种情况下，如果这些索引的数量太大，我们将只使用由分词器编码的索引。如果你想使用编码为多个索引的停用符号，你可以考虑在流客户端进行字符串匹配。一旦找到成功的匹配，你就可以跳出流循环。
要自定义聊天模板，请参考 chat_template.md。

OpenAI 兼容服务器

目录