Serving LoRA

服务LoRA#

启动LoRA#

LoRA 目前仅支持 PyTorch 后端。其部署过程与其他模型类似,您可以使用 lmdeploy serve api_server -h 查看命令。在 PyTorch 后端支持的参数中,有 LoRA 的配置选项。

PyTorch engine arguments:
  --adapters [ADAPTERS [ADAPTERS ...]]
                        Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora adapters. If only have one adapter, one can only input the path of the adapter.. Default:
                        None. Type: str

用户只需将LoRA权重的Hugging Face模型路径以字典形式传递给--adapters

lmdeploy serve api_server THUDM/chatglm2-6b --adapters mylora=chenchi/lora-chatglm2-6b-guodegang

服务启动后,你可以在Swagger UI中找到两个可用的模型名称:‘THUDM/chatglm2-6b’ 和 ‘mylora’。后者是--adapters字典中的键。

客户端使用#

命令行界面#

使用OpenAI端点时,model参数可用于选择基础模型或特定的LoRA权重进行推理。以下示例选择使用提供的chenchi/lora-chatglm2-6b-guodegang进行推理。

curl -X 'POST' \
  'http://localhost:23334/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "mylora",
  "messages": [
    {
      "content": "hi",
      "role": "user"
    }
  ]
}'

这是输出:

{
  "id": "2",
  "object": "chat.completion",
  "created": 1721377275,
  "model": "mylora",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " 很高兴哪有什么赶凳儿?(按东北语说的“起早哇”),哦,东北人都学会外语了?",
        "tool_calls": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 43,
    "completion_tokens": 26
  }
}

python#

from openai import OpenAI
client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url="http://0.0.0.0:23333/v1"
)
model_name = 'mylora'
response = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "user", "content": "hi"},
  ],
    temperature=0.8,
    top_p=0.8
)
print(response)

打印的响应内容是:

ChatCompletion(id='4', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' 很高兴能够见到你哪,我也在辐射区开了个愣儿,你呢,还活着。', role='assistant', function_call=None, tool_calls=None))], created=1721377497, model='mylora', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=22, prompt_tokens=17, total_tokens=39))