SmoothQuant#

LMDeploy 提供了使用8位整数（INT8）对大语言模型进行量化和推理的功能。对于像Nvidia H100这样的GPU，lmdeploy还支持8位浮点数（FP8）。

以下NVIDIA GPU分别可用于INT8/FP8推理：

INT8
- V100(sm70): V100
- 图灵(sm75): 20系列, T4
- 安培(sm80,sm86): 30系列, A10, A16, A30, A100
- 艾达·洛夫莱斯(sm89): 40系列
- Hopper(sm90): H100
FP8
- 艾达·洛夫莱斯(sm89): 40系列
- Hopper(sm90): H100

首先，运行以下命令来安装lmdeploy：

pip install lmdeploy[all]

8位权重量化#

执行8位权重量化涉及三个步骤：

平滑权重：首先对语言模型（LLM）的权重进行平滑处理。这个过程使得权重更适合量化。
替换模块：找到DecoderLayers并将模块RSMNorm和nn.Linear分别替换为QRSMNorm和QLinear模块。这些‘Q’模块可以在lmdeploy/pytorch/models/q_modules.py文件中找到。
保存量化模型：一旦你完成了必要的替换，保存新的量化模型。

lmdeploy 提供了 lmdeploy lite smooth_quant 命令来完成上述所有三个任务。请注意，参数 --quant-dtype 用于确定您是否在进行 int8 或 fp8 权重量化。要获取有关 CLI 用法的更多信息，请运行 lmdeploy lite smooth_quant --help

以下是两个示例：

int8

lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8

fp8

lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8

推理#

尝试以下代码，您可以使用量化模型执行批量离线推理：

from lmdeploy import pipeline, PytorchEngineConfig

engine_config = PytorchEngineConfig(tp=1)
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

服务#

LMDeploy的api_server使得模型可以通过一个命令轻松打包成服务。提供的RESTful API与OpenAI的接口兼容。以下是服务启动的示例：

lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch

api_server 的默认端口是 23333。服务器启动后，您可以通过 api_client 在终端与服务器进行通信：

lmdeploy serve api_client http://0.0.0.0:23333

您可以通过Swagger UI在线概览和试用api_server API，访问地址为http://0.0.0.0:23333，或者您也可以从这里阅读API规范。

SmoothQuant

目录

SmoothQuant#

8位权重量化#

推理#

服务#