配置文件请求吞吐量#

在应用程序中，用户输入提示的长度和生成令牌的大小是动态的。静态推理性能不足以反映推理引擎处理动态特性的能力。

因此，有必要使用真实的对话数据来评估推理引擎的动态推理能力。本文将介绍如何在本地主机上测试LMDeploy的动态推理性能。

性能分析脚本是 profile_throughput.py。在运行之前，请安装 lmdeploy 预编译包，下载性能分析脚本和测试数据集：

pip install lmdeploy
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

指标#

LMDeploy 记录了性能指标，如第一个令牌的延迟、令牌吞吐量（令牌/秒）和请求吞吐量（RPM）

first_token_latency 仅在流式推理的情况下报告。

计算token throughput的公式是：

\[\begin{split} TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime \end{split}\]

计算request throughput的公式是：

\[\begin{split} RPM(request\\ per\\ minute) = Number\\ of\\ prompts/TotalTime * 60 \end{split}\]

总时间包括预填充时间。

个人资料#

在本节中，我们以internlm/internlm-7b为例，展示如何分析LMDeploy的推理引擎。

分析 turbomind 引擎#

python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b

分析PyTorch引擎#

python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b  --backend pytorch

有关profile_throughput.py的详细参数说明，例如请求并发性、采样参数、k/v缓存内存百分比等，请运行帮助命令python3 profile_throughput.py -h。

Profile Request Throughput

目录

配置文件请求吞吐量#

指标#

个人资料#

分析 turbomind 引擎#

分析PyTorch引擎#