分析令牌延迟和吞吐量#

我们分析了在固定批量大小和固定输入/输出令牌下生成令牌的延迟和吞吐量。

分析脚本是profile_generation.py。在运行之前，请安装lmdeploy预编译包并下载分析脚本：

pip install lmdeploy
git clone --depth=1 https://github.com/InternLM/lmdeploy

指标#

LMDeploy 记录测试结果，如首个令牌的延迟、令牌吞吐量（令牌/秒）、每个令牌延迟的百分位数数据（P50、P75、P95、P99）、GPU 内存等。

first_token_latency 仅在流式推理的情况下报告。

计算throughput的公式是：

\[\begin{split} TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime \end{split}\]

总时间包括预填充时间。

在测试过程中，节点上的所有显卡不应运行任何其他程序，否则GPU内存的统计将不准确。

在本节中，我们以internlm/internlm-7b为例，展示如何分析LMDeploy的推理引擎。

cd lmdeploy/benchmark
python3 profile_generation.py internlm/internlm-7b

cd lmdeploy/benchmark
python3 profile_generation.py internlm/internlm-7b --backend pytorch

有关profile_generation.py的详细参数说明，例如批量大小、输入和输出的令牌数量等，请运行帮助命令python3 profile_generation.py -h。