性能基准测试#

本文档详细说明了vllm-ascend的基准测试方法,旨在评估各种工作负载下的性能表现。为保持与vLLM的一致性,我们使用了vllm项目提供的benchmark脚本。

基准测试覆盖范围:我们测量了离线端到端延迟和吞吐量,以及固定QPS的在线服务基准测试,更多详情请参阅vllm-ascend基准测试脚本

1. 运行Docker容器#

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3.post1
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash

2. 安装依赖#

cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt

3. (可选)准备模型权重#

为了获得更快的运行速度,我们建议提前下载模型:

modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct

为了进行更快速、更轻量级的测试,建议将参数load-format设置为dummy, 系统将根据传入的模型结构构建随机权重值,这样可以避免从互联网下载模型所耗费的时间。

您也可以将所有json文件中的模型路径替换为您本地的路径以及其他传入的参数:

[
  {
    "test_name": "latency_llama8B_tp1",
    "parameters": {
      "model": "/path/to/model",
      "tensor_parallel_size": 1,
      "load_format": "dummy",
      "num_iters_warmup": 5,
      "num_iters": 15
    }
  }
]

4. 运行基准测试脚本#

运行基准测试脚本:

bash benchmarks/scripts/run-performance-benchmarks.sh

大约10分钟后,输出结果如下所示:

online serving:
qps 1:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  212.77    
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              0.94      
Output token throughput (tok/s):         204.66    
Total Token throughput (tok/s):          405.16    
---------------Time to First Token----------------
Mean TTFT (ms):                          104.14    
Median TTFT (ms):                        102.22    
P99 TTFT (ms):                           153.82    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.78     
Median TPOT (ms):                        38.70     
P99 TPOT (ms):                           48.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.46     
Median ITL (ms):                         36.96     
P99 ITL (ms):                            75.03     
==================================================

qps 4:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  72.55     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              2.76      
Output token throughput (tok/s):         600.24    
Total Token throughput (tok/s):          1188.27   
---------------Time to First Token----------------
Mean TTFT (ms):                          115.62    
Median TTFT (ms):                        109.39    
P99 TTFT (ms):                           169.03    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.48     
Median TPOT (ms):                        52.40     
P99 TPOT (ms):                           69.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.47     
Median ITL (ms):                         43.95     
P99 ITL (ms):                            130.29    
==================================================

qps 16:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  47.82     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              4.18      
Output token throughput (tok/s):         910.62    
Total Token throughput (tok/s):          1802.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          128.50    
Median TTFT (ms):                        128.36    
P99 TTFT (ms):                           187.87    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.60     
Median TPOT (ms):                        77.85     
P99 TPOT (ms):                           165.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.72     
Median ITL (ms):                         54.84     
P99 ITL (ms):                            289.63    
==================================================

qps inf:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  41.26     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              4.85      
Output token throughput (tok/s):         1055.44   
Total Token throughput (tok/s):          2089.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          3394.37   
Median TTFT (ms):                        3359.93   
P99 TTFT (ms):                           3540.93   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          66.28     
Median TPOT (ms):                        64.19     
P99 TPOT (ms):                           97.66     
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.62     
Median ITL (ms):                         55.69     
P99 ITL (ms):                            82.90     
==================================================

offline:
latency:
Avg latency: 4.944929537673791 seconds
10% percentile latency: 4.894104263186454 seconds
25% percentile latency: 4.909652255475521 seconds
50% percentile latency: 4.932477846741676 seconds
75% percentile latency: 4.9608619548380375 seconds
90% percentile latency: 5.035418218374252 seconds
99% percentile latency: 5.052476694583893 seconds

throughput:
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens:  42659
Total num output tokens:  43545

结果JSON文件生成到默认路径benchmark/results 这些文件包含详细的基准测试结果以供进一步分析。