优化与调优#

本指南旨在帮助用户在系统层面提升vllm-ascend的性能。内容包括操作系统配置、库优化、部署指南等。欢迎提供任何反馈。

准备工作#

运行容器:

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the cann base image
export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
docker run --rm \
--name performance-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

配置您的环境:

# Configure the mirror
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list

# Install os packages
apt update && apt install wget gcc g++ libnuma-dev git vim -y

优化#

1. 编译优化#

1.1 安装优化版 python#

Python从3.6及以上版本开始支持LTOPGO优化,这些优化可以在编译时启用。为了方便用户,我们直接提供了经过编译优化的python包。您也可以根据具体场景,按照这个教程自行构建python

mkdir -p /workspace/tmp
cd /workspace/tmp

# Download prebuilt lib and packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz

# Configure python and pip
cp ./*.so* /usr/local/lib
tar -zxvf ./py311_bisheng.*  -C /usr/local/
mv  /usr/local/py311_bisheng/  /usr/local/python
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
ln -sf  /usr/local/python/bin/python3  /usr/bin/python
ln -sf  /usr/local/python/bin/python3  /usr/bin/python3
ln -sf  /usr/local/python/bin/python3.11  /usr/bin/python3.11
ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip3
ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip

export PATH=/usr/bin:/usr/local/python/bin:$PATH

1.2 安装优化版 torchtorch_npu#

python类似,为了方便用户使用,我们还直接提供了经过编译优化的torchtorch_npu软件包。您也可以根据具体场景,按照这个教程重新构建torch,或按照这个教程重新构建torch_npu

cd /workspace/tmp

# Download prebuilt packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch-2.5.1-cp311-cp311-linux_aarch64.whl
wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch_npu-2.5.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

# Install optimized torch and torch_npu
pip install /workspace/tmp/torch-2.5.1*.whl --force-reinstall --no-deps
pip install /workspace/tmp/torch_npu-2.5.1*.whl --force-reinstall --no-deps

# Clear pip cache and download files
pip cache purge
rm -rf /workspace/tmp/*

# Make torch and torch_npu can find the `xxx.so` libs we installed before
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

2. 操作系统优化#

Tcmalloc(线程计数内存分配器)是一种通用内存分配器,通过引入多级缓存结构、减少互斥锁竞争并优化大对象处理流程,在确保低延迟的同时提升整体性能。更多详情请参阅此处

# Install tcmalloc
sudo apt update
sudo apt install libgoogle-perftools4 libgoogle-perftools-dev

# Get the location of libtcmalloc.so*
find /usr -name libtcmalloc.so*

# Make the priority of tcmalloc higher
# The <path> is the location of libtcmalloc.so we get from the upper command
# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
export LD_PRELOAD="$LD_PRELOAD:<path>"

# Verify your configuration
# The path of libtcmalloc.so will be contained in the result if your configuration is valid
ldd `which python`

3. torch_npu 优化#

torch_npu中的部分性能调优功能通过环境变量控制。以下展示了一些特性及其相关环境变量。

内存优化:

# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

调度优化:

# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2

# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
export CPU_AFFINITY_CONF=1

4. CANN优化#

4.1 HCCL优化#

HCCL中部分性能调优功能目前存在一定的场景限制,因此通过环境变量来控制这些功能的启用状态。

  • HCCL_INTRA_ROCE_ENABLE: 在两组8P之间使用RDMA链路而非SDMA链路作为网格互连链路,更多详情请参阅此处

  • HCCL_RDMA_TC: 使用该变量配置RDMA网卡的流量类别,更多详情请参阅此处

  • HCCL_RDMA_SL: 使用此变量配置RDMA网卡的服务等级,更多详情请参阅此处

  • HCCL_BUFFSIZE: 使用该变量控制两个NPU之间共享数据的缓存大小,更多详情请参阅此处

4.2 mindie_turbo 优化#

mindie_turbo中的部分性能调优功能目前存在特定场景限制,因此通过环境变量来控制这些功能的启用状态。更多详情请参阅此处

基准测试#

准备工作#

# Install necessary dependencies
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install "modelscope<1.23.0" pandas datasets gevent sacrebleu rouge_score pybind11 pytest

# Configure this var to speed up model download
VLLM_USE_MODELSCOPE=true

请按照安装指南确保vllmvllm-ascendmindie-turbo已正确安装。

注意

请确保在完成python配置后再安装vllmvllm-ascendmindie-turbo,因为这些软件包会使用当前环境中的python来构建二进制文件。如果您在章节1.1之前就安装了vllmvllm-ascendmindie-turbo,那么二进制文件将不会使用优化后的python

使用方法#

启动vllm服务器:

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--tensor-parallel-size 1 \
--swap-space 16 \
--disable-log-stats \
--disable-log-requests \
--load-format dummy

注意

设置load-format=dummy进行轻量级测试,我们不需要实际下载权重。

在使用升腾调度器启动vllm服务器时,您可以传递--additional-config '{"ascend_scheduler_config":{}}'参数,这将加速V1引擎的推理过程。更多详情请参阅此处

运行基准测试(需要等待一段时间):

cd /vllm-workspace/vllm/benchmarks
python benchmark_serving.py \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset-name random \
--random-input-len 200 \
--num-prompts 200 \
--request-rate 1 \
--save-result --result-dir ./

结果#

我们使用vllm-ascend:v0.7.3作为基准,比较了不同优化方法组合的加速效果。我们在单NPU上完成了基准测试,结果如下所示。

注意

我们采用的优化方法组合详情:

  • 组A: vllm_ascend(基准)

  • B组: vllm_ascend + mindie_trubo

  • 组别C: vllm_ascend + 优化版 python/torch/torch_npu

  • D组: vllm_ascend + mindie_trubo + 优化后的 python/torch/torch_npu

  • E组: vllm_ascend + mindie_trubo + 优化版 python/torch/torch_npu + tcmalloc

  • F组: E组 + PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

  • 组别G: 组别E + PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

  • H组: E组 + TASK_QUEUE_ENABLE=2

  • 第一组: E组 + CPU_AFFINITY_CONF=1

综上所述,H组vllm_ascend + mindie_trubo + 优化后的python/torch/torch_npu + tcmalloc + TASK_QUEUE_ENABLE=2)在单NPU上实现了最佳性能。与基准测试(仅使用vllm_ascend)相比,TTFT(预填充时间)降低了43.31%TPOT(等同于ITL,解码时间)也降低了47.93%

此外,在使用多个NPU的分布式环境中,您可以尝试4.14.2章节展示的更多优化方法,以获得更快的推理速度。