优化与调优#
本指南旨在帮助用户在系统层面提升vllm-ascend的性能。内容包括操作系统配置、库优化、部署指南等。欢迎提供任何反馈。
准备工作#
运行容器:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the cann base image
export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
docker run --rm \
--name performance-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
配置您的环境:
# Configure the mirror
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list
# Install os packages
apt update && apt install wget gcc g++ libnuma-dev git vim -y
优化#
1. 编译优化#
1.1 安装优化版 python#
Python从3.6及以上版本开始支持LTO和PGO优化,这些优化可以在编译时启用。为了方便用户,我们直接提供了经过编译优化的python包。您也可以根据具体场景,按照这个教程自行构建python。
mkdir -p /workspace/tmp
cd /workspace/tmp
# Download prebuilt lib and packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz
# Configure python and pip
cp ./*.so* /usr/local/lib
tar -zxvf ./py311_bisheng.* -C /usr/local/
mv /usr/local/py311_bisheng/ /usr/local/python
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
ln -sf /usr/local/python/bin/python3 /usr/bin/python
ln -sf /usr/local/python/bin/python3 /usr/bin/python3
ln -sf /usr/local/python/bin/python3.11 /usr/bin/python3.11
ln -sf /usr/local/python/bin/pip3 /usr/bin/pip3
ln -sf /usr/local/python/bin/pip3 /usr/bin/pip
export PATH=/usr/bin:/usr/local/python/bin:$PATH
1.2 安装优化版 torch 和 torch_npu#
与python类似,为了方便用户使用,我们还直接提供了经过编译优化的torch和torch_npu软件包。您也可以根据具体场景,按照这个教程重新构建torch,或按照这个教程重新构建torch_npu。
cd /workspace/tmp
# Download prebuilt packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch-2.5.1-cp311-cp311-linux_aarch64.whl
wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch_npu-2.5.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
# Install optimized torch and torch_npu
pip install /workspace/tmp/torch-2.5.1*.whl --force-reinstall --no-deps
pip install /workspace/tmp/torch_npu-2.5.1*.whl --force-reinstall --no-deps
# Clear pip cache and download files
pip cache purge
rm -rf /workspace/tmp/*
# Make torch and torch_npu can find the `xxx.so` libs we installed before
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
2. 操作系统优化#
Tcmalloc(线程计数内存分配器)是一种通用内存分配器,通过引入多级缓存结构、减少互斥锁竞争并优化大对象处理流程,在确保低延迟的同时提升整体性能。更多详情请参阅此处。
# Install tcmalloc
sudo apt update
sudo apt install libgoogle-perftools4 libgoogle-perftools-dev
# Get the location of libtcmalloc.so*
find /usr -name libtcmalloc.so*
# Make the priority of tcmalloc higher
# The <path> is the location of libtcmalloc.so we get from the upper command
# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
export LD_PRELOAD="$LD_PRELOAD:<path>"
# Verify your configuration
# The path of libtcmalloc.so will be contained in the result if your configuration is valid
ldd `which python`
3. torch_npu 优化#
torch_npu中的部分性能调优功能通过环境变量控制。以下展示了一些特性及其相关环境变量。
内存优化:
# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
调度优化:
# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
export CPU_AFFINITY_CONF=1
4. CANN优化#
4.1 HCCL优化#
HCCL中部分性能调优功能目前存在一定的场景限制,因此通过环境变量来控制这些功能的启用状态。
4.2 mindie_turbo 优化#
mindie_turbo中的部分性能调优功能目前存在特定场景限制,因此通过环境变量来控制这些功能的启用状态。更多详情请参阅此处。
基准测试#
准备工作#
# Install necessary dependencies
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install "modelscope<1.23.0" pandas datasets gevent sacrebleu rouge_score pybind11 pytest
# Configure this var to speed up model download
VLLM_USE_MODELSCOPE=true
请按照安装指南确保vllm、vllm-ascend和mindie-turbo已正确安装。
注意
请确保在完成python配置后再安装vllm、vllm-ascend和mindie-turbo,因为这些软件包会使用当前环境中的python来构建二进制文件。如果您在章节1.1之前就安装了vllm、vllm-ascend和mindie-turbo,那么二进制文件将不会使用优化后的python。
使用方法#
启动vllm服务器:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--tensor-parallel-size 1 \
--swap-space 16 \
--disable-log-stats \
--disable-log-requests \
--load-format dummy
注意
设置load-format=dummy进行轻量级测试,我们不需要实际下载权重。
在使用升腾调度器启动vllm服务器时,您可以传递--additional-config '{"ascend_scheduler_config":{}}'参数,这将加速V1引擎的推理过程。更多详情请参阅此处。
运行基准测试(需要等待一段时间):
cd /vllm-workspace/vllm/benchmarks
python benchmark_serving.py \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset-name random \
--random-input-len 200 \
--num-prompts 200 \
--request-rate 1 \
--save-result --result-dir ./
结果#
我们使用vllm-ascend:v0.7.3作为基准,比较了不同优化方法组合的加速效果。我们在单NPU上完成了基准测试,结果如下所示。

注意
我们采用的优化方法组合详情:
组A:
vllm_ascend仅 (基准)B组:
vllm_ascend+mindie_trubo组别C:
vllm_ascend+ 优化版python/torch/torch_npuD组:
vllm_ascend+mindie_trubo+ 优化后的python/torch/torch_npuE组:
vllm_ascend+mindie_trubo+ 优化版python/torch/torch_npu+tcmallocF组: E组 +
PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"组别G: 组别E +
PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"H组: E组 +
TASK_QUEUE_ENABLE=2第一组: E组 +
CPU_AFFINITY_CONF=1
综上所述,H组(vllm_ascend + mindie_trubo + 优化后的python/torch/torch_npu + tcmalloc + TASK_QUEUE_ENABLE=2)在单NPU上实现了最佳性能。与基准测试(仅使用vllm_ascend)相比,TTFT(预填充时间)降低了43.31%,TPOT(等同于ITL,解码时间)也降低了47.93%。
此外,在使用多个NPU的分布式环境中,您可以尝试4.1和4.2章节展示的更多优化方法,以获得更快的推理速度。