优化与调优

优化与调优#

本指南旨在帮助用户在系统层面提升vllm-ascend的性能。内容包括操作系统配置、库优化、部署指南等。欢迎提供任何反馈。

准备工作#

运行容器：

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the cann base image
export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
docker run --rm \
--name performance-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

配置您的环境：

# Configure the mirror
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list

# Install os packages
apt update && apt install wget gcc g++ libnuma-dev git vim -y

优化#

1. 编译优化#

1.1 安装优化版 `python`#

Python从3.6及以上版本开始支持LTO和PGO优化，这些优化可以在编译时启用。为了方便用户，我们直接提供了经过编译优化的python包。您也可以根据具体场景，按照这个教程自行构建python。

mkdir -p /workspace/tmp
cd /workspace/tmp

# Download prebuilt lib and packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz

# Configure python and pip
cp ./*.so* /usr/local/lib
tar -zxvf ./py311_bisheng.*  -C /usr/local/
mv  /usr/local/py311_bisheng/  /usr/local/python
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
ln -sf  /usr/local/python/bin/python3  /usr/bin/python
ln -sf  /usr/local/python/bin/python3  /usr/bin/python3
ln -sf  /usr/local/python/bin/python3.11  /usr/bin/python3.11
ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip3
ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip

export PATH=/usr/bin:/usr/local/python/bin:$PATH

1.2 安装优化版 `torch` 和 `torch_npu`#

与python类似，为了方便用户使用，我们还直接提供了经过编译优化的torch和torch_npu软件包。您也可以根据具体场景，按照这个教程重新构建torch，或按照这个教程重新构建torch_npu。

cd /workspace/tmp

# Download prebuilt packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch-2.5.1-cp311-cp311-linux_aarch64.whl
wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch_npu-2.5.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

# Install optimized torch and torch_npu
pip install /workspace/tmp/torch-2.5.1*.whl --force-reinstall --no-deps
pip install /workspace/tmp/torch_npu-2.5.1*.whl --force-reinstall --no-deps

# Clear pip cache and download files
pip cache purge
rm -rf /workspace/tmp/*

# Make torch and torch_npu can find the `xxx.so` libs we installed before
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

2. 操作系统优化#

Tcmalloc（线程计数内存分配器）是一种通用内存分配器，通过引入多级缓存结构、减少互斥锁竞争并优化大对象处理流程，在确保低延迟的同时提升整体性能。更多详情请参阅此处。

# Install tcmalloc
sudo apt update
sudo apt install libgoogle-perftools4 libgoogle-perftools-dev

# Get the location of libtcmalloc.so*
find /usr -name libtcmalloc.so*

# Make the priority of tcmalloc higher
# The <path> is the location of libtcmalloc.so we get from the upper command
# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
export LD_PRELOAD="$LD_PRELOAD:<path>"

# Verify your configuration
# The path of libtcmalloc.so will be contained in the result if your configuration is valid
ldd `which python`

3. `torch_npu` 优化#

torch_npu中的部分性能调优功能通过环境变量控制。以下展示了一些特性及其相关环境变量。

内存优化：

# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

调度优化：

# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2

# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
export CPU_AFFINITY_CONF=1

4. CANN优化#

4.1 HCCL优化#

HCCL中部分性能调优功能目前存在一定的场景限制，因此通过环境变量来控制这些功能的启用状态。

HCCL_INTRA_ROCE_ENABLE: 在两组8P之间使用RDMA链路而非SDMA链路作为网格互连链路，更多详情请参阅此处。
HCCL_RDMA_TC: 使用该变量配置RDMA网卡的流量类别，更多详情请参阅此处。
HCCL_RDMA_SL: 使用此变量配置RDMA网卡的服务等级，更多详情请参阅此处。
HCCL_BUFFSIZE: 使用该变量控制两个NPU之间共享数据的缓存大小，更多详情请参阅此处。

4.2 `mindie_turbo` 优化#

mindie_turbo中的部分性能调优功能目前存在特定场景限制，因此通过环境变量来控制这些功能的启用状态。更多详情请参阅此处。

基准测试#

准备工作#

# Install necessary dependencies
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install "modelscope<1.23.0" pandas datasets gevent sacrebleu rouge_score pybind11 pytest

# Configure this var to speed up model download
VLLM_USE_MODELSCOPE=true

请按照安装指南确保vllm、vllm-ascend和mindie-turbo已正确安装。

注意

请确保在完成python配置后再安装vllm、vllm-ascend和mindie-turbo，因为这些软件包会使用当前环境中的python来构建二进制文件。如果您在章节1.1之前就安装了vllm、vllm-ascend和mindie-turbo，那么二进制文件将不会使用优化后的python。

使用方法#

启动vllm服务器：

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--tensor-parallel-size 1 \
--swap-space 16 \
--disable-log-stats \
--disable-log-requests \
--load-format dummy

注意

设置load-format=dummy进行轻量级测试，我们不需要实际下载权重。

在使用升腾调度器启动vllm服务器时，您可以传递--additional-config '{"ascend_scheduler_config":{}}'参数，这将加速V1引擎的推理过程。更多详情请参阅此处。

运行基准测试（需要等待一段时间）：

cd /vllm-workspace/vllm/benchmarks
python benchmark_serving.py \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset-name random \
--random-input-len 200 \
--num-prompts 200 \
--request-rate 1 \
--save-result --result-dir ./

结果#

我们使用vllm-ascend:v0.7.3作为基准，比较了不同优化方法组合的加速效果。我们在单NPU上完成了基准测试，结果如下所示。

注意

我们采用的优化方法组合详情：

组A： vllm_ascend 仅 (基准)
B组： vllm_ascend + mindie_trubo
组别C： vllm_ascend + 优化版 python/torch/torch_npu
D组： vllm_ascend + mindie_trubo + 优化后的 python/torch/torch_npu
E组： vllm_ascend + mindie_trubo + 优化版 python/torch/torch_npu + tcmalloc
F组： E组 + PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
组别G： 组别E + PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
H组： E组 + TASK_QUEUE_ENABLE=2
第一组： E组 + CPU_AFFINITY_CONF=1

综上所述，H组（vllm_ascend + mindie_trubo + 优化后的python/torch/torch_npu + tcmalloc + TASK_QUEUE_ENABLE=2）在单NPU上实现了最佳性能。与基准测试（仅使用vllm_ascend）相比，TTFT（预填充时间）降低了43.31%，TPOT（等同于ITL，解码时间）也降低了47.93%。

此外，在使用多个NPU的分布式环境中，您可以尝试4.1和4.2章节展示的更多优化方法，以获得更快的推理速度。

优化与调优

目录

优化与调优#

准备工作#

优化#

1. 编译优化#

1.1 安装优化版 python#

1.2 安装优化版 torch 和 torch_npu#

2. 操作系统优化#

3. torch_npu 优化#

4. CANN优化#

4.1 HCCL优化#

4.2 mindie_turbo 优化#

基准测试#

准备工作#

使用方法#

结果#

1.1 安装优化版 `python`#

1.2 安装优化版 `torch` 和 `torch_npu`#

3. `torch_npu` 优化#

4.2 `mindie_turbo` 优化#