专家并行部署¶
vLLM支持专家并行(EP)技术,该技术允许混合专家(MoE)模型中的专家部署在不同的GPU上,从而提高局部性、效率和整体吞吐量。
EP通常与数据并行(DP)结合使用。虽然DP可以独立于EP使用,但EP与DP结合使用时效率更高。您可以在此阅读更多关于数据并行的内容。
先决条件¶
在使用EP之前,您需要安装必要的依赖项。我们正在积极努力使这一过程在未来更加简便:
后端选择指南¶
vLLM为EP提供了三种通信后端:
| 后端 | 使用场景 | 功能特性 | 最佳适用 |
|---|---|---|---|
pplx | Single node | Chunked prefill support | Development, best for intra-node deployments |
deepep_high_throughput | Multi-node prefill | Grouped GEMM with continuous layout | High-throughput scenarios, prefill-dominated workloads |
deepep_low_latency | Multi-node decode | CUDA graph support, masked layout | Low-latency scenarios, decode-dominated workloads |
单节点部署¶
警告
EP是一项实验性功能。参数名称和默认值未来可能会变更。
配置¶
通过设置--enable-expert-parallel标志来启用EP。EP大小自动计算如下:
其中:
TP_SIZE: 张量并行大小(当前固定为1)DP_SIZE: 数据并行规模EP_SIZE: 专家并行规模(自动计算)
示例命令¶
以下命令以1路张量并行、8路(注意力)数据并行和8路专家并行的方式部署DeepSeek-V3-0324模型。注意力权重会在所有GPU上复制,而专家权重则分布在多个GPU上。该配置适用于配备8个GPU的H200(或H20)节点。对于H100设备,您可以尝试部署较小规模的模型,或参考多节点部署章节。
# Single node EP deployment with pplx backend
VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # Tensor parallelism across 1 GPU
--data-parallel-size 8 \ # Data parallelism across 8 processes
--enable-expert-parallel # Enable expert parallelism
多节点部署¶
对于多节点部署,请使用DeepEP通信内核并选择两种模式之一(参见上文的后端选择指南)。
部署步骤¶
- 每个节点运行一个命令 - 每个节点需要自己的启动命令
- 配置网络 - 确保IP地址和端口配置正确
- 设置节点角色 - 首个节点处理请求,其他节点以无头模式运行
示例:双节点部署¶
以下示例使用deepep_low_latency模式在2个节点上部署DeepSeek-V3-0324:
# Node 1 (Primary - handles incoming requests)
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # TP size per node
--enable-expert-parallel \ # Enable EP
--data-parallel-size 16 \ # Total DP size across all nodes
--data-parallel-size-local 8 \ # Local DP size on this node (8 GPUs per node)
--data-parallel-address 192.168.1.100 \ # Replace with actual IP of Node 1
--data-parallel-rpc-port 13345 \ # RPC communication port, can be any port as long as reachable by all nodes
--api-server-count=8 # Number of API servers for load handling (scaling this out to total ranks are recommended)
# Node 2 (Secondary - headless mode, no API server)
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # TP size per node
--enable-expert-parallel \ # Enable EP
--data-parallel-size 16 \ # Total DP size across all nodes
--data-parallel-size-local 8 \ # Local DP size on this node
--data-parallel-start-rank 8 \ # Starting rank offset for this node
--data-parallel-address 192.168.1.100 \ # IP of primary node (Node 1)
--data-parallel-rpc-port 13345 \ # Same RPC port as primary
--headless # No API server, worker only
关键配置说明¶
- 无界面模式: 从节点通过
--headless标志运行,意味着所有客户端请求都由主节点处理 - 排名计算:
--data-parallel-start-rank应等于前几个节点的累计本地DP大小 - 负载扩展: 在主节点上调整
--api-server-count以处理更高的请求负载
网络配置¶
InfiniBand集群
在InfiniBand网络集群上,设置以下环境变量以防止初始化挂起:
这确保torch分布式组发现使用以太网而非InfiniBand进行初始设置。专家并行负载均衡器 (EPLB)¶
虽然MoE模型通常训练时每个专家接收的令牌数量相近,但实际上令牌在专家间的分布可能高度不均衡。vLLM提供了专家并行负载均衡器(EPLB),可在EP等级间重新分配专家映射,使各专家的负载趋于均衡。
配置¶
通过--enable-eplb参数启用EPLB功能。
模型支持
目前仅支持DeepSeek V3架构。
启用后,vLLM会在每次前向传播时收集负载统计数据,并定期重新平衡专家分布。
EPLB 参数¶
| 参数 | 描述 | 默认值 |
|---|---|---|
--eplb-window-size | Number of engine steps to track for rebalancing decisions | - |
--eplb-step-interval | Frequency of rebalancing (every N engine steps) | - |
--eplb-log-balancedness | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | false |
--num-redundant-experts | Additional global experts per EP rank beyond equal distribution | 0 |
专家分布公式¶
- 默认: 每个EP等级拥有
NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS个专家 - 带冗余: 每个EP等级包含
(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS个专家
示例命令¶
启用EPLB的单节点部署:
# Single node with EPLB load balancing
VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # Tensor parallelism
--data-parallel-size 8 \ # Data parallelism
--enable-expert-parallel \ # Enable EP
--enable-eplb \ # Enable load balancer
--eplb-log-balancedness \ # Log balancing metrics
--eplb-window-size 1000 \ # Track last 1000 engine steps
--eplb-step-interval 3000 # Rebalance every 3000 steps
对于多节点部署,请将这些EPLB标志添加到每个节点的命令中。我们建议在大规模使用场景中将--num-redundant-experts设置为32,以确保最受欢迎的专家始终可用。
解耦式服务(预填充/解码分离)¶
对于需要严格保证首令牌时间和令牌间延迟SLA的生产环境部署,解耦式服务允许独立扩展预填充和解码操作。
架构概览¶
- 预填充实例: 使用
deepep_high_throughput后端以获得最佳预填充性能 - 解码实例: 使用
deepep_low_latency后端实现最低解码延迟 - KV缓存传输: 通过NIXL或其他KV连接器连接实例
安装步骤¶
-
安装KV连接器: 使用安装脚本安装NIXL
-
配置两个实例: 将此标志添加到预填充和解码两个实例中
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"} -
客户端编排: 使用以下客户端脚本来协调预填充/解码操作。我们正在积极开发路由解决方案。
客户端编排示例¶
from openai import OpenAI
import uuid
try:
# 1: Set up clients for prefill and decode instances
openai_api_key = "EMPTY" # vLLM doesn't require a real API key
# Replace these IP addresses with your actual instance addresses
prefill_client = OpenAI(
api_key=openai_api_key,
base_url="http://192.168.1.100:8000/v1", # Prefill instance URL
)
decode_client = OpenAI(
api_key=openai_api_key,
base_url="http://192.168.1.101:8001/v1", # Decode instance URL
)
# Get model name from prefill instance
models = prefill_client.models.list()
model = models.data[0].id
print(f"Using model: {model}")
# 2: Prefill Phase
# Generate unique request ID to link prefill and decode operations
request_id = str(uuid.uuid4())
print(f"Request ID: {request_id}")
prefill_response = prefill_client.completions.create(
model=model,
# Prompt must exceed vLLM's block size (16 tokens) for PD to work
prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations",
max_tokens=1, # Force prefill-only operation
extra_body={
"kv_transfer_params": {
"do_remote_decode": True, # Enable remote decode
"do_remote_prefill": False, # This is the prefill instance
"remote_engine_id": None, # Will be populated by vLLM
"remote_block_ids": None, # Will be populated by vLLM
"remote_host": None, # Will be populated by vLLM
"remote_port": None # Will be populated by vLLM
}
},
extra_headers={"X-Request-Id": request_id}
)
print("-" * 50)
print("✓ Prefill completed successfully")
print(f"Prefill response: {prefill_response.choices[0].text}")
# 3: Decode Phase
# Transfer KV cache parameters from prefill to decode instance
decode_response = decode_client.completions.create(
model=model,
prompt="This prompt is ignored during decode", # Original prompt not needed
max_tokens=150, # Generate up to 150 tokens
extra_body={
"kv_transfer_params": prefill_response.kv_transfer_params # Pass KV cache info
},
extra_headers={"X-Request-Id": request_id} # Same request ID
)
print("-" * 50)
print("✓ Decode completed successfully")
print(f"Final response: {decode_response.choices[0].text}")
except Exception as e:
print(f"❌ Error during disaggregated serving: {e}")
print("Check that both prefill and decode instances are running and accessible")