使用Kubernetes¶

在Kubernetes上部署vLLM是一种可扩展且高效的服务机器学习模型的方式。本指南将引导您使用原生Kubernetes部署vLLM。

使用CPU进行部署
使用GPU部署
故障排除
- 启动探针或就绪探针失败，容器日志包含"KeyboardInterrupt: terminated"
结论

或者，您可以使用以下任一方式将vLLM部署到Kubernetes：

使用CPU部署¶

注意

此处使用CPU仅用于演示和测试目的，其性能无法与GPU相媲美。

首先，创建一个Kubernetes PVC和Secret用于下载和存储Hugging Face模型：

Config

cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
data:
  token: $(HF_TOKEN)
EOF

接下来，将vLLM服务器作为Kubernetes部署和服务启动：

Config

cat <<EOF |kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
EOF

我们可以通过日志确认vLLM服务器已成功启动（下载模型可能需要几分钟时间）：

kubectl logs -l app.kubernetes.io/name=vllm
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

使用GPU部署¶

前提条件: 确保您已拥有一个运行中的支持GPU的Kubernetes集群。

Create a PVC, Secret and Deployment for vLLM

PVC is used to store the model cache and it is optional, you can use hostPath or other storage options

Yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mistral-7b
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: default
  volumeMode: Filesystem

Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models

apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
  namespace: default
type: Opaque
stringData:
  token: "REPLACE_WITH_TOKEN"

Next to create the deployment file for vLLM to run the model server. The following example deploys the Mistral-7B-Instruct-v0.3 model.

Here are two examples for using NVIDIA GPU and AMD GPU.

NVIDIA GPU:

Yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-7b
  namespace: default
  labels:
    app: mistral-7b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-7b
  template:
    metadata:
      labels:
        app: mistral-7b
    spec:
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: mistral-7b
      # vLLM needs to access the host's shared memory for tensor parallel inference.
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: mistral-7b
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 20G
            nvidia.com/gpu: "1"
          requests:
            cpu: "2"
            memory: 6G
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5

AMD GPU:

You can refer to the deployment.yaml below if using AMD ROCm GPU like MI300X.

Yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-7b
  namespace: default
  labels:
    app: mistral-7b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-7b
  template:
    metadata:
      labels:
        app: mistral-7b
    spec:
      volumes:
      # PVC
      - name: cache-volume
        persistentVolumeClaim:
          claimName: mistral-7b
      # vLLM needs to access the host's shared memory for tensor parallel inference.
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "8Gi"
      hostNetwork: true
      hostIPC: true
      containers:
      - name: mistral-7b
        image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
        securityContext:
          seccompProfile:
            type: Unconfined
          runAsGroup: 44
          capabilities:
            add:
            - SYS_PTRACE
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 20G
            amd.com/gpu: "1"
          requests:
            cpu: "6"
            memory: 6G
            amd.com/gpu: "1"
        volumeMounts:
        - name: cache-volume
          mountPath: /root/.cache/huggingface
        - name: shm
          mountPath: /dev/shm

You can get the full example with steps and sample yaml files from https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve.

为vLLM创建Kubernetes服务

接下来，创建一个Kubernetes服务文件来暴露mistral-7b部署：

Yaml

apiVersion: v1
kind: Service
metadata:
  name: mistral-7b
  namespace: default
spec:
  ports:
  - name: http-mistral-7b
    port: 80
    protocol: TCP
    targetPort: 8000
  # 标签选择器应与部署标签匹配，这对前缀缓存功能很有用
  selector:
    app: mistral-7b
  sessionAffinity: None
  type: ClusterIP

部署与测试

使用kubectl apply -f 应用部署和服务配置：

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

要测试部署，请运行以下curl命令：

curl http://mistral-7b.default.svc.cluster.local/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
      }'

如果服务部署正确，您应该会收到来自vLLM模型的响应。

故障排除¶

启动探针或就绪探针失败，容器日志包含"KeyboardInterrupt: terminated"¶

如果启动或就绪探针的failureThreshold设置过低，无法满足服务器启动所需的时间，Kubernetes调度器将会终止容器。以下是可能发生的几种迹象：

容器日志中包含"KeyboardInterrupt: terminated"
kubectl get events 显示消息 Container $NAME failed startup probe, will be restarted

为了缓解这个问题，可以增加failureThreshold值，为模型服务器提供更多启动服务的时间。您可以通过从清单中移除探针并测量模型服务器显示准备就绪所需的时间，来确定一个理想的failureThreshold值。

结论¶

使用Kubernetes部署vLLM可以实现高效扩展和管理利用GPU资源的机器学习模型。按照上述步骤操作后，您应该能够在Kubernetes集群中设置并测试vLLM部署。如果遇到任何问题或有建议，欢迎随时为文档贡献您的想法。