使用Kubernetes¶
在Kubernetes上部署vLLM是一种可扩展且高效的服务机器学习模型的方式。本指南将引导您使用原生Kubernetes部署vLLM。
或者,您可以使用以下任一方式将vLLM部署到Kubernetes:
- Helm
- InftyAI/llmaz
- KServe
- KubeRay
- kubernetes-sigs/lws
- meta-llama/llama-stack
- substratusai/kubeai
- vllm-project/aibrix
- vllm-project/production-stack
使用CPU部署¶
注意
此处使用CPU仅用于演示和测试目的,其性能无法与GPU相媲美。
首先,创建一个Kubernetes PVC和Secret用于下载和存储Hugging Face模型:
Config
接下来,将vLLM服务器作为Kubernetes部署和服务启动:
Config
cat <<EOF |kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: vllm
template:
metadata:
labels:
app.kubernetes.io/name: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve meta-llama/Llama-3.2-1B-Instruct"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app.kubernetes.io/name: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
EOF
我们可以通过日志确认vLLM服务器已成功启动(下载模型可能需要几分钟时间):
kubectl logs -l app.kubernetes.io/name=vllm
...
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
使用GPU部署¶
前提条件: 确保您已拥有一个运行中的支持GPU的Kubernetes集群。
-
Create a PVC, Secret and Deployment for vLLM
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
Yaml
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: mistral-7b namespace: default spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: default volumeMode: FilesystemSecret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
apiVersion: v1 kind: Secret metadata: name: hf-token-secret namespace: default type: Opaque stringData: token: "REPLACE_WITH_TOKEN"Next to create the deployment file for vLLM to run the model server. The following example deploys the
Mistral-7B-Instruct-v0.3model.Here are two examples for using NVIDIA GPU and AMD GPU.
NVIDIA GPU:
Yaml
apiVersion: apps/v1 kind: Deployment metadata: name: mistral-7b namespace: default labels: app: mistral-7b spec: replicas: 1 selector: matchLabels: app: mistral-7b template: metadata: labels: app: mistral-7b spec: volumes: - name: cache-volume persistentVolumeClaim: claimName: mistral-7b # vLLM needs to access the host's shared memory for tensor parallel inference. - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" containers: - name: mistral-7b image: vllm/vllm-openai:latest command: ["/bin/sh", "-c"] args: [ "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024" ] env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token ports: - containerPort: 8000 resources: limits: cpu: "10" memory: 20G nvidia.com/gpu: "1" requests: cpu: "2" memory: 6G nvidia.com/gpu: "1" volumeMounts: - mountPath: /root/.cache/huggingface name: cache-volume - name: shm mountPath: /dev/shm livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 5AMD GPU:
You can refer to the
deployment.yamlbelow if using AMD ROCm GPU like MI300X.Yaml
apiVersion: apps/v1 kind: Deployment metadata: name: mistral-7b namespace: default labels: app: mistral-7b spec: replicas: 1 selector: matchLabels: app: mistral-7b template: metadata: labels: app: mistral-7b spec: volumes: # PVC - name: cache-volume persistentVolumeClaim: claimName: mistral-7b # vLLM needs to access the host's shared memory for tensor parallel inference. - name: shm emptyDir: medium: Memory sizeLimit: "8Gi" hostNetwork: true hostIPC: true containers: - name: mistral-7b image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 securityContext: seccompProfile: type: Unconfined runAsGroup: 44 capabilities: add: - SYS_PTRACE command: ["/bin/sh", "-c"] args: [ "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024" ] env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token ports: - containerPort: 8000 resources: limits: cpu: "10" memory: 20G amd.com/gpu: "1" requests: cpu: "6" memory: 6G amd.com/gpu: "1" volumeMounts: - name: cache-volume mountPath: /root/.cache/huggingface - name: shm mountPath: /dev/shmYou can get the full example with steps and sample yaml files from https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve.
-
为vLLM创建Kubernetes服务
接下来,创建一个Kubernetes服务文件来暴露
mistral-7b部署:Yaml
-
部署与测试
使用
kubectl apply -f应用部署和服务配置:要测试部署,请运行以下
curl命令:curl http://mistral-7b.default.svc.cluster.local/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mistral-7B-Instruct-v0.3", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'如果服务部署正确,您应该会收到来自vLLM模型的响应。
故障排除¶
启动探针或就绪探针失败,容器日志包含"KeyboardInterrupt: terminated"¶
如果启动或就绪探针的failureThreshold设置过低,无法满足服务器启动所需的时间,Kubernetes调度器将会终止容器。以下是可能发生的几种迹象:
- 容器日志中包含"KeyboardInterrupt: terminated"
kubectl get events显示消息Container $NAME failed startup probe, will be restarted
为了缓解这个问题,可以增加failureThreshold值,为模型服务器提供更多启动服务的时间。您可以通过从清单中移除探针并测量模型服务器显示准备就绪所需的时间,来确定一个理想的failureThreshold值。
结论¶
使用Kubernetes部署vLLM可以实现高效扩展和管理利用GPU资源的机器学习模型。按照上述步骤操作后,您应该能够在Kubernetes集群中设置并测试vLLM部署。如果遇到任何问题或有建议,欢迎随时为文档贡献您的想法。