安装SGLang#

您可以使用以下任意一种方法安装SGLang。

方法1：使用pip#

pip install --upgrade pip
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

注意：请查看FlashInfer安装文档，根据您的PyTorch和CUDA版本安装合适的版本。

方法2：从源代码#

# Use the last release branch
git clone -b v0.4.1.post3 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

注意：请查看FlashInfer安装文档，根据您的PyTorch和CUDA版本安装合适的版本。

注意：对于带有Instinct/MI GPU的AMD ROCm系统，请执行以下操作：

# Use the last release branch
git clone -b v0.4.1.post3 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all_hip]"

方法3：使用docker#

Docker 镜像可在 Docker Hub 上获取，名称为 lmsysorg/sglang，这些镜像是从 Dockerfile 构建的。请将下面的替换为您的 Hugging Face Hub token。

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

注意：对于带有Instinct/MI GPU的AMD ROCm系统，建议使用docker/Dockerfile.rocm来构建镜像，示例如下：

docker build --build-arg SGL_BRANCH=v0.4.1.post3 -t v0.4.1.post3-rocm620 -f Dockerfile.rocm .

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    v0.4.1.post3-rocm620 \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
drun v0.4.1.post3-rocm620 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8

方法4：使用docker compose#

如果您计划将其作为服务提供，建议使用此方法。更好的方法是使用k8s-sglang-service.yaml。

将compose.yml复制到您的本地机器
在终端中执行命令 docker compose up -d。

方法5：使用SkyPilot在Kubernetes或云上运行#

要在Kubernetes或12+云上部署，您可以使用SkyPilot。

安装SkyPilot并设置Kubernetes集群或云访问：请参阅SkyPilot的文档。
通过一个命令在您自己的基础设施上部署，并获取HTTP API端点：

SkyPilot YAML: sglang.yaml

# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang

要进一步扩展您的部署，实现自动扩展和故障恢复，请查看SkyServe + SGLang 指南。

常见注意事项#

FlashInfer 是默认的注意力内核后端。它仅支持 sm75 及以上版本。如果您在 sm75+ 设备（例如 T4、A10、A100、L4、L40S、H100）上遇到任何与 FlashInfer 相关的问题，请通过添加 --attention-backend triton --sampling-backend pytorch 切换到其他内核，并在 GitHub 上提交问题。
如果您只需要在前端语言中使用OpenAI模型，您可以通过使用pip install "sglang[openai]"来避免安装其他依赖项。
语言前端独立于后端运行时运行。您可以在本地安装前端而无需GPU，而后端可以设置在支持GPU的机器上。要安装前端，请运行pip install sglang，对于后端，请使用pip install sglang[srt]。这使您可以在本地构建SGLang程序并通过连接到远程后端来执行它们。

安装 SGLang

目录