使用Spot实例进行服务#
SkyServe 支持在混合使用 spot 和按需副本的情况下提供服务模型,有两种选项:base_ondemand_fallback_replicas 和 dynamic_ondemand_fallback。目前,SkyServe 依赖于用户在 spot 实例被抢占时进行重试。
基于需求的回退#
base_ondemand_fallback_replicas 设置始终运行的按需副本数量。这对于确保服务可用性和确保即使没有可用的现货副本,也始终有一些容量可用非常有用。use_spot 应设置为 true 以启用现货副本。
service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
# Ensures that one of the replicas is run on on-demand instances
base_ondemand_fallback_replicas: 1
resources:
ports: 8081
cpus: 2+
use_spot: true
workdir: examples/serve/http_server
run: python3 server.py
提示
Kubernetes 实例被视为按需实例。您可以使用 base_ondemand_fallback_replicas 选项让一些副本在 Kubernetes 上运行,而其他副本则在云竞价实例上运行。
动态按需回退#
SkyServe 支持在 spot 副本不可用时动态回退到按需副本。
这是通过将 dynamic_ondemand_fallback 设置为 true 来启用的。
这对于在 spot 实例中断的情况下确保所需的副本容量非常有用。
当 spot 副本可用时,SkyServe 会自动切换回使用 spot 副本以最大限度地节省成本。
service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
# Allows replicas to be run on on-demand instances if spot instances are not available
dynamic_ondemand_fallback: true
resources:
ports: 8081
cpus: 2+
use_spot: true
workdir: examples/serve/http_server
run: python3 server.py
提示
SkyServe 支持同时指定 base_ondemand_fallback_replicas 和 dynamic_ondemand_fallback。同时指定这两个参数将设置一个基础数量的按需副本,并在没有可用的竞价副本时动态回退到按需副本。
示例#
以下示例演示了如何在SkyServe中使用动态回退的spot副本。该示例是一个简单的HTTP服务器,监听端口8081,并带有dynamic_ondemand_fallback: true。要运行:
$ sky serve up examples/serve/spot_policy/dynamic_on_demand_fallback.yaml -n http-server
当服务启动后,我们可以使用以下命令检查服务的状态和副本。最初,我们会看到:
$ sky serve status http-server
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
http-server 1 1m 17s NO_REPLICA 0/4 54.227.229.217:30001
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
http-server 1 1 - 1 min ago 1x GCP([Spot]vCPU=2) PROVISIONING us-east1
http-server 2 1 - 1 min ago 1x GCP([Spot]vCPU=2) PROVISIONING us-central1
http-server 3 1 - 1 mins ago 1x GCP(vCPU=2) PROVISIONING us-east1
http-server 4 1 - 1 min ago 1x GCP(vCPU=2) PROVISIONING us-central1
当所需的现货副本数量不可用时,SkyServe 将提供按需副本以满足目标副本数量。例如,当目标数量为2且没有现货副本准备就绪时,SkyServe 将提供2个按需副本以满足目标副本数量。
$ sky serve status http-server
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
http-server 1 1m 17s READY 2/4 54.227.229.217:30001
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
http-server 1 1 http://34.23.22.160:8081 3 min ago 1x GCP([Spot]vCPU=2) READY us-east1
http-server 2 1 http://34.68.226.193:8081 3 min ago 1x GCP([Spot]vCPU=2) READY us-central1
http-server 3 1 - 3 mins ago 1x GCP(vCPU=2) SHUTTING_DOWN us-east1
http-server 4 1 - 3 min ago 1x GCP(vCPU=2) SHUTTING_DOWN us-central1
当现货副本准备就绪时,SkyServe 将自动缩减按需副本,以最大限度地节省成本。
$ sky serve status http-server
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
http-server 1 3m 59s READY 2/2 54.227.229.217:30001
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
http-server 1 1 http://34.23.22.160:8081 4 mins ago 1x GCP([Spot]vCPU=2) READY us-east1
http-server 2 1 http://34.68.226.193:8081 4 mins ago 1x GCP([Spot]vCPU=2) READY us-central1
在发生现货实例中断的情况下(例如副本1),SkyServe将自动回退到按需副本(例如启动一个按需副本)以满足所需的副本容量。SkyServe将继续尝试在现货可用性恢复时提供一个现货副本。请注意,SkyServe将尝试不同的区域和云,以最大限度地提高成功提供现货实例的机会。
$ sky serve status http-server
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
http-server 1 7m 2s READY 1/3 54.227.229.217:30001
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
http-server 2 1 http://34.68.226.193:8081 7 mins ago 1x GCP([Spot]vCPU=2) READY us-central1
http-server 5 1 - 13 secs ago 1x GCP([Spot]vCPU=2) PROVISIONING us-central1
http-server 6 1 - 13 secs ago 1x GCP(vCPU=2) PROVISIONING us-central1
最终,当现场可用性恢复时,SkyServe 将自动缩减按需副本。
$ sky serve status http-server
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
http-server 1 10m 5s READY 2/3 54.227.229.217:30001
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
http-server 2 1 http://34.68.226.193:8081 10 mins ago 1x GCP([Spot]vCPU=2) READY us-central1
http-server 5 1 http://34.121.49.94:8081 1 min ago 1x GCP([Spot]vCPU=2) READY us-central1