使用Spot实例进行服务#

SkyServe 支持在混合使用 spot 和按需副本的情况下提供服务模型,有两种选项:base_ondemand_fallback_replicasdynamic_ondemand_fallback。目前,SkyServe 依赖于用户在 spot 实例被抢占时进行重试。

基于需求的回退#

base_ondemand_fallback_replicas 设置始终运行的按需副本数量。这对于确保服务可用性和确保即使没有可用的现货副本,也始终有一些容量可用非常有用。use_spot 应设置为 true 以启用现货副本。

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 2
    max_replicas: 3
    target_qps_per_replica: 1
    # Ensures that one of the replicas is run on on-demand instances
    base_ondemand_fallback_replicas: 1

resources:
  ports: 8081
  cpus: 2+
  use_spot: true

workdir: examples/serve/http_server

run: python3 server.py

提示

Kubernetes 实例被视为按需实例。您可以使用 base_ondemand_fallback_replicas 选项让一些副本在 Kubernetes 上运行,而其他副本则在云竞价实例上运行。

动态按需回退#

SkyServe 支持在 spot 副本不可用时动态回退到按需副本。 这是通过将 dynamic_ondemand_fallback 设置为 true 来启用的。 这对于在 spot 实例中断的情况下确保所需的副本容量非常有用。 当 spot 副本可用时,SkyServe 会自动切换回使用 spot 副本以最大限度地节省成本。

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 2
    max_replicas: 3
    target_qps_per_replica: 1
    # Allows replicas to be run on on-demand instances if spot instances are not available
    dynamic_ondemand_fallback: true

resources:
  ports: 8081
  cpus: 2+
  use_spot: true

workdir: examples/serve/http_server

run: python3 server.py

提示

SkyServe 支持同时指定 base_ondemand_fallback_replicasdynamic_ondemand_fallback。同时指定这两个参数将设置一个基础数量的按需副本,并在没有可用的竞价副本时动态回退到按需副本。

示例#

以下示例演示了如何在SkyServe中使用动态回退的spot副本。该示例是一个简单的HTTP服务器,监听端口8081,并带有dynamic_ondemand_fallback: true。要运行:

$ sky serve up examples/serve/spot_policy/dynamic_on_demand_fallback.yaml -n http-server

当服务启动后,我们可以使用以下命令检查服务的状态和副本。最初,我们会看到:

$ sky serve status http-server

Services
NAME         VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT
http-server  1        1m 17s  NO_REPLICA  0/4       54.227.229.217:30001

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED    RESOURCES             STATUS         REGION
http-server   1   1        -                          1 min ago   1x GCP([Spot]vCPU=2)  PROVISIONING  us-east1
http-server   2   1        -                          1 min ago   1x GCP([Spot]vCPU=2)  PROVISIONING  us-central1
http-server   3   1        -                          1 mins ago  1x GCP(vCPU=2)        PROVISIONING  us-east1
http-server   4   1        -                          1 min ago   1x GCP(vCPU=2)        PROVISIONING  us-central1

当所需的现货副本数量不可用时,SkyServe 将提供按需副本以满足目标副本数量。例如,当目标数量为2且没有现货副本准备就绪时,SkyServe 将提供2个按需副本以满足目标副本数量。

$ sky serve status http-server

Services
NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
http-server  1        1m 17s  READY   2/4       54.227.229.217:30001

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED    RESOURCES             STATUS         REGION
http-server   1   1        http://34.23.22.160:8081   3 min ago   1x GCP([Spot]vCPU=2)  READY          us-east1
http-server   2   1        http://34.68.226.193:8081  3 min ago   1x GCP([Spot]vCPU=2)  READY          us-central1
http-server   3   1        -                          3 mins ago  1x GCP(vCPU=2)        SHUTTING_DOWN  us-east1
http-server   4   1        -                          3 min ago   1x GCP(vCPU=2)        SHUTTING_DOWN  us-central1

当现货副本准备就绪时,SkyServe 将自动缩减按需副本,以最大限度地节省成本。

$ sky serve status http-server

Services
NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
http-server  1        3m 59s  READY   2/2       54.227.229.217:30001

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED    RESOURCES             STATUS  REGION
http-server   1   1        http://34.23.22.160:8081   4 mins ago  1x GCP([Spot]vCPU=2)  READY   us-east1
http-server   2   1        http://34.68.226.193:8081  4 mins ago  1x GCP([Spot]vCPU=2)  READY   us-central1

在发生现货实例中断的情况下(例如副本1),SkyServe将自动回退到按需副本(例如启动一个按需副本)以满足所需的副本容量。SkyServe将继续尝试在现货可用性恢复时提供一个现货副本。请注意,SkyServe将尝试不同的区域和云,以最大限度地提高成功提供现货实例的机会。

$ sky serve status http-server

Services
NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
http-server  1        7m 2s   READY   1/3       54.227.229.217:30001

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED     RESOURCES             STATUS        REGION
http-server   2   1        http://34.68.226.193:8081  7 mins ago   1x GCP([Spot]vCPU=2)  READY         us-central1
http-server   5   1        -                          13 secs ago  1x GCP([Spot]vCPU=2)  PROVISIONING  us-central1
http-server   6   1        -                          13 secs ago  1x GCP(vCPU=2)        PROVISIONING  us-central1

最终,当现场可用性恢复时,SkyServe 将自动缩减按需副本。

$ sky serve status http-server

Services
NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
http-server  1        10m 5s  READY   2/3       54.227.229.217:30001

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED     RESOURCES             STATUS         REGION
http-server   2   1        http://34.68.226.193:8081  10 mins ago  1x GCP([Spot]vCPU=2)  READY          us-central1
http-server   5   1        http://34.121.49.94:8081   1 min ago    1x GCP([Spot]vCPU=2)  READY          us-central1