Kubernetes#

当在您的Kubernetes集群外部运行时,SkyPilot使用您本地的~/.kube/config文件进行身份验证并在您的Kubernetes集群上创建资源。

当在您的Kubernetes集群内运行时(例如,作为Spot控制器或Serve控制器),SkyPilot可以使用以下三种认证方法中的任何一种进行操作:

  1. 自动创建服务账户:SkyPilot 可以自动创建服务账户和角色,以便在 Kubernetes 集群中管理资源。这是在集群内运行时的默认方法,不需要额外的配置。

    有关授予服务账户的权限的详细信息,请参阅下面的SkyPilot所需的最低权限部分。

  2. 使用自定义服务账户:如果您有一个具有必要权限的自定义服务账户,您可以通过将此添加到您的~/.sky/config.yaml文件来配置SkyPilot使用它:

    kubernetes:
      remote_identity: your-service-account-name
    
  3. 使用本地kubeconfig文件:在这种情况下,SkyPilot会将您本地的~/.kube/config文件复制到控制器pod中,并使用它进行身份验证。要使用此方法,请在~/.sky/config.yaml文件中将remote_identity: LOCAL_CREDENTIALS设置为您的Kubernetes配置:

    kubernetes:
      remote_identity: LOCAL_CREDENTIALS
    

    注意

    如果您的集群在~/.kube/config文件中使用基于exec的认证(例如,GKE默认使用exec认证),SkyPilot可能无法使用此方法进行认证。在这种情况下,请考虑使用以下的服务账户方法。

注意

基于服务账户的认证仅适用于远程SkyPilot集群(包括spot和serve控制器)在Kubernetes集群内部启动时。当在集群外部运行时(例如,在AWS上),SkyPilot将使用本地的~/.kube/config文件进行认证。

以下是SkyPilot所需的权限以及一个示例服务账户YAML,您可以使用它来创建具有必要权限的服务账户。

SkyPilot所需的最低权限#

SkyPilot 需要相当于以下角色的权限,以便能够管理 Kubernetes 集群中的资源:

# Namespaced role for the service account
# Required for creating pods, services and other necessary resources in the namespace.
# Note these permissions only apply in the namespace where SkyPilot is deployed, and the namespace can be changed below.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: sky-sa-role  # Can be changed if needed
  namespace: default  # Change to your namespace if using a different one.
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]
---
# ClusterRole for accessing cluster-wide resources. Details for each resource below:
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: sky-sa-cluster-role  # Can be changed if needed
  namespace: default  # Change to your namespace if using a different one.
  labels:
    parent: skypilot
rules:
  - apiGroups: [""]
    resources: ["nodes"]  # Required for getting node resources.
    verbs: ["get", "list", "watch"]
  - apiGroups: ["node.k8s.io"]
    resources: ["runtimeclasses"]   # Required for autodetecting the runtime class of the nodes.
    verbs: ["get", "list", "watch"]

提示

如果您使用的命名空间不是default,请确保更改上述清单中的命名空间。

这些角色必须同时适用于kubeconfig文件中配置的用户账户和SkyPilot使用的服务账户(如果已配置)。

如果您需要使用sky show-gpus查看实时GPU可用性,您的任务使用对象存储挂载或您的任务需要访问入口资源,您将需要授予如下所述的额外权限。

sky show-gpus的权限#

sky show-gpus 需要列出所有命名空间中的所有 pod 以计算 GPU 可用性。为此,SkyPilot 需要 getlist 权限来获取 ClusterRole 中的 pod:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
    name: sky-sa-cluster-role-pod-reader
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

提示

如果此角色未授予服务账户,sky show-gpus 仍然可以工作,但它只会显示节点上的总GPU数量,而不是空闲的GPU数量。

对象存储挂载的权限#

如果你的任务使用对象存储挂载(例如,S3、GCS等),SkyPilot 将需要运行一个 DaemonSet,将 FUSE 设备作为 Kubernetes 资源暴露给 SkyPilot 的 pod。

为了实现这一点,您还需要创建一个skypilot-system命名空间,该命名空间将运行DaemonSet并授予该命名空间中的服务帐户必要的权限。

# Required only if using object store mounting
# Create namespace for SkyPilot system
apiVersion: v1
kind: Namespace
metadata:
  name: skypilot-system  # Do not change this
  labels:
    parent: skypilot
---
# Role for the skypilot-system namespace to create FUSE device manager and
# any other system components required by SkyPilot.
# This role must be bound in the skypilot-system namespace to the service account used for SkyPilot.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: skypilot-system-service-account-role  # Can be changed if needed
  namespace: skypilot-system  # Do not change this namespace
  labels:
    parent: skypilot
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]

使用Ingress的权限#

如果你的任务使用Ingress来暴露端口,你将需要授予ingress-nginx命名空间中的服务账户必要的权限。

# Required only if using ingresses
# Role for accessing ingress service IP
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: ingress-nginx  # Do not change this
  name: sky-sa-role-ingress-nginx  # Can be changed if needed
rules:
  - apiGroups: [""]
    resources: ["services"]
    verbs: ["list", "get"]

使用自定义服务账户的示例#

要创建一个具有SkyPilot所有必要权限(包括访问对象存储)的服务账户,您可以使用以下YAML。

提示

在这个例子中,服务账户名为 sky-sa,并在 default 命名空间中创建。 根据需要更改命名空间和服务账户名称。

  1 # create-sky-sa.yaml
  2 kind: ServiceAccount
  3 apiVersion: v1
  4 metadata:
  5   name: sky-sa  # Change to your service account name
  6   namespace: default  # Change to your namespace if using a different one.
  7   labels:
  8     parent: skypilot
  9 ---
 10 # Role for the service account
 11 kind: Role
 12 apiVersion: rbac.authorization.k8s.io/v1
 13 metadata:
 14   name: sky-sa-role  # Can be changed if needed
 15   namespace: default  # Change to your namespace if using a different one.
 16   labels:
 17     parent: skypilot
 18 rules:
 19   - apiGroups: ["*"]  # Required for creating pods, services, secrets and other necessary resources in the namespace.
 20     resources: ["*"]
 21     verbs: ["*"]
 22 ---
 23 # RoleBinding for the service account
 24 kind: RoleBinding
 25 apiVersion: rbac.authorization.k8s.io/v1
 26 metadata:
 27   name: sky-sa-rb  # Can be changed if needed
 28   namespace: default  # Change to your namespace if using a different one.
 29   labels:
 30     parent: skypilot
 31 subjects:
 32   - kind: ServiceAccount
 33     name: sky-sa  # Change to your service account name
 34 roleRef:
 35   kind: Role
 36   name: sky-sa-role  # Use the same name as the role at line 14
 37   apiGroup: rbac.authorization.k8s.io
 38 ---
 39 # ClusterRole for the service account
 40 kind: ClusterRole
 41 apiVersion: rbac.authorization.k8s.io/v1
 42 metadata:
 43   name: sky-sa-cluster-role  # Can be changed if needed
 44   namespace: default  # Change to your namespace if using a different one.
 45   labels:
 46     parent: skypilot
 47 rules:
 48   - apiGroups: [""]
 49     resources: ["nodes"]  # Required for getting node resources.
 50     verbs: ["get", "list", "watch"]
 51   - apiGroups: ["node.k8s.io"]
 52     resources: ["runtimeclasses"]   # Required for autodetecting the runtime class of the nodes.
 53     verbs: ["get", "list", "watch"]
 54   - apiGroups: ["networking.k8s.io"]   # Required for exposing services through ingresses
 55     resources: ["ingressclasses"]
 56     verbs: ["get", "list", "watch"]
 57   - apiGroups: [""]                 # Required for `sky show-gpus` command
 58     resources: ["pods"]
 59     verbs: ["get", "list"]
 60 ---
 61 # ClusterRoleBinding for the service account
 62 apiVersion: rbac.authorization.k8s.io/v1
 63 kind: ClusterRoleBinding
 64 metadata:
 65   name: sky-sa-cluster-role-binding  # Can be changed if needed
 66   namespace: default  # Change to your namespace if using a different one.
 67   labels:
 68     parent: skypilot
 69 subjects:
 70   - kind: ServiceAccount
 71     name: sky-sa  # Change to your service account name
 72     namespace: default  # Change to your namespace if using a different one.
 73 roleRef:
 74   kind: ClusterRole
 75   name: sky-sa-cluster-role  # Use the same name as the cluster role at line 43
 76   apiGroup: rbac.authorization.k8s.io
 77 ---
 78 # Optional: If using object store mounting, create the skypilot-system namespace
 79 apiVersion: v1
 80 kind: Namespace
 81 metadata:
 82   name: skypilot-system  # Do not change this
 83   labels:
 84     parent: skypilot
 85 ---
 86 # Optional: If using object store mounting, create role in the skypilot-system
 87 # namespace to create FUSE device manager.
 88 kind: Role
 89 apiVersion: rbac.authorization.k8s.io/v1
 90 metadata:
 91   name: skypilot-system-service-account-role  # Can be changed if needed
 92   namespace: skypilot-system  # Do not change this namespace
 93   labels:
 94     parent: skypilot
 95 rules:
 96   - apiGroups: ["*"]
 97     resources: ["*"]
 98     verbs: ["*"]
 99 ---
100 # Optional: If using object store mounting, create rolebinding in the skypilot-system
101 # namespace to create FUSE device manager.
102 apiVersion: rbac.authorization.k8s.io/v1
103 kind: RoleBinding
104 metadata:
105   name: sky-sa-skypilot-system-role-binding
106   namespace: skypilot-system  # Do not change this namespace
107   labels:
108     parent: skypilot
109 subjects:
110   - kind: ServiceAccount
111     name: sky-sa  # Change to your service account name
112     namespace: default  # Change this to the namespace where the service account is created
113 roleRef:
114   kind: Role
115   name: skypilot-system-service-account-role  # Use the same name as the role at line 88
116   apiGroup: rbac.authorization.k8s.io
117 ---
118 # Optional: Role for accessing ingress resources
119 apiVersion: rbac.authorization.k8s.io/v1
120 kind: Role
121 metadata:
122   name: sky-sa-role-ingress-nginx  # Can be changed if needed
123   namespace: ingress-nginx  # Do not change this namespace
124   labels:
125     parent: skypilot
126 rules:
127   - apiGroups: [""]
128     resources: ["services"]
129     verbs: ["list", "get", "watch"]
130   - apiGroups: ["rbac.authorization.k8s.io"]
131     resources: ["roles", "rolebindings"]
132     verbs: ["list", "get", "watch"]
133 ---
134 # Optional: RoleBinding for accessing ingress resources
135 apiVersion: rbac.authorization.k8s.io/v1
136 kind: RoleBinding
137 metadata:
138   name: sky-sa-rolebinding-ingress-nginx  # Can be changed if needed
139   namespace: ingress-nginx  # Do not change this namespace
140   labels:
141     parent: skypilot
142 subjects:
143   - kind: ServiceAccount
144     name: sky-sa  # Change to your service account name
145     namespace: default  # Change this to the namespace where the service account is created
146 roleRef:
147   kind: Role
148   name: sky-sa-role-ingress-nginx  # Use the same name as the role at line 119
149   apiGroup: rbac.authorization.k8s.io

使用以下命令创建服务账户:

$ kubectl apply -f create-sky-sa.yaml

创建服务账户后,集群管理员可以向需要访问集群的用户分发带有sky-sa服务账户的kubeconfigs。

用户还应配置SkyPilot以通过~/.sky/config.yaml使用sky-sa服务账户:

# ~/.sky/config.yaml
kubernetes:
  remote_identity: sky-sa   # Or your service account name