1. 调度说明 1.1 简介 Scheduler 是 K8S 的调度器,主要任务是把定义的 pod 分配到集群节点上。它主要考虑如下问题:
公平: 如何确保每个节点都被分配资源
资源高利用率:集群所有资源最大化被使用
效率:调度性能要好,能够尽快地对大批量的 pod 完成调度工作
灵活:允许用户根据自己的需求控制调度的逻辑
Scheduler 作为独立的进程允许,启动后会一直监听API Server,获取 PodSpec.NodeName 为空的pod,对每个 pod 都会创建一个 binding,表明该 pod 应该放到哪个节点上
1.2 调度过程
首先,过滤掉不满足条件的节点,这个过程称为 predicate
其次,对通过的节点按优先级排序,这个是 priority
最后,从中选择优先级最高的节点。
总结:预选 + 优选
Predicate 算法:
PodFitsResources: 节点上剩余资源是否大于 pod 请求资源
PodFitsHost: 如果 pod 指定了NodeName,检查当前节点名称是否与之匹配
PodFitsHostPorts: pod 申请的port 是否已被占用
PodSelectorMatches: 过滤掉和 pod 指定的 label 不匹配的节点
NoDiskConflict: 已经 mount 的 volume和pod指定的 volume 不冲突,除非它们都是只读的
如果在 predicate 过程中没有合适的节点,pod 会一直在 pending 状态,不断重试调度,直到有节点满足条件。多个节点同时满足条件,继续按 priorities 过程,按优先级大小排序
优先级选项:
LeastRequestedPriority: 计算 CPU 和 Memory 的使用率来决定权重,使用率越低权重越高
BalancedResourceAllocation: CPU 和 Memory 的使用率接近,权重越高。通常和上一个一起使用
ImageLocalityPriority: 本地已下载镜像,镜像总大小越大,权重越高
1.3 自定义调度器 spec.schedulername 指定调度器名称
1 2 3 4 5 6 7 8 9 10 11 apiVersion: v1 kind: Pod metadata: name: annotation-second-scheduler labels: name: multischeduler-example spec: schedulername: my-scheduler conatiners: - name: pod-with-second-annotation-container image: gcr.io/google_containers/pause:2.0
2. 调度亲和性 2.1 节点亲和性 pod.spec.nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution: 软策略
requiredDuringSchedulingIgnoredDuringExecution: 硬策略
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 apiVersion: v1 kind: Pod metadata: name: node-affinity labels: app: node-affinity-pod spec: containers: - name: with-node-affinity image: busybox command: ["/bin/sh" , "-c" , "sleep 600" ] affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: NotIn values: - k8s-node02 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: source operator: In values: - k8s-node01
2.2 Pod 亲和性 pod.spec.affinity.podAffinity/PodAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution: 软策略
requiredDuringSchedulingIgnoredDuringExecution: 硬策略
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 apiVersion: v1 kind: Pod metadata: name: pod-affinity labels: app: pod-3 spec: containers: - name: pod-3 image: busybox command: ["/bin/sh" , "-c" , "sleep 600" ] affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - pod-1 topologyKey: kubernetes.io/hostname podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - pod-2 topologyKey: kubernetes.io/hostname
1 2 3 4 5 6 7 8 9 10 11 12 13 $ kubectl get pod NAME READY STATUS RESTARTS AGE node-affinity 1/1 Running 0 9m22s pod-affinity 0/1 Pending 0 10s $ kubectl label pod node-affinity app=pod-1 --overwrite=true pod/node-affinity labeled $ kubectl get pod --show-labels NAME READY STATUS RESTARTS AGE LABELS node-affinity 1/1 Running 1 10m app=pod-1 pod-affinity 1/1 Running 0 95s app=pod-3
2.3 亲和性/反亲和性策略对比
调度策略
匹配标签
操作符
拓扑域支持
调度目标
nodeAffinity
Node
In, NotIn, Exists, DoesNotExist, Gt, Lt
No
指定主机
podAffinity
Pod
In, NotIn, Exists, DoesNotExist
Yes
指定Pod在同一个拓扑域
podAntiAffinity
Pod
In, NotIn, Exists, DoesNotExist
Yes
指定Pod不在同一个拓扑域
3. 污点 和 容忍
3.1 Taint 3.1.1 污点的组成
其中value可以为空,effect描述污点的作用,当前支持如下三个选项:
NoSchedule: 不会将Pod调度到具有该污点的Node上
PreferNoSchedule: 尽量避免将Pod调度到具有该污点的Node上
NoExecute: 不会将Pod调度到具有该污点的Node上,同时会将已存在的Pod驱逐出该Node
3.1.2 污点的设置,查看和去除 1 2 3 4 5 6 7 8 9 $ kubectl taint nodes k8s-node01 kickoff=test :NoSchedule $ kubectl describe node k8s-node01 | grep -i taint Taints: kickoff=test :NoSchedule $ kubectl taint nodes k8s-node01 kickoff=test :NoSchedule-
3.2 Toleration pod.spec.tolerations
1 2 3 4 5 6 7 8 9 10 11 12 13 tolerations: - key: key1 operator: Equal value: value1 effect: NoSchedule tolerationSeconds: 3600 - key: key2 operator: Equal value: value2 effect: NoExecute - key: key3 operator: Exists effect: NoSchedule
key, value, effect 要与 Node 上设置的taint一致
operator等于 Exists 时,忽略 value值
tolerationSeconds Pod被驱离前的保留时间
当不指定key时,容忍所有的污点key
1 2 tolerations: - operator: Exists
当不指定effect时,容忍所有的污点作用
1 2 3 tolerations: - key: key operator: Exists
多个master节点,可去除默认污点
1 2 3 4 5 $ kubectl describe node k8s-master | grep -i taint Taints: node-role.kubernetes.io/master:NoSchedule $ kubectl taint nodes k8s-master node-role.kubernetes.io/master=:PreferNoSchedule
3.2.1 示例 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 apiVersion: v1 kind: Pod metadata: name: pod-1 spec: containers: - name: pod-1 image: busybox command: ["/bin/sh" , "-c" , "sleep 600" ] --- apiVersion: v1 kind: Pod metadata: name: pod-2 spec: containers: - name: pod-2 image: busybox command: ["/bin/sh" , "-c" , "sleep 600" ] tolerations: - key: kickoff operator: Equal value: test effect: NoSchedule
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 $ kubectl taint nodes k8s-node01 kickoff=test :NoSchedule $ kubectl taint nodes k8s-node02 kickoff=test :NoSchedule $ kubectl create -f taint-toleration.yaml $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-1 0/1 Pending 0 58s <none> <none> <none> <none> pod-2 1/1 Running 0 58s 10.244.2.55 k8s-node02 <none> <none> $ kubectl taint nodes k8s-node01 kickoff=test :NoSchedule- $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-1 1/1 Running 0 2m 10.244.1.40 k8s-node01 <none> <none> pod-2 1/1 Running 0 2m 10.244.2.55 k8s-node02 <none> <none>
4. 固定节点 4.1 指定节点名称 pod.spec.nodeName
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 apiVersion: apps/v1 kind: Deployment metadata: name: test-1 spec: replicas: 3 selector: matchLabels: app: tools template: metadata: labels: app: tools spec: nodeName: k8s-node01 containers: - name: pod-1 image: busybox command: ["/bin/sh" , "-c" , "sleep 600" ]
1 2 3 4 5 $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-1-5c889d444f-pp9td 1/1 Running 0 48s 10.244.1.41 k8s-node01 <none> <none> test-1-5c889d444f-rtk25 1/1 Running 0 48s 10.244.1.43 k8s-node01 <none> <none> test-1-5c889d444f-rv2fc 1/1 Running 0 48s 10.244.1.42 k8s-node01 <none> <none>
4.2 指定节点选择器 pod.spec.nodeSelector, 通过label-selector机制选择节点,由调度器调度策略匹配label,然后调度到目标节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apiVersion: apps/v1 kind: Deployment metadata: name: test-2 spec: replicas: 2 selector: matchLabels: app: web template: metadata: labels: app: web spec: nodeSelector: type: backendNode1 containers: - name: web image: busybox command: ["/bin/sh" , "-c" , "sleep 600" ]
1 2 3 4 5 6 7 8 9 10 11 12 13 $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE test-2-564fd7c7df-4jftd 0/1 Pending 0 3s <none> <none> <none> <none> test-2-564fd7c7df-tdwj7 0/1 Pending 0 3s <none> <none> <none> <none> $ kubectl label node k8s-node02 type =backendNode1 node/k8s-node02 labeled $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2-564fd7c7df-4jftd 1/1 Running 0 3m24s 10.244.2.56 k8s-node02 <none> <none> test-2-564fd7c7df-tdwj7 1/1 Running 0 3m24s 10.244.2.57 k8s-node02 <none> <none>