Eli's Blog

1. 调度说明

1.1 简介

Scheduler 是 K8S 的调度器,主要任务是把定义的 pod 分配到集群节点上。它主要考虑如下问题:

  • 公平: 如何确保每个节点都被分配资源
  • 资源高利用率:集群所有资源最大化被使用
  • 效率:调度性能要好,能够尽快地对大批量的 pod 完成调度工作
  • 灵活:允许用户根据自己的需求控制调度的逻辑

Scheduler 作为独立的进程允许,启动后会一直监听API Server,获取 PodSpec.NodeName 为空的pod,对每个 pod 都会创建一个 binding,表明该 pod 应该放到哪个节点上

1.2 调度过程

  • 首先,过滤掉不满足条件的节点,这个过程称为 predicate
  • 其次,对通过的节点按优先级排序,这个是 priority
  • 最后,从中选择优先级最高的节点。

总结:预选 + 优选

Predicate 算法:

  • PodFitsResources: 节点上剩余资源是否大于 pod 请求资源
  • PodFitsHost: 如果 pod 指定了NodeName,检查当前节点名称是否与之匹配
  • PodFitsHostPorts: pod 申请的port 是否已被占用
  • PodSelectorMatches: 过滤掉和 pod 指定的 label 不匹配的节点
  • NoDiskConflict: 已经 mount 的 volume和pod指定的 volume 不冲突,除非它们都是只读的

如果在 predicate 过程中没有合适的节点,pod 会一直在 pending 状态,不断重试调度,直到有节点满足条件。多个节点同时满足条件,继续按 priorities 过程,按优先级大小排序

优先级选项:

  • LeastRequestedPriority: 计算 CPU 和 Memory 的使用率来决定权重,使用率越低权重越高
  • BalancedResourceAllocation: CPU 和 Memory 的使用率接近,权重越高。通常和上一个一起使用
  • ImageLocalityPriority: 本地已下载镜像,镜像总大小越大,权重越高

1.3 自定义调度器

spec.schedulername 指定调度器名称

1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: Pod
metadata:
name: annotation-second-scheduler
labels:
name: multischeduler-example
spec:
schedulername: my-scheduler
conatiners:
- name: pod-with-second-annotation-container
image: gcr.io/google_containers/pause:2.0

2. 调度亲和性

2.1 节点亲和性

pod.spec.nodeAffinity:

  • preferredDuringSchedulingIgnoredDuringExecution: 软策略
  • requiredDuringSchedulingIgnoredDuringExecution: 硬策略
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: v1
kind: Pod
metadata:
name: node-affinity
labels:
app: node-affinity-pod
spec:
containers:
- name: with-node-affinity
image: busybox
command: ["/bin/sh", "-c", "sleep 600"]
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- k8s-node02
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: source
operator: In
values:
- k8s-node01

2.2 Pod 亲和性

pod.spec.affinity.podAffinity/PodAntiAffinity:

  • preferredDuringSchedulingIgnoredDuringExecution: 软策略
  • requiredDuringSchedulingIgnoredDuringExecution: 硬策略
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: v1
kind: Pod
metadata:
name: pod-affinity
labels:
app: pod-3
spec:
containers:
- name: pod-3
image: busybox
command: ["/bin/sh", "-c", "sleep 600"]
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- pod-1
topologyKey: kubernetes.io/hostname
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- pod-2
topologyKey: kubernetes.io/hostname
1
2
3
4
5
6
7
8
9
10
11
12
13
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
node-affinity 1/1 Running 0 9m22s
pod-affinity 0/1 Pending 0 10s

# 注意node-affinity必须是running的,否则即使修改了的label满足条件,pod-3也不会创建
$ kubectl label pod node-affinity app=pod-1 --overwrite=true
pod/node-affinity labeled

$ kubectl get pod --show-labels
NAME READY STATUS RESTARTS AGE LABELS
node-affinity 1/1 Running 1 10m app=pod-1
pod-affinity 1/1 Running 0 95s app=pod-3

2.3 亲和性/反亲和性策略对比

调度策略 匹配标签 操作符 拓扑域支持 调度目标
nodeAffinity Node In, NotIn, Exists, DoesNotExist, Gt, Lt No 指定主机
podAffinity Pod In, NotIn, Exists, DoesNotExist Yes 指定Pod在同一个拓扑域
podAntiAffinity Pod In, NotIn, Exists, DoesNotExist Yes 指定Pod不在同一个拓扑域

3. 污点 和 容忍

  • 亲和性:Pod的一种偏好或硬性要求,它使 Pod 能被吸引到一类特定的节点

  • 污点:与亲和性相反,它使节点能够排斥一类特定的Pod

    • Taint:用来避免pod节点被分配到不合适的节点上

    • Toleration:表示pod可以(容忍)被分配到Taint节点上

3.1 Taint

3.1.1 污点的组成

1
key=value:effect

其中value可以为空,effect描述污点的作用,当前支持如下三个选项:

  • NoSchedule: 不会将Pod调度到具有该污点的Node上
  • PreferNoSchedule: 尽量避免将Pod调度到具有该污点的Node上
  • NoExecute: 不会将Pod调度到具有该污点的Node上,同时会将已存在的Pod驱逐出该Node

3.1.2 污点的设置,查看和去除

1
2
3
4
5
6
7
8
9
# 设置污点
$ kubectl taint nodes k8s-node01 kickoff=test:NoSchedule

# 查看污点
$ kubectl describe node k8s-node01 | grep -i taint
Taints: kickoff=test:NoSchedule

# 去除污点
$ kubectl taint nodes k8s-node01 kickoff=test:NoSchedule-

3.2 Toleration

pod.spec.tolerations

1
2
3
4
5
6
7
8
9
10
11
12
13
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
tolerationSeconds: 3600 # 驱离前保留时间
- key: key2
operator: Equal
value: value2
effect: NoExecute
- key: key3
operator: Exists
effect: NoSchedule
  • key, value, effect 要与 Node 上设置的taint一致
  • operator等于 Exists 时,忽略 value值
  • tolerationSeconds Pod被驱离前的保留时间
  1. 当不指定key时,容忍所有的污点key
1
2
tolerations:
- operator: Exists
  1. 当不指定effect时,容忍所有的污点作用
1
2
3
tolerations:
- key: key
operator: Exists
  1. 多个master节点,可去除默认污点
1
2
3
4
5
# 主节点默认设置污点
$ kubectl describe node k8s-master | grep -i taint
Taints: node-role.kubernetes.io/master:NoSchedule

$ kubectl taint nodes k8s-master node-role.kubernetes.io/master=:PreferNoSchedule

3.2.1 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# taint-toleration.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-1
spec:
containers:
- name: pod-1
image: busybox
command: ["/bin/sh", "-c", "sleep 600"]

---
apiVersion: v1
kind: Pod
metadata:
name: pod-2
spec:
containers:
- name: pod-2
image: busybox
command: ["/bin/sh", "-c", "sleep 600"]
tolerations:
- key: kickoff
operator: Equal
value: test
effect: NoSchedule
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 节点都打上污点标识
$ kubectl taint nodes k8s-node01 kickoff=test:NoSchedule
$ kubectl taint nodes k8s-node02 kickoff=test:NoSchedule

$ kubectl create -f taint-toleration.yaml

$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-1 0/1 Pending 0 58s <none> <none> <none> <none>
pod-2 1/1 Running 0 58s 10.244.2.55 k8s-node02 <none> <none>

# 去除污点
$ kubectl taint nodes k8s-node01 kickoff=test:NoSchedule-

# 不再Pending
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-1 1/1 Running 0 2m 10.244.1.40 k8s-node01 <none> <none>
pod-2 1/1 Running 0 2m 10.244.2.55 k8s-node02 <none> <none>

4. 固定节点

4.1 指定节点名称

pod.spec.nodeName

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-1
spec:
replicas: 3
selector:
matchLabels:
app: tools
template:
metadata:
labels:
app: tools
spec:
nodeName: k8s-node01 # 指定节点名称
containers:
- name: pod-1
image: busybox
command: ["/bin/sh", "-c", "sleep 600"]
1
2
3
4
5
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-1-5c889d444f-pp9td 1/1 Running 0 48s 10.244.1.41 k8s-node01 <none> <none>
test-1-5c889d444f-rtk25 1/1 Running 0 48s 10.244.1.43 k8s-node01 <none> <none>
test-1-5c889d444f-rv2fc 1/1 Running 0 48s 10.244.1.42 k8s-node01 <none> <none>

4.2 指定节点选择器

pod.spec.nodeSelector, 通过label-selector机制选择节点,由调度器调度策略匹配label,然后调度到目标节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-2
spec:
replicas: 2
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
nodeSelector: # 指定标签
type: backendNode1
containers:
- name: web
image: busybox
command: ["/bin/sh", "-c", "sleep 600"]
1
2
3
4
5
6
7
8
9
10
11
12
13
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE
test-2-564fd7c7df-4jftd 0/1 Pending 0 3s <none> <none> <none> <none>
test-2-564fd7c7df-tdwj7 0/1 Pending 0 3s <none> <none> <none> <none>

# 给node打标签
$ kubectl label node k8s-node02 type=backendNode1
node/k8s-node02 labeled

$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-2-564fd7c7df-4jftd 1/1 Running 0 3m24s 10.244.2.56 k8s-node02 <none> <none>
test-2-564fd7c7df-tdwj7 1/1 Running 0 3m24s 10.244.2.57 k8s-node02 <none> <none>