Eli's Blog

1. 控制器

自主式 Pod 和 控制器管理的 Pod:

  • 自主式Pod:Pod退出,不会被再次创建,因为无管理者(资源控制器)。
  • 控制器管理的Pod: 在控制器的生命周期里,始终要维持 Pod 的副本数目

K8S 中内建了很多 controller (控制器),这些相当于一个状态机,用来控制Pod的具体状态和行为

2. ReplicaSet

作用:维持Pod的副本数,用来确保容器应用的副本数量始终保持在用户定义的副本数,即如果有容器异常退出,会自动创建新的Pod来替代;而如果异常多出来的容器也会自动回收。RS 通过selector标签来认定哪些pod是属于它当前的,template 中的 labels 必须对应起来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: rs-nginx
spec:
replicas: 3
selector:
matchLabels:
tier: frontend
template: # pod模板
metadata:
labels:
tier: frontend
spec:
containers:
- name: nginx
image: nginx
env:
- name: GET_HOSTS_FROM
value: dns
ports:
- containerPort: 80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ kubectl get pod --show-labels
rs-nginx-fqgmx 1/1 Running 0 107s tier=frontend
rs-nginx-r8949 1/1 Running 0 107s tier=frontend
rs-nginx-wstkr 1/1 Running 0 107s tier=frontend


$ kubectl label pod rs-nginx-fqgmx tier=backend --overwrite=true
pod/rs-nginx-fqgmx labeled

$ kubectl get pod --show-labels
NAME READY STATUS RESTARTS AGE LABELS
rs-nginx-6lqps 1/1 Running 0 93s tier=frontend
rs-nginx-fqgmx 1/1 Running 0 4m46s tier=backend # 不再受rs管理
rs-nginx-r8949 1/1 Running 0 4m46s tier=frontend
rs-nginx-wstkr 1/1 Running 0 4m46s tier=frontend

3 Deployment

Deployment 为 Pod 和 ReplicaSet 提供一个声明式 (declarative) 方法,用来替代以前的 RC 来方便的管理应用,典型的应用场景包括:

  • 定义 Deployment 来创建 Pod 和 ReplicaSet
  • 滚动升级和回滚应用 (创建一个新的RS,新RS中Pod增1,旧RS的Pod减1)
  • 扩容和缩容
  • 暂停和继续 Deployment

补充:命令式编程和声明式编程

  • 命令式编程:它侧重于如何实现程序,需要把程序的实现结果按照逻辑一步一步写下来
  • 声明式编程:它侧重于定义想要什么,然后告诉计算机 / 引擎,让它帮你去实现。(SQL)

声明式(Deployment):apply优先

命令式(RS):create优先

RS 与 Deployment的关联:

rs

部署一个简单的 Nginx 应用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: apps/v1
kind: Deployment
metadata:
name: deploy-nginx
spec:
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
app: nginx
tier: frontend
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ kubectl get pod --show-labels
NAME READY STATUS RESTARTS AGE LABELS
deploy-nginx-864fc9c987-7kxx7 1/1 Running 0 25s app=nginx,pod-template-hash=864fc9c987,tier=frontend
deploy-nginx-864fc9c987-d6pxg 1/1 Running 0 25s app=nginx,pod-template-hash=864fc9c987,tier=frontend
deploy-nginx-864fc9c987-nkk4p 1/1 Running 0 25s app=nginx,pod-template-hash=864fc9c987,tier=frontend

# 扩容
$ kubectl scale deployment deploy-nginx --replicas=5

# 更新镜像, 会自动创建rs
$ kubectl set image deployment/deploy-nginx nginx=nginx:1.21.3

# 回滚
$ kubectl rollout undo deployment/deploy-nginx

# 查询回滚状态
$ kubectl rollout status deployment/deploy-nginx

# 查看历史版本 (创建时,加--record,会显示描述)
$ kubectl rollout history deployment/deploy-nginx

# 回滚到某个历史版本
$ kubectl rollout undo deployment/deploy-nginx --to-revision=2

# 暂停更新
$ kubectl rollout pause deployment/deploy-nginx

版本更新策略:默认25%替换

清理历史版本:可以通过设置 spec.revisionHistoryLimit 来指定 Deployment 最多保留多少个 revision 历史记录。默认保留所有的revision,如果该项设置为0,Deployment将不能被回退

4. DaemonSet

确保每个没有被 tainted 的节点上,都运行一个 Pod 副本. 当有新节点加入集群时, 自动在该节点上新增一个 Pod 副本.当集群移除节点时, 该 Pod 也会被回收.

使用DaemonSet 的一些典型场景:

  • 运行集群存储 daemon,例如在每个Node上运行 glusterd, ceph
  • 在每个Node上运行日志收集 daemon,例如 fluentd, logstash
  • 在每个Node上运行监控 daemon, 例如 Promethesus Node Exporter, collectd, Datalog代理,New Replic代理,Ganglia, gmond
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ds-nginx
labels:
app: daemonset
spec:
selector:
matchLabels:
name: ds
template:
metadata:
labels:
name: ds
spec:
containers:
- name: nginx
image: nginx

5. Job

Job 仅执行一次的任务,它确保批处理任务的一个或多个 Pod 成功结束。

特殊说明:

  • spec.template 格式同 Pod
  • RestartPolicy 仅支持 Never 或 OnFailure
  • 单个Pod时,默认Pod成功运行后 Job 结束
  • .spec.completions 标志 Job 结束需要运行的Pod个数,默认为1
  • .spec.parallelism 标志并行运行的 Pod 个数,默认为1
  • .spec.activeDeadlineSeconds 标志失败 Pod的重试最大时间,超过这个时间将不会再重试
1
2
3
4
5
6
7
8
9
10
11
12
13
14
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum-bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
1
2
3
4
5
6
7
8
9
$ kubectl get job
NAME COMPLETIONS DURATION AGE
pi 0/1 10s 10s

$ kubectl get pod
NAME READY STATUS RESTARTS AGE
pi-779v9 0/1 ContainerCreating 0 14s

$ kubectl describe pod pi-779v9

6. CronJob

CornJob 管理基于时间的 Job,即:

  • 在给定时间点只执行一次
  • 周期性地在给定时间点运行

特殊说明:

  • .spec.schedule: 调度,必选字段,格式同Cron

  • spec.jobTemplate: 格式同 Pod

  • .spec.startingDeadlineSeconds: 启动Job的期限,可选字段。如果因为任何原因而错过了被调度的时间,那么错过了执行时间的Job被认为是失败的

  • .spec.concurrencyPolicy: 并发策略,可选字段

    • Allow: 默认,允许并发运行 Job
    • Forbid: 禁止并发Job,只能顺序执行
    • Replace: 用新的Job替换当前正在运行的 Job
  • .spec.suspend: 挂起,可选字段,如果设置为true,后续所有执行都会被挂起。默认为fasle

  • .spec.successfulJobsHistoryLimit.spec.failedJobsHistoryLimit: 历史限制,可选字段。它们指定了可以保留多少完成和失败的Job。默认值为3和1。如果设置为0,相关类型的Job完成后,将不会保留

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
1
2
3
4
5
6
7
8
9
10
11
$ kubectl get cj
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hello */1 * * * * False 0 52s 2m24s

$ kubectl get job
NAME COMPLETIONS DURATION AGE
hello-1602576000 1/1 17s 49s

$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hello-1602576000-r6mgh 0/1 Completed 0 52

7. StatefulSet (有状态服务)

StatefulSet 作为 Controller 为 Pod 提供的唯一标识,它可以确保部署和 scale 的顺序

StatefulSet 解决了有状态服务的问题,其应用场景包括:

  • 稳定的持久化存储,即 Pod 重新调度后,还能够访问到相同的持久化数据,基于PVC来实现
  • 稳定的网络标识,即 Pod 重新调度后其 PodName 和 HostName 不变,基于 Headless Service (即没有Cluster IP的Service)来实现
  • 有序部署、有序扩展,即Pod是有序的,在部署和扩展时,要按照定义的顺序依次进行 (即从 0 到N - 1, 在下一个Pod 运行前,所有 Pod 必须是 Running 和 Ready 状态),基于 Init Containers 来实现
  • 有序收缩、有序删除(即从 N-1 到 0)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
# Storages
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
nfs:
path: /nfsdata
server: 192.168.80.240

---
# MySQL configurations
apiVersion: v1
kind: ConfigMap
metadata:
name: mysql
labels:
app: mysql
data:
master.cnf: |
# Apply this config only on the master.
[mysqld]
log-bin
default-time-zone='+8:00'
character-set-client-handshake=FALSE
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'
slave.cnf: |
# Apply this config only on slaves.
[mysqld]
super-read-only
default-time-zone='+8:00'
character-set-client-handshake=FALSE
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'

---
# Headless service for stable DNS entries of StatefulSet members.
apiVersion: v1
kind: Service
metadata:
name: mysql
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
clusterIP: None
selector:
app: mysql

---
# Client service for connecting to any MySQL instance for reads.
# For writes, you must instead connect to the master: mysql-0.mysql.
apiVersion: v1
kind: Service
metadata:
name: mysql-read
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
selector:
app: mysql

---
# Applications
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
selector:
matchLabels:
app: mysql
serviceName: mysql
replicas: 3
template:
metadata:
labels:
app: mysql
spec:
initContainers:
- name: init-mysql
image: mysql:5.7
command:
- bash
- "-c"
- |
set -ex
# Generate mysql server-id from pod ordinal index.
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
echo [mysqld] > /mnt/conf.d/server-id.cnf
# Add an offset to avoid reserved server-id=0 value.
echo server-id=$((100 + $ordinal)) >> /mnt/conf.d/server-id.cnf
# Copy appropriate conf.d files from config-map to emptyDir.
if [[ $ordinal -eq 0 ]]; then
cp /mnt/config-map/master.cnf /mnt/conf.d/
else
cp /mnt/config-map/slave.cnf /mnt/conf.d/
fi
volumeMounts:
- name: conf
mountPath: /mnt/conf.d/
- name: config-map
mountPath: /mnt/config-map
- name: clone-mysql
image: ipunktbs/xtrabackup
command:
- bash
- "-c"
- |
set -ex
# Skip the clone if data already exists.
[[ -d /var/lib/mysql/mysql ]] && exit 0
# Skip the clone on master (ordinal index 0).
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
[[ $ordinal -eq 0 ]] && exit 0
# Clone data from previous peer.
ncat --recv-only mysql-$(($ordinal-1)).mysql 3307 | xbstream -x -C /var/lib/mysql
# Prepare the backup.
xtrabackup --prepare --target-dir=/var/lib/mysql
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
containers:
- name: mysql
image: mysql:5.7
env:
- name: MYSQL_ALLOW_EMPTY_PASSWORD
value: "1"
ports:
- name: mysql
containerPort: 3306
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 500m
memory: 512m
livenessProbe:
exec:
command: ["mysqladmin", "ping"]
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
# Check we can execute queries over TCP (skip-networking is off).
command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"]
initialDelaySeconds: 5
periodSeconds: 2
timeoutSeconds: 1
- name: xtrabackup
image: ipunktbs/xtrabackup
ports:
- name: xtrabackup
containerPort: 3307
command:
- bash
- "-c"
- |
set -ex
cd /var/lib/mysql

# Determine binlog position of cloned data, if any.
if [[ -f xtrabackup_slave_info && "x$(<xtrabackup_slave_info)" != "x" ]]; then
# XtraBackup already generated a partial "CHANGE MASTER TO" query
# because we're cloning from an existing slave. (Need to remove the tailing semicolon!)
cat xtrabackup_slave_info | sed -E 's/;$//g' > change_master_to.sql.in
# Ignore xtrabackup_binlog_info in this case (it's useless).
rm -f xtrabackup_slave_info xtrabackup_binlog_info
elif [[ -f xtrabackup_binlog_info ]]; then
# We're cloning directly from master. Parse binlog position.
[[ `cat xtrabackup_binlog_info` =~ ^(.*?)[[:space:]]+(.*?)$ ]] || exit 1
rm -f xtrabackup_binlog_info xtrabackup_slave_info
echo "CHANGE MASTER TO MASTER_LOG_FILE='${BASH_REMATCH[1]}',\
MASTER_LOG_POS=${BASH_REMATCH[2]}" > change_master_to.sql.in
fi

# Check if we need to complete a clone by starting replication.
if [[ -f change_master_to.sql.in ]]; then
echo "Waiting for mysqld to be ready (accepting connections)"
until mysql -h 127.0.0.1 -e "SELECT 1"; do sleep 1; done

echo "Initializing replication from clone position"
mysql -h 127.0.0.1 \
-e "$(<change_master_to.sql.in), \
MASTER_HOST='mysql-0.mysql', \
MASTER_USER='root', \
MASTER_PASSWORD='', \
MASTER_CONNECT_RETRY=10; \
START SLAVE;" || exit 1
# In case of container restart, attempt this at-most-once.
mv change_master_to.sql.in change_master_to.sql.orig
fi

# Start a server to send backups when requested by peers.
exec ncat --listen --keep-open --send-only --max-conns=1 3307 -c \
"xtrabackup --backup --slave-info --stream=xbstream --host=127.0.0.1 --user=root"
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 100m
memory: 100Mi
volumes:
- name: conf
emptyDir: {}
- name: config-map
configMap:
name: mysql
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi

8. Horizontal Pod AutoScalling

应用的资源使用率通常有高峰和低谷的时候,如何削峰填谷,提高集群的整体资源利用率,HPA 提供了 Pod 的水平自动缩放功能

通过控制RS,Deployment 来实现自动缩放