Kubernetes 资源控制器

Eli He

2018-07-04

k8s

1. 控制器

自主式 Pod 和控制器管理的 Pod：

自主式Pod：Pod退出，不会被再次创建，因为无管理者（资源控制器）。
控制器管理的Pod：在控制器的生命周期里，始终要维持 Pod 的副本数目

K8S 中内建了很多 controller (控制器)，这些相当于一个状态机，用来控制Pod的具体状态和行为

2. ReplicaSet

作用：维持Pod的副本数，用来确保容器应用的副本数量始终保持在用户定义的副本数，即如果有容器异常退出，会自动创建新的Pod来替代；而如果异常多出来的容器也会自动回收。RS 通过selector标签来认定哪些pod是属于它当前的，template 中的 labels 必须对应起来

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: rs-nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      tier: frontend
  template:  # pod模板
    metadata:
      labels:
        tier: frontend
    spec:
      containers:
      - name: nginx
        image: nginx
        env:
        - name: GET_HOSTS_FROM
          value: dns
        ports:
        - containerPort: 80

$ kubectl get pod --show-labels
rs-nginx-fqgmx                  1/1     Running   0          107s    tier=frontend
rs-nginx-r8949                  1/1     Running   0          107s    tier=frontend
rs-nginx-wstkr                  1/1     Running   0          107s    tier=frontend


$ kubectl label pod rs-nginx-fqgmx tier=backend --overwrite=true
pod/rs-nginx-fqgmx labeled

$ kubectl get pod --show-labels
NAME             READY   STATUS    RESTARTS   AGE     LABELS
rs-nginx-6lqps   1/1     Running   0          93s     tier=frontend
rs-nginx-fqgmx   1/1     Running   0          4m46s   tier=backend   # 不再受rs管理
rs-nginx-r8949   1/1     Running   0          4m46s   tier=frontend
rs-nginx-wstkr   1/1     Running   0          4m46s   tier=frontend

3 Deployment

Deployment 为 Pod 和 ReplicaSet 提供一个声明式 (declarative) 方法，用来替代以前的 RC 来方便的管理应用，典型的应用场景包括：

定义 Deployment 来创建 Pod 和 ReplicaSet
滚动升级和回滚应用 (创建一个新的RS，新RS中Pod增1，旧RS的Pod减1)
扩容和缩容
暂停和继续 Deployment

补充：命令式编程和声明式编程

命令式编程：它侧重于如何实现程序，需要把程序的实现结果按照逻辑一步一步写下来

声明式编程：它侧重于定义想要什么，然后告诉计算机 / 引擎，让它帮你去实现。（SQL）

声明式（Deployment）：apply优先

命令式（RS）：create优先

RS 与 Deployment的关联：

部署一个简单的 Nginx 应用

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deploy-nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      tier: frontend
  template:
    metadata:
      labels:
        app: nginx
        tier: frontend
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80

$ kubectl get pod --show-labels
NAME                            READY   STATUS    RESTARTS   AGE     LABELS
deploy-nginx-864fc9c987-7kxx7   1/1     Running   0          25s     app=nginx,pod-template-hash=864fc9c987,tier=frontend
deploy-nginx-864fc9c987-d6pxg   1/1     Running   0          25s     app=nginx,pod-template-hash=864fc9c987,tier=frontend
deploy-nginx-864fc9c987-nkk4p   1/1     Running   0          25s     app=nginx,pod-template-hash=864fc9c987,tier=frontend

# 扩容
$ kubectl scale deployment deploy-nginx --replicas=5

# 更新镜像, 会自动创建rs
$ kubectl set image deployment/deploy-nginx nginx=nginx:1.21.3

# 回滚
$ kubectl rollout undo deployment/deploy-nginx

# 查询回滚状态
$ kubectl rollout status deployment/deploy-nginx

# 查看历史版本 (创建时，加--record，会显示描述)
$ kubectl rollout history deployment/deploy-nginx

# 回滚到某个历史版本
$ kubectl rollout undo deployment/deploy-nginx --to-revision=2

# 暂停更新
$ kubectl rollout pause deployment/deploy-nginx

版本更新策略：默认25%替换

清理历史版本：可以通过设置 spec.revisionHistoryLimit 来指定 Deployment 最多保留多少个 revision 历史记录。默认保留所有的revision，如果该项设置为0，Deployment将不能被回退

4. DaemonSet

确保每个没有被 tainted 的节点上,都运行一个 Pod 副本. 当有新节点加入集群时, 自动在该节点上新增一个 Pod 副本.当集群移除节点时, 该 Pod 也会被回收.

使用DaemonSet 的一些典型场景：

运行集群存储 daemon，例如在每个Node上运行 glusterd, ceph
在每个Node上运行日志收集 daemon，例如 fluentd, logstash
在每个Node上运行监控 daemon, 例如 Promethesus Node Exporter, collectd, Datalog代理，New Replic代理，Ganglia, gmond

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ds-nginx
  labels:
    app: daemonset
spec:
  selector:
    matchLabels:
      name: ds
  template:
    metadata:
      labels:
        name: ds
    spec:
      containers:
      - name: nginx
        image: nginx

5. Job

Job 仅执行一次的任务，它确保批处理任务的一个或多个 Pod 成功结束。

特殊说明：

spec.template 格式同 Pod
RestartPolicy 仅支持 Never 或 OnFailure
单个Pod时，默认Pod成功运行后 Job 结束
.spec.completions 标志 Job 结束需要运行的Pod个数，默认为1
.spec.parallelism 标志并行运行的 Pod 个数，默认为1
.spec.activeDeadlineSeconds 标志失败 Pod的重试最大时间，超过这个时间将不会再重试

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec: 
  template:
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl", "-Mbignum-bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never

$ kubectl get job
NAME   COMPLETIONS   DURATION   AGE
pi     0/1           10s        10s

$ kubectl get pod
NAME       READY   STATUS              RESTARTS   AGE
pi-779v9   0/1     ContainerCreating   0          14s

$ kubectl describe pod pi-779v9

6. CronJob

CornJob 管理基于时间的 Job，即：

在给定时间点只执行一次
周期性地在给定时间点运行

特殊说明：

.spec.schedule: 调度，必选字段，格式同Cron
spec.jobTemplate: 格式同 Pod
.spec.startingDeadlineSeconds: 启动Job的期限，可选字段。如果因为任何原因而错过了被调度的时间，那么错过了执行时间的Job被认为是失败的
.spec.concurrencyPolicy: 并发策略，可选字段
- Allow: 默认，允许并发运行 Job
- Forbid: 禁止并发Job，只能顺序执行
- Replace: 用新的Job替换当前正在运行的 Job
.spec.suspend: 挂起，可选字段，如果设置为true，后续所有执行都会被挂起。默认为fasle
.spec.successfulJobsHistoryLimit 和 .spec.failedJobsHistoryLimit: 历史限制，可选字段。它们指定了可以保留多少完成和失败的Job。默认值为3和1。如果设置为0，相关类型的Job完成后，将不会保留

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec: 
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

$ kubectl get cj
NAME    SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
hello   */1 * * * *   False     0        52s             2m24s

$ kubectl get job
NAME               COMPLETIONS   DURATION   AGE
hello-1602576000   1/1           17s        49s

$ kubectl get pod
NAME                     READY   STATUS      RESTARTS   AGE
hello-1602576000-r6mgh   0/1     Completed   0          52

7. StatefulSet （有状态服务）

StatefulSet 作为 Controller 为 Pod 提供的唯一标识，它可以确保部署和 scale 的顺序

StatefulSet 解决了有状态服务的问题，其应用场景包括：

稳定的持久化存储，即 Pod 重新调度后，还能够访问到相同的持久化数据，基于PVC来实现
稳定的网络标识，即 Pod 重新调度后其 PodName 和 HostName 不变，基于 Headless Service （即没有Cluster IP的Service）来实现
有序部署、有序扩展，即Pod是有序的，在部署和扩展时，要按照定义的顺序依次进行 (即从 0 到N - 1, 在下一个Pod 运行前，所有 Pod 必须是 Running 和 Ready 状态)，基于 Init Containers 来实现
有序收缩、有序删除（即从 N-1 到 0）

# Storages
apiVersion: v1 
kind: PersistentVolume 
metadata:
  name: nfs-pv 
spec:
  capacity:
    storage: 5Gi 
  accessModes:
  - ReadWriteOnce 
  persistentVolumeReclaimPolicy: Retain
  nfs:
   path: /nfsdata
   server: 192.168.80.240

---
# MySQL configurations
apiVersion: v1
kind: ConfigMap
metadata:
  name: mysql
  labels:
    app: mysql
data:
  master.cnf: |
    # Apply this config only on the master.
    [mysqld]
    log-bin
    default-time-zone='+8:00'
    character-set-client-handshake=FALSE
    character-set-server=utf8mb4
    collation-server=utf8mb4_unicode_ci
    init_connect='SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'
  slave.cnf: |
    # Apply this config only on slaves.
    [mysqld]
    super-read-only
    default-time-zone='+8:00'
    character-set-client-handshake=FALSE
    character-set-server=utf8mb4
    collation-server=utf8mb4_unicode_ci
    init_connect='SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'
    
---
# Headless service for stable DNS entries of StatefulSet members.
apiVersion: v1
kind: Service
metadata:
  name: mysql
  labels:
    app: mysql
spec:
  ports:
  - name: mysql
    port: 3306
  clusterIP: None
  selector:
    app: mysql

---
# Client service for connecting to any MySQL instance for reads.
# For writes, you must instead connect to the master: mysql-0.mysql.
apiVersion: v1
kind: Service
metadata:
  name: mysql-read
  labels:
    app: mysql
spec:
  ports:
  - name: mysql
    port: 3306
  selector:
    app: mysql
    
---
# Applications
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  selector:
    matchLabels:
      app: mysql
  serviceName: mysql
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
    spec:
      initContainers:
      - name: init-mysql
        image: mysql:5.7
        command:
        - bash
        - "-c"
        - |
          set -ex
          # Generate mysql server-id from pod ordinal index.
          [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
          ordinal=${BASH_REMATCH[1]}
          echo [mysqld] > /mnt/conf.d/server-id.cnf
          # Add an offset to avoid reserved server-id=0 value.
          echo server-id=$((100 + $ordinal)) >> /mnt/conf.d/server-id.cnf
          # Copy appropriate conf.d files from config-map to emptyDir.
          if [[ $ordinal -eq 0 ]]; then
            cp /mnt/config-map/master.cnf /mnt/conf.d/
          else
            cp /mnt/config-map/slave.cnf /mnt/conf.d/
          fi
        volumeMounts:
        - name: conf
          mountPath: /mnt/conf.d/
        - name: config-map
          mountPath: /mnt/config-map
      - name: clone-mysql
        image: ipunktbs/xtrabackup
        command:
        - bash
        - "-c"
        - |
          set -ex
          # Skip the clone if data already exists.
          [[ -d /var/lib/mysql/mysql ]] && exit 0
          # Skip the clone on master (ordinal index 0).
          [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
          ordinal=${BASH_REMATCH[1]}
          [[ $ordinal -eq 0 ]] && exit 0
          # Clone data from previous peer.
          ncat --recv-only mysql-$(($ordinal-1)).mysql 3307 | xbstream -x -C /var/lib/mysql
          # Prepare the backup.
          xtrabackup --prepare --target-dir=/var/lib/mysql
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
          subPath: mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
      containers:
      - name: mysql
        image: mysql:5.7
        env:
        - name: MYSQL_ALLOW_EMPTY_PASSWORD
          value: "1"
        ports:
        - name: mysql
          containerPort: 3306
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
          subPath: mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
        resources:
          requests:
            cpu: 500m
            memory: 512m
        livenessProbe:
          exec:
            command: ["mysqladmin", "ping"]
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        readinessProbe:
          exec:
            # Check we can execute queries over TCP (skip-networking is off).
            command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"]
          initialDelaySeconds: 5
          periodSeconds: 2
          timeoutSeconds: 1
      - name: xtrabackup
        image: ipunktbs/xtrabackup
        ports:
        - name: xtrabackup
          containerPort: 3307
        command:
        - bash
        - "-c"
        - |
          set -ex
          cd /var/lib/mysql

          # Determine binlog position of cloned data, if any.
          if [[ -f xtrabackup_slave_info && "x$(<xtrabackup_slave_info)" != "x" ]]; then
            # XtraBackup already generated a partial "CHANGE MASTER TO" query
            # because we're cloning from an existing slave. (Need to remove the tailing semicolon!)
            cat xtrabackup_slave_info | sed -E 's/;$//g' > change_master_to.sql.in
            # Ignore xtrabackup_binlog_info in this case (it's useless).
            rm -f xtrabackup_slave_info xtrabackup_binlog_info
          elif [[ -f xtrabackup_binlog_info ]]; then
            # We're cloning directly from master. Parse binlog position.
            [[ `cat xtrabackup_binlog_info` =~ ^(.*?)[[:space:]]+(.*?)$ ]] || exit 1
            rm -f xtrabackup_binlog_info xtrabackup_slave_info
            echo "CHANGE MASTER TO MASTER_LOG_FILE='${BASH_REMATCH[1]}',\
                  MASTER_LOG_POS=${BASH_REMATCH[2]}" > change_master_to.sql.in
          fi

          # Check if we need to complete a clone by starting replication.
          if [[ -f change_master_to.sql.in ]]; then
            echo "Waiting for mysqld to be ready (accepting connections)"
            until mysql -h 127.0.0.1 -e "SELECT 1"; do sleep 1; done

            echo "Initializing replication from clone position"
            mysql -h 127.0.0.1 \
                  -e "$(<change_master_to.sql.in), \
                          MASTER_HOST='mysql-0.mysql', \
                          MASTER_USER='root', \
                          MASTER_PASSWORD='', \
                          MASTER_CONNECT_RETRY=10; \
                        START SLAVE;" || exit 1
            # In case of container restart, attempt this at-most-once.
            mv change_master_to.sql.in change_master_to.sql.orig
          fi

          # Start a server to send backups when requested by peers.
          exec ncat --listen --keep-open --send-only --max-conns=1 3307 -c \
            "xtrabackup --backup --slave-info --stream=xbstream --host=127.0.0.1 --user=root"
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
          subPath: mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
      volumes:
      - name: conf
        emptyDir: {}
      - name: config-map
        configMap:
          name: mysql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 5Gi

8. Horizontal Pod AutoScalling

应用的资源使用率通常有高峰和低谷的时候，如何削峰填谷，提高集群的整体资源利用率，HPA 提供了 Pod 的水平自动缩放功能

通过控制RS，Deployment 来实现自动缩放

Eli's Blog