Kubernetes 核心组件

Eli He

2021-07-13

kubernetes

1. API-Server

1.1 工作原理

核心功能：资源操作入口

提供集群管理的 REST API 接口，包括认证授权、准入控制、数据校验以及集群状态变更等
提供其他模块之间的数据交互和通信的枢纽（其他模块通过 API Server 查询或修改数据，只有 API Server 能够直接操作 etcd）

1.2 访问

1.2.1 访问端口

vi /etc/kubernetes/kube-apiserver.conf
--insecure-bind-address=127.0.0.1 \
--insecure-port=8080 \
--bind-address=192.168.80.11 \
--secure-port=6443 \

curl http://localhost:8080
curl https://192.168.80.11:6443

1.2.2 访问方式

# 1. sdk
go get k8s.io/client-go@latest

# 2. kubectl
kubectl get --raw /api/v1/namespaces | python -m json.tool
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods

# 3. kubectl proxy
kubectl proxy --port=8081 &
curl http://localhost:8081/api/
{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "192.168.80.45:6443"
    }
  ]
}

# 4. curl
TOKEN=$(kubectl describe secrets $(kubectl get secrets -n kube-system |grep admin |cut -f1 -d ' ') -n kube-system |grep -E '^token' |cut -f2 -d':'|tr -d '\t'|tr -d ' ')
APISERVER=$(kubectl config view |grep server|cut -f 2- -d ":" | tr -d " ")
curl -H "Authorization: Bearer $TOKEN" $APISERVER/api  --insecure
{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "192.168.80.45:6443"
    }
  ]
}r

1.3 API 资源

# 所有支持的资源
$ kubectl api-resources
NAME                              SHORTNAMES   APIGROUP                       NAMESPACED   KIND
bindings                                                                      true         Binding
componentstatuses                 cs                                          false        ComponentStatus
configmaps                        cm                                          true         ConfigMap
endpoints                         ep                                          true         Endpoints
events                            ev                                          true         Event
limitranges                       limits                                      true         LimitRange
namespaces                        ns                                          false        Namespace
nodes                             no                                          false        Node
persistentvolumeclaims            pvc                                         true         PersistentVolumeClaim
persistentvolumes                 pv                                          false        PersistentVolume
pods                              po                                          true         Pod
podtemplates                                                                  true         PodTemplate
replicationcontrollers            rc                                          true         ReplicationController
resourcequotas                    quota                                       true         ResourceQuota
secrets                                                                       true         Secret
serviceaccounts                   sa                                          true         ServiceAccount
services                          svc                                         true         Service
mutatingwebhookconfigurations                  admissionregistration.k8s.io   false        MutatingWebhookConfiguration
validatingwebhookconfigurations                admissionregistration.k8s.io   false        ValidatingWebhookConfiguration
customresourcedefinitions         crd,crds     apiextensions.k8s.io           false        CustomResourceDefinition
apiservices                                    apiregistration.k8s.io         false        APIService
controllerrevisions                            apps                           true         ControllerRevision
daemonsets                        ds           apps                           true         DaemonSet
deployments                       deploy       apps                           true         Deployment
replicasets                       rs           apps                           true         ReplicaSet
statefulsets                      sts          apps                           true         StatefulSet
tokenreviews                                   authentication.k8s.io          false        TokenReview
localsubjectaccessreviews                      authorization.k8s.io           true         LocalSubjectAccessReview
selfsubjectaccessreviews                       authorization.k8s.io           false        SelfSubjectAccessReview
selfsubjectrulesreviews                        authorization.k8s.io           false        SelfSubjectRulesReview
subjectaccessreviews                           authorization.k8s.io           false        SubjectAccessReview
horizontalpodautoscalers          hpa          autoscaling                    true         HorizontalPodAutoscaler
cronjobs                          cj           batch                          true         CronJob
jobs                                           batch                          true         Job
certificatesigningrequests        csr          certificates.k8s.io            false        CertificateSigningRequest
leases                                         coordination.k8s.io            true         Lease
endpointslices                                 discovery.k8s.io               true         EndpointSlice
events                            ev           events.k8s.io                  true         Event
ingresses                         ing          extensions                     true         Ingress
ingressclasses                                 networking.k8s.io              false        IngressClass
ingresses                         ing          networking.k8s.io              true         Ingress
networkpolicies                   netpol       networking.k8s.io              true         NetworkPolicy
runtimeclasses                                 node.k8s.io                    false        RuntimeClass
poddisruptionbudgets              pdb          policy                         true         PodDisruptionBudget
podsecuritypolicies               psp          policy                         false        PodSecurityPolicy
clusterrolebindings                            rbac.authorization.k8s.io      false        ClusterRoleBinding
clusterroles                                   rbac.authorization.k8s.io      false        ClusterRole
rolebindings                                   rbac.authorization.k8s.io      true         RoleBinding
roles                                          rbac.authorization.k8s.io      true         Role
priorityclasses                   pc           scheduling.k8s.io              false        PriorityClass
csidrivers                                     storage.k8s.io                 false        CSIDriver
csinodes                                       storage.k8s.io                 false        CSINode
storageclasses                    sc           storage.k8s.io                 false        StorageClass
volumeattachments                              storage.k8s.io                 false        VolumeAttachment


# 获取特定组 apps 的资源
$ kubectl api-resources --api-group apps
kubectl api-resources --api-group apps
NAME                  SHORTNAMES   APIGROUP   NAMESPACED   KIND
controllerrevisions                apps       true         ControllerRevision
daemonsets            ds           apps       true         DaemonSet
deployments           deploy       apps       true         Deployment
replicasets           rs           apps       true         ReplicaSet
statefulsets          sts          apps       true         StatefulSet

# 资源详细解释
$ kubectl explain svc
KIND:     Service
VERSION:  v1

DESCRIPTION:
     Service is a named abstraction of software service (for example, mysql)
     consisting of local port (for example 3306) that the proxy listens on, and
     the selector that determines which pods will answer requests sent through
     the proxy.

FIELDS:
   apiVersion   <string>
     APIVersion defines the versioned schema of this representation of an
     object. Servers should convert recognized schemas to the latest internal
     value, and may reject unrecognized values. More info:
     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources

   kind <string>
     Kind is a string value representing the REST resource this object
     represents. Servers may infer this from the endpoint the client submits
     requests to. Cannot be updated. In CamelCase. More info:
     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds

   metadata     <Object>
     Standard object's metadata. More info:
     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata

   spec <Object>
     Spec defines the behavior of a service.
     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status

   status       <Object>
     Most recently observed status of the service. Populated by the system.
     Read-only. More info:
     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
     
# 集群支持的API版本
$ kubectl api-versions
admissionregistration.k8s.io/v1
admissionregistration.k8s.io/v1beta1
apiextensions.k8s.io/v1
apiextensions.k8s.io/v1beta1
apiregistration.k8s.io/v1
apiregistration.k8s.io/v1beta1
apps/v1
authentication.k8s.io/v1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1
authorization.k8s.io/v1beta1
autoscaling/v1
autoscaling/v2beta1
autoscaling/v2beta2
batch/v1
batch/v1beta1
certificates.k8s.io/v1
certificates.k8s.io/v1beta1
coordination.k8s.io/v1
coordination.k8s.io/v1beta1
discovery.k8s.io/v1beta1
events.k8s.io/v1
events.k8s.io/v1beta1
extensions/v1beta1
networking.k8s.io/v1
networking.k8s.io/v1beta1
node.k8s.io/v1beta1
policy/v1beta1
rbac.authorization.k8s.io/v1
rbac.authorization.k8s.io/v1beta1
scheduling.k8s.io/v1
scheduling.k8s.io/v1beta1
storage.k8s.io/v1
storage.k8s.io/v1beta1
v1

1.4 示例

https://v1-19.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#read-pod-v1-core

cat > pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: pod-example
spec:
  containers:
  - name: alpine
    image: alpine:latest
    command: ["echo"]
    args: ["Hello World"]
EOF

kubectl apply -f pod.yml
kubectl get pod

curl http://localhost:8080/api/v1/namespaces/default/pods/pod-example

2. Controller-Manager

Controller Manager 由 kube-controller-manager 和 cloud-controller-manager 组成，是 Kubernetes 的大脑，它通过 apiserver 监控整个集群的状态，并确保集群处于预期的工作状态。

Controller Manager作为集群内部的管理控制中心，负责集群内的Node、Pod副本、服务端点（Endpoint）、命名空间（Namespace）、服务账号（ServiceAccount）、资源定额（ResourceQuota）的管理，当某个Node意外宕机时，Controller Manager会及时发现并执行自动化修复流程，确保集群始终处于预期的工作状态。

2.1 控制器分类

kube-controller-manager:

Replication Controller
Node Controller
CronJob Controller
Daemon Controller
Deployment Controller
Endpoint Controller
Garbage Collector
Namespace Controller
Job Controller
Pod AutoScaler
RelicaSet
Service Controller
ServiceAccount Controller
StatefulSet Controller
Volume Controller
Resource quota Controller

cloud-controller-manager: 在 Kubernetes 启用 Cloud Provider 的时候才需要，用来配合云服务提供商的控制

Node Controller
Route Controller
Service Controller

2.2 Replication Controller (RC)

简称RC，即副本控制器，它的作用是保证集群中一个RC所关联的Pod副本数始终保持预设值

只有当Pod的重启策略RestartPolicy=Always时，RC才会管理该Pod的操作（创建、销毁、重启等）
创建Pod的RC模板，只在创建Pod时有效，一旦Pod创建完成，模板的变化，不会影响到已创建好的Pod
可通过修改Pod的Label，使该Pod脱离RC的管理。该方法可用于Pod从集群中迁移，数据修复等调试
删除一个RC不影响它所创建的Pod，如果要删除Pod，需要将RC的副本数属性设置为0
不要越过RC创建Pod，因为RC实现了自动化管理Pod，提高容灾能力

2.2.1 RC 的职责

维护集群中Pod的副本数
通过调整RC中的spec.replicas属性值来实现系统的扩容或缩容
通过改变RC中的Pod模板来实现系统的滚动升级

2.2.2 存活探针

Kubemetes有以下三种探测容器的机制：

HTTPGET探针：对容器的地址http://ip:port/path执行HTTPGET请求
- 成功：探测器收到响应，且响应状态码不代表错误（2xx、3xx)
- 失败：未收到响应，或收到错误响应状态码
TCP套接字探针：尝试与容器指定端口建立TCP连接。如果连接成功建立，则探测成功；否则，容器重新启动。
Exec探针：在容器内执行任意命令，并检查命令的退出状态码。如果状态码是0, 则探测成功；其他状态码都被认为失败。

spec:
  containers:
    - name: nginx
      image: nginx:latest
      # 一个基于HTTP GET的存活探针
      livenessProbe:
        # 第一次检测在容器启动15秒后
        initialDelaySeconds: 15
        httpGet:
          port: 8080
          path: /

2.3 ReplicaSet (RS)

RS 是RC的替代者，它使用Deployment管理，比RC更强大

2.4 Node Controller

kubelet 在启动时，会通过API Server注册自身的节点信息，并定时向API Server汇报状态信息；API Server接收到信息后，将信息更新到etcd中。

Controller Manager 在启动时，如果设置了--cluster-cidr 参数，对于没有设置Sepc.PodCIDR的Node节点生成一个CIDR地址，并用该CIDR地址设置节点的Spec.PodCIDR属性，防止不同的节点的CIDR地址发生冲突。

2.4.1 Node Eviction

Node 控制器在节点异常后，会按照默认的速率（--node-eviction-rate=0.1，即每10秒一个节点的速率）进行 Node 的驱逐。Node 控制器按照 Zone 将节点划分为不同的组，再跟进 Zone 的状态进行速率调整：

Normal：所有节点都 Ready，默认速率驱逐。
PartialDisruption：即超过33% 的节点 NotReady 的状态。当异常节点比例大于--unhealthy-zone-threshold=0.55时开始减慢速率：
- 小集群（即节点数量小于 --large-cluster-size-threshold=50）：停止驱逐
- 大集群，减慢速率为 --secondary-node-eviction-rate=0.01
FullDisruption：所有节点都 NotReady，返回使用默认速率驱逐。但当所有 Zone 都处在 FullDisruption 时，停止驱逐。

2.5 ResourceQuota Controller

资源配额管理，确保指定的资源对象在任何适合都不会超量占用系统物理资源。

支持三个级别的资源配置管理：

容器级别：对CPU和Memory进行限制
Pod级别：对一个Pod内所有容器的可用资源进行限制
Namespace级别：
- Pod数量
- RS 数量
- SVC 数量
- ResourceQuota 数量
- Secret 数量
- 可持有的PV（Persistent Volume）数量

说明：

配额管理通过 Admission Control (准入控制) 来管理
Admission Control 提供两针配额约束方式
- LimitRanger：作用于Pod和Container
- ResourceQuota：作用于Namespace，限定一个Namespace中的各种资源的使用总额

ResourceQuota Controller流程图：

2.6 Namespace Controller

用户通过API Server创建新的Namespace并保存在etcd中，Namespace Controller定时通过API Server 读取这些Namespace信息。

如果Namespace被API标记为优雅删除(即设置删除期限，DeletionTimestamp)，则将该Namespace状态设置为”Terminating”，并保存到etcd中，同时Namespace Controller删除该Namespace下的ServiceAccount, RS, Pod等资源对象。

2.7 Endpoint Controller

Service, Endpoint, Pod的关系：

Endpoints 表示一个Service对应的所有Pod副本的访问地址，而Endpoints Controller负责生成和维护所有Endpoints对象的控制器，它负责监听Service和对应的Pod副本变化：

Service被删除，则删除和该Service同名的Endpoints对象
Service被创建或修改，则根据该Service信息获得相关的Pod列表，然后创建或更新Service对应的Endpoints对象
Pod事件，则更它对应的Service的Endpoints对象

kube-proxy 进程获取每个Service的Endpoints，实现Service的负载均衡功能

2.8 Service Controller

Service Controller 属于kubernetes集群与外部云平台之间的一个接口控制器。它监听Service变化，如果一个LoadBalancer类型的Service，则确保外部的云平台上对该Service对应的Load Balancer实例相应地创建、删除及更新路由转发表。

3. Scheduler

Scheduler负责Pod调度，在整个系统中起“承上启下”作用

承上：负责接收Controller Manager 创建的新的Pod，并为其选择合适的Node

启下：Node上的kubelet接管Pod的生命周期

Scheduler 集群分发调度器:

1) 通过调度算法，选择合适的Node，将待调度的Pod在该Node上创建，并将信息写入etcd中

2) kubelet 通过API Server监听到 Scheduler 产生的Pod绑定信息，然后获取对应的Pod清单，下载image，并启动容器

3.1 调度流程

预选调度过程：即遍历所有目标Node，筛选出符合要求的候选节点。k8s 内置了多种预选策略(Predicates) 供用户选择
确定最优节点：采用优选策略（Priority）计算出每个候选节点的积分，取最高分

调度流程通过插件式加载的调度算法提供者(Algorithm Provider) 具体实现，一个调度算法提供者就是包括一组预选策略与一组优选策略的结构体。

3.2 预选策略

CheckNodeCondition：检查节点是否正常（如ip，磁盘等）
GeneralPredicates
  HostName：检查Pod对象是否定义了pod.spec.hostname
  PodFitsHostPorts：pod要能适配node的端口 pods.spec.containers.ports.hostPort（指定绑定在节点的端口上）
  MatchNodeSelector：检查节点的NodeSelector的标签 pods.spec.nodeSelector
  PodFitsResources：检查Pod的资源需求是否能被节点所满足
NoDiskConflict: 检查Pod依赖的存储卷是否能满足需求（默认未使用）
PodToleratesNodeTaints：检查Pod上的spec.tolerations可容忍的污点是否完全包含节点上的污点
PodToleratesNodeNoExecuteTaints：不能执行（NoExecute）的污点（默认未使用）
CheckNodeLabelPresence：检查指定的标签再上节点是否存在
CheckServiceAffinity：将相同services相同的pod尽量放在一起（默认未使用）
MaxEBSVolumeCount： 检查EBS（AWS存储）存储卷的最大数量
MaxGCEPDVolumeCount GCE存储最大数
MaxAzureDiskVolumeCount: AzureDisk 存储最大数
CheckVolumeBinding：检查节点上已绑定或未绑定的pvc
NoVolumeZoneConflict：检查存储卷对象与pod是否存在冲突
CheckNodeMemoryPressure：检查节点内存是否存在压力过大
CheckNodePIDPressure：检查节点上的PID数量是否过大
CheckNodeDiskPressure： 检查内存、磁盘IO是否过大
MatchInterPodAffinity: 检查节点是否能满足pod的亲和性或反亲和性

3.3 优选策略

LeastRequested： 空闲量越高得分越高
(cpu((capacity-sum(requested))*10/capacity)+memory((capacity-sum(requested))*10/capacity))/2

BalancedResourceAllocation：CPU和内存资源被占用率相近的胜出
NodePreferAvoidPods: 节点注解信息“scheduler.alpha.kubernetes.io/preferAvoidPods”
TaintToleration：将Pod对象的spec.tolerations列表项与节点的taints列表项进行匹配度检查，匹配条目越，得分越低

SeletorSpreading：标签选择器分散度，（与当前pod对象通选的标签，所选其它pod越多的得分越低）
InterPodAffinity：遍历pod对象的亲和性匹配项目，项目越多得分越高
NodeAffinity：节点亲和性 、
MostRequested：空闲量越小得分越高，和LeastRequested相反 （默认未启用）
NodeLabel：节点是否存在对应的标签 （默认未启用）
ImageLocality：根据满足当前Pod对象需求的已有镜像的体积大小之和（默认未启用）

3.4 高级调度

3.4.1 `nodeSelector`

# 1. 创建 redis 集群
cat > redis-deploy.yml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  selector:
    matchLabels:
      app: redis
  replicas: 2
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:6.2.3
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 6379
      nodeSelector:
        disk: ssd  # 限定磁盘类型
EOF

kubetcl apply -f redis-deploy.yml

# 检查pod状态
kubectl get pod
NAME                   READY   STATUS    RESTARTS   AGE
redis-9fc84569-2jlxh   0/1     Pending   0          60s
redis-9fc84569-q78jd   0/1     Pending   0          60s

kubectl describe pod redis-9fc84569-2jlxh
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  20s (x3 over 76s)  default-scheduler  0/4 nodes are available: 4 node(s) didn't match node selector.

# k8s-node2 增加标签 disk=ssd 
kubectl label node k8s-node2 disk=ssd 

kubectl get  nodes --show-labels | grep disk=ssd
k8s-node2     Ready      node     4d21h   v1.19.11   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disk=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node2,kubernetes.io/os=linux,node-role.kubernetes.io/node=

# 再次检查 pod 是否创建成功
kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP           NODE        NOMINATED NODE   READINESS GATES
redis-9fc84569-2jlxh   1/1     Running   0          5m20s   10.244.1.8   k8s-node2   <none>           <none>
redis-9fc84569-q78jd   1/1     Running   0          5m20s   10.244.1.7   k8s-node2   <none>           <none>

3.4.2 亲和性 (affinity)

1. 软亲和

`preferredDuringSchedulingIgnoredDuringExecution`

软亲和：选择条件匹配多的，就算都不满足条件，还是会生成pod

cat > preferred-affinity-pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: preferred-affinity-pod
  labels:
    app: my-pod
spec:
  containers:
  - name: preferred-affinity-pod
    image: nginx
    ports:
    - name: http
      containerPort: 80
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: apps 	# 标签键名
            operator: In
            values:
            - mysql     # apps=mysql
            - redis     # apps=redis
        weight: 60 		# 匹配相应nodeSelectorTerm相关联的权重,1-100
EOF

kubectl apply -f preferred-affinity-pod.yml

# 不满足依旧创建成功
kubectl get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
preferred-affinity-pod   1/1     Running   0          61s   10.244.2.18   k8s-node1   <none>           <none>

2. 硬亲和

requiredDuringSchedulingIgnoredDuringExecution

硬亲和：选择条件匹配多的，必须满足一项，才会生成pod

cat > required-affinity-pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: required-affinity-pod
  labels:
    app: my-pod
spec:
  containers:
  - name: required-affinity-pod
    image: nginx
    ports:
    - name: http
      containerPort: 80
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: apps 	# 标签键名
              operator: In
              values:
                - mysql     # apps=mysql
                - redis     # apps=redis          
EOF

kubectl apply -f required-affinity-pod.yml

# 不满足无法创建成功
kubectl  get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
required-affinity-pod    0/1     Pending   0          18s   <none>        <none>      <none>           <none>

# 修改 k8s-node1 的标签
kubectl label node k8s-node1 apps=mysql 

# 创建成功
kubectl  get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE     IP            NODE        NOMINATED NODE   READINESS GATES
required-affinity-pod    1/1     Running   0          2m31s   10.244.2.19   k8s-node1   <none>           <none>

3.4.3 反亲和性

cat > anti-affinity.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: myapp1
  labels:
    app: myapp1
      
spec:
  containers:
  - name: myapp1
    image: nginx
    ports:
    - name: http
      containerPort: 80
---
apiVersion: v1
kind: Pod
metadata:
  name: myapp2
  labels:
    app: myapp2
      
spec:
  containers:
  - name: myapp2
    image: nginx
    ports:
    - name: http
      containerPort: 80
  affinity:
    podAntiAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - myapp1   # app=myapp1
        topologyKey: kubernetes.io/hostname  #kubernetes.io/hostname的值一样代表pod不处于同一位置  
EOF

kubectl apply -f anti-affinity.yml

# 分属不同的节点上
kubectl get pod -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
myapp1   1/1     Running   0          43s   10.244.2.20   k8s-node1   <none>           <none>
myapp2   1/1     Running   0          43s   10.244.1.9    k8s-node2   <none>           <none>

3.5 污点和容忍

taint的effect定义对Pod排斥效果：

NoSchedule：只影响调度过程，对现存的Pod对象不产生影响，即不驱离
NoExecute：既影响调度过程，也影响现在的Pod对象，即现存的Pod对象将被驱离
PreferNoSchedule：最好不部署Pod，但如果实在找不到节点，也可以在此节点上部署

3.5.1 污点管理

kubectl describe node k8s-master1 | grep Taints
Taints:             node-role.kubernetes.io/master:NoSchedule

kubectl describe node k8s-node1 | grep Taints
Taints:             <none>

# 打污点
kubectl taint node k8s-node1 node-role.kubernetes.io/node=:NoSchedule

kubectl describe node k8s-node1 | grep Taints
Taints:             node-role.kubernetes.io/node:NoSchedule

# 去除污点
kubectl taint node k8s-node1 node-role.kubernetes.io/node-

3.5.2 容忍

# 节点全部加上node-type污点
kubectl taint node k8s-node1 node-type=:NoSchedule
kubectl taint node k8s-node2 node-type=:NoSchedule

cat > toleration-pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: toleration-pod
  labels:
    app: toleration-pod
      
spec:
  containers:
  - name: toleration-pod
    image: nginx
    ports:
    - name: http
      containerPort: 80
  tolerations:
  - key: "node-type"           # 污点名称
    operator: "Equal"          # Exists/Equal
    value: "PreferNoSchedule"  # 污点值
    effect: "NoSchedule"       # 
    #tolerationSeconds: 3600    # 如果被驱逐的话，容忍时间 effect和tolerationSeconds不能同时存在
EOF

kubectl apply -f toleration-pod.yml

# 无法正常创建
kubectl get pod
NAME             READY   STATUS    RESTARTS   AGE
toleration-pod   0/1     Pending   0          5s

kubectl describe pod toleration-pod
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  26s (x2 over 26s)  default-scheduler  0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node-type: }, that the pod didn't tolerate.

# k8s-node1 污点增加 PreferNoSchedule，并删除NoSchedule
kubectl taint node k8s-node1 node-type=:PreferNoSchedule
kubectl taint node k8s-node1 node-type=:NoSchedule-    

# 调度成功
kubectl get pod -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
toleration-pod   1/1     Running   0          11m   10.244.2.21   k8s-node1   <none>           <none>

4. Kubelet

每个Node节点上都运行一个 Kubelet 服务进程，默认监听 10250 端口，接收并执行 Master 发来的指令，管理 Pod 及 Pod 中的容器。每个 Kubelet 进程会在 API Server 上注册所在Node节点的信息，定期向 Master 节点汇报该节点的资源使用情况，并通过 cAdvisor 监控节点和容器的资源。可以把kubelet理解成【Server-Agent】架构中的agent，是Node上的Pod管家。

4.1 节点管理

节点管理主要是节点自注册和节点状态更新：

通过启动参数“-register-node” 来确定是否向API Server注册自己
没有选择自注册模式，则需要用户自己配置Node资源信息，同时配置API Server的地址
启动时，通过API Server注册节点信息，并定时向API Server发送节点新消息，API Server在接收到新消息后，将信息写入 etcd

主要参数：

--kubeconfig: 指定kubeconfig的路径，该文件常用来指定证书
--hostname-override: 配置该节点在集群中显示的主机名
--node-status-update-frequency: kubelet向API Server上报心跳的频率，默认10s

4.2 Pod 管理

4.2.1 获取 Pod 清单

kubelet 通过 API Server Client 使用Watch加List的方式，监听 “/registry/csinodes” 和 “/registry/pods” 目录，将获取到的信息同步到本地缓存中

etcdctl get /registry/csinodes --prefix --keys-only
/registry/csinodes/k8s-master1
/registry/csinodes/k8s-master2
/registry/csinodes/k8s-node1
/registry/csinodes/k8s-node2

# 节点详细信息
etcdctl get /registry/csinodes/k8s-master1
/registry/csinodes/k8s-master1
k8s

storage.k8s.io/v1CSINode⚌
⚌

k8s-master1"*$cc45d1d2-dde9-4740-ac49-878a971ef0852⚌⚌⚌j=
Node
    k8s-master1"$5e851f78-5bfc-4acb-b5c3-4a10cf353237*v1z⚌⚌
kubeletUpdatestorage.k8s.io/v⚌⚌⚌FieldsV1:⚌
⚌{"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"5e851f78-5bfc-4acb-b5c3-4a10cf353237\"}":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:uid":{}}}}}"

# pods信息
etcdctl  get /registry/pods --prefix --keys-only
/registry/pods/kube-system/coredns-867bfd96bd-264bb
/registry/pods/kube-system/kube-flannel-ds-48kz2
/registry/pods/kube-system/kube-flannel-ds-bsfpp
/registry/pods/kube-system/kube-flannel-ds-h5shb
/registry/pods/kube-system/kube-flannel-ds-qpvlt
/registry/pods/kubernetes-dashboard/dashboard-metrics-scraper-79c5968bdc-62hlk
/registry/pods/kubernetes-dashboard/kubernetes-dashboard-9f9799597-b8hfr

所有针对 Pod 的操作都将会被 kubelet 监听到，kubelet会根据监听到的指令，创建、修改或删除本节点的Pod。

4.2.2 Pod 创建流程

kubelet 读取监听到的信息，如果是创建或修改 Pod，则执行如下处理：

为Pod创建一个数据目录
从API Server读取该 Pod 清单
为该 Pod 挂载外部卷
下载 Pod 用到的 Secret
检查节点上是否已运行Pod，如果 Pod 没有容器或 Pause容器没有启动，则先停止 Pod 里的所有容器进程。如果在Pod中有需要删除的容器，则删除这些容器
用 “kubernetes/pause” 镜像为每个Pod创建一个容器。Pause容器用于接管 Pod 中所有其他容器的网络。没创建一个新的Pod，kubelet 都会先创建一个 Pause容器，然后再创建其他容器。
Pod 中每个容器的处理：
- 为容器计算一个hash值，然后用容器名字去docker查询对应容器的hash值。若查找到容器，但两者hash值不同，则停止docker中的容器进程，并停止与之管理的Pause容器；若两者相同，则不做任何处理
- 如果容器被终止了，且容器未指定 restartPolicy，则不做任何处理
- 调用 Docker Client 下载容器镜像，然后运行容器

4.2.3 容器状态检查

Pod 通过两类探针检查容器的监控状态：

LivenessProbe: 生存检查。如果检查到容器不健康，则删除该容器，并根据容器的重启策略做响应处理。
ReadinessProbe: 就绪检查。如果检查的容器未就绪，将删除关联 Service 的 Endpoints 中关联条目。

LivenessProbe 的三种实现方式：

ExecAction：容器中执行命令，如果命令退出状态码是0，则表示容器健康
TCPSocketAction: 通过容器的 IP:PORT 执行 TCP 检查，如果端口能够被访问，则表示容器健康
HTTPGetAction：通过容器的 http://IP:PORT/path 调用HTTP GET方法，如果响应状态码表示成功(2xx, 3xx)，则认为容器健康

4.2.4 Static Pod

所有以非 API Server 方式创建的 Pod 都叫 Static Pod。Kubelet 将 Static Pod 的状态汇报给 API Server，API Server 为该 Static Pod 创建一个 Mirror Pod 和其相匹配。Mirror Pod 的状态将真实反映 Static Pod 的状态。当 Static Pod 被删除时，与之相对应的 Mirror Pod 也会被删除。

4.3 cAdvisor 资源监控

资源监控级别：容器，Pod，Service，整个集群

Heapster: 为k8s提供了一个级别的监控平台，它是集群级别的监控和事件数据集成器(Aggregator)。它以Pod方式运行在集群中，并通过 kubelet 发现所有运行在集群中的节点，查看来自这些节点的资源使用情况。kubelet 通过 cAdvisor 获取其所在节点即容器的数据。Heapster通过带着关联标签的 Pod 分组信息，它们被推送到一个可配置的后端，用于存储和可视化展示。

cAdvisor: 一个开源的分析容器资源使用率和性能特征的代理工具，集成到 kubelet，当 kubelet 启动时会同时启动 cAdvisor，且一个cAdvidsor 只监控一个Node节点的信息。cAdvisor 自动查找所有在其节点上的容器，自动采集 CPU、内存、文件系统和网络使用的统计信息。cAdvisor 通过它所在节点的 Root 容器，采集并分析该节点的全面使用情况。

cAdvisor 通过其所在节点的 4149 端口暴露一个简单的 UI。

4.4 工作原理

kubelet 内部组件：

kubelet API：认证API (10250)，cAdvisor API (4194)，只读 API (10255)，健康检查API (10248)
syncLoop: 从 API 或者 manifest 目录接收 Pod 跟新，发送到 podWorkers 处理，大量使用 channel 来处理异步请求
辅助的 Manager: cAdvisor, PLEG, Volume Manager等，处理 syncLoop 以外的工作
CRI：容器执行引擎接口，负责与 container runtime shim 通信
容器执行引擎：dockershim, rkt等
网络插件：CNI， kubenet

4.5 Kubelet Eviction

kubelet 会健康资源的使用情况，并通过驱逐机制防止计算和存储资源耗尽。在驱逐时，Pod中的容器全部停止，并将 PodPhase 设置为 Failed

定期 (housekeeping-interval) 检查系统的资源是否达到了预先配置的驱逐阈值：

Eviction Signal	Condition	Description
`memory.available`	MemoryPressue	`memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet`
`nodefs.available`	DiskPressure	`nodefs.available` := `node.stats.fs.available`（Kubelet Volume以及日志等）
`nodefs.inodesFree`	DiskPressure	`nodefs.inodesFree` := `node.stats.fs.inodesFree`
`imagefs.available`	DiskPressure	`imagefs.available` := `node.stats.runtime.imagefs.available`（镜像以及容器可写层等）
`imagefs.inodesFree`	DiskPressure	`imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree`

驱逐阈值可以使用百分比，也可以使用绝对值:

1
2
3

--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
--system-reserved=memory=1.5Gi

驱逐信号分类：

软驱逐 (Soft Eviction): 配合驱逐宽限期 (eviction-soft-grace-period 和 eviction-max-pod-grace-period) 一起使用。系统资源达到软驱逐阈值且超过宽限期之后才会执行驱逐动作
硬驱逐 (Hard Eviction): 系统资源达到硬驱逐阈值时理解执行驱逐动作

驱逐动作：

回收节点资源
- 配置了 imagefs 阈值
  - 达到 nodefs 阈值：删除已停止的 Pod
  - 达到 imagefs 阈值：删除未使用的镜像
- 未配置 imagefs 阈值
  - 达到 nodefs 阈值：先删除已停止的 Pod，后删除未使用的镜像，顺序清理
驱逐用户 Pod
- 驱逐顺序：BestEffort, Burstable, Guaranteed
- 配置了 imagefs 阈值
  - 达到 nodefs 阈值：基于nodefs用量驱逐 (local volume + logs)
  - 达到 imagefs 阈值：基于imagefs用量驱逐 (容器可写层)
- 未配置 imagefs 阈值
  - 达到 nodefs 阈值：安装总磁盘使用驱逐 (local volume + logs + 容器可写层)

其他容器和镜像垃圾回收选项：

垃圾回收参数	驱逐参数	解释
`--image-gc-high-threshold`	`--eviction-hard` 或 `--eviction-soft`	现存的驱逐回收信号可以触发镜像垃圾回收
`--image-gc-low-threshold`	`--eviction-minimum-reclaim`	驱逐回收实现相同行为
`--minimum-image-ttl-duration`		由于驱逐不包括TTL配置，所以它还会继续支持
`--maximum-dead-containers`		一旦旧日志存储在容器上下文之外，就会被弃用
`--maximum-dead-containers-per-container`		一旦旧日志存储在容器上下文之外，就会被弃用
`--minimum-container-ttl-duration`		一旦旧日志存储在容器上下文之外，就会被弃用
`--low-diskspace-threshold-mb`	`--eviction-hard` or `eviction-soft`	驱逐回收将磁盘阈值泛化到其他资源
`--outofdisk-transition-frequency`	`--eviction-pressure-transition-period`	驱逐回收将磁盘压力转换到其他资源

4.6 容器运行时

容器运行时 (Container Runtime)，负责真正管理镜像和容器的生命周期。kubelet 通过容器运行时接口 (Container Runtime Interface, CRI) 与容器运行时交互，以管理镜像和容器。

CRI 容器引擎：

Docker：dockershim
OCI (Open Container Initiative) 开放容器标准
- Containerd
- CRI-O
- runc, OCI 标准容器引擎
PouchContainer：阿里巴巴开源的胖容器引擎

4.7 Node 汇总指标

集群内部：curl http://k8s-master1:10255/stats/summary

集群外部：（暂未成功）

1 2	kubectl proxy & curl http://localhost:8001/api/v1/proxy/csinodes/k8s-master1:10255/stats/summary

5. Kube-proxy

5.1 简介

kube-proxy 监听 API server 中 service 和 endpoint 的变化情况，并通过 userspace、iptables、ipvs 或 winuserspace 等 proxier 来为服务配置负载均衡（仅支持 TCP & UDP）

kube-proxy 可以直接运行在物理机上，也可以以 static pod 或者daemonset的方式运行

kube-proxy 的实现：

userspace：早期方案，它在用户空间监听一个端口，所有服务通过 iptables 转发到这个端口，然后再其内部负载均衡器到实际的Pod。该方式最主要的问题时效率低，有明显的性能瓶颈。
iptables: 推荐方案，完全以iptables规则的方式来实现 service 负载均衡。该方式的最主要问题是创建了太多的 iptables 规则，非增量式更新会引入一定的时延，大规模情况下有明显的性能问题

ipvs: 解决了 iptables 的性能问题，采用增量式更新，可以保证 service 更新期间连接保持不断开

# ipvs 模式需要加载内核模块
modprobe -- ip_vs
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- ip_vs_sh
modprobe -- nf_conntrack_ipv4

# to check loaded modules, use
lsmod | grep -e ip_vs -e nf_conntrack_ipv4

# or
cut -f1 -d " "  /proc/modules | grep -e ip_vs -e nf_conntrack_ipv4

5.2 Iptables 示例

5.3 ipvs 示例

5.4 kube-proxy 的不足

只支持 TCP 和 UDP，不支持 HTTP 路由，也没有健康检查机制。这些可以通过自定义 Ingress Controller 的方法来解决。

6. Kube-DNS

6.1 kube-dns

6.1.1 工作原理

kube-dns 由三个容器构成：

kube-dns: 核心组件
- KubeDNS：负责监听 Service 和 Endpoint 的变化情况，并将相关信息更新到 Sky DNS 中
- SkyDNS: 负责 DNS 解析，监听在 10053 端口，同时也监听在 10055 端口提供 metrics 服务
- kube-dns 还监听在 8081 端口，提供健康检查使用
dnsmasq-nanny: 负责启动 dnsmasq，配置发生变化时，重启 dnsmasq
- dnsmasq 的 upstream 为 SkyDNS，即集群内部的 DNS 解析由 SkyDNS 负责
sidecar: 负责健康检查和提供 DNS metrics （10054端口）

6.1.2 使用 kube-dns

cat > kube-dns.yaml <<EOF
apiVersion: v1
kind: Service
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "KubeDNS"
spec:
  selector:
    k8s-app: kube-dns
  clusterIP: 10.0.0.2
  ports:
  - name: dns
    port: 53
    protocol: UDP
  - name: dns-tcp
    port: 53
    protocol: TCP
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  # replicas: not specified here:
  # 1. In order to make Addon Manager do not reconcile this replicas parameter.
  # 2. Default is 1.
  # 3. Will be tuned in real time if DNS horizontal auto-scaling is turned on.
  strategy:
    rollingUpdate:
      maxSurge: 10%
      maxUnavailable: 0
  selector:
    matchLabels:
      k8s-app: kube-dns
  template:
    metadata:
      labels:
        k8s-app: kube-dns
      annotations:
        prometheus.io/port: "10054"
        prometheus.io/scrape: "true"
    spec:
      priorityClassName: system-cluster-critical
      securityContext:
        seccompProfile:
          type: RuntimeDefault
        supplementalGroups: [ 65534 ]
        fsGroup: 65534
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                  - key: k8s-app
                    operator: In
                    values: ["kube-dns"]
              topologyKey: kubernetes.io/hostname
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      volumes:
      - name: kube-dns-config
        configMap:
          name: kube-dns
          optional: true
      nodeSelector:
        kubernetes.io/os: linux
      containers:
      - name: kubedns
        image: k8s.gcr.io/dns/k8s-dns-kube-dns:1.17.3
        resources:
          # TODO: Set memory limits when we've profiled the container for large
          # clusters, then set request = limit to keep this container in
          # guaranteed class. Currently, this container falls into the
          # "burstable" category so the kubelet doesn't backoff from restarting it.
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        livenessProbe:
          httpGet:
            path: /healthcheck/kubedns
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        readinessProbe:
          httpGet:
            path: /readiness
            port: 8081
            scheme: HTTP
          # we poll on pod startup for the Kubernetes master service and
          # only setup the /readiness HTTP server once that's available.
          initialDelaySeconds: 3
          timeoutSeconds: 5
        args:
        - --domain=cluster.local.
        - --dns-port=10053
        - --config-dir=/kube-dns-config
        - --v=2
        env:
        - name: PROMETHEUS_PORT
          value: "10055"
        ports:
        - containerPort: 10053
          name: dns-local
          protocol: UDP
        - containerPort: 10053
          name: dns-tcp-local
          protocol: TCP
        - containerPort: 10055
          name: metrics
          protocol: TCP
        volumeMounts:
        - name: kube-dns-config
          mountPath: /kube-dns-config
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsUser: 1001
          runAsGroup: 1001
      - name: dnsmasq
        image: k8s.gcr.io/dns/k8s-dns-dnsmasq-nanny:1.17.3
        livenessProbe:
          httpGet:
            path: /healthcheck/dnsmasq
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - -v=2
        - -logtostderr
        - -configDir=/etc/k8s/dns/dnsmasq-nanny
        - -restartDnsmasq=true
        - --
        - -k
        - --cache-size=1000
        - --no-negcache
        - --dns-loop-detect
        - --log-facility=-
        - --server=/cluster.local/127.0.0.1#10053
        - --server=/in-addr.arpa/127.0.0.1#10053
        - --server=/ip6.arpa/127.0.0.1#10053
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        # see: https://github.com/kubernetes/kubernetes/issues/29055 for details
        resources:
          requests:
            cpu: 150m
            memory: 20Mi
        volumeMounts:
        - name: kube-dns-config
          mountPath: /etc/k8s/dns/dnsmasq-nanny
        securityContext:
          capabilities:
            drop:
              - all
            add:
              - NET_BIND_SERVICE
              - SETGID
      - name: sidecar
        image: k8s.gcr.io/dns/k8s-dns-sidecar:1.17.3
        livenessProbe:
          httpGet:
            path: /metrics
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - --v=2
        - --logtostderr
        - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,SRV
        - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,SRV
        ports:
        - containerPort: 10054
          name: metrics
          protocol: TCP
        resources:
          requests:
            memory: 20Mi
            cpu: 10m
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsUser: 1001
          runAsGroup: 1001
      dnsPolicy: Default  # Don't use cluster DNS.
      serviceAccountName: kube-dns
EOF

kubectl apply -f kube-dns.yaml
kubectl get pod -n kube-system

6.1.3 相关问题

1. 问题定位

# 发现问题
kubectl describe pod kube-dns-594c5b5cb5-mdxp6 -n kube-system
...
  Normal   Pulled     13m                   kubelet            Container image "k8s.gcr.io/dns/k8s-dns-kube-dns:1.17.3" already present on machine
  Warning  Unhealthy  12m (x2 over 13m)     kubelet            Liveness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  9m32s (x25 over 13m)  kubelet            Readiness probe failed: Get "http://10.244.2.28:8081/readiness": dial tcp 10.244.2.28:8081: connect: connection refused
  Warning  BackOff    4m30s (x19 over 10m)  kubelet            Back-off restarting failed container

# 查看容器日志
kubectl logs  kube-dns-594c5b5cb5-mdxp6 kubedns -n kube-system
...
I0520 05:59:53.947378       1 server.go:195] Skydns metrics enabled (/metrics:10055)
I0520 05:59:53.947996       1 log.go:172] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0520 05:59:53.948005       1 log.go:172] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
E0520 05:59:53.957842       1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:kube-system:kube-dns" cannot list resource "services" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:kube-dns" not found
E0520 05:59:53.957894       1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kube-system:kube-dns" cannot list resource "endpoints" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:kube-dns" not found
I0520 05:59:54.447988       1 dns.go:220] Waiting for [endpoints services] to be initialized from apiserver...

问题根因：rbac “system:kube-dns” 未找到，导致无法访问apiserver

2. 解决办法

# 创建 rbac
cat > kube-dns-rbac.yaml <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:kube-dns
rules:
  - apiGroups:
    - ""
    resources:
    - endpoints
    - services
    verbs:
    - get
    - list
    - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:kube-dns
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:kube-dns
subjects:
- kind: ServiceAccount
  name: kube-dns
  namespace: kube-system
EOF

kubectl apply -f kube-dns-rbac.yaml

kubectl describe clusterrole system:kube-dns
kubectl describe clusterrolebinding system:kube-dns

3. 验证成功

kubectl apply -f kube-dns.yaml
kubectl apply -f kube-dns.yaml

kubectl get pod -n kube-system -o wide | grep kube-dns
kube-dns-594c5b5cb5-6wttp   3/3     Running   0          13m     10.244.2.29     k8s-node1     <none>           <none>

kubectl describe pod kube-dns-594c5b5cb5-mdxp6 -n kube-system
...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  48s   default-scheduler  Successfully assigned kube-system/kube-dns-594c5b5cb5-6wttp to k8s-node1
  Normal  Pulled     47s   kubelet            Container image "k8s.gcr.io/dns/k8s-dns-kube-dns:1.17.3" already present on machine
  Normal  Created    47s   kubelet            Created container kubedns
  Normal  Started    47s   kubelet            Started container kubedns
  Normal  Pulled     47s   kubelet            Container image "k8s.gcr.io/dns/k8s-dns-dnsmasq-nanny:1.17.3" already present on machine
  Normal  Created    47s   kubelet            Created container dnsmasq
  Normal  Started    47s   kubelet            Started container dnsmasq
  Normal  Pulled     47s   kubelet            Container image "k8s.gcr.io/dns/k8s-dns-sidecar:1.17.3" already present on machine
  Normal  Created    47s   kubelet            Created container sidecar
  Normal  Started    47s   kubelet            Started container sidecar

6.2 CoreDNS

kube-dns 的升级版。CoreDNS 的效率更高，资源占用更小

6.2.1 安装 coredns

wget https://github.com/coredns/deployment/archive/refs/tags/coredns-1.14.0.tar.gz
tar zxvf coredns-1.14.0.tar.gz
cd deployment-coredns-1.14.0/kubernetes

# 部署
./deploy.sh | kubectl apply -f -
kubectl delete --namespace=kube-system deployment kube-dns

# 卸载
./rollback.sh | kubectl apply -f -
kubectl delete --namespace=kube-system deployment coredns

6.2.2 支持的 DNS 格式

Service
- A record：${my-svc}.${my-namespace}.svc.cluster.local，解析分两种情况
  - 普通 Service 解析为 Cluster IP
  - Headless Service 解析为指定的 Pod IP 列表
- SRV record: _${my-port-name}._${my-port-protocol}${my-svc}.${my-namespace}.svc.cluster.local
Pod
- A record: ${pod-ip-address}.${my-namespace}.pod.cluster.local
- 指定 hostname 和 subdomain: ${hostname}.${custom-subdomain}.default.svc.cluster.local

示例：

cat > dns-test.yaml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    name: nginx
spec:
  hostname: nginx
  subdomain: default-subdomain
  containers:
  - name: nginx
    image: nginx
    ports:
    - name: http
      containerPort: 80 
---
apiVersion: v1
kind: Pod
metadata:
  name: dnsutils
  labels:
    name: dnsutils
spec:
  containers:
  - image: tutum/dnsutils
    command:
      - sleep
      - "7200"
    name: dnsutils
EOF

kubectl apply -f nginx-pod.yaml

kubectl exec -it dnsutils /bin/sh

6.3 私有和上游 DNS 服务器

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-dns
  namespace: kube-system
data:
  stubDomains: |
    {“acme.local”: [“1.2.3.4”]}
  upstreamNameservers: |
    [“8.8.8.8”, “8.8.4.4”]

查询请求首先会被发送到 kube-dns 的 DNS 缓存层 (Dnsmasq 服务器)。Dnsmasq 服务器会先检查请求的后缀，带有集群后缀（例如：”.cluster.local”）的请求会被发往 kube-dns，拥有存根域后缀的名称（例如：”.acme.local”）将会被发送到配置的私有 DNS 服务器 [“1.2.3.4”]。最后，不满足任何这些后缀的请求将会被发送到上游 DNS [“8.8.8.8”, “8.8.4.4”] 里。

Eli's Blog