Eli's Blog

1. API-Server

1.1 工作原理

核心功能:资源操作入口

  • 提供集群管理的 REST API 接口,包括认证授权、准入控制、数据校验以及集群状态变更等
  • 提供其他模块之间的数据交互和通信的枢纽(其他模块通过 API Server 查询或修改数据,只有 API Server 能够直接操作 etcd

img

1.2 访问

1.2.1 访问端口

1
2
3
4
5
6
7
8
vi /etc/kubernetes/kube-apiserver.conf
--insecure-bind-address=127.0.0.1 \
--insecure-port=8080 \
--bind-address=192.168.80.11 \
--secure-port=6443 \

curl http://localhost:8080
curl https://192.168.80.11:6443

1.2.2 访问方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# 1. sdk
go get k8s.io/client-go@latest

# 2. kubectl
kubectl get --raw /api/v1/namespaces | python -m json.tool
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods

# 3. kubectl proxy
kubectl proxy --port=8081 &
curl http://localhost:8081/api/
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "192.168.80.45:6443"
}
]
}

# 4. curl
TOKEN=$(kubectl describe secrets $(kubectl get secrets -n kube-system |grep admin |cut -f1 -d ' ') -n kube-system |grep -E '^token' |cut -f2 -d':'|tr -d '\t'|tr -d ' ')
APISERVER=$(kubectl config view |grep server|cut -f 2- -d ":" | tr -d " ")
curl -H "Authorization: Bearer $TOKEN" $APISERVER/api --insecure
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "192.168.80.45:6443"
}
]
}r

1.3 API 资源

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# 所有支持的资源
$ kubectl api-resources
NAME SHORTNAMES APIGROUP NAMESPACED KIND
bindings true Binding
componentstatuses cs false ComponentStatus
configmaps cm true ConfigMap
endpoints ep true Endpoints
events ev true Event
limitranges limits true LimitRange
namespaces ns false Namespace
nodes no false Node
persistentvolumeclaims pvc true PersistentVolumeClaim
persistentvolumes pv false PersistentVolume
pods po true Pod
podtemplates true PodTemplate
replicationcontrollers rc true ReplicationController
resourcequotas quota true ResourceQuota
secrets true Secret
serviceaccounts sa true ServiceAccount
services svc true Service
mutatingwebhookconfigurations admissionregistration.k8s.io false MutatingWebhookConfiguration
validatingwebhookconfigurations admissionregistration.k8s.io false ValidatingWebhookConfiguration
customresourcedefinitions crd,crds apiextensions.k8s.io false CustomResourceDefinition
apiservices apiregistration.k8s.io false APIService
controllerrevisions apps true ControllerRevision
daemonsets ds apps true DaemonSet
deployments deploy apps true Deployment
replicasets rs apps true ReplicaSet
statefulsets sts apps true StatefulSet
tokenreviews authentication.k8s.io false TokenReview
localsubjectaccessreviews authorization.k8s.io true LocalSubjectAccessReview
selfsubjectaccessreviews authorization.k8s.io false SelfSubjectAccessReview
selfsubjectrulesreviews authorization.k8s.io false SelfSubjectRulesReview
subjectaccessreviews authorization.k8s.io false SubjectAccessReview
horizontalpodautoscalers hpa autoscaling true HorizontalPodAutoscaler
cronjobs cj batch true CronJob
jobs batch true Job
certificatesigningrequests csr certificates.k8s.io false CertificateSigningRequest
leases coordination.k8s.io true Lease
endpointslices discovery.k8s.io true EndpointSlice
events ev events.k8s.io true Event
ingresses ing extensions true Ingress
ingressclasses networking.k8s.io false IngressClass
ingresses ing networking.k8s.io true Ingress
networkpolicies netpol networking.k8s.io true NetworkPolicy
runtimeclasses node.k8s.io false RuntimeClass
poddisruptionbudgets pdb policy true PodDisruptionBudget
podsecuritypolicies psp policy false PodSecurityPolicy
clusterrolebindings rbac.authorization.k8s.io false ClusterRoleBinding
clusterroles rbac.authorization.k8s.io false ClusterRole
rolebindings rbac.authorization.k8s.io true RoleBinding
roles rbac.authorization.k8s.io true Role
priorityclasses pc scheduling.k8s.io false PriorityClass
csidrivers storage.k8s.io false CSIDriver
csinodes storage.k8s.io false CSINode
storageclasses sc storage.k8s.io false StorageClass
volumeattachments storage.k8s.io false VolumeAttachment


# 获取特定组 apps 的资源
$ kubectl api-resources --api-group apps
kubectl api-resources --api-group apps
NAME SHORTNAMES APIGROUP NAMESPACED KIND
controllerrevisions apps true ControllerRevision
daemonsets ds apps true DaemonSet
deployments deploy apps true Deployment
replicasets rs apps true ReplicaSet
statefulsets sts apps true StatefulSet

# 资源详细解释
$ kubectl explain svc
KIND: Service
VERSION: v1

DESCRIPTION:
Service is a named abstraction of software service (for example, mysql)
consisting of local port (for example 3306) that the proxy listens on, and
the selector that determines which pods will answer requests sent through
the proxy.

FIELDS:
apiVersion <string>
APIVersion defines the versioned schema of this representation of an
object. Servers should convert recognized schemas to the latest internal
value, and may reject unrecognized values. More info:
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources

kind <string>
Kind is a string value representing the REST resource this object
represents. Servers may infer this from the endpoint the client submits
requests to. Cannot be updated. In CamelCase. More info:
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds

metadata <Object>
Standard object's metadata. More info:
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata

spec <Object>
Spec defines the behavior of a service.
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status

status <Object>
Most recently observed status of the service. Populated by the system.
Read-only. More info:
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status

# 集群支持的API版本
$ kubectl api-versions
admissionregistration.k8s.io/v1
admissionregistration.k8s.io/v1beta1
apiextensions.k8s.io/v1
apiextensions.k8s.io/v1beta1
apiregistration.k8s.io/v1
apiregistration.k8s.io/v1beta1
apps/v1
authentication.k8s.io/v1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1
authorization.k8s.io/v1beta1
autoscaling/v1
autoscaling/v2beta1
autoscaling/v2beta2
batch/v1
batch/v1beta1
certificates.k8s.io/v1
certificates.k8s.io/v1beta1
coordination.k8s.io/v1
coordination.k8s.io/v1beta1
discovery.k8s.io/v1beta1
events.k8s.io/v1
events.k8s.io/v1beta1
extensions/v1beta1
networking.k8s.io/v1
networking.k8s.io/v1beta1
node.k8s.io/v1beta1
policy/v1beta1
rbac.authorization.k8s.io/v1
rbac.authorization.k8s.io/v1beta1
scheduling.k8s.io/v1
scheduling.k8s.io/v1beta1
storage.k8s.io/v1
storage.k8s.io/v1beta1
v1

1.4 示例

https://v1-19.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#read-pod-v1-core

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cat > pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pod-example
spec:
containers:
- name: alpine
image: alpine:latest
command: ["echo"]
args: ["Hello World"]
EOF

kubectl apply -f pod.yml
kubectl get pod

curl http://localhost:8080/api/v1/namespaces/default/pods/pod-example

2. Controller-Manager

Controller Manager 由 kube-controller-manager 和 cloud-controller-manager 组成,是 Kubernetes 的大脑,它通过 apiserver 监控整个集群的状态,并确保集群处于预期的工作状态。

img

Controller Manager作为集群内部的管理控制中心,负责集群内的Node、Pod副本、服务端点(Endpoint)、命名空间(Namespace)、服务账号(ServiceAccount)、资源定额(ResourceQuota)的管理,当某个Node意外宕机时,Controller Manager会及时发现并执行自动化修复流程,确保集群始终处于预期的工作状态。

2.1 控制器分类

kube-controller-manager:

  • Replication Controller
  • Node Controller
  • CronJob Controller
  • Daemon Controller
  • Deployment Controller
  • Endpoint Controller
  • Garbage Collector
  • Namespace Controller
  • Job Controller
  • Pod AutoScaler
  • RelicaSet
  • Service Controller
  • ServiceAccount Controller
  • StatefulSet Controller
  • Volume Controller
  • Resource quota Controller

cloud-controller-manager: 在 Kubernetes 启用 Cloud Provider 的时候才需要,用来配合云服务提供商的控制

  • Node Controller
  • Route Controller
  • Service Controller

2.2 Replication Controller (RC)

简称RC,即副本控制器,它的作用是保证集群中一个RC所关联的Pod副本数始终保持预设值

  • 只有当Pod的重启策略RestartPolicy=Always时,RC才会管理该Pod的操作(创建、销毁、重启等)
  • 创建Pod的RC模板,只在创建Pod时有效,一旦Pod创建完成,模板的变化,不会影响到已创建好的Pod
  • 可通过修改Pod的Label,使该Pod脱离RC的管理。该方法可用于Pod从集群中迁移,数据修复等调试
  • 删除一个RC不影响它所创建的Pod,如果要删除Pod,需要将RC的副本数属性设置为0
  • 不要越过RC创建Pod,因为RC实现了自动化管理Pod,提高容灾能力

2.2.1 RC 的职责

  • 维护集群中Pod的副本数
  • 通过调整RC中的spec.replicas属性值来实现系统的扩容或缩容
  • 通过改变RC中的Pod模板来实现系统的滚动升级

2.2.2 存活探针

Kubemetes有以下三种探测容器的机制:

  • HTTPGET探针:对容器的地址http://ip:port/path执行HTTPGET请求
    • 成功:探测器收到响应,且响应状态码不代表错误(2xx、3xx)
    • 失败:未收到响应,或收到错误响应状态码
  • TCP套接字探针:尝试与容器指定端口建立TCP连接。如果连接成功建立,则探测成功;否则,容器重新启动。
  • Exec探针:在容器内执行任意命令,并检查命令的退出状态码。如果状态码是0, 则探测成功;其他状态码都被认为失败。
1
2
3
4
5
6
7
8
9
10
11
spec:
containers:
- name: nginx
image: nginx:latest
# 一个基于HTTP GET的存活探针
livenessProbe:
# 第一次检测在容器启动15秒后
initialDelaySeconds: 15
httpGet:
port: 8080
path: /

2.3 ReplicaSet (RS)

RS 是RC的替代者,它使用Deployment管理,比RC更强大

2.4 Node Controller

kubelet 在启动时,会通过API Server注册自身的节点信息,并定时向API Server汇报状态信息;API Server接收到信息后,将信息更新到etcd中。

Controller Manager 在启动时,如果设置了--cluster-cidr 参数,对于没有设置Sepc.PodCIDR的Node节点生成一个CIDR地址,并用该CIDR地址设置节点的Spec.PodCIDR属性,防止不同的节点的CIDR地址发生冲突。

img

2.4.1 Node Eviction

Node 控制器在节点异常后,会按照默认的速率(--node-eviction-rate=0.1,即每10秒一个节点的速率)进行 Node 的驱逐。Node 控制器按照 Zone 将节点划分为不同的组,再跟进 Zone 的状态进行速率调整:

  • Normal:所有节点都 Ready,默认速率驱逐。

  • PartialDisruption:即超过33% 的节点 NotReady 的状态。当异常节点比例大于--unhealthy-zone-threshold=0.55时开始减慢速率:

    • 小集群(即节点数量小于 --large-cluster-size-threshold=50):停止驱逐
    • 大集群,减慢速率为 --secondary-node-eviction-rate=0.01
  • FullDisruption:所有节点都 NotReady,返回使用默认速率驱逐。但当所有 Zone 都处在 FullDisruption 时,停止驱逐。

2.5 ResourceQuota Controller

资源配额管理,确保指定的资源对象在任何适合都不会超量占用系统物理资源。

支持三个级别的资源配置管理:

  • 容器级别:对CPU和Memory进行限制
  • Pod级别:对一个Pod内所有容器的可用资源进行限制
  • Namespace级别:
    • Pod数量
    • RS 数量
    • SVC 数量
    • ResourceQuota 数量
    • Secret 数量
    • 可持有的PV(Persistent Volume)数量

说明:

  1. 配额管理通过 Admission Control (准入控制) 来管理
  2. Admission Control 提供两针配额约束方式
    • LimitRanger:作用于Pod和Container
    • ResourceQuota:作用于Namespace,限定一个Namespace中的各种资源的使用总额

ResourceQuota Controller流程图:

img

2.6 Namespace Controller

用户通过API Server创建新的Namespace并保存在etcd中,Namespace Controller定时通过API Server 读取这些Namespace信息。

如果Namespace被API标记为优雅删除(即设置删除期限,DeletionTimestamp),则将该Namespace状态设置为”Terminating”,并保存到etcd中,同时Namespace Controller删除该Namespace下的ServiceAccount, RS, Pod等资源对象。

2.7 Endpoint Controller

Service, Endpoint, Pod的关系:

img

Endpoints 表示一个Service对应的所有Pod副本的访问地址,而Endpoints Controller负责生成和维护所有Endpoints对象的控制器,它负责监听Service和对应的Pod副本变化:

  • Service被删除,则删除和该Service同名的Endpoints对象
  • Service被创建或修改,则根据该Service信息获得相关的Pod列表,然后创建或更新Service对应的Endpoints对象
  • Pod事件,则更它对应的Service的Endpoints对象

kube-proxy 进程获取每个Service的Endpoints,实现Service的负载均衡功能

2.8 Service Controller

Service Controller 属于kubernetes集群与外部云平台之间的一个接口控制器。它监听Service变化,如果一个LoadBalancer类型的Service,则确保外部的云平台上对该Service对应的Load Balancer实例相应地创建、删除及更新路由转发表。

3. Scheduler

Scheduler负责Pod调度,在整个系统中起“承上启下”作用

承上:负责接收Controller Manager 创建的新的Pod,并为其选择合适的Node

启下:Node上的kubelet接管Pod的生命周期

Scheduler 集群分发调度器:

1) 通过调度算法,选择合适的Node,将待调度的Pod在该Node上创建,并将信息写入etcd中

2) kubelet 通过API Server监听到 Scheduler 产生的Pod绑定信息,然后获取对应的Pod清单,下载image,并启动容器

img

3.1 调度流程

  • 预选调度过程:即遍历所有目标Node,筛选出符合要求的候选节点。k8s 内置了多种预选策略(Predicates) 供用户选择
  • 确定最优节点:采用优选策略(Priority)计算出每个候选节点的积分,取最高分

调度流程通过插件式加载的调度算法提供者(Algorithm Provider) 具体实现,一个调度算法提供者就是包括一组预选策略与一组优选策略的结构体。

3.2 预选策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CheckNodeCondition:检查节点是否正常(如ip,磁盘等)
GeneralPredicates
HostName:检查Pod对象是否定义了pod.spec.hostname
PodFitsHostPorts:pod要能适配node的端口 pods.spec.containers.ports.hostPort(指定绑定在节点的端口上)
MatchNodeSelector:检查节点的NodeSelector的标签 pods.spec.nodeSelector
PodFitsResources:检查Pod的资源需求是否能被节点所满足
NoDiskConflict: 检查Pod依赖的存储卷是否能满足需求(默认未使用)
PodToleratesNodeTaints:检查Pod上的spec.tolerations可容忍的污点是否完全包含节点上的污点
PodToleratesNodeNoExecuteTaints:不能执行(NoExecute)的污点(默认未使用)
CheckNodeLabelPresence:检查指定的标签再上节点是否存在
CheckServiceAffinity:将相同services相同的pod尽量放在一起(默认未使用)
MaxEBSVolumeCount: 检查EBS(AWS存储)存储卷的最大数量
MaxGCEPDVolumeCount GCE存储最大数
MaxAzureDiskVolumeCount: AzureDisk 存储最大数
CheckVolumeBinding:检查节点上已绑定或未绑定的pvc
NoVolumeZoneConflict:检查存储卷对象与pod是否存在冲突
CheckNodeMemoryPressure:检查节点内存是否存在压力过大
CheckNodePIDPressure:检查节点上的PID数量是否过大
CheckNodeDiskPressure: 检查内存、磁盘IO是否过大
MatchInterPodAffinity: 检查节点是否能满足pod的亲和性或反亲和性

3.3 优选策略

1
2
3
4
5
6
7
8
9
10
11
12
13
LeastRequested: 空闲量越高得分越高
(cpu((capacity-sum(requested))*10/capacity)+memory((capacity-sum(requested))*10/capacity))/2

BalancedResourceAllocation:CPU和内存资源被占用率相近的胜出
NodePreferAvoidPods: 节点注解信息“scheduler.alpha.kubernetes.io/preferAvoidPods”
TaintToleration:将Pod对象的spec.tolerations列表项与节点的taints列表项进行匹配度检查,匹配条目越,得分越低

SeletorSpreading:标签选择器分散度,(与当前pod对象通选的标签,所选其它pod越多的得分越低)
InterPodAffinity:遍历pod对象的亲和性匹配项目,项目越多得分越高
NodeAffinity:节点亲和性 、
MostRequested:空闲量越小得分越高,和LeastRequested相反 (默认未启用)
NodeLabel:节点是否存在对应的标签 (默认未启用)
ImageLocality:根据满足当前Pod对象需求的已有镜像的体积大小之和(默认未启用)

3.4 高级调度

3.4.1 nodeSelector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# 1. 创建 redis 集群
cat > redis-deploy.yml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
spec:
selector:
matchLabels:
app: redis
replicas: 2
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:6.2.3
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 6379
nodeSelector:
disk: ssd # 限定磁盘类型
EOF

kubetcl apply -f redis-deploy.yml

# 检查pod状态
kubectl get pod
NAME READY STATUS RESTARTS AGE
redis-9fc84569-2jlxh 0/1 Pending 0 60s
redis-9fc84569-q78jd 0/1 Pending 0 60s

kubectl describe pod redis-9fc84569-2jlxh
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20s (x3 over 76s) default-scheduler 0/4 nodes are available: 4 node(s) didn't match node selector.

# k8s-node2 增加标签 disk=ssd
kubectl label node k8s-node2 disk=ssd

kubectl get nodes --show-labels | grep disk=ssd
k8s-node2 Ready node 4d21h v1.19.11 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disk=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node2,kubernetes.io/os=linux,node-role.kubernetes.io/node=

# 再次检查 pod 是否创建成功
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-9fc84569-2jlxh 1/1 Running 0 5m20s 10.244.1.8 k8s-node2 <none> <none>
redis-9fc84569-q78jd 1/1 Running 0 5m20s 10.244.1.7 k8s-node2 <none> <none>

3.4.2 亲和性 (affinity)

1. 软亲和

preferredDuringSchedulingIgnoredDuringExecution

软亲和:选择条件匹配多的,就算都不满足条件,还是会生成pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
cat > preferred-affinity-pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: preferred-affinity-pod
labels:
app: my-pod
spec:
containers:
- name: preferred-affinity-pod
image: nginx
ports:
- name: http
containerPort: 80
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: apps # 标签键名
operator: In
values:
- mysql # apps=mysql
- redis # apps=redis
weight: 60 # 匹配相应nodeSelectorTerm相关联的权重,1-100
EOF

kubectl apply -f preferred-affinity-pod.yml

# 不满足依旧创建成功
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
preferred-affinity-pod 1/1 Running 0 61s 10.244.2.18 k8s-node1 <none> <none>

2. 硬亲和

requiredDuringSchedulingIgnoredDuringExecution

硬亲和:选择条件匹配多的,必须满足一项,才会生成pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
cat > required-affinity-pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: required-affinity-pod
labels:
app: my-pod
spec:
containers:
- name: required-affinity-pod
image: nginx
ports:
- name: http
containerPort: 80
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: apps # 标签键名
operator: In
values:
- mysql # apps=mysql
- redis # apps=redis
EOF

kubectl apply -f required-affinity-pod.yml

# 不满足无法创建成功
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
required-affinity-pod 0/1 Pending 0 18s <none> <none> <none> <none>

# 修改 k8s-node1 的标签
kubectl label node k8s-node1 apps=mysql

# 创建成功
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
required-affinity-pod 1/1 Running 0 2m31s 10.244.2.19 k8s-node1 <none> <none>

3.4.3 反亲和性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
cat > anti-affinity.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: myapp1
labels:
app: myapp1

spec:
containers:
- name: myapp1
image: nginx
ports:
- name: http
containerPort: 80
---
apiVersion: v1
kind: Pod
metadata:
name: myapp2
labels:
app: myapp2

spec:
containers:
- name: myapp2
image: nginx
ports:
- name: http
containerPort: 80
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp1 # app=myapp1
topologyKey: kubernetes.io/hostname #kubernetes.io/hostname的值一样代表pod不处于同一位置
EOF

kubectl apply -f anti-affinity.yml

# 分属不同的节点上
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp1 1/1 Running 0 43s 10.244.2.20 k8s-node1 <none> <none>
myapp2 1/1 Running 0 43s 10.244.1.9 k8s-node2 <none> <none>

3.5 污点和容忍

taint的effect定义对Pod排斥效果:

  • NoSchedule:只影响调度过程,对现存的Pod对象不产生影响,即不驱离
  • NoExecute:既影响调度过程,也影响现在的Pod对象,即现存的Pod对象将被驱离
  • PreferNoSchedule: 最好不部署Pod,但如果实在找不到节点,也可以在此节点上部署

3.5.1 污点管理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
kubectl describe node k8s-master1 | grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule

kubectl describe node k8s-node1 | grep Taints
Taints: <none>

# 打污点
kubectl taint node k8s-node1 node-role.kubernetes.io/node=:NoSchedule

kubectl describe node k8s-node1 | grep Taints
Taints: node-role.kubernetes.io/node:NoSchedule

# 去除污点
kubectl taint node k8s-node1 node-role.kubernetes.io/node-

3.5.2 容忍

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# 节点全部加上node-type污点
kubectl taint node k8s-node1 node-type=:NoSchedule
kubectl taint node k8s-node2 node-type=:NoSchedule

cat > toleration-pod.yml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: toleration-pod
labels:
app: toleration-pod

spec:
containers:
- name: toleration-pod
image: nginx
ports:
- name: http
containerPort: 80
tolerations:
- key: "node-type" # 污点名称
operator: "Equal" # Exists/Equal
value: "PreferNoSchedule" # 污点值
effect: "NoSchedule" #
#tolerationSeconds: 3600 # 如果被驱逐的话,容忍时间 effect和tolerationSeconds不能同时存在
EOF

kubectl apply -f toleration-pod.yml

# 无法正常创建
kubectl get pod
NAME READY STATUS RESTARTS AGE
toleration-pod 0/1 Pending 0 5s

kubectl describe pod toleration-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 26s (x2 over 26s) default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node-type: }, that the pod didn't tolerate.

# k8s-node1 污点增加 PreferNoSchedule,并删除NoSchedule
kubectl taint node k8s-node1 node-type=:PreferNoSchedule
kubectl taint node k8s-node1 node-type=:NoSchedule-

# 调度成功
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
toleration-pod 1/1 Running 0 11m 10.244.2.21 k8s-node1 <none> <none>

4. Kubelet

每个Node节点上都运行一个 Kubelet 服务进程,默认监听 10250 端口,接收并执行 Master 发来的指令,管理 Pod 及 Pod 中的容器。每个 Kubelet 进程会在 API Server 上注册所在Node节点的信息,定期向 Master 节点汇报该节点的资源使用情况,并通过 cAdvisor 监控节点和容器的资源。可以把kubelet理解成【Server-Agent】架构中的agent,是Node上的Pod管家

4.1 节点管理

节点管理主要是节点自注册和节点状态更新:

  • 通过启动参数“-register-node” 来确定是否向API Server注册自己
  • 没有选择自注册模式,则需要用户自己配置Node资源信息,同时配置API Server的地址
  • 启动时,通过API Server注册节点信息,并定时向API Server发送节点新消息,API Server在接收到新消息后,将信息写入 etcd

主要参数:

  • --kubeconfig: 指定kubeconfig的路径,该文件常用来指定证书
  • --hostname-override: 配置该节点在集群中显示的主机名
  • --node-status-update-frequency: kubelet向API Server上报心跳的频率,默认10s

4.2 Pod 管理

4.2.1 获取 Pod 清单

kubelet 通过 API Server Client 使用Watch加List的方式,监听 “/registry/csinodes” 和 “/registry/pods” 目录,将获取到的信息同步到本地缓存中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
etcdctl get /registry/csinodes --prefix --keys-only
/registry/csinodes/k8s-master1
/registry/csinodes/k8s-master2
/registry/csinodes/k8s-node1
/registry/csinodes/k8s-node2

# 节点详细信息
etcdctl get /registry/csinodes/k8s-master1
/registry/csinodes/k8s-master1
k8s

storage.k8s.io/v1CSINode⚌


k8s-master1"*$cc45d1d2-dde9-4740-ac49-878a971ef0852⚌⚌⚌j=
Node
k8s-master1"$5e851f78-5bfc-4acb-b5c3-4a10cf353237*v1z⚌⚌
kubeletUpdatestorage.k8s.io/v⚌⚌⚌FieldsV1:⚌
⚌{"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"5e851f78-5bfc-4acb-b5c3-4a10cf353237\"}":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:uid":{}}}}}"

# pods信息
etcdctl get /registry/pods --prefix --keys-only
/registry/pods/kube-system/coredns-867bfd96bd-264bb
/registry/pods/kube-system/kube-flannel-ds-48kz2
/registry/pods/kube-system/kube-flannel-ds-bsfpp
/registry/pods/kube-system/kube-flannel-ds-h5shb
/registry/pods/kube-system/kube-flannel-ds-qpvlt
/registry/pods/kubernetes-dashboard/dashboard-metrics-scraper-79c5968bdc-62hlk
/registry/pods/kubernetes-dashboard/kubernetes-dashboard-9f9799597-b8hfr

所有针对 Pod 的操作都将会被 kubelet 监听到,kubelet会根据监听到的指令,创建、修改或删除本节点的Pod。

4.2.2 Pod 创建流程

img

kubelet 读取监听到的信息,如果是创建或修改 Pod,则执行如下处理:

  • 为Pod创建一个数据目录
  • 从API Server读取该 Pod 清单
  • 为该 Pod 挂载外部卷
  • 下载 Pod 用到的 Secret
  • 检查节点上是否已运行Pod,如果 Pod 没有容器或 Pause容器 没有启动,则先停止 Pod 里的所有容器进程。如果在Pod中有需要删除的容器,则删除这些容器
  • 用 “kubernetes/pause” 镜像为每个Pod创建一个容器。Pause容器用于接管 Pod 中所有其他容器的网络。没创建一个新的Pod,kubelet 都会先创建一个 Pause容器,然后再创建其他容器。
  • Pod 中每个容器的处理:
    • 为容器计算一个hash值,然后用容器名字去docker查询对应容器的hash值。若查找到容器,但两者hash值不同,则停止docker中的容器进程,并停止与之管理的Pause容器;若两者相同,则不做任何处理
    • 如果容器被终止了,且容器未指定 restartPolicy,则不做任何处理
    • 调用 Docker Client 下载容器镜像,然后运行容器

4.2.3 容器状态检查

Pod 通过两类探针检查容器的监控状态:

  • LivenessProbe: 生存检查。如果检查到容器不健康,则删除该容器,并根据容器的重启策略做响应处理。
  • ReadinessProbe: 就绪检查。如果检查的容器未就绪,将删除关联 Service 的 Endpoints 中关联条目。

LivenessProbe 的三种实现方式:

  • ExecAction:容器中执行命令,如果命令退出状态码是0,则表示容器健康
  • TCPSocketAction: 通过容器的 IP:PORT 执行 TCP 检查,如果端口能够被访问,则表示容器健康
  • HTTPGetAction:通过容器的 http://IP:PORT/path 调用HTTP GET方法,如果响应状态码表示成功(2xx, 3xx),则认为容器健康

4.2.4 Static Pod

所有以非 API Server 方式创建的 Pod 都叫 Static Pod。Kubelet 将 Static Pod 的状态汇报给 API Server,API Server 为该 Static Pod 创建一个 Mirror Pod 和其相匹配。Mirror Pod 的状态将真实反映 Static Pod 的状态。当 Static Pod 被删除时,与之相对应的 Mirror Pod 也会被删除。

4.3 cAdvisor 资源监控

资源监控级别:容器,Pod,Service,整个集群

Heapster: 为k8s提供了一个级别的监控平台,它是集群级别的监控和事件数据集成器(Aggregator)。它以Pod方式运行在集群中,并通过 kubelet 发现所有运行在集群中的节点,查看来自这些节点的资源使用情况。kubelet 通过 cAdvisor 获取其所在节点即容器的数据。Heapster通过带着关联标签的 Pod 分组信息,它们被推送到一个可配置的后端,用于存储和可视化展示。

cAdvisor: 一个开源的分析容器资源使用率和性能特征的代理工具,集成到 kubelet,当 kubelet 启动时会同时启动 cAdvisor,且一个cAdvidsor 只监控一个Node节点的信息。cAdvisor 自动查找所有在其节点上的容器,自动采集 CPU、内存、文件系统和网络使用的统计信息。cAdvisor 通过它所在节点的 Root 容器,采集并分析该节点的全面使用情况。

cAdvisor 通过其所在节点的 4149 端口暴露一个简单的 UI。

4.4 工作原理

img

kubelet 内部组件:

  • kubelet API:认证API (10250),cAdvisor API (4194),只读 API (10255),健康检查API (10248)
  • syncLoop: 从 API 或者 manifest 目录接收 Pod 跟新,发送到 podWorkers 处理,大量使用 channel 来处理异步请求
  • 辅助的 Manager: cAdvisor, PLEG, Volume Manager等,处理 syncLoop 以外的工作
  • CRI:容器执行引擎接口,负责与 container runtime shim 通信
  • 容器执行引擎:dockershim, rkt等
  • 网络插件:CNI, kubenet

4.5 Kubelet Eviction

kubelet 会健康资源的使用情况,并通过驱逐机制防止计算和存储资源耗尽。在驱逐时,Pod中的容器全部停止,并将 PodPhase 设置为 Failed

定期 (housekeeping-interval) 检查系统的资源是否达到了预先配置的驱逐阈值:

Eviction Signal Condition Description
memory.available MemoryPressue memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
nodefs.available DiskPressure nodefs.available := node.stats.fs.available(Kubelet Volume以及日志等)
nodefs.inodesFree DiskPressure nodefs.inodesFree := node.stats.fs.inodesFree
imagefs.available DiskPressure imagefs.available := node.stats.runtime.imagefs.available(镜像以及容器可写层等)
imagefs.inodesFree DiskPressure imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree

驱逐阈值可以使用百分比,也可以使用绝对值:

1
2
3
--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
--system-reserved=memory=1.5Gi

驱逐信号分类:

  • 软驱逐 (Soft Eviction): 配合驱逐宽限期 (eviction-soft-grace-period 和 eviction-max-pod-grace-period) 一起使用。系统资源达到软驱逐阈值且超过宽限期之后才会执行驱逐动作
  • 硬驱逐 (Hard Eviction): 系统资源达到硬驱逐阈值时理解执行驱逐动作

驱逐动作:

  • 回收节点资源
    • 配置了 imagefs 阈值
      • 达到 nodefs 阈值:删除已停止的 Pod
      • 达到 imagefs 阈值:删除未使用的镜像
    • 未配置 imagefs 阈值
      • 达到 nodefs 阈值:先删除已停止的 Pod,后删除未使用的镜像,顺序清理
  • 驱逐用户 Pod
    • 驱逐顺序:BestEffort, Burstable, Guaranteed
    • 配置了 imagefs 阈值
      • 达到 nodefs 阈值:基于nodefs用量驱逐 (local volume + logs)
      • 达到 imagefs 阈值:基于imagefs用量驱逐 (容器可写层)
    • 未配置 imagefs 阈值
      • 达到 nodefs 阈值:安装总磁盘使用驱逐 (local volume + logs + 容器可写层)

其他容器和镜像垃圾回收选项:

垃圾回收参数 驱逐参数 解释
--image-gc-high-threshold --eviction-hard--eviction-soft 现存的驱逐回收信号可以触发镜像垃圾回收
--image-gc-low-threshold --eviction-minimum-reclaim 驱逐回收实现相同行为
--minimum-image-ttl-duration 由于驱逐不包括TTL配置,所以它还会继续支持
--maximum-dead-containers 一旦旧日志存储在容器上下文之外,就会被弃用
--maximum-dead-containers-per-container 一旦旧日志存储在容器上下文之外,就会被弃用
--minimum-container-ttl-duration 一旦旧日志存储在容器上下文之外,就会被弃用
--low-diskspace-threshold-mb --eviction-hard or eviction-soft 驱逐回收将磁盘阈值泛化到其他资源
--outofdisk-transition-frequency --eviction-pressure-transition-period 驱逐回收将磁盘压力转换到其他资源

4.6 容器运行时

容器运行时 (Container Runtime),负责真正管理镜像和容器的生命周期。kubelet 通过容器运行时接口 (Container Runtime Interface, CRI) 与容器运行时交互,以管理镜像和容器。

img

CRI 容器引擎:

  • Docker:dockershim
  • OCI (Open Container Initiative) 开放容器标准
    • Containerd
    • CRI-O
    • runc, OCI 标准容器引擎
  • PouchContainer:阿里巴巴开源的胖容器引擎

4.7 Node 汇总指标

  • 集群内部:curl http://k8s-master1:10255/stats/summary

  • 集群外部:(暂未成功)

    1
    2
    kubectl proxy &
    curl http://localhost:8001/api/v1/proxy/csinodes/k8s-master1:10255/stats/summary

5. Kube-proxy

5.1 简介

kube-proxy 监听 API server 中 service 和 endpoint 的变化情况,并通过 userspace、iptables、ipvs 或 winuserspace 等 proxier 来为服务配置负载均衡(仅支持 TCP & UDP)

kube-proxy 可以直接运行在物理机上,也可以以 static pod 或者daemonset的方式运行

img

kube-proxy 的实现:

  • userspace: 早期方案,它在用户空间监听一个端口,所有服务通过 iptables 转发到这个端口,然后再其内部负载均衡器到实际的Pod。该方式最主要的问题时效率低,有明显的性能瓶颈。

  • iptables: 推荐方案,完全以iptables规则的方式来实现 service 负载均衡。该方式的最主要问题是创建了太多的 iptables 规则,非增量式更新会引入一定的时延,大规模情况下有明显的性能问题

  • ipvs: 解决了 iptables 的性能问题,采用增量式更新,可以保证 service 更新期间连接保持不断开

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # ipvs 模式需要加载内核模块
    modprobe -- ip_vs
    modprobe -- ip_vs_rr
    modprobe -- ip_vs_wrr
    modprobe -- ip_vs_sh
    modprobe -- nf_conntrack_ipv4

    # to check loaded modules, use
    lsmod | grep -e ip_vs -e nf_conntrack_ipv4

    # or
    cut -f1 -d " " /proc/modules | grep -e ip_vs -e nf_conntrack_ipv4

5.2 Iptables 示例

img

5.3 ipvs 示例

img

5.4 kube-proxy 的不足

只支持 TCP 和 UDP,不支持 HTTP 路由,也没有健康检查机制。这些可以通过自定义 Ingress Controller 的方法来解决。

6. Kube-DNS

6.1 kube-dns

6.1.1 工作原理

img

kube-dns 由三个容器构成:

  • kube-dns: 核心组件
    • KubeDNS:负责监听 Service 和 Endpoint 的变化情况,并将相关信息更新到 Sky DNS 中
    • SkyDNS: 负责 DNS 解析,监听在 10053 端口,同时也监听在 10055 端口提供 metrics 服务
    • kube-dns 还监听在 8081 端口,提供健康检查使用
  • dnsmasq-nanny: 负责启动 dnsmasq,配置发生变化时,重启 dnsmasq
    • dnsmasq 的 upstream 为 SkyDNS,即集群内部的 DNS 解析由 SkyDNS 负责
  • sidecar: 负责健康检查和 提供 DNS metrics (10054端口)

6.1.2 使用 kube-dns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
cat > kube-dns.yaml <<EOF
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "KubeDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 10.0.0.2
ports:
- name: dns
port: 53
protocol: UDP
- name: dns-tcp
port: 53
protocol: TCP
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-dns
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-dns
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: EnsureExists
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-dns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
# replicas: not specified here:
# 1. In order to make Addon Manager do not reconcile this replicas parameter.
# 2. Default is 1.
# 3. Will be tuned in real time if DNS horizontal auto-scaling is turned on.
strategy:
rollingUpdate:
maxSurge: 10%
maxUnavailable: 0
selector:
matchLabels:
k8s-app: kube-dns
template:
metadata:
labels:
k8s-app: kube-dns
annotations:
prometheus.io/port: "10054"
prometheus.io/scrape: "true"
spec:
priorityClassName: system-cluster-critical
securityContext:
seccompProfile:
type: RuntimeDefault
supplementalGroups: [ 65534 ]
fsGroup: 65534
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values: ["kube-dns"]
topologyKey: kubernetes.io/hostname
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
volumes:
- name: kube-dns-config
configMap:
name: kube-dns
optional: true
nodeSelector:
kubernetes.io/os: linux
containers:
- name: kubedns
image: k8s.gcr.io/dns/k8s-dns-kube-dns:1.17.3
resources:
# TODO: Set memory limits when we've profiled the container for large
# clusters, then set request = limit to keep this container in
# guaranteed class. Currently, this container falls into the
# "burstable" category so the kubelet doesn't backoff from restarting it.
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
livenessProbe:
httpGet:
path: /healthcheck/kubedns
port: 10054
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: /readiness
port: 8081
scheme: HTTP
# we poll on pod startup for the Kubernetes master service and
# only setup the /readiness HTTP server once that's available.
initialDelaySeconds: 3
timeoutSeconds: 5
args:
- --domain=cluster.local.
- --dns-port=10053
- --config-dir=/kube-dns-config
- --v=2
env:
- name: PROMETHEUS_PORT
value: "10055"
ports:
- containerPort: 10053
name: dns-local
protocol: UDP
- containerPort: 10053
name: dns-tcp-local
protocol: TCP
- containerPort: 10055
name: metrics
protocol: TCP
volumeMounts:
- name: kube-dns-config
mountPath: /kube-dns-config
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsUser: 1001
runAsGroup: 1001
- name: dnsmasq
image: k8s.gcr.io/dns/k8s-dns-dnsmasq-nanny:1.17.3
livenessProbe:
httpGet:
path: /healthcheck/dnsmasq
port: 10054
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
args:
- -v=2
- -logtostderr
- -configDir=/etc/k8s/dns/dnsmasq-nanny
- -restartDnsmasq=true
- --
- -k
- --cache-size=1000
- --no-negcache
- --dns-loop-detect
- --log-facility=-
- --server=/cluster.local/127.0.0.1#10053
- --server=/in-addr.arpa/127.0.0.1#10053
- --server=/ip6.arpa/127.0.0.1#10053
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
# see: https://github.com/kubernetes/kubernetes/issues/29055 for details
resources:
requests:
cpu: 150m
memory: 20Mi
volumeMounts:
- name: kube-dns-config
mountPath: /etc/k8s/dns/dnsmasq-nanny
securityContext:
capabilities:
drop:
- all
add:
- NET_BIND_SERVICE
- SETGID
- name: sidecar
image: k8s.gcr.io/dns/k8s-dns-sidecar:1.17.3
livenessProbe:
httpGet:
path: /metrics
port: 10054
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
args:
- --v=2
- --logtostderr
- --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,SRV
- --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,SRV
ports:
- containerPort: 10054
name: metrics
protocol: TCP
resources:
requests:
memory: 20Mi
cpu: 10m
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsUser: 1001
runAsGroup: 1001
dnsPolicy: Default # Don't use cluster DNS.
serviceAccountName: kube-dns
EOF

kubectl apply -f kube-dns.yaml
kubectl get pod -n kube-system

6.1.3 相关问题

1. 问题定位

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 发现问题
kubectl describe pod kube-dns-594c5b5cb5-mdxp6 -n kube-system
...
Normal Pulled 13m kubelet Container image "k8s.gcr.io/dns/k8s-dns-kube-dns:1.17.3" already present on machine
Warning Unhealthy 12m (x2 over 13m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 9m32s (x25 over 13m) kubelet Readiness probe failed: Get "http://10.244.2.28:8081/readiness": dial tcp 10.244.2.28:8081: connect: connection refused
Warning BackOff 4m30s (x19 over 10m) kubelet Back-off restarting failed container

# 查看容器日志
kubectl logs kube-dns-594c5b5cb5-mdxp6 kubedns -n kube-system
...
I0520 05:59:53.947378 1 server.go:195] Skydns metrics enabled (/metrics:10055)
I0520 05:59:53.947996 1 log.go:172] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0520 05:59:53.948005 1 log.go:172] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
E0520 05:59:53.957842 1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:kube-system:kube-dns" cannot list resource "services" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:kube-dns" not found
E0520 05:59:53.957894 1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kube-system:kube-dns" cannot list resource "endpoints" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:kube-dns" not found
I0520 05:59:54.447988 1 dns.go:220] Waiting for [endpoints services] to be initialized from apiserver...

问题根因:rbac “system:kube-dns” 未找到,导致无法访问apiserver

2. 解决办法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 创建 rbac
cat > kube-dns-rbac.yaml <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:kube-dns
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:kube-dns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:kube-dns
subjects:
- kind: ServiceAccount
name: kube-dns
namespace: kube-system
EOF

kubectl apply -f kube-dns-rbac.yaml

kubectl describe clusterrole system:kube-dns
kubectl describe clusterrolebinding system:kube-dns

3. 验证成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
kubectl apply -f kube-dns.yaml
kubectl apply -f kube-dns.yaml

kubectl get pod -n kube-system -o wide | grep kube-dns
kube-dns-594c5b5cb5-6wttp 3/3 Running 0 13m 10.244.2.29 k8s-node1 <none> <none>

kubectl describe pod kube-dns-594c5b5cb5-mdxp6 -n kube-system
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 48s default-scheduler Successfully assigned kube-system/kube-dns-594c5b5cb5-6wttp to k8s-node1
Normal Pulled 47s kubelet Container image "k8s.gcr.io/dns/k8s-dns-kube-dns:1.17.3" already present on machine
Normal Created 47s kubelet Created container kubedns
Normal Started 47s kubelet Started container kubedns
Normal Pulled 47s kubelet Container image "k8s.gcr.io/dns/k8s-dns-dnsmasq-nanny:1.17.3" already present on machine
Normal Created 47s kubelet Created container dnsmasq
Normal Started 47s kubelet Started container dnsmasq
Normal Pulled 47s kubelet Container image "k8s.gcr.io/dns/k8s-dns-sidecar:1.17.3" already present on machine
Normal Created 47s kubelet Created container sidecar
Normal Started 47s kubelet Started container sidecar

6.2 CoreDNS

kube-dns 的升级版。CoreDNS 的效率更高,资源占用更小

6.2.1 安装 coredns

1
2
3
4
5
6
7
8
9
10
11
wget https://github.com/coredns/deployment/archive/refs/tags/coredns-1.14.0.tar.gz
tar zxvf coredns-1.14.0.tar.gz
cd deployment-coredns-1.14.0/kubernetes

# 部署
./deploy.sh | kubectl apply -f -
kubectl delete --namespace=kube-system deployment kube-dns

# 卸载
./rollback.sh | kubectl apply -f -
kubectl delete --namespace=kube-system deployment coredns

6.2.2 支持的 DNS 格式

  • Service

    • A record:${my-svc}.${my-namespace}.svc.cluster.local,解析分两种情况
      • 普通 Service 解析为 Cluster IP
      • Headless Service 解析为指定的 Pod IP 列表
    • SRV record: _${my-port-name}._${my-port-protocol}${my-svc}.${my-namespace}.svc.cluster.local
  • Pod

    • A record: ${pod-ip-address}.${my-namespace}.pod.cluster.local
    • 指定 hostname 和 subdomain: ${hostname}.${custom-subdomain}.default.svc.cluster.local

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
cat > dns-test.yaml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
name: nginx
spec:
hostname: nginx
subdomain: default-subdomain
containers:
- name: nginx
image: nginx
ports:
- name: http
containerPort: 80
---
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
labels:
name: dnsutils
spec:
containers:
- image: tutum/dnsutils
command:
- sleep
- "7200"
name: dnsutils
EOF

kubectl apply -f nginx-pod.yaml

kubectl exec -it dnsutils /bin/sh

6.3 私有和上游 DNS 服务器

1
2
3
4
5
6
7
8
9
10
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-dns
namespace: kube-system
data:
stubDomains: |
{“acme.local”: [“1.2.3.4”]}
upstreamNameservers: |
[“8.8.8.8”, “8.8.4.4”]

查询请求首先会被发送到 kube-dns 的 DNS 缓存层 (Dnsmasq 服务器)。Dnsmasq 服务器会先检查请求的后缀,带有集群后缀(例如:”.cluster.local”)的请求会被发往 kube-dns,拥有存根域后缀的名称(例如:”.acme.local”)将会被发送到配置的私有 DNS 服务器 [“1.2.3.4”]。最后,不满足任何这些后缀的请求将会被发送到上游 DNS [“8.8.8.8”, “8.8.4.4”] 里。

img