应用自测与调试

运行应用时，不可避免的需要定位问题。前面我们介绍了如何使用 kubectl get pods 来查询 pod 的简单信息。除此之外，还有一系列的方法来获取应用的更详细信息。

使用 `kubectl describe pod` 命令获取 pod 详情

与之前的例子类似，我们使用一个 Deployment 来创建两个 pod。

`application/nginx-with-request.yaml`
`apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment spec: selector: matchLabels: app: nginx replicas: 2 template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx resources: limits: memory: "128Mi" cpu: "500m" ports: - containerPort: 80`

application/nginx-with-request.yaml Copy application/nginx-with-request.yaml to clipboard

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
        ports:
        - containerPort: 80

使用如下命令创建 deployment：

kubectl apply -f https://k8s.io/examples/application/nginx-with-request.yaml

deployment.apps/nginx-deployment created

使用如下命令查看 pod 状态：

kubectl get pods

NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-1006230814-6winp   1/1       Running   0          11s
nginx-deployment-1006230814-fmgu3   1/1       Running   0          11s

我们可以使用 kubectl describe pod 命令来查询每个 pod 的更多信息，比如：

kubectl describe pod nginx-deployment-1006230814-6winp

Name:		nginx-deployment-1006230814-6winp
Namespace:	default
Node:		kubernetes-node-wul5/10.240.0.9
Start Time:	Thu, 24 Mar 2016 01:39:49 +0000
Labels:		app=nginx,pod-template-hash=1006230814
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16" ...
Status:		Running
IP:		10.244.0.6
Controllers:	ReplicaSet/nginx-deployment-1006230814
Containers:
  nginx:
    Container ID:	docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149
    Image:		nginx
    Image ID:		docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707
    Port:		80/TCP
    QoS Tier:
      cpu:	Guaranteed
      memory:	Guaranteed
    Limits:
      cpu:	500m
      memory:	128Mi
    Requests:
      memory:		128Mi
      cpu:		500m
    State:		Running
      Started:		Thu, 24 Mar 2016 01:39:51 +0000
    Ready:		True
    Restart Count:	0
    Environment:        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5kdvl (ro)
Conditions:
  Type          Status
  Initialized   True
  Ready         True
  PodScheduled  True
Volumes:
  default-token-4bcbi:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-4bcbi
    Optional:   false
QoS Class:      Guaranteed
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath		Type		Reason		Message
  ---------	--------	-----	----					-------------		--------	------		-------
  54s		54s		1	{default-scheduler }						Normal		Scheduled	Successfully assigned nginx-deployment-1006230814-6winp to kubernetes-node-wul5
  54s		54s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Pulling		pulling image "nginx"
  53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Pulled		Successfully pulled image "nginx"
  53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Created		Created container with docker id 90315cc9f513
  53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Started		Started container with docker id 90315cc9f513

这里可以看到容器和 pod 的 label、资源需求等配置信息，还可以看到状态、就绪状态、重启次数、事件等状态信息。

容器状态包括 Waiting、Running 和 Terminated。根据状态的不同，将提供额外的信息——在这里您可以看到，对于处于运行状态的容器，系统会告诉您容器的启动时间。

Ready 指示是否通过了最后一个就绪探测。 (在本例中，容器没有配置就绪探测；如果没有配置就绪探测，则假定容器已经就绪。)

Restart Count 告诉您容器已重启的次数；这些信息对于定位配置了“always”重启策略的容器持续崩溃问题非常有用。

目前，唯一与 Pod 有关的状态是就绪状态，这表明 Pod 能够为请求提供服务，并且应该添加到相应服务的负载平衡池中。

最后，您将看到与 Pod 相关的近期事件。系统通过指示第一次和最后一次看到它以及看到它的次数来压缩多个相同的事件。 “From”表示记录事件的组件， “SubobjectPath”告诉您引用了哪个对象(例如pod中的容器)， “Reason”和“Message”告诉您发生了什么。

例子: 调试 Pending Pods

一个常见的场景是，当你创建了一个 Pod，但是他没有被调度到任何一个 node，可以使用 event 来调试。比如说，这个 Pod 需求的资源比较多，没有任何一个 node 能够满足，或者它指定了一个标签，没有 node 被匹配到。假如，我们创建之前的 Deployment 时指定副本数是5（不再是2），并且需求 600 millicore（不再是 500），对于一个4个节点的集群并且每个节点只有1个CPU。这个情况下，至少有一个 Pod 不能被调度。（需要注意的是，其他集群组件，比如 fluentd、skydns等等会在每个 node 上运行，如果我们需求 1000 millicore，那么将不会有 Pod 会被调度。）

kubectl get pods

NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-1006230814-6winp   1/1       Running   0          7m
nginx-deployment-1006230814-fmgu3   1/1       Running   0          7m
nginx-deployment-1370807587-6ekbw   1/1       Running   0          1m
nginx-deployment-1370807587-fg172   0/1       Pending   0          1m
nginx-deployment-1370807587-fz9sd   0/1       Pending   0          1m

为了查找 Pod nginx-deployment-1370807587-fz9sd 没有运行的原因，我们可以使用 kubectl describe pod 命令查询处理 pending 状态的 Pod：

kubectl describe pod nginx-deployment-1370807587-fz9sd

  Name:		nginx-deployment-1370807587-fz9sd
  Namespace:	default
  Node:		/
  Labels:		app=nginx,pod-template-hash=1370807587
  Status:		Pending
  IP:
  Controllers:	ReplicaSet/nginx-deployment-1370807587
  Containers:
    nginx:
      Image:	nginx
      Port:	80/TCP
      QoS Tier:
        memory:	Guaranteed
        cpu:	Guaranteed
      Limits:
        cpu:	1
        memory:	128Mi
      Requests:
        cpu:	1
        memory:	128Mi
      Environment Variables:
  Volumes:
    default-token-4bcbi:
      Type:	Secret (a volume populated by a Secret)
      SecretName:	default-token-4bcbi
  Events:
    FirstSeen	LastSeen	Count	From			        SubobjectPath	Type		Reason			    Message
    ---------	--------	-----	----			        -------------	--------	------			    -------
    1m		    48s		    7	    {default-scheduler }			        Warning		FailedScheduling	pod (nginx-deployment-1370807587-fz9sd) failed to fit in any node
  fit failure on node (kubernetes-node-6ta5): Node didn't have enough resource: CPU, requested: 1000, used: 1420, capacity: 2000
  fit failure on node (kubernetes-node-wul5): Node didn't have enough resource: CPU, requested: 1000, used: 1100, capacity: 2000

这里你可以看到由调度器记录的事件，它表明了 Pod 不能被调度的原因是 FailedScheduling（也可能是其他值）。这个信息表明，没有任何 node 拥有足够多的资源。

要纠正这种情况，可以使用“kubectl scale”更新部署，以指定四个或更少的副本。 (或者你可以让 Pod 继续保持这个状态，这是无害的。)

与您在“kubectl describe pod”结尾处看到的一样，这些事件都将保存在 etcd 中，并提供关于集群中正在发生的事情的高级信息。如果需要列出所有事件，可使用命令：

kubectl get events

但是，需要注意的是，事件是按照 namespace 分组的。如果你对些些 namespace 下的对象感兴趣（比如查看 namespace my-namespace 下的 Pod 事件）, 你需要显式的在命令行中指定 namespace：

kubectl get events --namespace=my-namespace

查看所有 namespace 的事件，可使用 --all-namespaces 参数：

除了 kubectl describe pod 以外，另一种获取 Pod 额外信息（超越 kubectl get pod）的方法是给 kubectl get pod 增加 -o yaml 输出格式参数。这将以YAML格式为您提供比“kubectl describe pod”更多的信息——实际上是系统拥有的关于pod的所有信息。在这里，您将看到注释(没有标签限制的键值元数据，由Kubernetes系统组件在内部使用)、重启策略、端口和卷。

kubectl get pod nginx-deployment-1006230814-6winp -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1006230814","uid":"4c84c175-f161-11e5-9a78-42010af00005","apiVersion":"extensions","resourceVersion":"133434"}}
  creationTimestamp: 2016-03-24T01:39:50Z
  generateName: nginx-deployment-1006230814-
  labels:
    app: nginx
    pod-template-hash: "1006230814"
  name: nginx-deployment-1006230814-6winp
  namespace: default
  resourceVersion: "133447"
  uid: 4c879808-f161-11e5-9a78-42010af00005
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx
    ports:
    - containerPort: 80
      protocol: TCP
    resources:
      limits:
        cpu: 500m
        memory: 128Mi
      requests:
        cpu: 500m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-4bcbi
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: kubernetes-node-wul5
  restartPolicy: Always
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: default-token-4bcbi
    secret:
      secretName: default-token-4bcbi
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2016-03-24T01:39:51Z
    status: "True"
    type: Ready
  containerStatuses:
  - containerID: docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149
    image: nginx
    imageID: docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707
    lastState: {}
    name: nginx
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2016-03-24T01:39:51Z
  hostIP: 10.240.0.9
  phase: Running
  podIP: 10.244.0.6
  startTime: 2016-03-24T01:39:49Z

例子: 调试 down/unreachable node

有时候，在调试时，查看节点的状态是很有用的——例如，因为您已经注意到节点上运行的 Pod 的奇怪行为，或者想了解为什么 Pod 不会调度到节点上。与Pods一样，您可以使用 kubectl describe node 和 kubectl get node -o yaml 来查询节点的详细信息。例如，如果某个节点关闭(与网络断开连接，或者 kubelet 挂掉，无法重新启动，等等)，您将看到以下情况。请注意显示节点未就绪的事件，也请注意 pod 不再运行(它们在5分钟未就绪状态后被驱逐)。

kubectl get nodes

NAME                     STATUS       ROLES     AGE     VERSION
kubernetes-node-861h     NotReady     <none>    1h      v1.13.0
kubernetes-node-bols     Ready        <none>    1h      v1.13.0
kubernetes-node-st6x     Ready        <none>    1h      v1.13.0
kubernetes-node-unaj     Ready        <none>    1h      v1.13.0

kubectl describe node kubernetes-node-861h

Name:			kubernetes-node-861h
Role
Labels:		 kubernetes.io/arch=amd64
           kubernetes.io/os=linux
           kubernetes.io/hostname=kubernetes-node-861h
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:	Mon, 04 Sep 2017 17:13:23 +0800
Phase:
Conditions:
  Type		Status		LastHeartbeatTime			LastTransitionTime			Reason					Message
  ----    ------    -----------------     ------------------      ------          -------
  OutOfDisk             Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
  MemoryPressure        Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
  DiskPressure          Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
  Ready                 Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
Addresses:	10.240.115.55,104.197.0.26
Capacity:
 cpu:           2
 hugePages:     0
 memory:        4046788Ki
 pods:          110
Allocatable:
 cpu:           1500m
 hugePages:     0
 memory:        1479263Ki
 pods:          110
System Info:
 Machine ID:                    8e025a21a4254e11b028584d9d8b12c4
 System UUID:                   349075D1-D169-4F25-9F2A-E886850C47E3
 Boot ID:                       5cd18b37-c5bd-4658-94e0-e436d3f110e0
 Kernel Version:                4.4.0-31-generic
 OS Image:                      Debian GNU/Linux 8 (jessie)
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://1.12.5
 Kubelet Version:               v1.6.9+a3d1dfa6f4335
 Kube-Proxy Version:            v1.6.9+a3d1dfa6f4335
ExternalID:                     15233045891481496305
Non-terminated Pods:            (9 in total)
  Namespace                     Name                                            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                            ------------    ----------      --------------- -------------
......
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits      Memory Requests         Memory Limits
  ------------  ----------      ---------------         -------------
  900m (60%)    2200m (146%)    1009286400 (66%)        5681286400 (375%)
Events:         <none>

kubectl get node kubernetes-node-861h -o yaml

apiVersion: v1
kind: Node
metadata:
  creationTimestamp: 2015-07-10T21:32:29Z
  labels:
    kubernetes.io/hostname: kubernetes-node-861h
  name: kubernetes-node-861h
  resourceVersion: "757"
  selfLink: /api/v1/nodes/kubernetes-node-861h
  uid: 2a69374e-274b-11e5-a234-42010af0d969
spec:
  externalID: "15233045891481496305"
  podCIDR: 10.244.0.0/24
  providerID: gce://striped-torus-760/us-central1-b/kubernetes-node-861h
status:
  addresses:
  - address: 10.240.115.55
    type: InternalIP
  - address: 104.197.0.26
    type: ExternalIP
  capacity:
    cpu: "1"
    memory: 3800808Ki
    pods: "100"
  conditions:
  - lastHeartbeatTime: 2015-07-10T21:34:32Z
    lastTransitionTime: 2015-07-10T21:35:15Z
    reason: Kubelet stopped posting node status.
    status: Unknown
    type: Ready
  nodeInfo:
    bootID: 4e316776-b40d-4f78-a4ea-ab0d73390897
    containerRuntimeVersion: docker://Unknown
    kernelVersion: 3.16.0-0.bpo.4-amd64
    kubeProxyVersion: v0.21.1-185-gffc5a86098dc01
    kubeletVersion: v0.21.1-185-gffc5a86098dc01
    machineID: ""
    osImage: Debian GNU/Linux 7 (wheezy)
    systemUUID: ABE5F6B4-D44B-108B-C46A-24CCE16C8B6E

接下来

了解更多的调试工具：

Black lives matter.

文档

应用自测与调试

使用 `kubectl describe pod` 命令获取 pod 详情

例子: 调试 Pending Pods

例子: 调试 down/unreachable node

接下来

反馈

应用自测与调试

使用 kubectl describe pod 命令获取 pod 详情

例子: 调试 Pending Pods

例子: 调试 down/unreachable node

接下来

反馈

使用 `kubectl describe pod` 命令获取 pod 详情