Managing and debugging Kubernetes (K8s) is hard. I thought I understood Kubernetes; I had provisioned clusters from scratch, I had installed Istio and a whole bunch of other tools, I had fixed outages in the past. I then joined an organization whose expertise were Kubernetes, and oh boy, did I have a lot to learn.
This isn’t a blog post of what I’ve learned, I think that’s a useless post, it’d be irrelevant in weeks. This is a post on how to debug clusters using concepts from troubleshoot, which I’m hoping will stay relevant for a little while longer.
Note: This blog was originally published as an exploration by Replicated engineer Alexander Trelore on his own page here, and is republished here only lightly edited. Replicated doesn't necessarily recommend K9s to manage and visualize a cluster, but this was a nice way to help others understand Troubleshoot.
This section will be brief as we just make sure everyone is up to speed on the basics of debugging Kubernetes clusters.
The Kubernetes docs have a great set of resources for monitoring, logging, and troubleshooting.
Another great resource is this flowchart.
Lastly I want to highlight a few tools to help visualize the cluster, personally I’m a visual learner, and the greater insight into the cluster I have the better. So for starters here’s k9s, it’s a CLI visualization tool that describes itself as a way to manage your cluster - but there’s a few visualization tools within it. Secondly, the visualization choice of many lens (it’s on my to do list, to use on stream), which just looks phenomenal.
This should serve as a baseline for everyone to get up to speed. So let’s get started with the main course!
Replicated sponsors an open source project called Troubleshoot, follow the link for the official troubleshoot.sh page. There are two major sides to Troubleshoot, which are preflight checks and support bundles.
Preflight checks are intended to be run before an application is installed onto a cluster. They allow the administrator to know things like, "is there enough CPU for my application?", "is the Kubernetes version at least version X?", and "do certain secrets exist?"
Support bundles are subtly different. Their main use case is “something broke, what can I inspect and share to software maintainers?” They allow the cluster administrators to know the same thing as preflight checks, however they export a file that users can explore and share after the fact.
Both of these have three components: collectors, redactors, and analyzers.
Collectors allow administrators to collect (as the name implies) details around their cluster - from logs, to cluster information and everything in between.
Redactors (again as the name implies), allow users to redact information. The use case here being that a collector may collect sensitive information such as database connection strings. This step allows us to remove any sensitive information in a couple of different ways. By default there are a few redactors run.
Lastly we have analyzers (also excellently named if I may add), instead of manually going through an incredibly large amount of information, the analyzers allow users to quickly highlight issues with the cluster.
For this section we’ll just spin up an empty k8s cluster, with a couple extra little things. Feel free to deploy as much or as little as you want to your cluster.
Let’s create a simple password secret with the following command:
[.pre]kubectl create secret -n default generic mysecret --from-
literal=password=hunter2[.pre]
Once your demo application (or real application) is ready to be troubleshot (troubleshooted?), we’ll explore how to write collectors, redactors, and analyzers.
There are many types of collectors to choose from, varying from host level information to copy files from pods. We’re going to copy a file from a pod.
[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: example
spec:
collectors:
-copy:
selector:
- run=busybox
namespace: default
containerPath: /etc/foo
containerName: busybox[.pre]
This will copy the path /etc/foo from the pod with the label run=busybox into the support bundle.
There are so many more collectors, I’d heavily recommend exploring them, trying out ones that may be useful.
Redactors are slightly different in that they have their own manifest Kind (i.e. it’s not SupportBundle or Preflight)
[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: Redactor
metadata:
name: example
spec:
redactors:
-name: all files
removals:
yamlPath:
- password[.pre]
There are a few different ways to redact information, we’ve chosen yamlPath as in our collector we’re collecting a yaml file.
These allows us to quickly identify issues with the cluster.
[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: example
spec:
analyzers:
- yamlCompare:
checkName: Compare YAML Example
fileName: default/busybox/busybox/etc/foo/secrets.yaml
path: username
value: "Alexander"
outcomes:
- fail:
when: "false"
message: The collected data does not match the value.
- pass:
when: "true"
message: The collected data matches the value[.pre]
Following our yaml obsession, we’re going to use yamlCompare, this will allow us to specify a file, path and value to compare against. We can also specify the pass and fail conditions.
As with collectors and redactors, there are also many analyzers.
Here's our secrets.yaml file:
[.pre]kubectl create secret -n default generic mysecret --from-
file=secrets.yaml[.pre]
[.pre]username: Alexander
password: "12345678"[.pre]
Here's our deployment.yaml file:
[.pre]kubectl apply -f deployment.yaml[.pre]
[.pre]apiVersion: v1
kind: Pod
metadata:
name: busybox
labels:
run: busybox
spec:
containers:
- command:
- sleep
- "3600"
image: busybox
name: busybox
volumeMounts:
- name: foo
mountPath: /etc/foo/secrets.yaml # needed for volumeMounts
subPath: secrets.yaml # needed otherwise it's a symlink
readOnly: true
volumes:
- name: foo
secret:
secretName: mysecret
optional: false[.pre]
Here's our support-bundle.yaml file:
[.pre]kubectl support-bundle -f support-bundle.yaml[.pre]
[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: example
spec:
collectors:
- copy:
selector:
- run=busybox
namespace: default
containerPath: /etc/foo
containerName: busybox
analyzers:
- yamlCompare:
checkName: Compare YAML Example
fileName: default/busybox/busybox/etc/foo/secrets.yaml
path: username
value: "Alexander"
outcomes:
- fail:
when: "false"
message: The collected data does not match the value.
- pass:
when: "true"
message: The collected data matches the value
---
apiVersion: troubleshoot.sh/v1beta2
kind: Redactor
metadata:
name: example
spec:
redactors:
- name: all files
removals:
yamlPath:
- password[.pre]
With this you should see a support bundle in your terminal. (You can ignore this if you used --interactive.)
You can then share this bundle with the maintainers of your cluster, or anyone else that’s interested.
To recap, we’re created a secret, deployed a pod, and created a support bundle. We have redacted the users password, and automatically confirmed the users username is correct.
I hope these examples gets the point across of how automation can easily diagnose problems with your own cluster.