mirror of
https://github.com/siderolabs/talos.git
synced 2025-10-09 14:41:31 +02:00
Make improvements to help documentation discoverability and categorization. Ensure all content pages have a description. Ensure all link are replaced with Hugo shortcode. Ensure all moved pages have an alias so redirects work. Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
488 lines
24 KiB
Markdown
488 lines
24 KiB
Markdown
---
|
|
title: "Troubleshooting Control Plane"
|
|
description: "Troubleshoot control plane failures for running cluster and bootstrap process."
|
|
aliases:
|
|
- ../guides/troubleshooting-control-plane
|
|
---
|
|
|
|
<!-- markdownlint-disable MD026 -->
|
|
|
|
This guide is written as series of topics and detailed answers for each topic.
|
|
It starts with basics of control plane and goes into Talos specifics.
|
|
|
|
In this guide we assume that Talos client config is available and Talos API access is available.
|
|
Kubernetes client configuration can be pulled from control plane nodes with `talosctl -n <IP> kubeconfig`
|
|
(this command works before Kubernetes is fully booted).
|
|
|
|
### What is a control plane node?
|
|
|
|
A control plane node is a node which:
|
|
|
|
- runs etcd, the Kubernetes database
|
|
- runs the Kubernetes control plane
|
|
- kube-apiserver
|
|
- kube-controller-manager
|
|
- kube-scheduler
|
|
- serves as an administrative proxy to the worker nodes
|
|
|
|
These nodes are critical to the operation of your cluster.
|
|
Without control plane nodes, Kubernetes will not respond to changes in the
|
|
system, and certain central services may not be available.
|
|
|
|
Talos nodes which have `.machine.type` of `controlplane` are control plane nodes.
|
|
|
|
Control plane nodes are tainted by default to prevent workloads from being scheduled to control plane nodes.
|
|
|
|
### How many control plane nodes should be deployed?
|
|
|
|
Because control plane nodes are so important, it is important that they be
|
|
deployed with redundancy to ensure consistent, reliable operation of the cluster
|
|
during upgrades, reboots, hardware failures, and other such events.
|
|
This is also known as high-availability or just HA.
|
|
Non-HA clusters are sometimes used as test clusters, CI clusters, or in specific scenarios
|
|
which warrant the loss of redundancy, but they should almost never be used in production.
|
|
|
|
Maintaining the proper count of control plane nodes is also critical.
|
|
The etcd database operates on the principles of membership and quorum, so
|
|
membership should always be an odd number, and there is exponentially-increasing
|
|
overhead for each additional member.
|
|
Therefore, the number of control plane nodes should almost always be 3.
|
|
In some particularly large or distributed clusters, the count may be 5, but this
|
|
is very rare.
|
|
|
|
See [this document]({{< relref "../learn-more/concepts#control-planes-are-not-linear-replicas" >}}) on the topic for more information.
|
|
|
|
### What is the control plane endpoint?
|
|
|
|
The Kubernetes control plane endpoint is the single canonical URL by which the
|
|
Kubernetes API is accessed.
|
|
Especially with high-availability (HA) control planes, it is common that this endpoint may not point to the Kubernetes API server
|
|
directly, but may be instead point to a load balancer or a DNS name which may
|
|
have multiple `A` and `AAAA` records.
|
|
|
|
Like Talos' own API, the Kubernetes API is constructed with mutual TLS, client
|
|
certs, and a common Certificate Authority (CA).
|
|
Unlike general-purpose websites, there is no need for an upstream CA, so tools
|
|
such as cert-manager, services such as Let's Encrypt, or purchased products such
|
|
as validated TLS certificates are not required.
|
|
Encryption, however, _is_, and hence the URL scheme will always be `https://`.
|
|
|
|
By default, the Kubernetes API server in Talos runs on port 6443.
|
|
As such, the control plane endpoint URLs for Talos will almost always be of the form
|
|
`https://endpoint:6443`, noting that the port, since it is not the `https`
|
|
default of `443` is _required_.
|
|
The `endpoint` above may be a DNS name or IP address, but it should be
|
|
ultimately be directed to the _set_ of all controlplane nodes, as opposed to a
|
|
single one.
|
|
|
|
As mentioned above, this can be achieved by a number of strategies, including:
|
|
|
|
- an external load balancer
|
|
- DNS records
|
|
- Talos-builtin shared IP ([VIP]({{< relref "../talos-guides/network/vip" >}}))
|
|
- BGP peering of a shared IP (such as with [kube-vip](https://kube-vip.io))
|
|
|
|
Using a DNS name here is usually a good idea, it being the most flexible
|
|
option, since it allows the combination with any _other_ option, while offering
|
|
a layer of abstraction.
|
|
It allows the underlying IP addresses to change over time without impacting the
|
|
canonical URL.
|
|
|
|
Unlike most services in Kubernetes, the API server runs with host networking,
|
|
meaning that it shares the network namespace with the host.
|
|
This means you can use the IP address(es) of the host to refer to the Kubernetes
|
|
API server.
|
|
|
|
For availability of the API, it is important that any load balancer be aware of
|
|
the health of the backend API servers.
|
|
This makes a load balancer-based system valuable to minimize disruptions during
|
|
common node lifecycle operations like reboots and upgrades.
|
|
|
|
It is critical that control plane endpoint works correctly during cluster bootstrap phase, as nodes discover
|
|
each other using control plane endpoint.
|
|
|
|
### kubelet is not running on control plane node
|
|
|
|
The `kubelet` service should be running on control plane nodes as soon as networking is configured:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> service kubelet
|
|
NODE 172.20.0.2
|
|
ID kubelet
|
|
STATE Running
|
|
HEALTH OK
|
|
EVENTS [Running]: Health check successful (2m54s ago)
|
|
[Running]: Health check failed: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused (3m4s ago)
|
|
[Running]: Started task kubelet (PID 2334) for container kubelet (3m6s ago)
|
|
[Preparing]: Creating service runner (3m6s ago)
|
|
[Preparing]: Running pre state (3m15s ago)
|
|
[Waiting]: Waiting for service "timed" to be "up" (3m15s ago)
|
|
[Waiting]: Waiting for service "cri" to be "up", service "timed" to be "up" (3m16s ago)
|
|
[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m18s ago)
|
|
```
|
|
|
|
If the `kubelet` is not running, it may be due to invalid configuration.
|
|
Check `kubelet` logs with the `talosctl logs` command:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> logs kubelet
|
|
172.20.0.2: I0305 20:45:07.756948 2334 controller.go:101] kubelet config controller: starting controller
|
|
172.20.0.2: I0305 20:45:07.756995 2334 controller.go:267] kubelet config controller: ensuring filesystem is set up correctly
|
|
172.20.0.2: I0305 20:45:07.757000 2334 fsstore.go:59] kubelet config controller: initializing config checkpoints directory "/etc/kubernetes/kubelet/store"
|
|
```
|
|
|
|
### etcd is not running
|
|
|
|
By far the most likely cause of `etcd` not running is because the cluster has
|
|
not yet been bootstrapped or because bootstrapping is currently in progress.
|
|
The `talosctl bootstrap` command must be run manually and only _once_ per
|
|
cluster, and this step is commonly missed.
|
|
Once a node is bootstrapped, it will start `etcd` and, over the course of a
|
|
minute or two (depending on the download speed of the control plane nodes), the
|
|
other control plane nodes should discover it and join themselves to the cluster.
|
|
|
|
Also, `etcd` will only run on control plane nodes.
|
|
If a node is designated as a worker node, you should not expect `etcd` to be
|
|
running on it.
|
|
|
|
When node boots for the first time, the `etcd` data directory (`/var/lib/etcd`) is empty, and it will only be populated when `etcd` is launched.
|
|
|
|
If `etcd` is not running, check service `etcd` state:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> service etcd
|
|
NODE 172.20.0.2
|
|
ID etcd
|
|
STATE Running
|
|
HEALTH OK
|
|
EVENTS [Running]: Health check successful (3m21s ago)
|
|
[Running]: Started task etcd (PID 2343) for container etcd (3m26s ago)
|
|
[Preparing]: Creating service runner (3m26s ago)
|
|
[Preparing]: Running pre state (3m26s ago)
|
|
[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m26s ago)
|
|
```
|
|
|
|
If service is stuck in `Preparing` state for bootstrap node, it might be related to slow network - at this stage
|
|
Talos pulls the `etcd` image from the container registry.
|
|
|
|
If the `etcd` service is crashing and restarting, check its logs with `talosctl -n <IP> logs etcd`.
|
|
The most common reasons for crashes are:
|
|
|
|
- wrong arguments passed via `extraArgs` in the configuration;
|
|
- booting Talos on non-empty disk with previous Talos installation, `/var/lib/etcd` contains data from old cluster.
|
|
|
|
### etcd is not running on non-bootstrap control plane node
|
|
|
|
The `etcd` service on control plane nodes which were not the target of the cluster bootstrap will wait until the bootstrapped control plane node has completed.
|
|
The bootstrap and discovery processes may take a few minutes to complete.
|
|
As soon as the bootstrapped node starts its Kubernetes control plane components, `kubectl get endpoints` will return the IP of bootstrapped control plane node.
|
|
At this point, the other control plane nodes will start their `etcd` services, join the cluster, and then start their own Kubernetes control plane components.
|
|
|
|
### Kubernetes static pod definitions are not generated
|
|
|
|
Talos should write the static pod definitions for the Kubernetes control plane
|
|
in `/etc/kubernetes/manifests`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> ls /etc/kubernetes/manifests
|
|
NODE NAME
|
|
172.20.0.2 .
|
|
172.20.0.2 talos-kube-apiserver.yaml
|
|
172.20.0.2 talos-kube-controller-manager.yaml
|
|
172.20.0.2 talos-kube-scheduler.yaml
|
|
```
|
|
|
|
If the static pod definitions are not rendered, check `etcd` and `kubelet` service health (see above)
|
|
and the controller runtime logs (`talosctl logs controller-runtime`).
|
|
|
|
### Talos prints error `an error on the server ("") has prevented the request from succeeding`
|
|
|
|
This is expected during initial cluster bootstrap and sometimes after a reboot:
|
|
|
|
```bash
|
|
[ 70.093289] [talos] task labelNodeAsMaster (1/1): starting
|
|
[ 80.094038] [talos] retrying error: an error on the server ("") has prevented the request from succeeding (get nodes talos-default-master-1)
|
|
```
|
|
|
|
Initially `kube-apiserver` component is not running yet, and it takes some time before it becomes fully up
|
|
during bootstrap (image should be pulled from the Internet, etc.)
|
|
Once the control plane endpoint is up, Talos should continue with its boot
|
|
process.
|
|
|
|
If Talos doesn't proceed, it may be due to a configuration issue.
|
|
|
|
In any case, the status of the control plane components on each control plane nodes can be checked with `talosctl containers -k`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> containers --kubernetes
|
|
NODE NAMESPACE ID IMAGE PID STATUS
|
|
172.20.0.2 k8s.io kube-system/kube-apiserver-talos-default-master-1 k8s.gcr.io/pause:3.2 2539 SANDBOX_READY
|
|
172.20.0.2 k8s.io └─ kube-system/kube-apiserver-talos-default-master-1:kube-apiserver k8s.gcr.io/kube-apiserver:v{{< k8s_release >}} 2572 CONTAINER_RUNNING
|
|
```
|
|
|
|
If `kube-apiserver` shows as `CONTAINER_EXITED`, it might have exited due to configuration error.
|
|
Logs can be checked with `taloctl logs --kubernetes` (or with `-k` as a shorthand):
|
|
|
|
```bash
|
|
$ talosctl -n <IP> logs -k kube-system/kube-apiserver-talos-default-master-1:kube-apiserver
|
|
172.20.0.2: 2021-03-05T20:46:13.133902064Z stderr F 2021/03/05 20:46:13 Running command:
|
|
172.20.0.2: 2021-03-05T20:46:13.133933824Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
|
|
172.20.0.2: 2021-03-05T20:46:13.133938524Z stderr F Run from directory:
|
|
172.20.0.2: 2021-03-05T20:46:13.13394154Z stderr F Executable path: /usr/local/bin/kube-apiserver
|
|
...
|
|
```
|
|
|
|
### Talos prints error `nodes "talos-default-master-1" not found`
|
|
|
|
This error means that `kube-apiserver` is up and the control plane endpoint is healthy, but the `kubelet` hasn't received
|
|
its client certificate yet, and it wasn't able to register itself to Kubernetes.
|
|
The Kubernetes controller manager (`kube-controller-manager`)is responsible for monitoring the certificate
|
|
signing requests (CSRs) and issuing certificates for each of them.
|
|
The kubelet is responsible for generating and submitting the CSRs for its
|
|
associated node.
|
|
|
|
For the `kubelet` to get its client certificate, then, the Kubernetes control plane
|
|
must be healthy:
|
|
|
|
- the API server is running and available at the Kubernetes control plane
|
|
endpoint URL
|
|
- the controller manager is running and a leader has been elected
|
|
|
|
The states of any CSRs can be checked with `kubectl get csr`:
|
|
|
|
```bash
|
|
$ kubectl get csr
|
|
NAME AGE SIGNERNAME REQUESTOR CONDITION
|
|
csr-jcn9j 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
|
|
csr-p6b9q 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
|
|
csr-sw6rm 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
|
|
csr-vlghg 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
|
|
```
|
|
|
|
### Talos prints error `node not ready`
|
|
|
|
A Node in Kubernetes is marked as `Ready` only once its CNI is up.
|
|
It takes a minute or two for the CNI images to be pulled and for the CNI to start.
|
|
If the node is stuck in this state for too long, check CNI pods and logs with `kubectl`.
|
|
Usually, CNI-related resources are created in `kube-system` namespace.
|
|
|
|
For example, for Talos default Flannel CNI:
|
|
|
|
```bash
|
|
$ kubectl -n kube-system get pods
|
|
NAME READY STATUS RESTARTS AGE
|
|
...
|
|
kube-flannel-25drx 1/1 Running 0 23m
|
|
kube-flannel-8lmb6 1/1 Running 0 23m
|
|
kube-flannel-gl7nx 1/1 Running 0 23m
|
|
kube-flannel-jknt9 1/1 Running 0 23m
|
|
...
|
|
```
|
|
|
|
### Talos prints error `x509: certificate signed by unknown authority`
|
|
|
|
The full error might look like:
|
|
|
|
```bash
|
|
x509: certificate signed by unknown authority (possiby because of crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"
|
|
```
|
|
|
|
Usually, this occurs because the control plane endpoint points to a different
|
|
cluster than the client certificate was generated for.
|
|
If a node was recycled between clusters, make sure it was properly wiped between
|
|
uses.
|
|
If a client has multiple client configurations, make sure you are matching the correct `talosconfig` with the
|
|
correct cluster.
|
|
|
|
### etcd is running on bootstrap node, but stuck in `pre` state on non-bootstrap nodes
|
|
|
|
Please see question `etcd is not running on non-bootstrap control plane node`.
|
|
|
|
### Checking `kube-controller-manager` and `kube-scheduler`
|
|
|
|
If the control plane endpoint is up, the status of the pods can be ascertained with `kubectl`:
|
|
|
|
```bash
|
|
$ kubectl get pods -n kube-system -l k8s-app=kube-controller-manager
|
|
NAME READY STATUS RESTARTS AGE
|
|
kube-controller-manager-talos-default-master-1 1/1 Running 0 28m
|
|
kube-controller-manager-talos-default-master-2 1/1 Running 0 28m
|
|
kube-controller-manager-talos-default-master-3 1/1 Running 0 28m
|
|
```
|
|
|
|
If the control plane endpoint is not yet up, the container status of the control plane components can be queried with
|
|
`talosctl containers --kubernetes`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> c -k
|
|
NODE NAMESPACE ID IMAGE PID STATUS
|
|
...
|
|
172.20.0.2 k8s.io kube-system/kube-controller-manager-talos-default-master-1 k8s.gcr.io/pause:3.2 2547 SANDBOX_READY
|
|
172.20.0.2 k8s.io └─ kube-system/kube-controller-manager-talos-default-master-1:kube-controller-manager k8s.gcr.io/kube-controller-manager:v{{< k8s_release >}} 2580 CONTAINER_RUNNING
|
|
172.20.0.2 k8s.io kube-system/kube-scheduler-talos-default-master-1 k8s.gcr.io/pause:3.2 2638 SANDBOX_READY
|
|
172.20.0.2 k8s.io └─ kube-system/kube-scheduler-talos-default-master-1:kube-scheduler k8s.gcr.io/kube-scheduler:v{{< k8s_release >}} 2670 CONTAINER_RUNNING
|
|
...
|
|
```
|
|
|
|
If some of the containers are not running, it could be that image is still being pulled.
|
|
Otherwise the process might crashing.
|
|
The logs can be checked with `talosctl logs --kubernetes <containerID>`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> logs -k kube-system/kube-controller-manager-talos-default-master-1:kube-controller-manager
|
|
172.20.0.3: 2021-03-09T13:59:34.291667526Z stderr F 2021/03/09 13:59:34 Running command:
|
|
172.20.0.3: 2021-03-09T13:59:34.291702262Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
|
|
172.20.0.3: 2021-03-09T13:59:34.291707121Z stderr F Run from directory:
|
|
172.20.0.3: 2021-03-09T13:59:34.291710908Z stderr F Executable path: /usr/local/bin/kube-controller-manager
|
|
172.20.0.3: 2021-03-09T13:59:34.291719163Z stderr F Args (comma-delimited): /usr/local/bin/kube-controller-manager,--allocate-node-cidrs=true,--cloud-provider=,--cluster-cidr=10.244.0.0/16,--service-cluster-ip-range=10.96.0.0/12,--cluster-signing-cert-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--cluster-signing-key-file=/system/secrets/kubernetes/kube-controller-manager/ca.key,--configure-cloud-routes=false,--kubeconfig=/system/secrets/kubernetes/kube-controller-manager/kubeconfig,--leader-elect=true,--root-ca-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--service-account-private-key-file=/system/secrets/kubernetes/kube-controller-manager/service-account.key,--profiling=false
|
|
172.20.0.3: 2021-03-09T13:59:34.293870359Z stderr F 2021/03/09 13:59:34 Now listening for interrupts
|
|
172.20.0.3: 2021-03-09T13:59:34.761113762Z stdout F I0309 13:59:34.760982 10 serving.go:331] Generated self-signed cert in-memory
|
|
...
|
|
```
|
|
|
|
### Checking controller runtime logs
|
|
|
|
Talos runs a set of controllers which operate on resources to build and support the Kubernetes control plane.
|
|
|
|
Some debugging information can be queried from the controller logs with `talosctl logs controller-runtime`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> logs controller-runtime
|
|
172.20.0.2: 2021/03/09 13:57:11 secrets.EtcdController: controller starting
|
|
172.20.0.2: 2021/03/09 13:57:11 config.MachineTypeController: controller starting
|
|
172.20.0.2: 2021/03/09 13:57:11 k8s.ManifestApplyController: controller starting
|
|
172.20.0.2: 2021/03/09 13:57:11 v1alpha1.BootstrapStatusController: controller starting
|
|
172.20.0.2: 2021/03/09 13:57:11 v1alpha1.TimeStatusController: controller starting
|
|
...
|
|
```
|
|
|
|
Controllers continuously run a reconcile loop, so at any time, they may be starting, failing, or restarting.
|
|
This is expected behavior.
|
|
|
|
Things to look for:
|
|
|
|
`v1alpha1.BootstrapStatusController: bootkube initialized status not found`: control plane is not self-hosted, running with static pods.
|
|
|
|
`k8s.KubeletStaticPodController: writing static pod "/etc/kubernetes/manifests/talos-kube-apiserver.yaml"`: static pod definitions were rendered successfully.
|
|
|
|
`k8s.ManifestApplyController: controller failed: error creating mapping for object /v1/Secret/bootstrap-token-q9pyzr: an error on the server ("") has prevented the request from succeeding`: control plane endpoint is not up yet, bootstrap manifests can't be injected, controller is going to retry.
|
|
|
|
`k8s.KubeletStaticPodController: controller failed: error refreshing pod status: error fetching pod status: an error on the server ("Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)") has prevented the request from succeeding`: kubelet hasn't been able to contact `kube-apiserver` yet to push pod status, controller
|
|
is going to retry.
|
|
|
|
`k8s.ManifestApplyController: created rbac.authorization.k8s.io/v1/ClusterRole/psp:privileged`: one of the bootstrap manifests got successfully applied.
|
|
|
|
`secrets.KubernetesController: controller failed: missing cluster.aggregatorCA secret`: Talos is running with 0.8 configuration, if the cluster was upgraded from 0.8, this is expected, and conversion process will fix machine config
|
|
automatically.
|
|
If this cluster was bootstrapped with version 0.9, machine configuration should be regenerated with 0.9 talosctl.
|
|
|
|
If there are no new messages in the `controller-runtime` log, it means that the controllers have successfully finished reconciling, and that the current system state is the desired system state.
|
|
|
|
### Checking static pod definitions
|
|
|
|
Talos generates static pod definitions for the `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler`
|
|
components based on its machine configuration.
|
|
These definitions can be checked as resources with `talosctl get staticpods`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> get staticpods -o yaml
|
|
get staticpods -o yaml
|
|
node: 172.20.0.2
|
|
metadata:
|
|
namespace: controlplane
|
|
type: StaticPods.kubernetes.talos.dev
|
|
id: kube-apiserver
|
|
version: 2
|
|
phase: running
|
|
finalizers:
|
|
- k8s.StaticPodStatus("kube-apiserver")
|
|
spec:
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
annotations:
|
|
talos.dev/config-version: "1"
|
|
talos.dev/secrets-version: "1"
|
|
creationTimestamp: null
|
|
labels:
|
|
k8s-app: kube-apiserver
|
|
tier: control-plane
|
|
name: kube-apiserver
|
|
namespace: kube-system
|
|
...
|
|
```
|
|
|
|
The status of the static pods can queried with `talosctl get staticpodstatus`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> get staticpodstatus
|
|
NODE NAMESPACE TYPE ID VERSION READY
|
|
172.20.0.2 controlplane StaticPodStatus kube-system/kube-apiserver-talos-default-master-1 1 True
|
|
172.20.0.2 controlplane StaticPodStatus kube-system/kube-controller-manager-talos-default-master-1 1 True
|
|
172.20.0.2 controlplane StaticPodStatus kube-system/kube-scheduler-talos-default-master-1 1 True
|
|
```
|
|
|
|
The most important status field is `READY`, which is the last column printed.
|
|
The complete status can be fetched by adding `-o yaml` flag.
|
|
|
|
### Checking bootstrap manifests
|
|
|
|
As part of the bootstrap process, Talos injects bootstrap manifests into Kubernetes API server.
|
|
There are two kinds of these manifests: system manifests built-in into Talos and extra manifests downloaded (custom CNI, extra manifests in the machine config):
|
|
|
|
```bash
|
|
$ talosctl -n <IP> get manifests
|
|
NODE NAMESPACE TYPE ID VERSION
|
|
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1
|
|
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1
|
|
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1
|
|
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1
|
|
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1
|
|
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1
|
|
172.20.0.2 controlplane Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1
|
|
172.20.0.2 controlplane Manifest 10-kube-proxy 1
|
|
172.20.0.2 controlplane Manifest 11-core-dns 1
|
|
172.20.0.2 controlplane Manifest 11-core-dns-svc 1
|
|
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1
|
|
```
|
|
|
|
Details of each manifest can be queried by adding `-o yaml`:
|
|
|
|
```bash
|
|
$ talosctl -n <IP> get manifests 01-csr-approver-role-binding --namespace=controlplane -o yaml
|
|
node: 172.20.0.2
|
|
metadata:
|
|
namespace: controlplane
|
|
type: Manifests.kubernetes.talos.dev
|
|
id: 01-csr-approver-role-binding
|
|
version: 1
|
|
phase: running
|
|
spec:
|
|
- apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: ClusterRoleBinding
|
|
metadata:
|
|
name: system-bootstrap-approve-node-client-csr
|
|
roleRef:
|
|
apiGroup: rbac.authorization.k8s.io
|
|
kind: ClusterRole
|
|
name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
|
|
subjects:
|
|
- apiGroup: rbac.authorization.k8s.io
|
|
kind: Group
|
|
name: system:bootstrappers
|
|
```
|
|
|
|
### Worker node is stuck with `apid` health check failures
|
|
|
|
Control plane nodes have enough secret material to generate `apid` server certificates, but worker nodes
|
|
depend on control plane `trustd` services to generate certificates.
|
|
Worker nodes wait for their `kubelet` to join the cluster.
|
|
Then the Talos `apid` queries the Kubernetes endpoints via control plane
|
|
endpoint to find `trustd` endpoints.
|
|
They then use `trustd` to request and receive their certificate.
|
|
|
|
So if `apid` health checks are failing on worker node:
|
|
|
|
- make sure control plane endpoint is healthy
|
|
- check that worker node `kubelet` joined the cluster
|