Add new health check which checks if the etcd members match the control plane nodes. Closes siderolabs#5553.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This fixes an error when integration test become stuck with the message
like:
```
waiting for coredns to report ready: some pods are not ready: [coredns-868c687b7-g2z64]
```
After some random sequence of node restarts one of the pods might become
"stuck" in `Completed` state (as it is shown in `kubectl get pods`)
blocking the check, as the pod will never become ready.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This fixes an issue when `talosctl upgrade-k8s` fails with unhelpful
message if the version is specified as `v1.23.5` vs. `1.23.5`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Having polymorphic (spec type depends on ID) resources is not a good
idea, and it's not compatible with protobuf encoding.
Introduce new resources for each polymorphic sub-spec using new Go 1.18
generic typed.Resource to reduce the boilerplate code.
(Still needs proper deepcopy-gen, but I'm skipping it for now, as
K8sControlPlane had also broken deep copy).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Containerd CRI plugin was merged into the main repo, but we were using
old import path, so our constants coming from the module were outdated.
This fixes the image version for the pause container.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Use the last `:` in the image reference.
Handle the case when no version was discovered.
See https://github.com/siderolabs/theila/issues/138
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This showed up recently frequently in integration-provision tests
(might be related to Kubernetes upgrade), but anyways errors should be
retried.
Refactored the function to extract the retryable part.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Some failures can be fixed by updating the machine configuration.
Now `userDisks` and `userFiles` do not make Talos to enter into reboot
loop but pause for 35 minutes.
Additionally, `apid` and `machined` are now started right after
containerd is up and running.
That makes it possible for the operator to connect to the node using
talosctl and fix the config.
Fixes: https://github.com/talos-systems/talos/issues/4669
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
With graceful kubelet shutdown (#5108), after graceful node restart pods
on the restarted node might stay in the status `Terminated` which breaks
the check on pod readiness.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Allow cluster nodes to have multiple internal IP addresses when checking
for all Kubernetes nodes.
Fixes#4807
Signed-off-by: Seán C McCord <ulexus@gmail.com>
Talos shouldn't try to re-encode the machine config it was provided
with.
So add a `ReadonlyWrapper` around `*v1alpha1.Config` which makes sure
that raw config object is not available anymore (it's a private field),
but config accessors are available for read-only access.
Another thing that `ReadonlyWrapper` does is that it preserves the
original `[]byte` encoding of the config keeping it exactly same way as
it was loaded from file or read over the network.
Improved `talosctl edit mc` to preserve the config as it was submitted,
and preserve the edits on error from Talos (previously edits were lost).
`ReadonlyWrapper` is not used on config generation path though - config
there is represented by `*v1alpha.Config` and can be freely modified.
Why almost? Some parts of Talos (platform code) patch the machine
configuration with new data. We need to fix platforms to provide
networking configuration in a different way, but this will come with
other PRs later.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4656
As now changes to kubelet configuration can be applied without a reboot,
`talosctl upgrade-k8s` can handle the kubelet upgrades as well.
The gist is simply modifying machine config and waiting for `Node`
version to be updated, rest of the code is required for reliability of
the process.
Also fixed a bug in the API while watching deleted items with
tombstones.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes: https://github.com/talos-systems/talos/issues/4065
Get all Talos generated manifests and apply them, wait for deployments to be
updated and to become ready.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Looks like we bumped sonobuoy library, and it silently changed a lot of
things in the way it works with the results.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#3951
Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11
support conversion of self-hosted bootkube-based control plane to the
new style control plane running as static pods managed by Talos.
This commit removes all backwards compatibility and removes conversion
code.
For the k8s controllers, `BootstrapStatus` is removed and a dependency
on `etcd` service status is added (as it was implicitly there via
`BootstrapStatus`).
Remove control plane conversion code.
In k8s upgrade code, remove self-hosted part.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Scan all pods in `kube-system` and find `kube-proxy`, `kube-scheduler`,
`kube-controller-manager` and `kube-apiserver` ones, then check the
lowest version amongst them.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This is going to be useful in the third party code which is using
upgrade modules, to collect output logs instead of printing them to the
stdout.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
The problem was that gRPC method `status.Code(err)` doesn't unwrap
errors, while Talos client returns errors wrapped with
`multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return
error code correctly.
Fix that by introducing our own client method which correctly goes over
the chain of wrapped errors.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
The problem is that there's no official way to close Kuberentes client
underlying TCP/HTTP connections. So each time Talos initializes
connection to the control plane endpoint, new client is built, but this
client is never closed, so the connection stays active on the load
balancers, on the API server level, etc. It also eats some resources out
of Talos itself.
We add a way to close underlying connections by using helper from the
Kubernetes client libraries to force close all TCP connections which
should shut down all HTTP/2 connections as well.
Alternative approach might be to cache a client for some time, but many
of the clients are created with temporary PKI, so even cached client
still needs to be closed once it gets stale, and it's not clear how to
recreate a client in case existing one is broken for one reason or
another (and we need to force a re-connection).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This removes `retrying error` messages while waiting for the API server
pod state to reflect changes from the updated static pod definition.
Log more lines to notify about the progress.
Skip `kube-proxy` if not found (as we allow it to be disabled).
```
$ talosctl upgrade-k8s -n 172.20.0.2 --from 1.21.0 --to 1.21.2
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
updating "kube-apiserver" to version "1.21.2"
> "172.20.0.2": starting update
> "172.20.0.2": machine configuration patched
> "172.20.0.2": waiting for API server state pod update
< "172.20.0.2": successfully updated
> "172.20.0.3": starting update
> "172.20.0.3": machine configuration patched
> "172.20.0.3": waiting for API server state pod update
< "172.20.0.3": successfully updated
> "172.20.0.4": starting update
> "172.20.0.4": machine configuration patched
> "172.20.0.4": waiting for API server state pod update
< "172.20.0.4": successfully updated
updating "kube-controller-manager" to version "1.21.2"
> "172.20.0.2": starting update
> "172.20.0.2": machine configuration patched
> "172.20.0.2": waiting for API server state pod update
< "172.20.0.2": successfully updated
> "172.20.0.3": starting update
> "172.20.0.3": machine configuration patched
> "172.20.0.3": waiting for API server state pod update
< "172.20.0.3": successfully updated
> "172.20.0.4": starting update
> "172.20.0.4": machine configuration patched
> "172.20.0.4": waiting for API server state pod update
< "172.20.0.4": successfully updated
updating "kube-scheduler" to version "1.21.2"
> "172.20.0.2": starting update
> "172.20.0.2": machine configuration patched
> "172.20.0.2": waiting for API server state pod update
< "172.20.0.2": successfully updated
> "172.20.0.3": starting update
> "172.20.0.3": machine configuration patched
> "172.20.0.3": waiting for API server state pod update
< "172.20.0.3": successfully updated
> "172.20.0.4": starting update
> "172.20.0.4": machine configuration patched
> "172.20.0.4": waiting for API server state pod update
< "172.20.0.4": successfully updated
updating daemonset "kube-proxy" to version "1.21.2"
kube-proxy skipped as DaemonSet was not found
```
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
With the recent changes, bootstrap API might wait for the time to be in
sync (as the apid is launched before time is sync). We set timeout to
500ms for the bootstrap API call, so there's a chance that a call might
time out, and we should ignore it.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR can be split into two parts:
* controllers
* apid binding into COSI world
Controllers
-----------
* `k8s.EndpointController` provides control plane endpoints on worker
nodes (it isn't required for now on control plane nodes)
* `secrets.RootController` now provides OS top-level secrets (CA cert)
and secret configuration
* `secrets.APIController` generates API secrets (certificates) in a bit
different way for workers and control plane nodes: controlplane nodes
generate directly, while workers reach out to `trustd` on control plane
nodes via `k8s.Endpoint` resource
apid Binding
------------
Resource `secrets.API` provides binding to protobuf by converting
itself back and forth to protobuf spec.
apid no longer receives machine configuration, instead it receives
gRPC-backed socket to access Resource API. apid watches `secrets.API`
resource, fetches certs and CA from it and uses that in its TLS
configuration.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
We're no longer testing against Talos <= 0.8, so no reason to
run this check (even if it's no-op).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>