Use new version of go-kubernetes, and move the `kube-proxy` DaemonSet
update to follow common logic of bootstrap manifests update.
This fixes a confusing behavior when after `k8s-upgrade` the version of
`kube-proxy` is not updated in the machine config.
See https://github.com/siderolabs/go-kubernetes/pull/3
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add cilium e2e tests. The existing cilium check was very old, update to
latest cilium version and also add a test for KPR strict mode.
Signed-off-by: Noel Georgi <git@frezbo.dev>
The shared code is going out to the
github.com/siderolabs/go-kubernetes library.
The code will be used in Talos and other projects using same features.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Not all Kubernetes deprecated resources are same - if the old API
version is deprecated, but new one is available, API server handles
trnasition for us. If some resource is removed completely, we need to
check for it. This reduces number of items to check, and simplifies the
check.
Move the check under the umbrella of the 'upgrade pre-checks', and make
it actually fatal.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add more checks for the Talos Kubernetes upgrade.
The removed api-server resources checks are kept as is, needs to be
moved to the new checks as part of #6599.
Fixes: #6444
Signed-off-by: Noel Georgi <git@frezbo.dev>
This PR adds two additional checks which are performed during boot sequence and in `talosctl health`. They ensure that nodes have enough memory and disk.
- Boot check will print a warning if memory / disk size is not sufficient.
- Health check will fail if memory / disk size is not sufficient.
Closes#6467
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This the first step towards replacing all import paths to be based on
`siderolabs/` instead of `talos-systems/`.
All updates contain no functional changes, just refactorings to adapt to
the new path structure.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
It got broken with the changes to the kubelet now sourcing static pods
from a HTTP internal server.
As we don't want it to be broken, and to make health checks better, add
a new check to make sure kubelet reports control plane static pods as
running. This coupled with API server check should make it more
thorough.
Also add logging when static pod definitions are updated (they were
previously there for file-based implementation). These logs are very
helpful for troubleshooting.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
I had to do several things:
- contextcheck now supports Go 1.18 generics, but I had to disable it because of this https://github.com/kkHAIKE/contextcheck/issues/9
- dupword produces to many false positives, so it's also disabled
- revive found all packages which didn't have a documentation comment before. And tehre is A LOT of them. I updated some of them, but gave up at some point and just added them to exclude rules for now.
- change lint-vulncheck to use `base` stage as base
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
If a member has no IP addresses, prevent cluster health checks from failing with a panic by checking for the length of member IPs and not assuming there's always at least 1 IP.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Track the progress of the long-running actions `reboot`, `reset`, `upgrade` and `shutdown` on the client side by default, unless `--no-wait=true` is specified.
Use the events API to follow the events using the actor ID of the action and display it using an stderr reporter with a spinner.
Closessiderolabs/talos#5499.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Overview: deprecate existing Talos resource API, and introduce new COSI
API.
Consequences:
* COSI API can only go via one-2-one proxy (`client.WithNode`)
* client-side API access is way easier with `state.State` wrappers
* lots of small changes on the client side to use new APIs
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
As we submit results to Certified Kubernetes, we provide metadata which
should be updated now, and also we lost the logo in our assets.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
When integration tests run without data from Talos provisioner (e.g.
against AWS/GCP), it should work only with `talosconfig` as an input.
This specific flow was missing filling out `infoWrapper` properly.
Clean up things a bit by reducing code duplication.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Query the discovery service to fetch the node list and use the results in health checks. Closes siderolabs#5554.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Introduce `cluster.NodeInfo` to represent the basic info about a node which can be used in the health checks. This information, where possible, will be populated by the discovery service in following PRs. Part of siderolabs#5554.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Add new health check which checks if the etcd members match the control plane nodes. Closes siderolabs#5553.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This fixes an error when integration test become stuck with the message
like:
```
waiting for coredns to report ready: some pods are not ready: [coredns-868c687b7-g2z64]
```
After some random sequence of node restarts one of the pods might become
"stuck" in `Completed` state (as it is shown in `kubectl get pods`)
blocking the check, as the pod will never become ready.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This fixes an issue when `talosctl upgrade-k8s` fails with unhelpful
message if the version is specified as `v1.23.5` vs. `1.23.5`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Having polymorphic (spec type depends on ID) resources is not a good
idea, and it's not compatible with protobuf encoding.
Introduce new resources for each polymorphic sub-spec using new Go 1.18
generic typed.Resource to reduce the boilerplate code.
(Still needs proper deepcopy-gen, but I'm skipping it for now, as
K8sControlPlane had also broken deep copy).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Containerd CRI plugin was merged into the main repo, but we were using
old import path, so our constants coming from the module were outdated.
This fixes the image version for the pause container.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Use the last `:` in the image reference.
Handle the case when no version was discovered.
See https://github.com/siderolabs/theila/issues/138
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This showed up recently frequently in integration-provision tests
(might be related to Kubernetes upgrade), but anyways errors should be
retried.
Refactored the function to extract the retryable part.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Some failures can be fixed by updating the machine configuration.
Now `userDisks` and `userFiles` do not make Talos to enter into reboot
loop but pause for 35 minutes.
Additionally, `apid` and `machined` are now started right after
containerd is up and running.
That makes it possible for the operator to connect to the node using
talosctl and fix the config.
Fixes: https://github.com/talos-systems/talos/issues/4669
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
With graceful kubelet shutdown (#5108), after graceful node restart pods
on the restarted node might stay in the status `Terminated` which breaks
the check on pod readiness.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>