38 Commits

Author SHA1 Message Date
Dmitriy Matrenichev
45e6e27af7
chore: bump runtime
Use new functions and methods from runtime module.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-05-11 17:18:08 -04:00
Noel Georgi
cad43f0ad3
chore: remove k8s master label
Since talos now defaults to k8s 1.27, remove the handling
of `master` label for controlplane nodes.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-04-25 20:48:05 +05:30
Noel Georgi
a78281214d
feat: add cilium e2e tests
Add cilium e2e tests. The existing cilium check was very old, update to
latest cilium version and also add a test for KPR strict mode.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-03 20:03:25 +05:30
Dmitriy Matrenichev
eb332cfcb7
feat: add health check for a minimal memory / disk size
This PR adds two additional checks which are performed during boot sequence and in `talosctl health`. They ensure that nodes have enough memory and disk.

- Boot check will print a warning if memory / disk size is not sufficient.
- Health check will fail if memory / disk size is not sufficient.

Closes #6467

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-12-10 07:05:08 +03:00
Andrey Smirnov
96aa9638f7
chore: rename talos-systems/talos to siderolabs/talos
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-11-03 16:50:32 +04:00
Andrey Smirnov
08e7e49a29
test: update versions for upgrade tests
Use the latest releases in each branch.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-11-01 10:40:19 +04:00
Andrey Smirnov
0b41923c36
fix: restore the StaticPodStatus resource
It got broken with the changes to the kubelet now sourcing static pods
from a HTTP internal server.

As we don't want it to be broken, and to make health checks better, add
a new check to make sure kubelet reports control plane static pods as
running. This coupled with API server check should make it more
thorough.

Also add logging when static pod definitions are updated (they were
previously there for file-based implementation). These logs are very
helpful for troubleshooting.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-10-31 18:48:03 +04:00
Dmitriy Matrenichev
fc48849d00
chore: move maps/slices/ordered to gen module
Use github.com/siderolabs/gen

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-09-21 20:22:43 +03:00
Utku Ozdemir
0847400f72
fix: prevent panic on health check if a member has no IPs
If a member has no IP addresses, prevent cluster health checks from failing with a panic by checking for the length of member IPs and not assuming there's always at least 1 IP.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-09-02 15:16:59 +02:00
Utku Ozdemir
0b339a9dc5
feat: track progress of action API calls
Track the progress of the long-running actions `reboot`, `reset`, `upgrade` and `shutdown` on the client side by default, unless `--no-wait=true` is specified.

Use the events API to follow the events using the actor ID of the action and display it using an stderr reporter with a spinner.

Closes siderolabs/talos#5499.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-08-29 22:54:40 +02:00
Dmitriy Matrenichev
b59ca5810e
chore: move from inet.af/netaddr to net/netip and go4.org/netipx
Closes #6007

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-08-25 17:51:32 +03:00
Noel Georgi
b62b18a972
feat: bump k8s to v1.25.0-beta.0
Bump k8s to v1.25.0-beta.0

Update most kubernetes `master` references to `controlplane`

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-08-10 22:17:53 +05:30
Andrey Smirnov
a6b010a8b4
chore: update Go to 1.19, Linux to 5.15.58
See https://go.dev/doc/go1.19

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-03 17:03:58 +04:00
Andrey Smirnov
a167a54021
test: fix CLI nodes discovery without provisioner data
When integration tests run without data from Talos provisioner (e.g.
against AWS/GCP), it should work only with `talosconfig` as an input.

This specific flow was missing filling out `infoWrapper` properly.

Clean up things a bit by reducing code duplication.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-06-21 18:42:26 +04:00
Utku Ozdemir
6759fcd4ae
feat: use discovery service on cluster health checks
Query the discovery service to fetch the node list and use the results in health checks. Closes siderolabs#5554.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-06-15 16:01:38 +02:00
Utku Ozdemir
8d2be5e315
feat: extend node definition used in health checks
Introduce `cluster.NodeInfo` to represent the basic info about a node which can be used in the health checks. This information, where possible, will be populated by the discovery service in following PRs. Part of siderolabs#5554.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-06-13 14:13:42 +02:00
Dmitriy Matrenichev
70fc424099
chore: add generic methods and use them
Things like ToSet, Keys etc...

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-06-09 02:59:23 +08:00
Utku Ozdemir
c19dd1b892
feat: add 'etcd members should be control plane nodes' health check
Add new health check which checks if the etcd members match the control plane nodes. Closes siderolabs#5553.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-06-07 10:34:38 +02:00
Dmitriy Matrenichev
bf7a6443ee
feat: add 'etcd membership is consistent across nodes' health check
Add new health check which waits for all etcd members. Closes #5552.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-05-20 21:51:17 +08:00
Andrey Smirnov
5a91f6076d
fix: ignore completed pods in cluster health check
This fixes an error when integration test become stuck with the message
like:

```
waiting for coredns to report ready: some pods are not ready: [coredns-868c687b7-g2z64]
```

After some random sequence of node restarts one of the pods might become
"stuck" in `Completed` state (as it is shown in `kubectl get pods`)
blocking the check, as the pod will never become ready.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-05-16 14:28:25 +03:00
Andrey Smirnov
50594ab1a7
fix: ignore terminated pods in pod health checks
With graceful kubelet shutdown (#5108), after graceful node restart pods
on the restarted node might stay in the status `Terminated` which breaks
the check on pod readiness.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-17 19:17:56 +03:00
Seán C McCord
6af83afd5a
fix: handle multiple-IP cluster nodes
Allow cluster nodes to have multiple internal IP addresses when checking
for all Kubernetes nodes.

Fixes #4807

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-01-17 11:41:54 -05:00
Andrey Smirnov
9f24b519dc chore: remove bootkube check from cluster health check
We're no longer testing against Talos <= 0.8, so no reason to
run this check (even if it's no-op).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-17 10:04:32 -07:00
Alexey Palazhchenko
7662d033bf fix: talosctl health should not check kube-proxy when it is disabled
Fixes #3299.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-16 13:21:36 -07:00
Artem Chernyshev
22f375300c chore: update golanci-lint to 1.38.0
Fix all discovered issues.
Detected couple bugs, fixed them as well.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-12 06:50:02 -08:00
Alexey Palazhchenko
df52c13581 chore: fix //nolint directives
That's the recommended syntax:
https://golangci-lint.run/usage/false-positives/

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-05 05:58:33 -08:00
Artem Chernyshev
7108bb3f5b test: upgrade master to master tests
Verify upgrade flow using the same version of the installer.
Run that with disk encryption enabled.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-24 07:56:44 -08:00
Artem Chernyshev
02b3719df9 feat: skip filesystem for state and ephemeral partitions in the installer
Filesystem creation step is moved on the later stage: when Talos mounts
the partition for the first time.
Now it checks if the partition doesn't have any filesystem and formats
it right before mounting.

Additionally refactored mount options a bit:
- replaced separate options with a set of binary flags.
- implemented pre-mount and post-unmount hooks.

And fixed typos in couple of places and increased timeout for `apid ready`.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-17 09:37:21 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00
Andrey Smirnov
5e3b8ee099 fix: ignore pods spun up from checkpoints in health checks
Pods which are spun up as checkpoints shouldn't be counted towards the
normal pods spun up as part of the daemon set.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-22 13:24:30 -08:00
Andrey Smirnov
362bb933a8 test: add an extra 'node boot done' health check
This makes sure node boot sequence is done before we consider cluster to
be healthy.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-15 06:10:23 -08:00
Andrey Smirnov
07f4ed7fb4 feat: upgrade etcd to 3.4.14
No major fixes, just keeping version up to date.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-26 09:14:41 -08:00
Andrey Smirnov
8560fb9662 chore: enable nlreturn linter
Most of the fixes were automatically applied.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 06:48:07 -08:00
Andrey Smirnov
bc9e0c0dba fix: re-implement upgrade (install) with preserve
For 0.6 -> 0.7 upgrade, in any case config.yaml is preserved and moved
from `/boot` to `/system/state`.

For single node upgrade, `EPHEMERAL` partition is not touched and other
partitions are re-created as needed.

Bump provision tests to 0.6/0.7 upgrades as we get closer to the new
release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-28 07:25:26 -07:00
Andrey Smirnov
d1c9fc1b49 fix: correctly calculate output width in colored health reporter
This was bugging me for quite some time, as (randomly) output was
scrolling endless repeated line when terminal width was small.

Root cause was that `\t` on output takes random amount of spaces
(columns). Fixed small issues counting correctly UTF-8 runes (important
for the spinner and if we ever do i18n), and make sure spinner itself is
included into the calculations.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-21 06:19:35 -07:00
Andrey Smirnov
d7f5de62c3 feat: colorize output of cluster health checks
It only gets enabled if output is a terminal. Failures which resolve
themselves are removed from the final output. Small spinner to indicate
progress.

While I was at it, I fixed client-side `talosctl health` when init node
is missing.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-06 07:59:30 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
9379cf9ee1 refactor: expose provision as public package
This change is only moving packages and updating import paths.

Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other
projects to import Talos provisioning library.

As cluster checks are almost always required as part of provisioning
process, package `internal/pkg/cluster` was also made public as
`pkg/cluster`.

Other changes were direct dependencies discovered by `importvet` which
were updated.

Public packages (useful, general purpose packages with stable API):

* `internal/pkg/conditions` -> `pkg/conditions`
* `internal/pkg/tail` -> `pkg/tail`

Private packages (used only on provisioning library internally):

* `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp`
* `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz`
* `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils`

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-12 05:12:05 -07:00