22 Commits

Author SHA1 Message Date
Dmitriy Matrenichev
70fc424099
chore: add generic methods and use them
Things like ToSet, Keys etc...

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-06-09 02:59:23 +08:00
Utku Ozdemir
c19dd1b892
feat: add 'etcd members should be control plane nodes' health check
Add new health check which checks if the etcd members match the control plane nodes. Closes siderolabs#5553.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-06-07 10:34:38 +02:00
Dmitriy Matrenichev
bf7a6443ee
feat: add 'etcd membership is consistent across nodes' health check
Add new health check which waits for all etcd members. Closes #5552.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-05-20 21:51:17 +08:00
Andrey Smirnov
5a91f6076d
fix: ignore completed pods in cluster health check
This fixes an error when integration test become stuck with the message
like:

```
waiting for coredns to report ready: some pods are not ready: [coredns-868c687b7-g2z64]
```

After some random sequence of node restarts one of the pods might become
"stuck" in `Completed` state (as it is shown in `kubectl get pods`)
blocking the check, as the pod will never become ready.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-05-16 14:28:25 +03:00
Andrey Smirnov
50594ab1a7
fix: ignore terminated pods in pod health checks
With graceful kubelet shutdown (#5108), after graceful node restart pods
on the restarted node might stay in the status `Terminated` which breaks
the check on pod readiness.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-17 19:17:56 +03:00
Seán C McCord
6af83afd5a
fix: handle multiple-IP cluster nodes
Allow cluster nodes to have multiple internal IP addresses when checking
for all Kubernetes nodes.

Fixes #4807

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-01-17 11:41:54 -05:00
Andrey Smirnov
9f24b519dc chore: remove bootkube check from cluster health check
We're no longer testing against Talos <= 0.8, so no reason to
run this check (even if it's no-op).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-17 10:04:32 -07:00
Alexey Palazhchenko
7662d033bf fix: talosctl health should not check kube-proxy when it is disabled
Fixes #3299.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-16 13:21:36 -07:00
Artem Chernyshev
22f375300c chore: update golanci-lint to 1.38.0
Fix all discovered issues.
Detected couple bugs, fixed them as well.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-12 06:50:02 -08:00
Alexey Palazhchenko
df52c13581 chore: fix //nolint directives
That's the recommended syntax:
https://golangci-lint.run/usage/false-positives/

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-05 05:58:33 -08:00
Artem Chernyshev
7108bb3f5b test: upgrade master to master tests
Verify upgrade flow using the same version of the installer.
Run that with disk encryption enabled.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-24 07:56:44 -08:00
Artem Chernyshev
02b3719df9 feat: skip filesystem for state and ephemeral partitions in the installer
Filesystem creation step is moved on the later stage: when Talos mounts
the partition for the first time.
Now it checks if the partition doesn't have any filesystem and formats
it right before mounting.

Additionally refactored mount options a bit:
- replaced separate options with a set of binary flags.
- implemented pre-mount and post-unmount hooks.

And fixed typos in couple of places and increased timeout for `apid ready`.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-17 09:37:21 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00
Andrey Smirnov
5e3b8ee099 fix: ignore pods spun up from checkpoints in health checks
Pods which are spun up as checkpoints shouldn't be counted towards the
normal pods spun up as part of the daemon set.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-22 13:24:30 -08:00
Andrey Smirnov
362bb933a8 test: add an extra 'node boot done' health check
This makes sure node boot sequence is done before we consider cluster to
be healthy.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-15 06:10:23 -08:00
Andrey Smirnov
07f4ed7fb4 feat: upgrade etcd to 3.4.14
No major fixes, just keeping version up to date.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-26 09:14:41 -08:00
Andrey Smirnov
8560fb9662 chore: enable nlreturn linter
Most of the fixes were automatically applied.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 06:48:07 -08:00
Andrey Smirnov
bc9e0c0dba fix: re-implement upgrade (install) with preserve
For 0.6 -> 0.7 upgrade, in any case config.yaml is preserved and moved
from `/boot` to `/system/state`.

For single node upgrade, `EPHEMERAL` partition is not touched and other
partitions are re-created as needed.

Bump provision tests to 0.6/0.7 upgrades as we get closer to the new
release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-28 07:25:26 -07:00
Andrey Smirnov
d1c9fc1b49 fix: correctly calculate output width in colored health reporter
This was bugging me for quite some time, as (randomly) output was
scrolling endless repeated line when terminal width was small.

Root cause was that `\t` on output takes random amount of spaces
(columns). Fixed small issues counting correctly UTF-8 runes (important
for the spinner and if we ever do i18n), and make sure spinner itself is
included into the calculations.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-21 06:19:35 -07:00
Andrey Smirnov
d7f5de62c3 feat: colorize output of cluster health checks
It only gets enabled if output is a terminal. Failures which resolve
themselves are removed from the final output. Small spinner to indicate
progress.

While I was at it, I fixed client-side `talosctl health` when init node
is missing.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-06 07:59:30 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
9379cf9ee1 refactor: expose provision as public package
This change is only moving packages and updating import paths.

Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other
projects to import Talos provisioning library.

As cluster checks are almost always required as part of provisioning
process, package `internal/pkg/cluster` was also made public as
`pkg/cluster`.

Other changes were direct dependencies discovered by `importvet` which
were updated.

Public packages (useful, general purpose packages with stable API):

* `internal/pkg/conditions` -> `pkg/conditions`
* `internal/pkg/tail` -> `pkg/tail`

Private packages (used only on provisioning library internally):

* `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp`
* `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz`
* `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils`

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-12 05:12:05 -07:00