96 Commits

Author SHA1 Message Date
Dmitriy Matrenichev
4dbbf4ac50
chore: add generic methods and use them part #2
Use things from #5702.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-06-09 23:10:02 +08:00
Dmitriy Matrenichev
70fc424099
chore: add generic methods and use them
Things like ToSet, Keys etc...

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-06-09 02:59:23 +08:00
Utku Ozdemir
c19dd1b892
feat: add 'etcd members should be control plane nodes' health check
Add new health check which checks if the etcd members match the control plane nodes. Closes siderolabs#5553.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-06-07 10:34:38 +02:00
Dmitriy Matrenichev
bf7a6443ee
feat: add 'etcd membership is consistent across nodes' health check
Add new health check which waits for all etcd members. Closes #5552.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-05-20 21:51:17 +08:00
Andrey Smirnov
5a91f6076d
fix: ignore completed pods in cluster health check
This fixes an error when integration test become stuck with the message
like:

```
waiting for coredns to report ready: some pods are not ready: [coredns-868c687b7-g2z64]
```

After some random sequence of node restarts one of the pods might become
"stuck" in `Completed` state (as it is shown in `kubectl get pods`)
blocking the check, as the pod will never become ready.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-05-16 14:28:25 +03:00
Andrey Smirnov
f1f43131f8
fix: strip 'v' prefix from versions on Kubernetes upgrade
This fixes an issue when `talosctl upgrade-k8s` fails with unhelpful
message if the version is specified as `v1.23.5` vs. `1.23.5`.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-22 14:59:12 +03:00
Andrey Smirnov
4eb9f45cc8
refactor: split polymorphic K8sControlPlane into typed resources
Having polymorphic (spec type depends on ID) resources is not a good
idea, and it's not compatible with protobuf encoding.

Introduce new resources for each polymorphic sub-spec using new Go 1.18
generic typed.Resource to reduce the boilerplate code.

(Still needs proper deepcopy-gen, but I'm skipping it for now, as
K8sControlPlane had also broken deep copy).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-19 16:53:09 +03:00
Andrey Smirnov
8af50fcd27
fix: correct cri package import path
Containerd CRI plugin was merged into the main repo, but we were using
old import path, so our constants coming from the module were outdated.

This fixes the image version for the pause container.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-14 16:27:45 +03:00
Andrey Smirnov
0cb84e8c1a
fix: correctly parse tags out of images
Use the last `:` in the image reference.

Handle the case when no version was discovered.

See https://github.com/siderolabs/theila/issues/138

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-07 19:32:12 +03:00
Andrey Smirnov
2ca5279e56
fix: retry manifest updates in upgrade-k8s
This showed up recently frequently in integration-provision tests
(might be related to Kubernetes upgrade), but anyways errors should be
retried.

Refactored the function to extract the retryable part.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-01 16:20:25 +03:00
Andrey Smirnov
ca8b9c0a3a
feat: update Kubernetes to 1.24.0-alpha.4
See https://github.com/kubernetes/kubernetes/releases/tag/v1.24.0-alpha.4

Fix some incompatibilities around dropped flags/API versions.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-30 22:59:07 +03:00
Dmitriy Matrenichev
e06e1473b0
feat: update golangci-lint to 1.45.0 and gofumpt to 0.3.0
- Update golangci-lint to 1.45.0
- Update gofumpt to 0.3.0
- Fix gofumpt errors
- Add goimports and format imports since gofumports is removed
- Update Dockerfile
- Fix .golangci.yml configuration
- Fix linting errors

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-03-24 08:14:04 +04:00
Artem Chernyshev
27af5d41c6
feat: pause the boot process on some failures instead of rebooting
Some failures can be fixed by updating the machine configuration.
Now `userDisks` and `userFiles` do not make Talos to enter into reboot
loop but pause for 35 minutes.

Additionally, `apid` and `machined` are now started right after
containerd is up and running.

That makes it possible for the operator to connect to the node using
talosctl and fix the config.

Fixes: https://github.com/talos-systems/talos/issues/4669
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-03-21 17:39:45 +03:00
Andrey Smirnov
50594ab1a7
fix: ignore terminated pods in pod health checks
With graceful kubelet shutdown (#5108), after graceful node restart pods
on the restarted node might stay in the status `Terminated` which breaks
the check on pod readiness.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-17 19:17:56 +03:00
Andrew Rynhard
84ee1795dc
docs: update logo
Changes the logo and reformats the description on the front page.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2022-03-16 15:57:56 +03:00
Noel Georgi
dcde2c4f68
chore: update k8s upgrade message
Update k8s upgrade message

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-01-31 16:49:25 +05:30
Artem Chernyshev
831f65a07f
fix: close client provider instead of Talos client in the upgrade module
Otherwise it breaks Theila, which never closes Talos clients during
operation.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-01-27 15:07:28 +03:00
Seán C McCord
6af83afd5a
fix: handle multiple-IP cluster nodes
Allow cluster nodes to have multiple internal IP addresses when checking
for all Kubernetes nodes.

Fixes #4807

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-01-17 11:41:54 -05:00
Artem Chernyshev
2f2bdb26aa
feat: replace flags with --mode in apply, edit and patch commands
Fixes: https://github.com/talos-systems/talos/issues/4588

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-01-13 16:09:53 +03:00
Andrey Smirnov
2f4b9d8d6d
feat: make machine configuration read-only in Talos (almost)
Talos shouldn't try to re-encode the machine config it was provided
with.

So add a `ReadonlyWrapper` around `*v1alpha1.Config` which makes sure
that raw config object is not available anymore (it's a private field),
but config accessors are available for read-only access.

Another thing that `ReadonlyWrapper` does is that it preserves the
original `[]byte` encoding of the config keeping it exactly same way as
it was loaded from file or read over the network.

Improved `talosctl edit mc` to preserve the config as it was submitted,
and preserve the edits on error from Talos (previously edits were lost).

`ReadonlyWrapper` is not used on config generation path though - config
there is represented by `*v1alpha.Config` and can be freely modified.

Why almost? Some parts of Talos (platform code) patch the machine
configuration with new data. We need to fix platforms to provide
networking configuration in a different way, but this will come with
other PRs later.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-28 20:12:55 +03:00
Andrey Smirnov
f49f40a336
fix: pass path to conformance retrieve results
Sonobouy once again changed the API in a way that breaks our tool.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-22 17:28:05 +03:00
Andrey Smirnov
dc9a0cfe94
chore: bump Go dependencies
Bump all dependencies, update `grpc.WithInsecure()` which is deprecated
now.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-20 23:05:32 +03:00
Andrey Smirnov
97ffa7a645
feat: upgrade kubelet version in talosctl upgrade-k8s
Fixes #4656

As now changes to kubelet configuration can be applied without a reboot,
`talosctl upgrade-k8s` can handle the kubelet upgrades as well.

The gist is simply modifying machine config and waiting for `Node`
version to be updated, rest of the code is required for reliability of
the process.

Also fixed a bug in the API while watching deleted items with
tombstones.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-08 21:12:17 +03:00
Andrey Smirnov
753a82188f
refactor: move pkg/resources to machinery
Fixes #4420

No functional changes, just moving packages around.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-11-15 19:50:35 +03:00
Alexey Palazhchenko
95105071de
chore: fix simple issues found by golangci-lint
Avoid slice mutation with append.
Simplify code.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-12 15:20:28 +00:00
Alexey Palazhchenko
8e8687d759
fix: use temporary sonobuoy version
`replace` should be removed when v0.55.1+ is released.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-12 11:34:09 +00:00
Alexey Palazhchenko
d6147eb17d
chore: update sonobuoy
See https://github.com/vmware-tanzu/sonobuoy/issues/1520.

Closes #4516.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-11 14:53:54 +00:00
Artem Chernyshev
261c497c71
feat: implement talosctl support command
Fixes: https://github.com/talos-systems/talos/issues/4406

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2021-11-08 16:20:50 +03:00
Andrey Smirnov
ae5af9d3fa
feat: update Kubernetes to 1.23.0-alpha.3
See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.23.md#v1230-alpha3

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-10-22 14:59:41 +03:00
Artem Chernyshev
e3e2113adc
feat: upgrade CoreDNS during upgrade-k8s call
Fixes: https://github.com/talos-systems/talos/issues/4065

Get all Talos generated manifests and apply them, wait for deployments to be
updated and to become ready.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2021-10-13 15:47:06 +03:00
Andrey Smirnov
a1c9d64907
fix: update the way results are retrieved for certified conformance
Looks like we bumped sonobuoy library, and it silently changed a lot of
things in the way it works with the results.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-13 23:32:59 +03:00
Alexey Palazhchenko
d53e9e8963
chore: use named constants
Just for consistency.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-09-07 12:13:48 +00:00
Alexey Palazhchenko
032e7c6b86
chore: import yaml.v3 consistently
Do not use yaml.v2.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-08-26 11:36:50 +00:00
Artem Chernyshev
2b614e430e
feat: check if cluster has deprecated resources versions
Fixes: https://github.com/talos-systems/talos/issues/4026

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2021-08-18 23:26:36 +03:00
Alexey Palazhchenko
09d70b7eaf feat: update Kubernetes to v1.22.0
Closes #3967.
Closes #3997.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-08-06 09:06:32 -07:00
Andrey Smirnov
539f42090e chore: bump dependencies via dependabot
Fixes #3993

Fixes #3994

Fixes #3995

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-08-03 10:25:17 -07:00
Andrey Smirnov
0c7ce1cd81 feat: remove remnants of bootkube support
Fixes #3951

Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11
support conversion of self-hosted bootkube-based control plane to the
new style control plane running as static pods managed by Talos.

This commit removes all backwards compatibility and removes conversion
code.

For the k8s controllers, `BootstrapStatus` is removed and a dependency
on `etcd` service status is added (as it was implicitly there via
`BootstrapStatus`).

Remove control plane conversion code.

In k8s upgrade code, remove self-hosted part.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-08-03 07:55:42 -07:00
Artem Chernyshev
70d2505b7c fix: do not require ToVersion to be set when detecting version
We do not know the upgrade version when checking components versions in
Theila.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-21 08:51:26 -07:00
Artem Chernyshev
f8f1c83a75 feat: detect the lowest Kubernetes version in upgrade-k8s CLI command
Scan all pods in `kube-system` and find `kube-proxy`, `kube-scheduler`,
`kube-controller-manager` and `kube-apiserver` ones, then check the
lowest version amongst them.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-19 08:24:04 -07:00
Artem Chernyshev
2e463348b2 fix: pass all logs through the options.Log method
Looks like I've missed some 🤦

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-15 08:32:48 -07:00
Artem Chernyshev
bf61c2cc4a fix: write upgrade logs only to the LogOutput if it's defined
No need to print them to stdout in that case.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-15 07:02:45 -07:00
Artem Chernyshev
23ef1d40af chore: add ability to redirect talos upgrade module logs to io.Writer
This is going to be useful in the third party code which is using
upgrade modules, to collect output logs instead of printing them to the
stdout.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-13 08:12:06 -07:00
Andrey Smirnov
10c28758a4 fix: ignore DeadlineExceeded error correctly on bootstrap
The problem was that gRPC method `status.Code(err)` doesn't unwrap
errors, while Talos client returns errors wrapped with
`multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return
error code correctly.

Fix that by introducing our own client method which correctly goes over
the chain of wrapped errors.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-07 12:02:26 -07:00
Andrey Smirnov
6d13d2cf92 fix: close Kubernetes API client
The problem is that there's no official way to close Kuberentes client
underlying TCP/HTTP connections. So each time Talos initializes
connection to the control plane endpoint, new client is built, but this
client is never closed, so the connection stays active on the load
balancers, on the API server level, etc. It also eats some resources out
of Talos itself.

We add a way to close underlying connections by using helper from the
Kubernetes client libraries to force close all TCP connections which
should shut down all HTTP/2 connections as well.

Alternative approach might be to cache a client for some time, but many
of the clients are created with temporary PKI, so even cached client
still needs to be closed once it gets stale, and it's not clear how to
recreate a client in case existing one is broken for one reason or
another (and we need to force a re-connection).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:25:26 -07:00
Andrey Smirnov
e883c12b31 fix: make output of upgrade-k8s command less scary
This removes `retrying error` messages while waiting for the API server
pod state to reflect changes from the updated static pod definition.

Log more lines to notify about the progress.

Skip `kube-proxy` if not found (as we allow it to be disabled).

```
$ talosctl upgrade-k8s -n 172.20.0.2 --from 1.21.0 --to 1.21.2
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
updating "kube-apiserver" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-controller-manager" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-scheduler" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating daemonset "kube-proxy" to version "1.21.2"
kube-proxy skipped as DaemonSet was not found
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-01 06:54:36 -07:00
Andrey Smirnov
60d7360944 fix: ignore deadline exceeded errors on bootstrap
With the recent changes, bootstrap API might wait for the time to be in
sync (as the apid is launched before time is sync). We set timeout to
500ms for the bootstrap API call, so there's a chance that a call might
time out, and we should ignore it.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-30 06:59:36 -07:00
Andrey Smirnov
d8c2bca1b5 feat: reimplement apid certificate generation on top of COSI
This PR can be split into two parts:

* controllers
* apid binding into COSI world

Controllers
-----------

* `k8s.EndpointController` provides control plane endpoints on worker
nodes (it isn't required for now on control plane nodes)
* `secrets.RootController` now provides OS top-level secrets (CA cert)
and secret configuration
* `secrets.APIController` generates API secrets (certificates) in a bit
different way for workers and control plane nodes: controlplane nodes
generate directly, while workers reach out to `trustd` on control plane
nodes via `k8s.Endpoint` resource

apid Binding
------------

Resource `secrets.API` provides binding to protobuf by converting
itself back and forth to protobuf spec.

apid no longer receives machine configuration, instead it receives
gRPC-backed socket to access Resource API. apid watches `secrets.API`
resource, fetches certs and CA from it and uses that in its TLS
configuration.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-23 13:07:00 -07:00
Alexey Palazhchenko
06209bba28 chore: update RBAC rules, remove old APIs
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-18 09:54:49 -07:00
Andrey Smirnov
9f24b519dc chore: remove bootkube check from cluster health check
We're no longer testing against Talos <= 0.8, so no reason to
run this check (even if it's no-op).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-17 10:04:32 -07:00
Alexey Palazhchenko
f63ab9dd9b feat: implement talosctl config new command
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-17 09:06:43 -07:00