2550 Commits

Author SHA1 Message Date
Alexey Palazhchenko
4708beaee5 feat: implement talosctl config info command
Closes #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-06 00:58:47 -07:00
Andrey Smirnov
6d13d2cf92 fix: close Kubernetes API client
The problem is that there's no official way to close Kuberentes client
underlying TCP/HTTP connections. So each time Talos initializes
connection to the control plane endpoint, new client is built, but this
client is never closed, so the connection stays active on the load
balancers, on the API server level, etc. It also eats some resources out
of Talos itself.

We add a way to close underlying connections by using helper from the
Kubernetes client libraries to force close all TCP connections which
should shut down all HTTP/2 connections as well.

Alternative approach might be to cache a client for some time, but many
of the clients are created with temporary PKI, so even cached client
still needs to be closed once it gets stale, and it's not clear how to
recreate a client in case existing one is broken for one reason or
another (and we need to force a re-connection).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:25:26 -07:00
Andrey Smirnov
aaa36f3b4f fix: ignore 'not a leader' error on forfeit leadership
When forfeiting etcd leadership, it might be that the node still reports
leadership status while not being a leader once the actual API call is
used. We should ignore such an error as the node is not a leader.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:23:24 -07:00
Andrey Smirnov
22a4193678 fix: workaround 'Unauthorized' errors when accessing Kubernetes API
This should fix an error like:

```
failed to create etcd client: error getting kubernetes endpoints: Unauthorized
```

The problem is that the generated cert was used immediately, so even
slight time sync issue across nodes might render the cert not (yet)
usable. Cert is generated on one node, but might be used on any other
node (as it goes via the LB).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:15:03 -07:00
Alexey Palazhchenko
71c6f7004e chore: bump go.mod dependencies
Closes #3879, #3880, #3881, #3882, #3883, #3884, #3885, #3886.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-05 06:59:14 -07:00
Alexey Palazhchenko
915cd8fe20 docs: add guide for RBAC
Document how to enable RBAC without screwing up.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-05 05:56:29 -07:00
Serge Logvinov
f5721050de fix: controlplane keyusage
* kube-apiserver keyusage serverAuth
* kube-scheduler keyusage clientAuth
* kube-controller-manager keyusage clientAuth
* kubeconfig keyusage clientAuth

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-07-01 12:49:29 -07:00
Andrey Smirnov
3d7726613c fix: fill uuid argument correctly in the config download URL
It was broken, because `?uuid=` URL parses to `{"uuid": []string{""}}`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-01 10:50:39 -07:00
Serge Logvinov
d8602025c8 chore: update containerd config version 2
* Rename key cri -> io.containerd.grpc.v1.cri
* Disable plugins aufs,zfs,devmapper,btrfs (less warning messages on
  boot time)

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-07-01 09:08:54 -07:00
Andrey Smirnov
5949ec4e6e docs: describe the new network configuration subsystem
Internal details, resources, examples inspecting the configuration.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-01 09:02:56 -07:00
Spencer Smith
444d72b4d7 feat: update pkgs version
This PR bumps pkgs to v0.7.0-alpha.0, so that we gain a fix for
hotplugging of nvme drives.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-07-01 07:55:00 -07:00
Andrey Smirnov
e883c12b31 fix: make output of upgrade-k8s command less scary
This removes `retrying error` messages while waiting for the API server
pod state to reflect changes from the updated static pod definition.

Log more lines to notify about the progress.

Skip `kube-proxy` if not found (as we allow it to be disabled).

```
$ talosctl upgrade-k8s -n 172.20.0.2 --from 1.21.0 --to 1.21.2
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
updating "kube-apiserver" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-controller-manager" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-scheduler" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating daemonset "kube-proxy" to version "1.21.2"
kube-proxy skipped as DaemonSet was not found
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-01 06:54:36 -07:00
Andrey Smirnov
7f8e50de4d fix: restart the merge controllers on conflict
Fixes #3861

What this change effectively does is that it changes immediate reconcile
request to an error return, so that controller will be restarted with a
backoff.

More details:

* root cause of the update/teardown conflict is that the finalizer is
still pending on the tearing down resource
* finalizer might not be removed immediately, e.g. if the controller
which put the finalizer is itself in the crash loop
* if the merge controller queues reconcile immediately, it restarts
itself, but the finalizer is still there, so it once again goes into
reconcile loop and that goes forever until the finalizer is removed, so
instead if the controller fails, it will be restarted with exponential
backoff lowering the load on the system

Change is validated with the unit-tests reproducing the conflict.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-30 08:23:33 -07:00
Andrey Smirnov
60d7360944 fix: ignore deadline exceeded errors on bootstrap
With the recent changes, bootstrap API might wait for the time to be in
sync (as the apid is launched before time is sync). We set timeout to
500ms for the bootstrap API call, so there's a chance that a call might
time out, and we should ignore it.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-30 06:59:36 -07:00
Andrey Smirnov
ee06dd69fc fix: don't print git sha of the release twice in the dashboard
This is a small nit, Talos version already contains sha when it is
needed.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-30 03:45:34 -07:00
Andrey Smirnov
07fb61e5d2 fix: issue worker apid certs properly on renewal
This fixes endless block on RemoteGenerator.Close method rewriting the
RemoteGenerator using the retry package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-29 09:02:35 -07:00
Andrey Smirnov
84817f7334 chore: bump Talos version in upgrade tests
Preparing for 0.11 to be stable release soon.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-29 07:24:48 -07:00
Alexey Palazhchenko
2fa54107b2 chore: fix tests for disabled RBAC
This commit also introduces a hidden `--json` flag for `talosctl version` command
that is not supported and should be re-worked at #907.

Refs #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 13:56:40 -07:00
Andrey Smirnov
78583ba985 fix: don't set bond delay options if miimon is not enabled
Basically all delay options are interlocked with `miimon`: if `miimon`
is zero, all delays are set to zero, and kernel complains even if zero
delay attribute is sent while miimon is zero.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-28 11:55:39 -07:00
Alexey Palazhchenko
bbf1c091d4 feat: add RBAC to talosctl version output
Refs #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 07:10:25 -07:00
Andrey Smirnov
5f6ec3ef66 fix: handle cases when merged resource re-appears before being destroyed
The sequence of events to reproduce the problem:

* some resource was merged as final representation with ID `x`
* underlying source resource gets destroyed
* merge controller marks final resource `x` for teardown and waits
for the finalizers to be empty
* another source resource appears which gets merged to same final `x`
* as `x` is in the teardown phase, spec controller will ignore it
* merge controller doesn't see the problem as well, as `x` spec is
correct, but the phase is wrong (which merge controller ignores)

This pulls in COSI fix to return an error if a resource in teardown
phase is modified. This way merge controller knows that the resource `x`
is in the teardown phase, so it should be first fully torn down, and
then new representation should be re-created as new resource with same ID
`x`.

Regression unit-tests included (they don't reproduce the sequence of
events always reliably, but they do with 10% probability).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-28 06:45:44 -07:00
Rui Lopes
1e9a0e745d fix: documentation typos
Fix a couple of documentation typos.

Signed-off-by: Rui Lopes <rgl@ruilopes.com>
2021-06-28 02:50:31 -07:00
Alexey Palazhchenko
f228af4061 chore: bump go.mod dependencies
Closes #3848, #3849, #3850, #3851.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 02:25:43 -07:00
Spencer Smith
2060ceaa0b chore: add CAPI version to CI setup
This PR makes sure we pin to a known CAPI version because with the new
v0.4.x released, we'll fail until we support the v1alpha4 APIs.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-06-25 10:44:07 -04:00
Alexey Palazhchenko
ad047a7dee chore: small RBAC improvements
* `talosctl config new` now sets endpoints in the generated config.
* Avoid duplication of roles in metadata.
* Remove method name prefix handling. All methods should be set explicitly.
* Add tests.

Closes #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-25 05:50:38 -07:00
Andrey Smirnov
829e54f1a4 fix: limit apid access to COSI runtime resources
This makes sure that apid can't access any resources than the one it
actually needs. This improves the security in case of a container
breach.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-24 14:10:27 -07:00
Andrey Smirnov
f9e01d0274 fix: ignore EINVAL on unmount operations
Fixes #3837

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-24 14:09:53 -07:00
Artem Chernyshev
7672435e16 feat: add a method to get gRPC connection from the client
This change is for Theila which is going to use gRPC proxy to forward
requests from TS frontend right to the node's apid.
`gRPC` proxy operates on top of `grpc.ClientConn` objects, so getting
this connection from the clients which are already being created is the
easiest path.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-06-24 23:02:12 +03:00
Andrey Smirnov
b5244bf182 chore: bump go.mod dependencies, fix netaddr API changes
Bump dependencies, clean up go.mod files, update for netaddr changes
(all around `netaddr.IPPrefix` being a private struct now).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-24 08:37:37 -07:00
Serge Logvinov
c7e6225671 chore: update coredns to 1.8.4
* Coredns 1.8.0 -> 1.8.4
* Add RBAC endpointslices list/watch

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-06-24 07:47:36 -07:00
Andrey Smirnov
3a34f1a51d chore: bump Talos Go modules to release versions
No actual changes, just tag updates.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-24 07:45:41 -07:00
Andrey Smirnov
8d60abff7a chore: use tagged versions of bldr dependencies for 0.11
No actual changes, just tag updates.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-24 07:17:16 -07:00
Serge Logvinov
8ef68a6fb8 feat: remove go-runner in staticpods
Do not use legacy method to run contolplane

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-06-24 06:06:05 -07:00
Andrey Smirnov
a650531fab release(v0.11.0-alpha.2): prepare release
This is the official v0.11.0-alpha.2 release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
v0.11.0-alpha.2
2021-06-23 16:58:05 -07:00
Artem Chernyshev
71fff02ff0 fix: revert back resource.proto order
Otherwise it breaks older `talosctl` compatibility.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-06-23 16:37:36 -07:00
Andrey Smirnov
d3f4e6006f fix: replace tabs with spaces in console output
See https://github.com/talos-systems/go-kmsg/pull/2

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-23 16:12:47 -07:00
Artem Chernyshev
1990ad2525 feat: add created and updated timestamps to the resource metadata
This will allow to keep track of when the resource was created and
updated.
Update is tied to the version bump.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-06-23 13:56:49 -07:00
Spencer Smith
0731be908b feat: add cloud images to releases
This PR updates our CI so that when we release talos, a json file
containing our cloud images for AWS will be published as a release
asset.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-06-23 16:40:54 -04:00
Serge Logvinov
b52b206665 feat: split etcd certificates to peer/client
Changes:
* Etcd peer port key usage: ServerAuth,ClientAuth
* Etcd client port key usage: ServerAuth,ClientAuth
* Talos etcd client key usage: ClientAuth
* KubeAPI etcd client key usage: ClientAuth
* List of etcd allowed ciphers

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-23 13:26:48 -07:00
Andrey Smirnov
33119d2b8e chore: add an option to launch cluster with bad RTC state
This is useful for time sync testing.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-23 13:08:20 -07:00
Andrey Smirnov
d8c2bca1b5 feat: reimplement apid certificate generation on top of COSI
This PR can be split into two parts:

* controllers
* apid binding into COSI world

Controllers
-----------

* `k8s.EndpointController` provides control plane endpoints on worker
nodes (it isn't required for now on control plane nodes)
* `secrets.RootController` now provides OS top-level secrets (CA cert)
and secret configuration
* `secrets.APIController` generates API secrets (certificates) in a bit
different way for workers and control plane nodes: controlplane nodes
generate directly, while workers reach out to `trustd` on control plane
nodes via `k8s.Endpoint` resource

apid Binding
------------

Resource `secrets.API` provides binding to protobuf by converting
itself back and forth to protobuf spec.

apid no longer receives machine configuration, instead it receives
gRPC-backed socket to access Resource API. apid watches `secrets.API`
resource, fetches certs and CA from it and uses that in its TLS
configuration.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-23 13:07:00 -07:00
Alexey Palazhchenko
3c1b32199d chore: refactor CLI tests
Use testing.T.TempDir.
Add support for `talosctl --endpoints`.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-23 05:49:00 -07:00
Andrew Rynhard
0fd9ea2d63 feat: enable MACVTAP support
Brings in the latest version of `pkgs` with a kernel that has MACVTAP
support.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2021-06-23 05:17:33 -07:00
Spencer Smith
898673e8d3 chore: update e2e tests to use latest capi releases
This PR version bumps cacppt, cabpt, capa, capg, and cluster api itself

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-06-22 12:37:22 -07:00
Andrey Smirnov
e26c5583c2 docs: add AMI IDs for Talos 0.10.4
Just new AMI IDs.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-22 11:14:27 -07:00
Andrey Smirnov
72ef48f0ea fix: assign source address to the DHCP default gateway routes
This isn't strictly require, but it should be backwards compatible with
Talos 0.10 (networkd).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-22 10:01:43 -07:00
Andrey Smirnov
004885a379 feat: update Linux kernel to 5.10.45, etcd to 3.4.16
This also pulls in HP ILO driver, dmesg restrict mode by default and
dm-crypt options.

See talos-systems/pkgs#289, talos-systems/pkgs#290,
talos-systems/pkgs#287

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-22 02:42:09 -07:00
Andrew Rynhard
821f469a1d feat: skip overlay mount checks with docker
We need to be able to run an install with `docker run`. This checks if
we are running from docker and skips overlay mount checks if we are, as
docker creates a handful of overlay mounts by default that we can't
workaround (not easily at least).

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2021-06-21 15:51:39 -07:00
Alexey Palazhchenko
b6e02311a3 feat: use COSI RD's sensitivity for RBAC
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-21 14:05:06 -07:00
Serge Logvinov
46751c1ad2 feat: improve security of Kubernetes control plane components
Fix of fixes #3765

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-06-21 13:04:54 -07:00