40 Commits

Author SHA1 Message Date
Andrey Smirnov
c3e4182000
refactor: use COSI runtime with new controller runtime DB
See https://github.com/cosi-project/runtime/pull/336

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-10-12 19:44:44 +04:00
Andrey Smirnov
3c9f7a7de6
chore: re-enable nolintlint and typecheck linters
Drop startup/rand.go, as since Go 1.20 `rand.Seed` is done
automatically.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-25 01:05:41 +04:00
Andrey Smirnov
dc6764871c
refactor: move around config interfaces, make RawV1Alpha1 typed
See #7230

Refactor more config interfaces, move config accessor interfaces
to different package to break the dependency loop.

Make `.RawV1Alpha1()` method typed to avoid type assertions everywhere.

No functional changes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-23 22:08:58 +04:00
Andrey Smirnov
860002c735
fix: don't reload control plane pods on cert SANs changes
Fixes #7159

The change looks big, but it's actually pretty simple inside: the static
pods had an annotation which tracks a version of the secrets which
forced control plane pods to reload on a change. At the same time
`kube-apiserver` can reload certificate inputs automatically from files
without restart.

So the inputs were split: the dynamic (for kube-apiserver) inputs don't
need to be reloaded, so its version is not tracked in static pod
annotation, so they don't cause a reload. The previous non-dynamic
resource still causes a reload, but it doesn't get updated when e.g.
node addresses change.

There might be many more refactoring done, the resource chain is a bit
of a mess there, but I wanted to keep number of changes minimal to keep
this backportable.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-05 16:59:09 +04:00
Niklas Wik
34babe858d
chore: make organization selection an interface
Making organization a interface for preparing to avoid giving
system:masters access to the talosctl kubeconfig generated certificate.

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-19 15:12:30 +04:00
Andrey Smirnov
a505b8909a
fix: update COSI and reset restart backoff on success
See https://github.com/cosi-project/runtime/pull/191

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-06 17:43:26 +04:00
Andrey Smirnov
96aa9638f7
chore: rename talos-systems/talos to siderolabs/talos
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-11-03 16:50:32 +04:00
Andrey Smirnov
6882725157
fix: use different username for Talos Kubernetes API access
Fixes #6156

Now access from Talos itself goes with `talos:admin` username in the
Kubernetes API server audit log, while access with admin kubeconfig goes
with `admin` username as before.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-09 19:30:36 +04:00
Andrey Smirnov
f62d17125b
chore: update crypto to use new import path siderolabs/crypto
No functional changes in this PR, just updating import paths.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-07 23:02:50 +04:00
Utku Ozdemir
ae3840dbc3
refactor: move kubeconfig package under public api
Move the kubeconfig package under pkg/ so that other projects can reuse parts of it.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-07-01 19:22:16 +02:00
Andrey Smirnov
da2985fe1b
fix: respect local API server port
It wasn't used when building an endpoint to the local API server, so
Talos couldn't talk to the local API server when port was changed from
the default one.

Fixes #5706

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-06-09 00:33:49 +04:00
Dmitriy Matrenichev
6351928611
chore: redo pointer with github.com/siderolabs/go-pointer module
With the advent of generics, redo pointer functionality and remove github.com/AlekSi/pointer dependency.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-05-02 02:17:13 +04:00
Andrey Smirnov
85b328e997
refactor: convert secrets resources to use typed.Resource
No functional changes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-26 14:51:56 +03:00
Andrey Smirnov
e91350acd7
refactor: convert time & v1alpha1 resources to use typed.Resource
No functional changes, just pure refactoring.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-25 22:41:52 +03:00
Andrey Smirnov
b085343dcb
feat: use discovery information for etcd join (and other etcd calls)
Talos historically relied on `kubernetes` `Endpoints` resource (which
specifies `kube-apiserver` endpoints) to find other controlplane members
of the cluster to connect to the `etcd` nodes for the cluster (when node
local etcd instance is not up, for example). This method works great,
but it relies on Kubernetes endpoint being up. If the Kubernetes API is
down for whatever reason, or if the loadbalancer malfunctions, endpoints
are not available and join/leave operations don't work.

This PR replaces the endpoints lookup to use the `Endpoints` COSI
resource which is filled in using two methods:

* from the discovery data (if discovery is enabled, default to enabled)
* from the Kubernetes `Endpoints` resource

If the discovery is disabled (or not available), this change does almost
nothing: still Kubernetes is used to discover control plane endpoints,
but as the data persists in memory, even if the Kubernetes control plane
endpoint went down, cached copy will be used to connect to the endpoint.

If the discovery is enabled, Talos can join the etcd cluster immediately
on boot without waiting for Kubernetes to be up on the bootstrap node
which means that Talos cluster initial bootstrap runs in parallel on all
control plane nodes, while previously nodes were waiting for the first
node to finish bootstrap enough to fill in the endpoints data.

As the `etcd` communication is anyways protected with mutual TLS,
there's no risk even if the discovery data is stale or poisoned, as etcd
operations would fail on TLS mismatch.

Most of the changes in this PR actually enable populating Talos
`Endpoints` resource based on the `Kubernetes` `endpoints` resource
using the watch API.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-21 22:00:27 +03:00
Dmitriy Matrenichev
e06e1473b0
feat: update golangci-lint to 1.45.0 and gofumpt to 0.3.0
- Update golangci-lint to 1.45.0
- Update gofumpt to 0.3.0
- Fix gofumpt errors
- Add goimports and format imports since gofumports is removed
- Update Dockerfile
- Fix .golangci.yml configuration
- Fix linting errors

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-03-24 08:14:04 +04:00
Andrey Smirnov
753a82188f
refactor: move pkg/resources to machinery
Fixes #4420

No functional changes, just moving packages around.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-11-15 19:50:35 +03:00
Andrey Smirnov
8329d21114
chore: split polymorphic RootSecret resource into specific types
Fixes #4418

Only one resource (one of the very first ones) was polymorphic: its
actual spec type depends on its ID. This was a bad idea, and it doesn't
work with protobuf specs (as type <> protobuf relationship can't be
established).

Refactor this by splitting into three separate resource types:
`OSRoot` (OS-level root secrets), `EtcdRoot` (for etcd),
`KubernetesRoot` (for Kubernetes).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-10-27 19:56:04 +03:00
Andrey Smirnov
c3b2429ce9
fix: suppress spurious Kubernetes API server cert updates
With the last changes, `kube-apiserver` certificates are generated based
on the assigned `NodeAdresses`, machine configuration, etc. Whenver the
certificate is regenerated, `kube-apiserver` is reloaded to pick up the
new cert.

With Virtual IP enabled, Virtual IP address is included into the
certificate from the beginning as it is specified in the machine
configuration, but as virtual IP moves between the nodes this causes
`NodeAddresses` update, which triggers the controller, generates new
certs and reloads `kube-apiserver` at bad time (right after VIP got
moved). Even though the cert generated is identical to the previous one,
the API server reload makes it unavailable for 30-90 seconds.

This change extracts `CertSANs` as a separate resource so that its
updates are suppressed if the CertSANs sources change, but the final
list stays the same, and in turn prevents final certificate from being
updated.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-09 00:31:54 +03:00
Andrey Smirnov
af6622109f
feat: implement Kubernetes cluster discovery registry
This implements pushing to and pulling from Kubernetes cluster discovery
registry which is simply using extra Talos annotations on the Node
resources.

Note: cluster discovery is still disabled by default.

This means that each Talos node is going to push data from its own local
`Affiliate` structure to the `Node` resource, and also watches the other
`Node`s to scrape data to build `Affiliate`s from each other cluster
member.

Further down the pipeline, `Affiliate` is converted to a cluster
`Member` which is an easy way to see the cluster membership.

In its current form, `talosctl get members` is mostly equivalent to
`kubectl get nodes`, but as we add more registries, it will become more
powerful.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-03 22:09:26 +03:00
Andrey Smirnov
2c66e1b3c5
feat: provide building of local Affiliate structure (for the node)
Fixes #4139

This builds the local (for the node) `Affiliate` structure which
describes node for the cluster discovery. Dependending on the
configuration, KubeSpan information might be included as well.

`NodeAddresses` were updated to hold CIDRs instead of simple IPs.

The `Affiliate` will be pushed to the registries, while `Affiliate`s for
other nodes will be fetched back from the registries.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-03 16:44:19 +03:00
Andrey Smirnov
0b347570a7
feat: use dynamic NodeAddresses/HostnameStatus in Kubernetes certs
This is a PR on a path towards removing `ApplyDynamicConfig`.

This fixes Kubernetes API server certificate generation to use dynamic
data to generate cert with proper SANs for IPs of the node.

As part of that refactored a bit apid certificate generation (without
any changes).

Added two unit-tests for apid and Kubernetes certificate generation.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-09-01 20:56:53 +03:00
Andrey Smirnov
22a4193678 fix: workaround 'Unauthorized' errors when accessing Kubernetes API
This should fix an error like:

```
failed to create etcd client: error getting kubernetes endpoints: Unauthorized
```

The problem is that the generated cert was used immediately, so even
slight time sync issue across nodes might render the cert not (yet)
usable. Cert is generated on one node, but might be used on any other
node (as it goes via the LB).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:15:03 -07:00
Serge Logvinov
f5721050de fix: controlplane keyusage
* kube-apiserver keyusage serverAuth
* kube-scheduler keyusage clientAuth
* kube-controller-manager keyusage clientAuth
* kubeconfig keyusage clientAuth

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-07-01 12:49:29 -07:00
Andrey Smirnov
70ac771e08 fix: use localhost API server endpoint for internal communication
This includes communication from controller-manager and scheduler to the
API server and manifest injection by Talos controllers.

This eliminates dependency on control plane endpoint to be up, and might
speed up bootstrap on platform where load balancer might need some time
to start proxying to the first API server instance.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-18 12:06:47 -07:00
Andrey Smirnov
a941eb7da0 feat: improve security of Kubernetes control plane components
Fixes #3765

See #3581

There are several changes:

* `kube-controller-manager` insecure port is disabled
* `kube-controller-manager` and `kube-scheduler` now listen securely
only on localhost by default, this can be overridden with `--bind-addr`
in extra args
* `kube-controller-manager` and `kube-scheduler` now use kubeconfig with
limited access role instead of admin one

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-18 10:21:45 -07:00
Andrey Smirnov
f2ae9cd0c1 feat: replace networkd with new network implementation
This removes networkd, updates network ready condition, enables all the
controllers which were previously disabled.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-15 17:37:28 -07:00
Artem Chernyshev
1db301edf6 feat: switch controller-runtime to zap.Logger
Enable logging using default development config with some fine tuning.
Additionally, now `info` and below logs go to kmsg.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-05-25 02:15:31 -07:00
Andrey Smirnov
d24df8f844 chore: re-import talos-systems/os-runtime as cosi-project/runtime
No changes, just import path change (as project got moved).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-12 07:44:24 -07:00
Andrey Smirnov
fbfd1eb2b1 refactor: pull new version of os-runtime, update code
This is mostly refactoring to adapt to the new APIs.

There are some small changes which are not user-visible immediately (but
visible when using `talosctl get` to inspect low-level details):

* `extras` namespace is removed, it was a hack to distinguish extra and
system manifests
* `Manifests` are managed by two controllers as shared outputs, stored
in the `controlplane` namespace now
* `talosctl inspect dependencies` output got slightly changed
* resources now have `md.owner` set to the controller name which manages
the resource

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-07 06:55:09 -07:00
Andrey Smirnov
2ea20f598a feat: replace timed with time sync controller
This is a complete rewrite of time sync process.

Now the time sync process starts early at boot time, and it adapts to
configuration changes:

* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used

Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.

Talos services which depend on time now wait on same resource instead of
waiting on timed health.

New features:

* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy

* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time

* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time

Features to implement later:

* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump

* trustd should be updated in same way

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 09:29:43 -07:00
Artem Chernyshev
22f375300c chore: update golanci-lint to 1.38.0
Fix all discovered issues.
Detected couple bugs, fixed them as well.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-12 06:50:02 -08:00
Alexey Palazhchenko
df52c13581 chore: fix //nolint directives
That's the recommended syntax:
https://golangci-lint.run/usage/false-positives/

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-05 05:58:33 -08:00
Andrey Smirnov
60aa011c7a feat: rename namespaces, resources, types etc
See https://github.com/talos-systems/os-runtime/pull/12 for new mnaming
conventions.

No functional changes.

Additionally implements printing extra columns in `talosctl get xyz`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-02 13:34:15 -08:00
Andrey Smirnov
31e56e63db fix: update in-cluster kubeconfig validity to match other certs
Talos generates in-cluster kubeconfig for the kube-scheduler and
kube-controller-manager to authenticate to kube-apiserver. Bug was that
validity of that kubeconfig was set to 24h by mistake. Fix that by
bumping validity to default for other Kubernetes certs (1 year).

Add a certificate refresh at 50% of the validity.

Fix bugs with copying secret resources which was leading to updates not
being propagated correctly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-01 11:16:04 -08:00
Andrey Smirnov
b914398154 refactor: split kubernetes/etcd resource generation into subresources
Fixes #3062

There's no user-visible change in this PR.

It carefully separates generated secrets (e.g. certs) from source
secrets from the config (e.g. CAs), so that certs are generated on
config changes which actually affect cert input.

And same way separates etcd and Kubernetes PKI, so if etcd CA got
changed, only etcd certs will be regenerated.

This should have noticeable impact with RSA-based PKI as it reduces
number of times PKI gets generated.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-18 22:01:28 -08:00
Andrey Smirnov
7751920dba feat: add a tool and package to convert self-hosted CP to static pods
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).

Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 23:26:57 -08:00
Andrey Smirnov
85ae9f75e9 fix: wait for time sync before generating Kubernetes certificates
Certificate generation depends on current time, and this bug is visible
on RPi which doesn't have RTC clock - controllers can generate certs
before `timed` does its initial sync creating certs which are not
usable.

Fix generates new intermediate resource `TimeSync` which tracks time
sync status (aggregates `timed` service status and `timed`
enabled/disabled in the config).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 10:01:19 -08:00
Andrey Smirnov
2277ce8abe feat: move to ECDSA keys for all Kubernetes/etcd certs and keys
ECDSA keys are smaller which decreases Talos config size, they are more
efficient in terms of key generation, signing, etc., so it makes boot
performance better (and config generation as well).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-02 13:25:00 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00