See #7230
Refactor more config interfaces, move config accessor interfaces
to different package to break the dependency loop.
Make `.RawV1Alpha1()` method typed to avoid type assertions everywhere.
No functional changes.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7159
The change looks big, but it's actually pretty simple inside: the static
pods had an annotation which tracks a version of the secrets which
forced control plane pods to reload on a change. At the same time
`kube-apiserver` can reload certificate inputs automatically from files
without restart.
So the inputs were split: the dynamic (for kube-apiserver) inputs don't
need to be reloaded, so its version is not tracked in static pod
annotation, so they don't cause a reload. The previous non-dynamic
resource still causes a reload, but it doesn't get updated when e.g.
node addresses change.
There might be many more refactoring done, the resource chain is a bit
of a mess there, but I wanted to keep number of changes minimal to keep
this backportable.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Making organization a interface for preparing to avoid giving
system:masters access to the talosctl kubeconfig generated certificate.
Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6156
Now access from Talos itself goes with `talos:admin` username in the
Kubernetes API server audit log, while access with admin kubeconfig goes
with `admin` username as before.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
It wasn't used when building an endpoint to the local API server, so
Talos couldn't talk to the local API server when port was changed from
the default one.
Fixes#5706
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
With the advent of generics, redo pointer functionality and remove github.com/AlekSi/pointer dependency.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Talos historically relied on `kubernetes` `Endpoints` resource (which
specifies `kube-apiserver` endpoints) to find other controlplane members
of the cluster to connect to the `etcd` nodes for the cluster (when node
local etcd instance is not up, for example). This method works great,
but it relies on Kubernetes endpoint being up. If the Kubernetes API is
down for whatever reason, or if the loadbalancer malfunctions, endpoints
are not available and join/leave operations don't work.
This PR replaces the endpoints lookup to use the `Endpoints` COSI
resource which is filled in using two methods:
* from the discovery data (if discovery is enabled, default to enabled)
* from the Kubernetes `Endpoints` resource
If the discovery is disabled (or not available), this change does almost
nothing: still Kubernetes is used to discover control plane endpoints,
but as the data persists in memory, even if the Kubernetes control plane
endpoint went down, cached copy will be used to connect to the endpoint.
If the discovery is enabled, Talos can join the etcd cluster immediately
on boot without waiting for Kubernetes to be up on the bootstrap node
which means that Talos cluster initial bootstrap runs in parallel on all
control plane nodes, while previously nodes were waiting for the first
node to finish bootstrap enough to fill in the endpoints data.
As the `etcd` communication is anyways protected with mutual TLS,
there's no risk even if the discovery data is stale or poisoned, as etcd
operations would fail on TLS mismatch.
Most of the changes in this PR actually enable populating Talos
`Endpoints` resource based on the `Kubernetes` `endpoints` resource
using the watch API.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4418
Only one resource (one of the very first ones) was polymorphic: its
actual spec type depends on its ID. This was a bad idea, and it doesn't
work with protobuf specs (as type <> protobuf relationship can't be
established).
Refactor this by splitting into three separate resource types:
`OSRoot` (OS-level root secrets), `EtcdRoot` (for etcd),
`KubernetesRoot` (for Kubernetes).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
With the last changes, `kube-apiserver` certificates are generated based
on the assigned `NodeAdresses`, machine configuration, etc. Whenver the
certificate is regenerated, `kube-apiserver` is reloaded to pick up the
new cert.
With Virtual IP enabled, Virtual IP address is included into the
certificate from the beginning as it is specified in the machine
configuration, but as virtual IP moves between the nodes this causes
`NodeAddresses` update, which triggers the controller, generates new
certs and reloads `kube-apiserver` at bad time (right after VIP got
moved). Even though the cert generated is identical to the previous one,
the API server reload makes it unavailable for 30-90 seconds.
This change extracts `CertSANs` as a separate resource so that its
updates are suppressed if the CertSANs sources change, but the final
list stays the same, and in turn prevents final certificate from being
updated.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This implements pushing to and pulling from Kubernetes cluster discovery
registry which is simply using extra Talos annotations on the Node
resources.
Note: cluster discovery is still disabled by default.
This means that each Talos node is going to push data from its own local
`Affiliate` structure to the `Node` resource, and also watches the other
`Node`s to scrape data to build `Affiliate`s from each other cluster
member.
Further down the pipeline, `Affiliate` is converted to a cluster
`Member` which is an easy way to see the cluster membership.
In its current form, `talosctl get members` is mostly equivalent to
`kubectl get nodes`, but as we add more registries, it will become more
powerful.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4139
This builds the local (for the node) `Affiliate` structure which
describes node for the cluster discovery. Dependending on the
configuration, KubeSpan information might be included as well.
`NodeAddresses` were updated to hold CIDRs instead of simple IPs.
The `Affiliate` will be pushed to the registries, while `Affiliate`s for
other nodes will be fetched back from the registries.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This is a PR on a path towards removing `ApplyDynamicConfig`.
This fixes Kubernetes API server certificate generation to use dynamic
data to generate cert with proper SANs for IPs of the node.
As part of that refactored a bit apid certificate generation (without
any changes).
Added two unit-tests for apid and Kubernetes certificate generation.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This should fix an error like:
```
failed to create etcd client: error getting kubernetes endpoints: Unauthorized
```
The problem is that the generated cert was used immediately, so even
slight time sync issue across nodes might render the cert not (yet)
usable. Cert is generated on one node, but might be used on any other
node (as it goes via the LB).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This includes communication from controller-manager and scheduler to the
API server and manifest injection by Talos controllers.
This eliminates dependency on control plane endpoint to be up, and might
speed up bootstrap on platform where load balancer might need some time
to start proxying to the first API server instance.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes#3765
See #3581
There are several changes:
* `kube-controller-manager` insecure port is disabled
* `kube-controller-manager` and `kube-scheduler` now listen securely
only on localhost by default, this can be overridden with `--bind-addr`
in extra args
* `kube-controller-manager` and `kube-scheduler` now use kubeconfig with
limited access role instead of admin one
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This removes networkd, updates network ready condition, enables all the
controllers which were previously disabled.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Enable logging using default development config with some fine tuning.
Additionally, now `info` and below logs go to kmsg.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This is mostly refactoring to adapt to the new APIs.
There are some small changes which are not user-visible immediately (but
visible when using `talosctl get` to inspect low-level details):
* `extras` namespace is removed, it was a hack to distinguish extra and
system manifests
* `Manifests` are managed by two controllers as shared outputs, stored
in the `controlplane` namespace now
* `talosctl inspect dependencies` output got slightly changed
* resources now have `md.owner` set to the controller name which manages
the resource
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is a complete rewrite of time sync process.
Now the time sync process starts early at boot time, and it adapts to
configuration changes:
* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used
Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.
Talos services which depend on time now wait on same resource instead of
waiting on timed health.
New features:
* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy
* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time
* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time
Features to implement later:
* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump
* trustd should be updated in same way
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
See https://github.com/talos-systems/os-runtime/pull/12 for new mnaming
conventions.
No functional changes.
Additionally implements printing extra columns in `talosctl get xyz`.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Talos generates in-cluster kubeconfig for the kube-scheduler and
kube-controller-manager to authenticate to kube-apiserver. Bug was that
validity of that kubeconfig was set to 24h by mistake. Fix that by
bumping validity to default for other Kubernetes certs (1 year).
Add a certificate refresh at 50% of the validity.
Fix bugs with copying secret resources which was leading to updates not
being propagated correctly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes#3062
There's no user-visible change in this PR.
It carefully separates generated secrets (e.g. certs) from source
secrets from the config (e.g. CAs), so that certs are generated on
config changes which actually affect cert input.
And same way separates etcd and Kubernetes PKI, so if etcd CA got
changed, only etcd certs will be regenerated.
This should have noticeable impact with RSA-based PKI as it reduces
number of times PKI gets generated.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).
Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Certificate generation depends on current time, and this bug is visible
on RPi which doesn't have RTC clock - controllers can generate certs
before `timed` does its initial sync creating certs which are not
usable.
Fix generates new intermediate resource `TimeSync` which tracks time
sync status (aggregates `timed` service status and `timed`
enabled/disabled in the config).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
ECDSA keys are smaller which decreases Talos config size, they are more
efficient in terms of key generation, signing, etc., so it makes boot
performance better (and config generation as well).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Control plane components are running as static pods managed by the
kubelets.
Whole subsystem is managed via resources/controllers from os-runtime.
Many supporting changes/refactoring to enable new code paths.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>