talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-27 14:31:11 +01:00

Author	SHA1	Message	Date
Andrey Smirnov	c3e4182000	refactor: use COSI runtime with new controller runtime DB See https://github.com/cosi-project/runtime/pull/336 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>	2023-10-12 19:44:44 +04:00
Andrey Smirnov	3c9f7a7de6	chore: re-enable nolintlint and typecheck linters Drop startup/rand.go, as since Go 1.20 `rand.Seed` is done automatically. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>	2023-08-25 01:05:41 +04:00
Andrey Smirnov	dc6764871c	refactor: move around config interfaces, make RawV1Alpha1 typed See #7230 Refactor more config interfaces, move config accessor interfaces to different package to break the dependency loop. Make `.RawV1Alpha1()` method typed to avoid type assertions everywhere. No functional changes. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2023-05-23 22:08:58 +04:00
Andrey Smirnov	860002c735	fix: don't reload control plane pods on cert SANs changes Fixes #7159 The change looks big, but it's actually pretty simple inside: the static pods had an annotation which tracks a version of the secrets which forced control plane pods to reload on a change. At the same time `kube-apiserver` can reload certificate inputs automatically from files without restart. So the inputs were split: the dynamic (for kube-apiserver) inputs don't need to be reloaded, so its version is not tracked in static pod annotation, so they don't cause a reload. The previous non-dynamic resource still causes a reload, but it doesn't get updated when e.g. node addresses change. There might be many more refactoring done, the resource chain is a bit of a mess there, but I wanted to keep number of changes minimal to keep this backportable. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2023-05-05 16:59:09 +04:00
Niklas Wik	34babe858d	chore: make organization selection an interface Making organization a interface for preparing to avoid giving system:masters access to the talosctl kubeconfig generated certificate. Signed-off-by: Niklas Wik <niklas.wik@nokia.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-12-19 15:12:30 +04:00
Andrey Smirnov	a505b8909a	fix: update COSI and reset restart backoff on success See https://github.com/cosi-project/runtime/pull/191 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-12-06 17:43:26 +04:00
Andrey Smirnov	96aa9638f7	chore: rename talos-systems/talos to siderolabs/talos There's a cyclic dependency on siderolink library which imports talos machinery back. We will fix that after we get talos pushed under a new name. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-03 16:50:32 +04:00
Andrey Smirnov	6882725157	fix: use different username for Talos Kubernetes API access Fixes #6156 Now access from Talos itself goes with `talos:admin` username in the Kubernetes API server audit log, while access with admin kubeconfig goes with `admin` username as before. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-09-09 19:30:36 +04:00
Andrey Smirnov	f62d17125b	chore: update crypto to use new import path siderolabs/crypto No functional changes in this PR, just updating import paths. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-09-07 23:02:50 +04:00
Utku Ozdemir	ae3840dbc3	refactor: move kubeconfig package under public api Move the kubeconfig package under pkg/ so that other projects can reuse parts of it. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>	2022-07-01 19:22:16 +02:00
Andrey Smirnov	da2985fe1b	fix: respect local API server port It wasn't used when building an endpoint to the local API server, so Talos couldn't talk to the local API server when port was changed from the default one. Fixes #5706 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-06-09 00:33:49 +04:00
Dmitriy Matrenichev	6351928611	chore: redo pointer with github.com/siderolabs/go-pointer module With the advent of generics, redo pointer functionality and remove github.com/AlekSi/pointer dependency. Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2022-05-02 02:17:13 +04:00
Andrey Smirnov	85b328e997	refactor: convert secrets resources to use typed.Resource No functional changes. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-04-26 14:51:56 +03:00
Andrey Smirnov	e91350acd7	refactor: convert time & v1alpha1 resources to use typed.Resource No functional changes, just pure refactoring. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-04-25 22:41:52 +03:00
Andrey Smirnov	b085343dcb	feat: use discovery information for etcd join (and other etcd calls) Talos historically relied on `kubernetes` `Endpoints` resource (which specifies `kube-apiserver` endpoints) to find other controlplane members of the cluster to connect to the `etcd` nodes for the cluster (when node local etcd instance is not up, for example). This method works great, but it relies on Kubernetes endpoint being up. If the Kubernetes API is down for whatever reason, or if the loadbalancer malfunctions, endpoints are not available and join/leave operations don't work. This PR replaces the endpoints lookup to use the `Endpoints` COSI resource which is filled in using two methods: * from the discovery data (if discovery is enabled, default to enabled) * from the Kubernetes `Endpoints` resource If the discovery is disabled (or not available), this change does almost nothing: still Kubernetes is used to discover control plane endpoints, but as the data persists in memory, even if the Kubernetes control plane endpoint went down, cached copy will be used to connect to the endpoint. If the discovery is enabled, Talos can join the etcd cluster immediately on boot without waiting for Kubernetes to be up on the bootstrap node which means that Talos cluster initial bootstrap runs in parallel on all control plane nodes, while previously nodes were waiting for the first node to finish bootstrap enough to fill in the endpoints data. As the `etcd` communication is anyways protected with mutual TLS, there's no risk even if the discovery data is stale or poisoned, as etcd operations would fail on TLS mismatch. Most of the changes in this PR actually enable populating Talos `Endpoints` resource based on the `Kubernetes` `endpoints` resource using the watch API. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-04-21 22:00:27 +03:00
Dmitriy Matrenichev	e06e1473b0	feat: update golangci-lint to 1.45.0 and gofumpt to 0.3.0 - Update golangci-lint to 1.45.0 - Update gofumpt to 0.3.0 - Fix gofumpt errors - Add goimports and format imports since gofumports is removed - Update Dockerfile - Fix .golangci.yml configuration - Fix linting errors Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2022-03-24 08:14:04 +04:00
Andrey Smirnov	753a82188f	refactor: move pkg/resources to machinery Fixes #4420 No functional changes, just moving packages around. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-11-15 19:50:35 +03:00
Andrey Smirnov	8329d21114	chore: split polymorphic RootSecret resource into specific types Fixes #4418 Only one resource (one of the very first ones) was polymorphic: its actual spec type depends on its ID. This was a bad idea, and it doesn't work with protobuf specs (as type <> protobuf relationship can't be established). Refactor this by splitting into three separate resource types: `OSRoot` (OS-level root secrets), `EtcdRoot` (for etcd), `KubernetesRoot` (for Kubernetes). Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-10-27 19:56:04 +03:00
Andrey Smirnov	c3b2429ce9	fix: suppress spurious Kubernetes API server cert updates With the last changes, `kube-apiserver` certificates are generated based on the assigned `NodeAdresses`, machine configuration, etc. Whenver the certificate is regenerated, `kube-apiserver` is reloaded to pick up the new cert. With Virtual IP enabled, Virtual IP address is included into the certificate from the beginning as it is specified in the machine configuration, but as virtual IP moves between the nodes this causes `NodeAddresses` update, which triggers the controller, generates new certs and reloads `kube-apiserver` at bad time (right after VIP got moved). Even though the cert generated is identical to the previous one, the API server reload makes it unavailable for 30-90 seconds. This change extracts `CertSANs` as a separate resource so that its updates are suppressed if the CertSANs sources change, but the final list stays the same, and in turn prevents final certificate from being updated. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-09-09 00:31:54 +03:00
Andrey Smirnov	af6622109f	feat: implement Kubernetes cluster discovery registry This implements pushing to and pulling from Kubernetes cluster discovery registry which is simply using extra Talos annotations on the Node resources. Note: cluster discovery is still disabled by default. This means that each Talos node is going to push data from its own local `Affiliate` structure to the `Node` resource, and also watches the other `Node`s to scrape data to build `Affiliate`s from each other cluster member. Further down the pipeline, `Affiliate` is converted to a cluster `Member` which is an easy way to see the cluster membership. In its current form, `talosctl get members` is mostly equivalent to `kubectl get nodes`, but as we add more registries, it will become more powerful. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-09-03 22:09:26 +03:00
Andrey Smirnov	2c66e1b3c5	feat: provide building of local `Affiliate` structure (for the node) Fixes #4139 This builds the local (for the node) `Affiliate` structure which describes node for the cluster discovery. Dependending on the configuration, KubeSpan information might be included as well. `NodeAddresses` were updated to hold CIDRs instead of simple IPs. The `Affiliate` will be pushed to the registries, while `Affiliate`s for other nodes will be fetched back from the registries. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-09-03 16:44:19 +03:00
Andrey Smirnov	0b347570a7	feat: use dynamic NodeAddresses/HostnameStatus in Kubernetes certs This is a PR on a path towards removing `ApplyDynamicConfig`. This fixes Kubernetes API server certificate generation to use dynamic data to generate cert with proper SANs for IPs of the node. As part of that refactored a bit apid certificate generation (without any changes). Added two unit-tests for apid and Kubernetes certificate generation. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-09-01 20:56:53 +03:00
Andrey Smirnov	22a4193678	fix: workaround 'Unauthorized' errors when accessing Kubernetes API This should fix an error like: ``` failed to create etcd client: error getting kubernetes endpoints: Unauthorized ``` The problem is that the generated cert was used immediately, so even slight time sync issue across nodes might render the cert not (yet) usable. Cert is generated on one node, but might be used on any other node (as it goes via the LB). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-05 14:15:03 -07:00
Serge Logvinov	f5721050de	fix: controlplane keyusage * kube-apiserver keyusage serverAuth * kube-scheduler keyusage clientAuth * kube-controller-manager keyusage clientAuth * kubeconfig keyusage clientAuth Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>	2021-07-01 12:49:29 -07:00
Andrey Smirnov	70ac771e08	fix: use localhost API server endpoint for internal communication This includes communication from controller-manager and scheduler to the API server and manifest injection by Talos controllers. This eliminates dependency on control plane endpoint to be up, and might speed up bootstrap on platform where load balancer might need some time to start proxying to the first API server instance. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-18 12:06:47 -07:00
Andrey Smirnov	a941eb7da0	feat: improve security of Kubernetes control plane components Fixes #3765 See #3581 There are several changes: * `kube-controller-manager` insecure port is disabled * `kube-controller-manager` and `kube-scheduler` now listen securely only on localhost by default, this can be overridden with `--bind-addr` in extra args * `kube-controller-manager` and `kube-scheduler` now use kubeconfig with limited access role instead of admin one Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-18 10:21:45 -07:00
Andrey Smirnov	f2ae9cd0c1	feat: replace networkd with new network implementation This removes networkd, updates network ready condition, enables all the controllers which were previously disabled. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-15 17:37:28 -07:00
Artem Chernyshev	1db301edf6	feat: switch controller-runtime to zap.Logger Enable logging using default development config with some fine tuning. Additionally, now `info` and below logs go to kmsg. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-05-25 02:15:31 -07:00
Andrey Smirnov	d24df8f844	chore: re-import talos-systems/os-runtime as cosi-project/runtime No changes, just import path change (as project got moved). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-12 07:44:24 -07:00
Andrey Smirnov	fbfd1eb2b1	refactor: pull new version of os-runtime, update code This is mostly refactoring to adapt to the new APIs. There are some small changes which are not user-visible immediately (but visible when using `talosctl get` to inspect low-level details): * `extras` namespace is removed, it was a hack to distinguish extra and system manifests * `Manifests` are managed by two controllers as shared outputs, stored in the `controlplane` namespace now * `talosctl inspect dependencies` output got slightly changed * resources now have `md.owner` set to the controller name which manages the resource Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-07 06:55:09 -07:00
Andrey Smirnov	2ea20f598a	feat: replace timed with time sync controller This is a complete rewrite of time sync process. Now the time sync process starts early at boot time, and it adapts to configuration changes: * before config is available, `pool.ntp.org` is used * once config is available, configured time servers are used Controller updates same time sync resource as other controllers had dependency on, so they have a chance to wait for the time sync event. Talos services which depend on time now wait on same resource instead of waiting on timed health. New features: * time sync now sticks to the particular time server unless there's an error from that server, and server is changed in that case, this improves time sync accuracy * time sync acts on config changes immediately, so it's possible to reconfigure time sync at any time * there's a new 'epoch' field in time sync resources which allows time-dependent controllers to regenerate certs when there's a big enough jump in time Features to implement later: * apid shouldn't depend on timed, it should be started early and it should regenerate certs on time jump * trustd should be updated in same way Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-29 09:29:43 -07:00
Artem Chernyshev	22f375300c	chore: update golanci-lint to 1.38.0 Fix all discovered issues. Detected couple bugs, fixed them as well. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-12 06:50:02 -08:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Andrey Smirnov	60aa011c7a	feat: rename namespaces, resources, types etc See https://github.com/talos-systems/os-runtime/pull/12 for new mnaming conventions. No functional changes. Additionally implements printing extra columns in `talosctl get xyz`. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-02 13:34:15 -08:00
Andrey Smirnov	31e56e63db	fix: update in-cluster kubeconfig validity to match other certs Talos generates in-cluster kubeconfig for the kube-scheduler and kube-controller-manager to authenticate to kube-apiserver. Bug was that validity of that kubeconfig was set to 24h by mistake. Fix that by bumping validity to default for other Kubernetes certs (1 year). Add a certificate refresh at 50% of the validity. Fix bugs with copying secret resources which was leading to updates not being propagated correctly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-01 11:16:04 -08:00
Andrey Smirnov	b914398154	refactor: split kubernetes/etcd resource generation into subresources Fixes #3062 There's no user-visible change in this PR. It carefully separates generated secrets (e.g. certs) from source secrets from the config (e.g. CAs), so that certs are generated on config changes which actually affect cert input. And same way separates etcd and Kubernetes PKI, so if etcd CA got changed, only etcd certs will be regenerated. This should have noticeable impact with RSA-based PKI as it reduces number of times PKI gets generated. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-18 22:01:28 -08:00
Andrey Smirnov	7751920dba	feat: add a tool and package to convert self-hosted CP to static pods This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster is fully upgraded, control plane is still self-hosted (as it was bootstrapped with bootkube). Tool `talosctl convert-k8s` (and library behind it) performs the upgrade to self-hosted version. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 23:26:57 -08:00
Andrey Smirnov	85ae9f75e9	fix: wait for time sync before generating Kubernetes certificates Certificate generation depends on current time, and this bug is visible on RPi which doesn't have RTC clock - controllers can generate certs before `timed` does its initial sync creating certs which are not usable. Fix generates new intermediate resource `TimeSync` which tracks time sync status (aggregates `timed` service status and `timed` enabled/disabled in the config). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-09 10:01:19 -08:00
Andrey Smirnov	2277ce8abe	feat: move to ECDSA keys for all Kubernetes/etcd certs and keys ECDSA keys are smaller which decreases Talos config size, they are more efficient in terms of key generation, signing, etc., so it makes boot performance better (and config generation as well). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-02 13:25:00 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00

40 Commits