talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-09 22:51:12 +02:00

Author	SHA1	Message	Date
Andrey Smirnov	f62d17125b	chore: update crypto to use new import path siderolabs/crypto No functional changes in this PR, just updating import paths. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-09-07 23:02:50 +04:00
Noel Georgi	b62b18a972	feat: bump k8s to v1.25.0-beta.0 Bump k8s to v1.25.0-beta.0 Update most kubernetes `master` references to `controlplane` Signed-off-by: Noel Georgi <git@frezbo.dev>	2022-08-10 22:17:53 +05:30
Utku Ozdemir	84e712a9f1	feat: introduce Talos API access from Kubernetes We add a new CRD, `serviceaccounts.talos.dev` (with `tsa` as short name), and its controller which allows users to get a `Secret` containing a short-lived Talosconfig in their namespaces with the roles they need. Additionally, we introduce the `talosctl inject serviceaccount` command to accept a YAML file with Kubernetes manifests and inject them with Talos service accounts so that they can be directly applied to Kubernetes afterwards. If Talos API access feature is enabled on Talos side, the injected workloads will be able to talk to Talos API. Closes siderolabs/talos#4422. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>	2022-08-08 18:27:26 +02:00
Andrey Smirnov	a6b010a8b4	chore: update Go to 1.19, Linux to 5.15.58 See https://go.dev/doc/go1.19 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-08-03 17:03:58 +04:00
Dmitriy Matrenichev	30f7851d2a	chore: bump golangci-lint from 1.45.2 to 1.47.2 Minor linter upgrade. Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2022-07-22 17:49:44 +03:00
Utku Ozdemir	284a2f9596	fix: filter static pods correctly and optimize fetching When we query kubelet API to populate the StaticPodStatuses, instead of checking for ownerReferences to be empty, we check the annotation "kubernetes.io/config.source" value so we avoid including standalone pods (that are regular pods but not part of a replicaset). We also optimize their fetching by avoiding to unmarshal the fields we do not need. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>	2022-06-27 18:50:47 +02:00
Andrey Smirnov	b085343dcb	feat: use discovery information for etcd join (and other etcd calls) Talos historically relied on `kubernetes` `Endpoints` resource (which specifies `kube-apiserver` endpoints) to find other controlplane members of the cluster to connect to the `etcd` nodes for the cluster (when node local etcd instance is not up, for example). This method works great, but it relies on Kubernetes endpoint being up. If the Kubernetes API is down for whatever reason, or if the loadbalancer malfunctions, endpoints are not available and join/leave operations don't work. This PR replaces the endpoints lookup to use the `Endpoints` COSI resource which is filled in using two methods: * from the discovery data (if discovery is enabled, default to enabled) * from the Kubernetes `Endpoints` resource If the discovery is disabled (or not available), this change does almost nothing: still Kubernetes is used to discover control plane endpoints, but as the data persists in memory, even if the Kubernetes control plane endpoint went down, cached copy will be used to connect to the endpoint. If the discovery is enabled, Talos can join the etcd cluster immediately on boot without waiting for Kubernetes to be up on the bootstrap node which means that Talos cluster initial bootstrap runs in parallel on all control plane nodes, while previously nodes were waiting for the first node to finish bootstrap enough to fill in the endpoints data. As the `etcd` communication is anyways protected with mutual TLS, there's no risk even if the discovery data is stale or poisoned, as etcd operations would fail on TLS mismatch. Most of the changes in this PR actually enable populating Talos `Endpoints` resource based on the `Kubernetes` `endpoints` resource using the watch API. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-04-21 22:00:27 +03:00
Andrey Smirnov	5e0c80f616	fix: ignore connection reset errors on k8s upgrade This fixes `talosctl upgrade-k8s`: ``` Get "https://172.21.0.1:6443/api/v1/namespaces/kube-system/pods?labelSelector=k8s-app+%3D+kube-apiserver": read tcp 172.21.0.1:51416->172.21.0.1:6443: read: connection reset by peer ``` The error happens when the `kube-apiserver` is restarted during the control plane upgrade, and it should be ignored as a transient error. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-03-18 22:11:28 +03:00
Andrey Smirnov	c6a67b8662	fix: ignore not existing nodes on cordoning Fixes #4557 When running `reset` for a node which was already deleted from Kubernetes, we should ignore failure to cordon and proceed with other actions. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-11-18 19:07:35 +03:00
Artem Chernyshev	e3e2113adc	feat: upgrade CoreDNS during `upgrade-k8s` call Fixes: https://github.com/talos-systems/talos/issues/4065 Get all Talos generated manifests and apply them, wait for deployments to be updated and to become ready. Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2021-10-13 15:47:06 +03:00
Andrey Smirnov	0b347570a7	feat: use dynamic NodeAddresses/HostnameStatus in Kubernetes certs This is a PR on a path towards removing `ApplyDynamicConfig`. This fixes Kubernetes API server certificate generation to use dynamic data to generate cert with proper SANs for IPs of the node. As part of that refactored a bit apid certificate generation (without any changes). Added two unit-tests for apid and Kubernetes certificate generation. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-09-01 20:56:53 +03:00
Alexey Palazhchenko	eea750de2c	chore: rename "join" type to "worker" Closes #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-07-09 07:10:45 -07:00
Andrey Smirnov	6d13d2cf92	fix: close Kubernetes API client The problem is that there's no official way to close Kuberentes client underlying TCP/HTTP connections. So each time Talos initializes connection to the control plane endpoint, new client is built, but this client is never closed, so the connection stays active on the load balancers, on the API server level, etc. It also eats some resources out of Talos itself. We add a way to close underlying connections by using helper from the Kubernetes client libraries to force close all TCP connections which should shut down all HTTP/2 connections as well. Alternative approach might be to cache a client for some time, but many of the clients are created with temporary PKI, so even cached client still needs to be closed once it gets stale, and it's not clear how to recreate a client in case existing one is broken for one reason or another (and we need to force a re-connection). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-05 14:25:26 -07:00
Andrey Smirnov	22a4193678	fix: workaround 'Unauthorized' errors when accessing Kubernetes API This should fix an error like: ``` failed to create etcd client: error getting kubernetes endpoints: Unauthorized ``` The problem is that the generated cert was used immediately, so even slight time sync issue across nodes might render the cert not (yet) usable. Cert is generated on one node, but might be used on any other node (as it goes via the LB). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-05 14:15:03 -07:00
Andrey Smirnov	3aae94e530	feat: provide Kubernetes nodename as a COSI resource This changes the way Kubernetes nodename is computed: it is set by the controller based on the hostname and machine configuration, and pulled from the resource when needed. Kubelet client now also uses nodename to fix the certifcate mismatch issue on AWS. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-18 19:58:19 +03:00
Andrey Smirnov	5811f4dda1	feat: implement link (interface) controllers The structure of the controllers is really similar to addresses and routes: * `LinkSpec` resource describes desired link state * `LinkConfig` controller generates `LinkSpecs` based on machine configuration and kernel cmdline * `LinkMerge` controller merges multiple configuration sources into a single `LinkSpec` paying attention to the config layer priority * `LinkSpec` controller applies the specs to the kernel state Controller `LinkStatus` (which was implemented before) watches the kernel state and publishes current link status. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-01 09:36:25 -07:00
Andrey Smirnov	2261d7ed02	fix: use both self-signed and Kubernetes CA to verify Kubelet cert Kubelet might be running either self-signed cert (by default) or API server issued cert (signed by the CA). User might switch between the two methods, so instead of guessing based on filesystem contents, accept both Kubernetes CA and self-signed cert (if available). Spotted by @aceat64 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-26 12:21:22 -07:00
Andrey Smirnov	e26c977d85	fix: check retryable network errors by interface Looks like tls errors implement the interface, but they are not derived from the `*net.OpError`, so this check should catch more errors. Fixes #3457 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-12 09:56:17 -07:00
Andrey Smirnov	a1e6415403	fix: retry Kubernetes API errors on cordon/uncordon/etc This extracts function which was used in upgrade/convert flows to retry transient errors to the main `kubernetes` package, expands it to ignore timeout errors, and it is now used to retry errors where applicable in `pkg/kubernetes`. Fixes #3403 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-02 03:51:40 -07:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Artem Chernyshev	638af35db0	chore: properly propagate context object in the controller This is required to correctly handle ACPI reboot or forceful reboots during sequence that locks the controller. Additionally fix `NoSchedule` untaint when the configuration is changed. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-03 16:59:27 +03:00
Andrey Smirnov	779ac74a08	fix: improve the drain function Critical bug (I believe) was that drain code entered the loop to evict the pod after wait for pod to be deleted returned success effectively evicting pod once again once it got rescheduled to a different node. Add a global timeout to prevent draining code from running forever. Filter more pod types which shouldn't be ever drained. Fixes #3124 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-25 07:02:24 -08:00
Andrey Smirnov	41430e72d2	fix: handle case when kubelet serving certificates are issued If kubelet is configured to issue certificates from the control plane, `/var/lib/kubelet/pki/kubelet.crt` file is never created, and cluster CA canv be used to verify the TLS connection. Use k8s `RESTClient` instead of a custom client, this also results in much more descriptive error messages if API call fails. Fix a problem in apid on worker nodes with issued serving certificates: `/var/lib/kubelet/pki` doesn't exist by the time `apid` starts. First write static pods, then try to build kubelet client: for issued serving kubelet certificates, control plane should be up first. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-19 13:21:26 -08:00
Andrey Smirnov	9205870ee6	fix: move versions to annotations in control plane static pods Labels shouldn't be used, as this is not supposed to be used for filtering pods. Use proper annotation prefix private for Talos. Add config-version annotation to track how static pod propagates up to API server (it will be used in control plane upgrade). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-16 14:57:17 -08:00
Andrey Smirnov	8d7a36cc0c	fix: find master node IPs correctly in health checks Health checks verify node list in Kubernetes to match expectations, but initial set of nodes for server-side health checks was driven by `MasterIPs` functions which returns list of master endpoints which is not exactly same as master nodes: endpoints also include some healthchecks. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-16 06:28:02 -08:00
Andrey Smirnov	2277ce8abe	feat: move to ECDSA keys for all Kubernetes/etcd certs and keys ECDSA keys are smaller which decreases Talos config size, they are more efficient in terms of key generation, signing, etc., so it makes boot performance better (and config generation as well). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-02 13:25:00 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	f836f145f3	fix: synchronize bootkube timeouts and various boot timeouts When bootkube service fails, it can clean up manifests after itself, but it only happens if we give it a chance to shut down cleanly. If boot sequence times out, `machined` does emergency reboot and it doesn't let `bootkube` do the cleanup. So this fix has two paths: * synchronize boot/bootstrap sequence timeouts with bootkube asset timeout; * cleanup bootkube-generated manifests and bootkube service startup. Also logs errors on initial phases like `labelNodeAsMaster` to provide some feedback on why boot is stuck. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-18 13:45:28 -08:00
Andrey Smirnov	92cde0c2ea	fix: node taint doesn't contain value anymore As code was looking for existing taint with `value == true`, it failed to find existing taint and tried to add another one which never succeeds. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-03 13:12:42 -08:00
Andrey Smirnov	a26acfef9c	fix: remove value (change to empty) for `NoSchedule` taint This seems to be more preferred way and fixes compatibility with deployments which don't do `operator: Exists` in tolerations. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-02 07:05:49 -08:00
Andrey Smirnov	28ba6e416e	feat: update Kubernetes to v1.20.0-beta.2 Talos 0.8 is going to ship with K8s 1.20.x. Changes to support new `control-plane` label, upgrade-k8s supports automated fixups for 1.20. See also: https://github.com/talos-systems/bootkube-plugin/pull/22 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-25 06:39:14 -08:00
Andrey Smirnov	a2efa44663	chore: enable gci linter Fixes were applied automatically. Import ordering might be questionable, but it's strict: * stdlib * other packages * same package imports Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 08:09:48 -08:00
Andrey Smirnov	8560fb9662	chore: enable nlreturn linter Most of the fixes were automatically applied. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 06:48:07 -08:00
Artem Chernyshev	9c969a4be5	feat: allow disabling NoSchedule on master nodes Add talosconfig parameter that allows to disable NoSchedule taint on master nodes. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-06 10:52:37 -07:00
Andrey Smirnov	788cd15c29	test: add e2e test to the provision (upgrade) tests Add sonobuoy runner code with log fetching on failure. Use hand-picked set of e2e tests to run: verify basic pod functionality, verify service connectivity. Add option `--run-e2e` to the `talosctl health` to run quick e2e test to verify cluster health. Add option to run provision tests with custom CNI, run one track of provision tests with Cilium. Bump Cilium to 1.8.2. Talos 0.6 won't uncordon node automatically after upgrade from 0.5, as 0.5 doesn't put annotation. Workaround that in upgrade tests. Bump upgrade test version to 0.6.0 release. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-08 13:26:31 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	52c5911fcd	chore: extract pkg/crypto as external module Package `pkg/crypto` was extracted as `github.com/talos-systems/crypto` repository and Go module. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-14 06:33:30 -07:00
Andrey Smirnov	b110a9fa4d	fix: retry non-HTTP errors from API server While waiting for node ready condition, API server endpoint might return networking errors (e.g. if endpoint is a RR DNS record). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-10 07:26:52 -07:00
Andrey Smirnov	3926442704	feat: taint master nodes with `NoSchedule` taint Fixes #2350 This also brings in a fix for `coredns` tolerations from https://github.com/talos-systems/bootkube-plugin/pull/19. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 14:02:41 -07:00
Andrey Smirnov	c54639e541	feat: implement server-side API for cluster health checks This implements existing server-side health checks as defined in `internal/pkg/cluster/checks` in Talos API. Summary of changes: * new `cluster` API * `apid` now listens without auth on local file socket * `cluster` API is for now implemented in `machined`, but we can move it to the new service if we find it more appropriate * `talosctl health` by default now does server-side health check UX: `talosctl health` without arguments does health check for the cluster if it has healthy K8s to return master/worker nodes. If needed, node list can be overridden with flags. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-15 13:52:13 -07:00
Andrey Smirnov	804f162756	fix: improve node uncordon tasks 1. Increase retry timeout. 2. Use timeout per attempt. 3. Check for node readiness as a gate to succeed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 09:26:47 -07:00
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	ddbe9cfc2f	fix: update timeouts on service startup to match boot timeout There's a global timeout for all services to be up: it's 5 minutes. We need to make sure each service startup takes less than that, otherwise boot sequence is aborted and there's no way to see the error message for each particular service. Also propagate contexts correctly and set some default timeouts to make sure API operations are not hanging forever. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 07:39:36 -07:00
Spencer Smith	3a4eaeeef0	feat: upgrade kubernetes to 1.18 This PR will pull in the latest release of k8s 1.18 so we can start validating it through our test suite. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-03-26 14:59:43 -04:00
Spencer Smith	fa82454be4	chore: fix formatting of imports This PR cleans up the formatting for various package imports as they were causing the linter to throw errors. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-03-19 15:06:05 -04:00
Andrey Smirnov	01d696ed10	chore: update golangci-lint-1.23.3 `gomnd` disabled, as it complains about every number used in the code, and `wsl` became much more thorough. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-04 08:56:39 -08:00
Andrew Rynhard	f3623d22b0	refactor: use tls.Config as client credentials The `client.Creds` struct was not used very often, and made using the `client.NewClient` function impossible to use in combination with the `RemoteRenewingFileCertificateProvider`. This modifies `client.NewClient` to accept a `tls.Config` instead of `client.Creds`, allowing for the use of `RemoteRenewingFileCertificateProvider` with `client.NewClient`. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-01-21 17:10:07 -08:00
Andrew Rynhard	3e5ca30aa5	refactor: simplify NewTemporaryClientFromPKI This is a simple refactor that reduces the number of arguments required by `NewTemporaryClientFromPKI`. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-12-03 09:10:24 -08:00
Andrew Rynhard	6a1a9fc8d9	fix: retry cordon and uncordon When implementing the controller-manager I found a race condition between it and the cordon operation. The controller-manager annotates the node to indicate that an upgrade is in progress, and Talos tries to mark the node as unschedulable at nearly the same time. This leads to a race condition. The fix is to simply retry the cordon. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-16 11:15:22 -08:00

1 2

62 Commits