talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-08 14:11:13 +02:00

Author	SHA1	Message	Date
Noel Georgi	cad43f0ad3	chore: remove k8s master label Since talos now defaults to k8s 1.27, remove the handling of `master` label for controlplane nodes. Signed-off-by: Noel Georgi <git@frezbo.dev>	2023-04-25 20:48:05 +05:30
Andrey Smirnov	230e46e567	refactor: extract parts of kubernetes libraries The shared code is going out to the github.com/siderolabs/go-kubernetes library. The code will be used in Talos and other projects using same features. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2023-02-22 14:56:49 +04:00
Andrey Smirnov	0a5a8802e7	feat: use 'localhost' endpoint for controlplane nodes This switches the last usage of Kubernetes controlplane endpoint to use `localhost` (itself) for controlplane nodes. Worker nodes still use cluster-wide controlplane endpoint. This allows controlplane nodes to boot fully even if the controlplane endpoint (e.g. loadbalancer) doesn't function. The process of joining etcd still requires either a discovery service or a proper functioning controlplane endpoint. With this fix, Talos controlplane nodes can boot successfully without a loadbalancer being up, while worker nodes obviously won't join. This improves Talos behavior in single-node clusters when controlplane endpoint is not available, the node will still boot just fine and function properly. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2023-01-10 20:50:51 +04:00
Philipp Sauter	e1e340bdd9	feat: expose Talos node labels as a machine configuration field We add the `nodeLabels` key to the machine config to allow users to add node labels to the kubernetes Node object. A controller reads the nodeLabels from the machine config and applies them via the kubernetes API. Older versions of talosctl will throw an unknown keys error if `edit mc` is called on a node with this change. Fixes #6301 Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-15 21:25:40 +04:00
Andrey Smirnov	96aa9638f7	chore: rename talos-systems/talos to siderolabs/talos There's a cyclic dependency on siderolink library which imports talos machinery back. We will fix that after we get talos pushed under a new name. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-03 16:50:32 +04:00
Andrey Smirnov	343c55762e	chore: replace talos-systems Go modules with siderolabs This the first step towards replacing all import paths to be based on `siderolabs/` instead of `talos-systems/`. All updates contain no functional changes, just refactorings to adapt to the new path structure. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-01 12:55:40 +04:00
Andrey Smirnov	f62d17125b	chore: update crypto to use new import path siderolabs/crypto No functional changes in this PR, just updating import paths. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-09-07 23:02:50 +04:00
Noel Georgi	b62b18a972	feat: bump k8s to v1.25.0-beta.0 Bump k8s to v1.25.0-beta.0 Update most kubernetes `master` references to `controlplane` Signed-off-by: Noel Georgi <git@frezbo.dev>	2022-08-10 22:17:53 +05:30
Andrey Smirnov	b085343dcb	feat: use discovery information for etcd join (and other etcd calls) Talos historically relied on `kubernetes` `Endpoints` resource (which specifies `kube-apiserver` endpoints) to find other controlplane members of the cluster to connect to the `etcd` nodes for the cluster (when node local etcd instance is not up, for example). This method works great, but it relies on Kubernetes endpoint being up. If the Kubernetes API is down for whatever reason, or if the loadbalancer malfunctions, endpoints are not available and join/leave operations don't work. This PR replaces the endpoints lookup to use the `Endpoints` COSI resource which is filled in using two methods: * from the discovery data (if discovery is enabled, default to enabled) * from the Kubernetes `Endpoints` resource If the discovery is disabled (or not available), this change does almost nothing: still Kubernetes is used to discover control plane endpoints, but as the data persists in memory, even if the Kubernetes control plane endpoint went down, cached copy will be used to connect to the endpoint. If the discovery is enabled, Talos can join the etcd cluster immediately on boot without waiting for Kubernetes to be up on the bootstrap node which means that Talos cluster initial bootstrap runs in parallel on all control plane nodes, while previously nodes were waiting for the first node to finish bootstrap enough to fill in the endpoints data. As the `etcd` communication is anyways protected with mutual TLS, there's no risk even if the discovery data is stale or poisoned, as etcd operations would fail on TLS mismatch. Most of the changes in this PR actually enable populating Talos `Endpoints` resource based on the `Kubernetes` `endpoints` resource using the watch API. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-04-21 22:00:27 +03:00
Andrey Smirnov	c6a67b8662	fix: ignore not existing nodes on cordoning Fixes #4557 When running `reset` for a node which was already deleted from Kubernetes, we should ignore failure to cordon and proceed with other actions. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-11-18 19:07:35 +03:00
Artem Chernyshev	e3e2113adc	feat: upgrade CoreDNS during `upgrade-k8s` call Fixes: https://github.com/talos-systems/talos/issues/4065 Get all Talos generated manifests and apply them, wait for deployments to be updated and to become ready. Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2021-10-13 15:47:06 +03:00
Andrey Smirnov	0b347570a7	feat: use dynamic NodeAddresses/HostnameStatus in Kubernetes certs This is a PR on a path towards removing `ApplyDynamicConfig`. This fixes Kubernetes API server certificate generation to use dynamic data to generate cert with proper SANs for IPs of the node. As part of that refactored a bit apid certificate generation (without any changes). Added two unit-tests for apid and Kubernetes certificate generation. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-09-01 20:56:53 +03:00
Alexey Palazhchenko	eea750de2c	chore: rename "join" type to "worker" Closes #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-07-09 07:10:45 -07:00
Andrey Smirnov	6d13d2cf92	fix: close Kubernetes API client The problem is that there's no official way to close Kuberentes client underlying TCP/HTTP connections. So each time Talos initializes connection to the control plane endpoint, new client is built, but this client is never closed, so the connection stays active on the load balancers, on the API server level, etc. It also eats some resources out of Talos itself. We add a way to close underlying connections by using helper from the Kubernetes client libraries to force close all TCP connections which should shut down all HTTP/2 connections as well. Alternative approach might be to cache a client for some time, but many of the clients are created with temporary PKI, so even cached client still needs to be closed once it gets stale, and it's not clear how to recreate a client in case existing one is broken for one reason or another (and we need to force a re-connection). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-05 14:25:26 -07:00
Andrey Smirnov	22a4193678	fix: workaround 'Unauthorized' errors when accessing Kubernetes API This should fix an error like: ``` failed to create etcd client: error getting kubernetes endpoints: Unauthorized ``` The problem is that the generated cert was used immediately, so even slight time sync issue across nodes might render the cert not (yet) usable. Cert is generated on one node, but might be used on any other node (as it goes via the LB). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-05 14:15:03 -07:00
Andrey Smirnov	5811f4dda1	feat: implement link (interface) controllers The structure of the controllers is really similar to addresses and routes: * `LinkSpec` resource describes desired link state * `LinkConfig` controller generates `LinkSpecs` based on machine configuration and kernel cmdline * `LinkMerge` controller merges multiple configuration sources into a single `LinkSpec` paying attention to the config layer priority * `LinkSpec` controller applies the specs to the kernel state Controller `LinkStatus` (which was implemented before) watches the kernel state and publishes current link status. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-01 09:36:25 -07:00
Andrey Smirnov	a1e6415403	fix: retry Kubernetes API errors on cordon/uncordon/etc This extracts function which was used in upgrade/convert flows to retry transient errors to the main `kubernetes` package, expands it to ignore timeout errors, and it is now used to retry errors where applicable in `pkg/kubernetes`. Fixes #3403 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-02 03:51:40 -07:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Artem Chernyshev	638af35db0	chore: properly propagate context object in the controller This is required to correctly handle ACPI reboot or forceful reboots during sequence that locks the controller. Additionally fix `NoSchedule` untaint when the configuration is changed. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-03 16:59:27 +03:00
Andrey Smirnov	779ac74a08	fix: improve the drain function Critical bug (I believe) was that drain code entered the loop to evict the pod after wait for pod to be deleted returned success effectively evicting pod once again once it got rescheduled to a different node. Add a global timeout to prevent draining code from running forever. Filter more pod types which shouldn't be ever drained. Fixes #3124 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-25 07:02:24 -08:00
Andrey Smirnov	9205870ee6	fix: move versions to annotations in control plane static pods Labels shouldn't be used, as this is not supposed to be used for filtering pods. Use proper annotation prefix private for Talos. Add config-version annotation to track how static pod propagates up to API server (it will be used in control plane upgrade). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-16 14:57:17 -08:00
Andrey Smirnov	8d7a36cc0c	fix: find master node IPs correctly in health checks Health checks verify node list in Kubernetes to match expectations, but initial set of nodes for server-side health checks was driven by `MasterIPs` functions which returns list of master endpoints which is not exactly same as master nodes: endpoints also include some healthchecks. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-16 06:28:02 -08:00
Andrey Smirnov	2277ce8abe	feat: move to ECDSA keys for all Kubernetes/etcd certs and keys ECDSA keys are smaller which decreases Talos config size, they are more efficient in terms of key generation, signing, etc., so it makes boot performance better (and config generation as well). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-02 13:25:00 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	f836f145f3	fix: synchronize bootkube timeouts and various boot timeouts When bootkube service fails, it can clean up manifests after itself, but it only happens if we give it a chance to shut down cleanly. If boot sequence times out, `machined` does emergency reboot and it doesn't let `bootkube` do the cleanup. So this fix has two paths: * synchronize boot/bootstrap sequence timeouts with bootkube asset timeout; * cleanup bootkube-generated manifests and bootkube service startup. Also logs errors on initial phases like `labelNodeAsMaster` to provide some feedback on why boot is stuck. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-18 13:45:28 -08:00
Andrey Smirnov	92cde0c2ea	fix: node taint doesn't contain value anymore As code was looking for existing taint with `value == true`, it failed to find existing taint and tried to add another one which never succeeds. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-03 13:12:42 -08:00
Andrey Smirnov	a26acfef9c	fix: remove value (change to empty) for `NoSchedule` taint This seems to be more preferred way and fixes compatibility with deployments which don't do `operator: Exists` in tolerations. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-02 07:05:49 -08:00
Andrey Smirnov	28ba6e416e	feat: update Kubernetes to v1.20.0-beta.2 Talos 0.8 is going to ship with K8s 1.20.x. Changes to support new `control-plane` label, upgrade-k8s supports automated fixups for 1.20. See also: https://github.com/talos-systems/bootkube-plugin/pull/22 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-25 06:39:14 -08:00
Andrey Smirnov	a2efa44663	chore: enable gci linter Fixes were applied automatically. Import ordering might be questionable, but it's strict: * stdlib * other packages * same package imports Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 08:09:48 -08:00
Andrey Smirnov	8560fb9662	chore: enable nlreturn linter Most of the fixes were automatically applied. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 06:48:07 -08:00
Artem Chernyshev	9c969a4be5	feat: allow disabling NoSchedule on master nodes Add talosconfig parameter that allows to disable NoSchedule taint on master nodes. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-06 10:52:37 -07:00
Andrey Smirnov	788cd15c29	test: add e2e test to the provision (upgrade) tests Add sonobuoy runner code with log fetching on failure. Use hand-picked set of e2e tests to run: verify basic pod functionality, verify service connectivity. Add option `--run-e2e` to the `talosctl health` to run quick e2e test to verify cluster health. Add option to run provision tests with custom CNI, run one track of provision tests with Cilium. Bump Cilium to 1.8.2. Talos 0.6 won't uncordon node automatically after upgrade from 0.5, as 0.5 doesn't put annotation. Workaround that in upgrade tests. Bump upgrade test version to 0.6.0 release. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-08 13:26:31 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	52c5911fcd	chore: extract pkg/crypto as external module Package `pkg/crypto` was extracted as `github.com/talos-systems/crypto` repository and Go module. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-14 06:33:30 -07:00
Andrey Smirnov	b110a9fa4d	fix: retry non-HTTP errors from API server While waiting for node ready condition, API server endpoint might return networking errors (e.g. if endpoint is a RR DNS record). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-10 07:26:52 -07:00
Andrey Smirnov	3926442704	feat: taint master nodes with `NoSchedule` taint Fixes #2350 This also brings in a fix for `coredns` tolerations from https://github.com/talos-systems/bootkube-plugin/pull/19. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 14:02:41 -07:00
Andrey Smirnov	c54639e541	feat: implement server-side API for cluster health checks This implements existing server-side health checks as defined in `internal/pkg/cluster/checks` in Talos API. Summary of changes: * new `cluster` API * `apid` now listens without auth on local file socket * `cluster` API is for now implemented in `machined`, but we can move it to the new service if we find it more appropriate * `talosctl health` by default now does server-side health check UX: `talosctl health` without arguments does health check for the cluster if it has healthy K8s to return master/worker nodes. If needed, node list can be overridden with flags. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-15 13:52:13 -07:00
Andrey Smirnov	804f162756	fix: improve node uncordon tasks 1. Increase retry timeout. 2. Use timeout per attempt. 3. Check for node readiness as a gate to succeed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 09:26:47 -07:00
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	ddbe9cfc2f	fix: update timeouts on service startup to match boot timeout There's a global timeout for all services to be up: it's 5 minutes. We need to make sure each service startup takes less than that, otherwise boot sequence is aborted and there's no way to see the error message for each particular service. Also propagate contexts correctly and set some default timeouts to make sure API operations are not hanging forever. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 07:39:36 -07:00
Spencer Smith	3a4eaeeef0	feat: upgrade kubernetes to 1.18 This PR will pull in the latest release of k8s 1.18 so we can start validating it through our test suite. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-03-26 14:59:43 -04:00
Spencer Smith	fa82454be4	chore: fix formatting of imports This PR cleans up the formatting for various package imports as they were causing the linter to throw errors. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-03-19 15:06:05 -04:00
Andrey Smirnov	01d696ed10	chore: update golangci-lint-1.23.3 `gomnd` disabled, as it complains about every number used in the code, and `wsl` became much more thorough. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-04 08:56:39 -08:00
Andrew Rynhard	f3623d22b0	refactor: use tls.Config as client credentials The `client.Creds` struct was not used very often, and made using the `client.NewClient` function impossible to use in combination with the `RemoteRenewingFileCertificateProvider`. This modifies `client.NewClient` to accept a `tls.Config` instead of `client.Creds`, allowing for the use of `RemoteRenewingFileCertificateProvider` with `client.NewClient`. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-01-21 17:10:07 -08:00
Andrew Rynhard	3e5ca30aa5	refactor: simplify NewTemporaryClientFromPKI This is a simple refactor that reduces the number of arguments required by `NewTemporaryClientFromPKI`. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-12-03 09:10:24 -08:00
Andrew Rynhard	6a1a9fc8d9	fix: retry cordon and uncordon When implementing the controller-manager I found a race condition between it and the cordon operation. The controller-manager annotates the node to indicate that an upgrade is in progress, and Talos tries to mark the node as unschedulable at nearly the same time. This leads to a race condition. The fix is to simply retry the cordon. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-16 11:15:22 -08:00
Andrew Rynhard	03a09c2294	refactor: rename Helper to Client The name helper isn't very good. This renames it to Client. A new func was also added, NewForConfig, that will allow for the creation of the helper client from an arbitrary Kubernetes REST config. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 19:31:27 -08:00
Andrey Smirnov	d3d011c8d2	chore: replace `/* */` comments with `//` comments in license header This fixes issues with `// +build` directives not being recognized in source files. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-10-25 14:15:17 -07:00
Spencer Smith	d0111fe617	feat: allow specifcation of full url for endpoint This PR moves to using the full URL for endpoint instead of trying to hardcode 6443 in various places like we were doing. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2019-10-16 13:45:05 -04:00

1 2

59 Commits