talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-09 22:51:12 +02:00

Author	SHA1	Message	Date
Alexey Palazhchenko	eea750de2c	chore: rename "join" type to "worker" Closes #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-07-09 07:10:45 -07:00
Andrey Smirnov	9bf899bdd8	fix: make forfeit leadership connect to the right node I believe `clientv3.SetEndpoints()` calls doesn't make etcd client connect to the endpoints mentioned immediately, it might stil reuse old connection (?). At the same time `MaintenanceClient` which implements `MoveLeader` calls doesn't support explicit endpoint setting (as other similar calls do), so we have to manually force the connection to the leader node we need. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-06 13:50:57 -07:00
Andrey Smirnov	6d13d2cf92	fix: close Kubernetes API client The problem is that there's no official way to close Kuberentes client underlying TCP/HTTP connections. So each time Talos initializes connection to the control plane endpoint, new client is built, but this client is never closed, so the connection stays active on the load balancers, on the API server level, etc. It also eats some resources out of Talos itself. We add a way to close underlying connections by using helper from the Kubernetes client libraries to force close all TCP connections which should shut down all HTTP/2 connections as well. Alternative approach might be to cache a client for some time, but many of the clients are created with temporary PKI, so even cached client still needs to be closed once it gets stale, and it's not clear how to recreate a client in case existing one is broken for one reason or another (and we need to force a re-connection). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-05 14:25:26 -07:00
Andrey Smirnov	aaa36f3b4f	fix: ignore 'not a leader' error on forfeit leadership When forfeiting etcd leadership, it might be that the node still reports leadership status while not being a leader once the actual API call is used. We should ignore such an error as the node is not a leader. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-05 14:23:24 -07:00
Serge Logvinov	b52b206665	feat: split etcd certificates to peer/client Changes: * Etcd peer port key usage: ServerAuth,ClientAuth * Etcd client port key usage: ServerAuth,ClientAuth * Talos etcd client key usage: ClientAuth * KubeAPI etcd client key usage: ClientAuth * List of etcd allowed ciphers Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev> Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-23 13:26:48 -07:00
Andrey Smirnov	4ac9bea27d	fix: stop etcd client logs from going to the server console When etcd calls start failing, these log messages start spamming the console a lot. Disable the client log until we figure out a better destination for them. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-17 10:04:14 -07:00
Andrey Smirnov	59cfd312c1	chore: bump dependencies via dependabot There were some upstream code changes in etcd, some code got moved around. PRs #3651 #3652 #3653 #3654 #3655 #3655 #3656 #3657 #3658 #3659 #3660 #3661 #3662 #3663 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-05-24 12:15:15 -07:00
Andrey Smirnov	6cb266e74e	fix: update etcd client errors, print etcd join failures Better error message to understand where the error is coming from, also print errors to console when etcd is trying to join - this is invaluable to understand why etcd doesn't join the cluster. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-15 11:54:25 -07:00
Andrey Smirnov	ce795f1cea	fix: command `etcd remove-member` shouldn't remove etcd data directory There are two APIs and `talosctl` commands: * `etcd leave` removes the member from the cluster and removes etcd data directory for the called node * `etcd remove-member <node>` removes some other node from the etcd cluster, but it doesn't affect called node state This fixes confusing naming of the methods vs. what they're doing. Fixes #3340 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-22 02:11:06 -07:00
Artem Chernyshev	22f375300c	chore: update golanci-lint to 1.38.0 Fix all discovered issues. Detected couple bugs, fixed them as well. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-12 06:50:02 -08:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Artem Chernyshev	376fdcf6cb	feat: implement etcd remove-member cli command Fixes: https://github.com/talos-systems/talos/issues/3219 We already have `etcd leave`, which makes the node exclude itself from etcd members. But in case if the node can't remove itself because it doesn't have connection to etcd we need this etcd remove-member cli, which basically removes a node from a different node. No unit tests for that as it's going to destroy the test cluster. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-01 07:55:08 -08:00
Andrey Smirnov	953ce643ab	feat: bump etcd client library to 3.5.0-alpha.0 This version is finally using working `go.mod` files and tags, so no more hacks with imports, and allows us to bump `grpc` library to the latest version (I also did for this PR). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-25 10:36:15 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	a2efa44663	chore: enable gci linter Fixes were applied automatically. Import ordering might be questionable, but it's strict: * stdlib * other packages * same package imports Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 08:09:48 -08:00
Andrey Smirnov	8560fb9662	chore: enable nlreturn linter Most of the fixes were automatically applied. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 06:48:07 -08:00
Seán C McCord	a7a27e7edd	feat: extend etcd health check on upgrade When an etcd node is upgraded, we now perform additional quorum checks. This is necessary because when etcd nodes are upgraded, they are removed from membership. If, for instance, two etcd nodes were to upgrade simultaneously, quorum may be lost. This, of course, does not apply to single-node etcd clusters. Fixes #1422 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-10-23 15:49:55 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Andrew Rynhard	d4f103ffcb	fix: pass config via stdin In order to perform upgrades the way we would like, it is important that we avoid any bind mounts into containers. This change ensures that all system services get their config via stdin. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-08-20 15:26:13 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	2697b99b7d	refactor: extract `pkg/net` as `github.com/talos-systems/net` This extracts common package as new module/repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-14 11:04:50 -07:00
Andrey Smirnov	52c5911fcd	chore: extract pkg/crypto as external module Package `pkg/crypto` was extracted as `github.com/talos-systems/crypto` repository and Go module. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-14 06:33:30 -07:00
Andrey Smirnov	7474b8ba52	feat: upgrade etcd to 3.4.10 This upgrades etcd to latest v3.4.x version as smooth upgrade from version 3.3.22 in 0.6. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-13 07:33:51 -07:00
Andrey Smirnov	47608fb874	refactor: make `pkg/config` not rely on `machined/../internal/runtime` This makes `pkg/config` directly importable from other projects. There should be no functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 12:40:12 -07:00
Andrey Smirnov	ddbe9cfc2f	fix: update timeouts on service startup to match boot timeout There's a global timeout for all services to be up: it's 5 minutes. We need to make sure each service startup takes less than that, otherwise boot sequence is aborted and there's no way to see the error message for each particular service. Also propagate contexts correctly and set some default timeouts to make sure API operations are not hanging forever. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 07:39:36 -07:00
Seán C McCord	8ba8742f2e	fix: wrap etcd address URLs with formatting Make certain that all etcd address URLs are properly wrapped to handle IPv6 addresses. Fixes #2120 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-05-18 10:30:16 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Andrew Rynhard	69fa63a7b2	refactor: perform upgrade upon reboot This PR introduces a new strategy for upgrades. Instead of attempting to zap the partition table, create a new one, and then format the partitions, this change will only update the `vmlinuz`, and `initramfs.xz` being used to boot. It introduces an A/B style upgrade process, which will allow for easy rollbacks. One deviation from our original intention with upgrades is that this change does not completely reset a node. It falls just short of that and does not reset the partition table. This forces us to keep the current partition scheme in mind as we make changes in the future, because an upgrade assumes a specific partition scheme. We can improve upgrades further in the future, but this will at least make them more dependable. Finally, one more feature in this PR is the ability to keep state. This enables single node clusters to upgrade since we keep the etcd data around. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 17:32:18 -07:00
Spencer Smith	12bfd8dd94	feat: allow for persistence of config data This PR will allow users to set the `persist: true` value in their config data to tell talos not to re-pull the config data at each reboot. The default will still remain as a "pull every time" methodolgy in order to encourage immutability by default. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-03-06 11:42:00 -05:00
Spencer Smith	7719a67834	fix: refuse to upgrade if single master This PR adds some simple logic to bail early in the upgrade process if there only seems to be a single etcd node present in the cluster. This allows us to keep from blowing up non-HA clusters if users issue an upgrade command. Will close #1770. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-01-13 15:31:01 -05:00
Andrew Rynhard	898cf01f0a	refactor: unify generate type and machine type We have been using two packages that define a config type and a machine type, when really they are one and the same. This unifies the types down to one set. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-01-10 16:46:28 -08:00
Andrew Rynhard	3e5ca30aa5	refactor: simplify NewTemporaryClientFromPKI This is a simple refactor that reduces the number of arguments required by `NewTemporaryClientFromPKI`. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-12-03 09:10:24 -08:00
Andrew Rynhard	c9732458c1	fix: verify that all etcd members are running before upgrading This verifies that all etcd members are running before performing an upgrade. Without this we run the risk of destroying the etcd cluster. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 18:17:13 -08:00
Andrew Rynhard	33468f4d6a	fix: don't use 127.0.0.1 for etcd client We should use 127.0.0.1 only in special cases (like when bootstrapping the cluster). There is the potential that the local etcd member is unhealthy and/or not responsive. This adds function for creating an etcd client configured with all control plane node IPs in order to better handle this case. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 16:54:15 -08:00
Andrew Rynhard	ce911c02da	refactor: use etcd package This DRYs things up by using the etcd package for client creation. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-01 21:02:44 -07:00
Andrew Rynhard	5abbb9b041	fix: Avoid running bootkube on reboots Since bootkube should only be ran once, we need a way to determine if it has already been ran. This makes use of etcd to store a key-value pair indicating that the cluster has been initialized. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-01 15:20:43 -07:00

36 Commits