talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-06 13:11:12 +02:00

Author	SHA1	Message	Date
Andrey Smirnov	62c702c4fd	fix: remove conflicting etcd member on rejoin with empty data directory This fixes a scenario when control plane node loses contents of `/var` without leaving etcd first: on reboot etcd data directory is empty, but member is already present in the etcd member list, so etcd won't be able to join because of raft log being empty. The fix is to remove a member with matching hostname if found in the etcd member list followed by new member add. The risk here is removing another member which has same hostname as the joining node, but having duplicate hostnames for control plane node is a problem anyways. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-03 15:11:44 -07:00
Andrey Smirnov	0acb04ad7a	feat: implement route network controllers Route handling is very similar to addresses: * `RouteStatus` describes kernel routing table state, `RouteStatusController` reflects kernel state into resources * `RouteSpec` defines routes to be configured * `RouteConfigController` creates `RouteSpec`s based on cmdline and machine configuration * `RouteMergeController` merges different configuration layers into the final representation * `RouteSpecController` applies the specs to the kernel routing table Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-05-25 11:09:21 -07:00
Alexey Palazhchenko	1fcf38f9d6	feat: add support for "none" CNI type Closes #3411. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-04-09 12:53:00 -07:00
Andrey Smirnov	e0650218a6	feat: support etcd recovery from snapshot on bootstrap When Talos `controlplane` node is waiting for a bootstrap, `etcd` contents can be recovered from a snapshot created with `talosctl etcd snapshot` on a healthy cluster. Bootstrap process goes same way as before, but the etcd data directory is recovered from the snapshot. This flow enables disaster recovery for the control plane: given that periodic backups are available, destroy control plane nodes, re-create them with the same config, and bootstrap one node with the saved snapshot to recover etcd state at the time of the snapshot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-08 10:15:37 -07:00
Andrey Smirnov	7d91258475	test: fix data race in apply config tests Variable `chanErr` was read before waiting for the goroutine to finish. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-31 10:46:50 -07:00
Andrey Smirnov	204caf8eb9	test: fix apply-config integration test, bump clusterctl version Tests for ApplyConfig API were relying on not really supported behavior of modifying config via the `Provider` interface (and it was "fixed" in another PR which cleans up such access to the configuration). Cluster version bumped to try to workaround strange CAPI bootstrap failures in e2e-capi. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-31 09:55:53 -07:00
Andrey Smirnov	2ea20f598a	feat: replace timed with time sync controller This is a complete rewrite of time sync process. Now the time sync process starts early at boot time, and it adapts to configuration changes: * before config is available, `pool.ntp.org` is used * once config is available, configured time servers are used Controller updates same time sync resource as other controllers had dependency on, so they have a chance to wait for the time sync event. Talos services which depend on time now wait on same resource instead of waiting on timed health. New features: * time sync now sticks to the particular time server unless there's an error from that server, and server is changed in that case, this improves time sync accuracy * time sync acts on config changes immediately, so it's possible to reconfigure time sync at any time * there's a new 'epoch' field in time sync resources which allows time-dependent controllers to regenerate certs when there's a big enough jump in time Features to implement later: * apid shouldn't depend on timed, it should be started early and it should regenerate certs on time jump * trustd should be updated in same way Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-29 09:29:43 -07:00
Alexey Palazhchenko	d7e9f6d6a8	chore: build integration tests with -race Refs https://github.com/talos-systems/talos/issues/3378. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-26 10:08:12 -07:00
Artem Chernyshev	6ffabe5169	feat: add ability to find disk by disk properties Fixes: https://github.com/talos-systems/talos/issues/3323 Not exactly matching with udevd generated `by-<id>` symlinks, but should provide sufficient amount of property selectors to be able to pick specific disks for any kind of disk: sd card, hdd, ssd, nvme. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-23 14:23:02 -07:00
Artem Chernyshev	22f375300c	chore: update golanci-lint to 1.38.0 Fix all discovered issues. Detected couple bugs, fixed them as well. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-12 06:50:02 -08:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Andrey Smirnov	31e56e63db	fix: update in-cluster kubeconfig validity to match other certs Talos generates in-cluster kubeconfig for the kube-scheduler and kube-controller-manager to authenticate to kube-apiserver. Bug was that validity of that kubeconfig was set to 24h by mistake. Fix that by bumping validity to default for other Kubernetes certs (1 year). Add a certificate refresh at 50% of the validity. Fix bugs with copying secret resources which was leading to updates not being propagated correctly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-01 11:16:04 -08:00
Artem Chernyshev	7108bb3f5b	test: upgrade master to master tests Verify upgrade flow using the same version of the installer. Run that with disk encryption enabled. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-02-24 07:56:44 -08:00
Artem Chernyshev	06b8c09484	test: enable disk encryption key rotation test Verify that disk encryption sync operations work properly. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-02-20 06:17:55 -08:00
Andrey Smirnov	32d2588528	test: update integration tests to use wrapped client for etcd APIs This continues the fix from #3167. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-18 08:08:48 -08:00
Artem Chernyshev	58ff2c9808	feat: implement ephemeral partition encryption This PR introduces the first part of disk encryption support. New config section `systemDiskEncryption` was added into MachineConfig. For now it contains only Ephemeral partition encryption. Encryption itself supports two kinds of keys for now: - node id deterministic key. - static key which is hardcoded in the config and mainly used for test purposes. Talosctl cluster create can now be told to encrypt ephemeral partition by using `--encrypt-ephemeral` flag. Additionally: - updated pkgs library version. - changed Dockefile to copy cryptsetup deps from pkgs. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-02-17 13:39:04 -08:00
Andrey Smirnov	cc83b83808	feat: rename apply-config --no-reboot to --on-reboot This explains the intetion better: config is applied on reboot, and allows to easily distinguish it from `apply-config --immediate` which applies config immediately without a reboot (that is coming in a different PR). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 12:49:47 -08:00
Andrey Smirnov	d99a016af2	fix: correct response structure for GenerateConfig API Also fix recovery grpc handler to print panic stacktrace to the log. Any API should follow the structure compatible with apid proxying injection of errors/nodes. Explicitly fail GenerateConfig API on worker nodes, as it panics on worker nodes (missing certificates in node config). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-11 06:34:10 -08:00
Andrey Smirnov	7f3dca8e4c	test: add support for IPv6 in talosctl cluster create Modify provision library to support multiple IPs, CIDRs, gateways, which can be IPv4/IPv6. Based on IP types, enable services in the cluster to run DHCPv4/DHCPv6 in the test environment. There's outstanding bug left with routes not being properly set up in the cluster so, IPs are not properly routable, but DHCPv6 works and IPs are allocated (validates DHCPv6 client). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-09 13:28:53 -08:00
Andrey Smirnov	87ccf0eb21	test: clear connection refused errors after reset After node reboot (and gRPC API unavailability), gRPC stack might cache connection refused errors for up to backoff timeout. Explicitly clear such errors in reset tests before trying to read data from the node to verify reset success. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-01 08:11:27 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	47fb5720cf	test: skip etcd tests on non-HA clusters We can't test much of the flow on single-node clusters. Fixes #3013 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-08 07:39:36 -08:00
Andrey Smirnov	a8dd2ff30d	fix: checkpoint controller-manager and scheduler Default manifests created by bootkube so far were only enabling pod-checkpointer for kube-apiserver. This seems to have issues with single-node control plane scenario, when without scheduler and controller-manager node might fall into `NodeAffinity` state. See https://github.com/talos-systems/bootkube-plugin/pull/23 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-28 11:53:17 -08:00
Andrey Smirnov	3dae6df27b	test: stabilize upgrade test by running health check several times For single node clusters, control plane is unstable after reboot, run health check several times to let it settle down to avoid failures in subsequent checks. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-11 08:31:01 -08:00
Andrey Smirnov	54ed80e244	feat: reset with system disk wipe spec Idea is to add an option to perform "selective" reset: default reset operation is to wipe all partitions (triggering reinstall), while spec allows only to wipe some of the operations. Other operations are performed exactly in the same way for any reset flow. Possible use case: reset only `EPHEMERAL` partition. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-10 11:31:07 -08:00
Andrey Smirnov	350280eb59	feat: implement "staged" (failsafe/backup) upgrades Regular upgrade path takes just one reboot, but it requires all the processes to be stopped on the node before upgrade might proceed. Under some circumstances and with potential Talos bugs it might not work rendering Talos upgrades almost impossible. Staged upgrades build upon regular install flow to run the upgrade on the node reboot. Such upgrades require two reboots of the node, and it requires two pulls of the installer image, but they should be much less suspicious to the failure. Once the upgrade is staged, node can be rebooted in any possible way, including hard reset and upgrade is performed on the next boot. New ADV format was implemented as well to allow to store install image ref/options across reboots. New format allows for bigger values and takes 50% of the `META` partition. Old ADV is still kept for compatibility reasons. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-08 08:34:26 -08:00
Artem Chernyshev	8aad711f18	feat: implement network interfaces list API To be used in the interactive installer to configure networking. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-27 10:48:45 -08:00
Artem Chernyshev	f96cffd2b2	feat: add ability to choose CNI config Initial version which only allows setting CNI using preset, no custom CNI urls are supported at the moment. Still need to figure out what kind of UI can be used for that. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-26 06:49:54 -08:00
Andrey Smirnov	9a32e34cb1	feat: implement apply configuration without reboot This allows config to be written to disk without being applied immediately. Small refactoring to extract common code paths. At first, I tried to implement this via the sequencer, but looks like it's too hard to get it right, as sequencer lacks context and config to be written is not applied to the runtime. Fixes #2828 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-23 12:42:44 -08:00
Artem Chernyshev	2588e2960b	feat: make GenerateConfiguration API reuse current node auth Fixes: https://github.com/talos-systems/talos/issues/2819 Only if requested config type is not `TypeInit`. This functionality will help implementing TUI installer cluster extension workflow. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-23 12:12:15 -08:00
Artem Chernyshev	8513123d22	feat: return client config as the second value in GenerateConfiguration To be used in interactive installer to output the node client configuration to a file. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-17 07:20:05 -08:00
Artem Chernyshev	0f924b5122	feat: add generate config gRPC API Fixes: https://github.com/talos-systems/talos/issues/2766 This API is implemented in Maintenance and Machine services. Can be used to generate configuration on the node, instead of using talosctl to generate it locally. To be used in interactive installer and talosctl gen config. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-13 08:07:32 -08:00
Andrey Smirnov	8560fb9662	chore: enable nlreturn linter Most of the fixes were automatically applied. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 06:48:07 -08:00
Andrey Smirnov	773912833e	test: clean up integration test code, fix flakes This enables golangci-lint via build tags for integration tests (this should have been done long ago!), and fixes the linting errors. Two tests were updated to reduce flakiness: * apply config: wait for nodes to issue "boot done" sequence event before proceeding * recover: kill pods even if they appear after the initial set gets killed (potential race condition with previous test). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-19 15:44:14 -07:00
Artem Chernyshev	e7e99cf1b3	feat: support disk usage command in talosctl Usage example: ```bash talosctl du --nodes 10.5.0.2 /var -H -d 2 NODE NAME 10.5.0.2 8.4 kB etc 10.5.0.2 1.3 GB lib 10.5.0.2 16 MB log 10.5.0.2 25 kB run 10.5.0.2 4.1 kB tmp 10.5.0.2 1.3 GB . ``` Supported flags: - `-a` writes counts for all files, not just directories. - `-d` recursion depth - '-H' humanize size outputs. - '-t' size threshold (skip files if < size or > size). Fixes: https://github.com/talos-systems/talos/issues/2504 Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-13 09:30:31 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Seán C McCord	ff92d2a14b	feat: add ApplyConfiguration API Adds the ability to apply (replace) an existing node configuration with a new one via the Machine API. Fixes #2345 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-09-29 14:44:06 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Marco De Luca	1fbb171fd0	test: determine reboots using boot id Changed the RebootSuite to use /proc/sys/kernel/random/boot_id rather than /proc/uptime Signed-off-by: Marco De Luca <marcodl404@gmail.com>	2020-08-26 06:09:02 -07:00
Andrey Smirnov	6a7cc02648	fix: handle bootkube recover correctly, support recovery from etcd Bootkube recover process (and `talosctl recover`) was actually regenerating assets each time `recover` runs forcing control plane to be at the state when cluster got created. This PR fixes that by running recover process correctly. Recovery via etcd was fixed to handle encrypted etcd data: it follows the way `apiserver` handles encryption at rest, and as at the moment AES CBC is the only supported encryption method, code simply follows the same path. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-18 14:24:14 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	47608fb874	refactor: make `pkg/config` not rely on `machined/../internal/runtime` This makes `pkg/config` directly importable from other projects. There should be no functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 12:40:12 -07:00
Andrey Smirnov	3d8418a689	feat: force nodes to be set in `talosctl` commands using the API With load-balancing enabled by default running `talosctl` without `--nodes` is risky, as it might hit any control plane by default without `--nodes`. Only two commands do not enforce this check, as they do their own node contexts: `crashdump` and `health` (client-side). Integration tests were updated to always supply `--nodes` cli argument, while doing that I refactored the storage for discovered nodes to use existing `cluster.Info` interface. The downside is that with e2e CAPI tests CLI tests will be mostly skipped as we don't support discovery in CLI tests at the momemnt. This can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for node discovery. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 12:17:43 -07:00
Andrey Smirnov	1a0e1bc393	chore: update module dependencies Fixes #2316 Simply update dependencies we don't track on version level to be compatible with Talos components (like etcd or k8s). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:00:50 -07:00
Andrey Smirnov	cbb7ca8390	refactor: merge osd into machined This merges `osd` API into `machined`. API was copied from `osd` into `machined`, and `osd` API was deprecated. For backwards compatibility, `machined` still implements `osd` API, so older Talos API clients can still talk to the node without changes. Docs were updated. No functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-13 12:50:00 -07:00
Andrey Smirnov	931237b23c	test: update init node check in reset API tests Previously we assumed that node 0 is the init node, and it can't be reset. With new bootstrap API approach, there's no init node, and all the nodes can be reset. This corrects the check to skip only the init node, and with bootstrap API there's no init node (so no nodes are skipped). Fixes #2277 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 10:48:14 -07:00
Andrey Smirnov	5ecddf2866	feat: add round-robin LB policy to Talos client by default Handling of multiple endpoints has already been implemented in #2094. This PR enables round-robin policy so that grpc picks up new endpoint for each call (and not send each request to the first control plane node). Endpoint list is randomized to handle cases when only one request is going to be sent, so that it doesn't go always to the first node in the list. gprc handles dead/unresponsive nodes automatically for us. `talosctl cluster create` and provision tests switched to use client-side load balancer for Talos API. On the additional improvements we got: * `talosctl` now reports correct node IP when using commands without `-n`, not the loadbalancer IP (if using multiple endpoints of course) * loadbalancer can't provide reliable handling of errors when upstream server is unresponsive or there're no upstreams available, grpc returns much more helpful errors Fixes #1641 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 08:35:15 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	a6b3bd2ff6	feat: implement service events This implements service events, adds test for events API based on service events as they're the easiest to generate on demand. Disabled validate test for 'metal' as it validates disk device against local system which doesn't make much sense. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-03 13:52:53 -07:00
Andrey Smirnov	81d1c2bfe7	chore: enable godot linter Issues were fixed automatically. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-30 10:39:56 -07:00

1 2

77 Commits