talos

mirror of https://github.com/siderolabs/talos.git synced 2025-12-06 01:51:14 +01:00

Author	SHA1	Message	Date
Andrey Smirnov	dadaa65d54	feat: print uid/gid for the files in `ls -l` This adds information about file ownership in the long listing which is crucial sometimes. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-08-13 00:10:49 +03:00
Andrey Smirnov	eefe1c21c3	feat: add new etcd members in learner mode Fixes #3714 This provides more safe way to join new members to the etcd cluster. See https://etcd.io/docs/v3.4/learning/design-learner/ With learner mode join there are few differences: * new nodes are joined one by one, because etcd enforces a single learner member in the cluster * learner members are not counted in quorum calculations, so while learner catches up with the master node, quorum is not affected and cluster is still operational Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-08-12 17:56:57 +03:00
Andrey Smirnov	0c7ce1cd81	feat: remove remnants of bootkube support Fixes #3951 Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11 support conversion of self-hosted bootkube-based control plane to the new style control plane running as static pods managed by Talos. This commit removes all backwards compatibility and removes conversion code. For the k8s controllers, `BootstrapStatus` is removed and a dependency on `etcd` service status is added (as it was implicitly there via `BootstrapStatus`). Remove control plane conversion code. In k8s upgrade code, remove self-hosted part. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-08-03 07:55:42 -07:00
Alexey Palazhchenko	eea750de2c	chore: rename "join" type to "worker" Closes #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-07-09 07:10:45 -07:00
Alexey Palazhchenko	bbf1c091d4	feat: add RBAC to `talosctl version` output Refs #3852. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-28 07:10:25 -07:00
Alexey Palazhchenko	06209bba28	chore: update RBAC rules, remove old APIs Refs #3421. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-18 09:54:49 -07:00
Alexey Palazhchenko	f63ab9dd9b	feat: implement `talosctl config new` command Refs #3421. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-17 09:06:43 -07:00
Artem Chernyshev	9a91142a38	feat: print complete member info in etcd members Fixes: https://github.com/talos-systems/talos/issues/3487 Example output: ``` NODE ID HOSTNAME PEERS CLIENTS 10.5.0.2 c3d3020cf75b8728 talos-default-master-1 https://10.5.0.2:2380 https://10.5.0.2:2379 ``` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-04-17 11:07:59 -07:00
Andrey Smirnov	0bd8b0e800	feat: provide an option to recover etcd from data directory copy Sometimes `talosctl etcd snapshot` might not be available, for example when etcd is not healthy. In that case it's possible to copy raw etcd data directory with `talosctl cp /var/lib/etcd .` and use `member/snap/db` to recover the cluster. But such copy won't pass integrity checks, so they should be disabled explicitly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-14 08:25:32 -07:00
Alexey Palazhchenko	29da22d063	feat: add config validation warnings Closes #3412. Refs #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-04-08 13:49:58 -07:00
Andrey Smirnov	e0650218a6	feat: support etcd recovery from snapshot on bootstrap When Talos `controlplane` node is waiting for a bootstrap, `etcd` contents can be recovered from a snapshot created with `talosctl etcd snapshot` on a healthy cluster. Bootstrap process goes same way as before, but the etcd data directory is recovered from the snapshot. This flow enables disaster recovery for the control plane: given that periodic backups are available, destroy control plane nodes, re-create them with the same config, and bootstrap one node with the saved snapshot to recover etcd state at the time of the snapshot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-08 10:15:37 -07:00
Andrey Smirnov	e664362cec	feat: add API and command to save etcd snapshot (backup) This adds a simple API and `talosctl etcd snapshot` command to stream snapshot of etcd from one of the control plane nodes to the local file. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-02 09:20:16 -07:00
Artem Chernyshev	376fdcf6cb	feat: implement etcd remove-member cli command Fixes: https://github.com/talos-systems/talos/issues/3219 We already have `etcd leave`, which makes the node exclude itself from etcd members. But in case if the node can't remove itself because it doesn't have connection to etcd we need this etcd remove-member cli, which basically removes a node from a different node. No unit tests for that as it's going to destroy the test cluster. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-01 07:55:08 -08:00
Andrey Smirnov	7751920dba	feat: add a tool and package to convert self-hosted CP to static pods This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster is fully upgraded, control plane is still self-hosted (as it was bootstrapped with bootkube). Tool `talosctl convert-k8s` (and library behind it) performs the upgrade to self-hosted version. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 23:26:57 -08:00
Andrey Smirnov	cc83b83808	feat: rename apply-config --no-reboot to --on-reboot This explains the intetion better: config is applied on reboot, and allows to easily distinguish it from `apply-config --immediate` which applies config immediately without a reboot (that is coming in a different PR). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 12:49:47 -08:00
Andrey Smirnov	d99a016af2	fix: correct response structure for GenerateConfig API Also fix recovery grpc handler to print panic stacktrace to the log. Any API should follow the structure compatible with apid proxying injection of errors/nodes. Explicitly fail GenerateConfig API on worker nodes, as it panics on worker nodes (missing certificates in node config). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-11 06:34:10 -08:00
Andrey Smirnov	edf5777222	feat: add an option to force upgrade without checks Our upgrades are safe by default - we check etcd health, take locks, etc. But sometimes upgrades might be a way to recover broken (or semi-broken) cluster, in that case we need upgrade to run even if the checks are not passing. This is not a safe way to do upgrades, but it might be a way to recover a cluster. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-09 10:20:03 -08:00
Andrey Smirnov	76a6794436	fix: kill all processes and umount all disk on reboot/shutdown There are several ways Talos node might be restarted or shut down: * error in sequence (initiated from machined) * panic in main goroutine (machined recovers panics) * error in sequence (initiated via API, event caught by machined) * reboot/shutdown via Talos API Before this change, paths (1) and (2) were handled in machined, and no disks were unmounted and processes killed, so technically all the processes are running and potentially writing to the filesystems. Paths (3) and (4) try to stop services (but not pods) and unmount explicitly mounted filesystems, followed by reboot directly from sequencer (bypassing machined handler). There was a bug that user disks were never explicitly unmounted (but they might have been unmounted if mounted on top `/var`). This refactors all the reboot/shutdown paths to flow through machined's main function: on paths (4) event is sent via event API from the sequencer back to the machined and machined initiates proper shutdown sequence. Refactoring in machined leads to all the paths (1)-(4) flowing through the same function `handle(error)`. Added two additional checks before flushing buffers: * kill all non-system processes, this also kills all mount namespaces * unmount any filesystem backed by `/dev/*` This ensures all filesystems are unmounted before buffers are flushed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-29 06:14:07 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Alexey Palazhchenko	f3465b8e3e	feat: support type filter in list API and CLI Closes #2068. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2020-12-24 06:34:02 -08:00
Artem Chernyshev	a83e8758db	feat: add commands to manage/query etcd cluster Used already existing protobufs for that. Commands: `talosctl etcd members -n <node>` `talosctl etcd leave -n <node>` `talosctl etcd forfeit-leadership -n <node>` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-22 11:49:10 -08:00
Andrey Smirnov	54ed80e244	feat: reset with system disk wipe spec Idea is to add an option to perform "selective" reset: default reset operation is to wipe all partitions (triggering reinstall), while spec allows only to wipe some of the operations. Other operations are performed exactly in the same way for any reset flow. Possible use case: reset only `EPHEMERAL` partition. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-10 11:31:07 -08:00
Andrey Smirnov	350280eb59	feat: implement "staged" (failsafe/backup) upgrades Regular upgrade path takes just one reboot, but it requires all the processes to be stopped on the node before upgrade might proceed. Under some circumstances and with potential Talos bugs it might not work rendering Talos upgrades almost impossible. Staged upgrades build upon regular install flow to run the upgrade on the node reboot. Such upgrades require two reboots of the node, and it requires two pulls of the installer image, but they should be much less suspicious to the failure. Once the upgrade is staged, node can be rebooted in any possible way, including hard reset and upgrade is performed on the next boot. New ADV format was implemented as well to allow to store install image ref/options across reboots. New format allows for bigger values and takes 50% of the `META` partition. Old ADV is still kept for compatibility reasons. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-08 08:34:26 -08:00
Artem Chernyshev	5d48bd5f6a	feat: allow disabling NoSchedule taint on masters using TUI installer I think this should come handy for setting up single node SBC clusters. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-07 07:31:54 -08:00
Artem Chernyshev	63e0d02aa9	feat: add TUI for configuring network interfaces settings Allows configuring: - cidr. - dhcp enable/disable. - MTU. - Ignore. - Dhcp metric. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-03 11:05:55 -08:00
Artem Chernyshev	c7062e3f4d	feat: make GenerateConfiguration accept current time as a parameter If the node time is out of sync, it can generate incorrect configuration. And maintenance mode does not allow us starting ntp, because there is no containerd. By providing current UTC time of the machine where talosctl client is running, it is possible to force GenerateConfiguration use correct time. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-03 08:28:11 -08:00
Artem Chernyshev	f96cffd2b2	feat: add ability to choose CNI config Initial version which only allows setting CNI using preset, no custom CNI urls are supported at the moment. Still need to figure out what kind of UI can be used for that. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-26 06:49:54 -08:00
Andrey Smirnov	9a32e34cb1	feat: implement apply configuration without reboot This allows config to be written to disk without being applied immediately. Small refactoring to extract common code paths. At first, I tried to implement this via the sequencer, but looks like it's too hard to get it right, as sequencer lacks context and config to be written is not applied to the runtime. Fixes #2828 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-23 12:42:44 -08:00
Artem Chernyshev	8513123d22	feat: return client config as the second value in GenerateConfiguration To be used in interactive installer to output the node client configuration to a file. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-17 07:20:05 -08:00
Artem Chernyshev	0f924b5122	feat: add generate config gRPC API Fixes: https://github.com/talos-systems/talos/issues/2766 This API is implemented in Maintenance and Machine services. Can be used to generate configuration on the node, instead of using talosctl to generate it locally. To be used in interactive installer and talosctl gen config. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-13 08:07:32 -08:00
Artem Chernyshev	93e30a1738	chore: remove maintenance service interface and use machine service Now maintenance service implements `MachineService` interface, stubbing all not implemented methods. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-11 12:33:44 -08:00
Andrew Rynhard	71321214a1	feat: add storage API This is the initial implementation of a storage API. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-11-11 10:12:25 -08:00
Andrew Rynhard	562f816526	refactor: use gRPC for interactive installation Instead of hosting a web service, we decided to implement a gRPC service that exposes APIs that can be used in a client-side interactive installer. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-11-03 08:36:44 -08:00
Artem Chernyshev	e7e99cf1b3	feat: support disk usage command in talosctl Usage example: ```bash talosctl du --nodes 10.5.0.2 /var -H -d 2 NODE NAME 10.5.0.2 8.4 kB etc 10.5.0.2 1.3 GB lib 10.5.0.2 16 MB log 10.5.0.2 25 kB run 10.5.0.2 4.1 kB tmp 10.5.0.2 1.3 GB . ``` Supported flags: - `-a` writes counts for all files, not just directories. - `-d` recursion depth - '-H' humanize size outputs. - '-t' size threshold (skip files if < size or > size). Fixes: https://github.com/talos-systems/talos/issues/2504 Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-13 09:30:31 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Seán C McCord	ff92d2a14b	feat: add ApplyConfiguration API Adds the ability to apply (replace) an existing node configuration with a new one via the Machine API. Fixes #2345 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-09-29 14:44:06 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	ad99cb6421	feat: implement talosctl dashboard command This builds a simple CLI UI for Talos cluster monitoring. Some new APIs were added for monitoring based on Prometheus procfs package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 14:24:04 -07:00
Andrey Smirnov	cbb7ca8390	refactor: merge osd into machined This merges `osd` API into `machined`. API was copied from `osd` into `machined`, and `osd` API was deprecated. For backwards compatibility, `machined` still implements `osd` API, so older Talos API clients can still talk to the node without changes. Docs were updated. No functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-13 12:50:00 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	a6b3bd2ff6	feat: implement service events This implements service events, adds test for events API based on service events as they're the easiest to generate on demand. Disabled validate test for 'metal' as it validates disk device against local system which doesn't make much sense. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-03 13:52:53 -07:00
Andrew Rynhard	a5a2d959ed	feat: upgrade runc to v1.0.0-rc90 This updates runc to the same version vendored by containerd. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-02 13:19:33 -07:00
Andrew Rynhard	11ad2a5ea8	feat: add rollback API This adds an API for rolling back the version of Talos loaded by the bootloader. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-06-09 16:18:40 -07:00
Andrey Smirnov	1739439674	fix: update Events API response type to match proxying conventions Streaming APIs are not supposed to wrap response into `repeated` container, as streaming allows to send as many responses back as required. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-15 11:57:47 -07:00
Andrew Rynhard	7915c73a86	fix: register event service with router This adds the events streaming RPC to routerd. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-15 07:33:32 -07:00
Andrew Rynhard	1902519727	feat: add events API This adds an event stream to the runtime, and the ability to stream events via the API. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-13 12:18:10 -07:00
Andrew Rynhard	8e07b1bab3	feat: add bootstrap API This adds the ability to bootstrap a cluster using the API. The API simply starts the bootkube service. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-07 16:47:28 -07:00
Andrew Rynhard	56d7bf19fe	feat: add recovery API This adds an API for recovering the self-hosted control plane. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-04 19:38:30 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Andrew Rynhard	69fa63a7b2	refactor: perform upgrade upon reboot This PR introduces a new strategy for upgrades. Instead of attempting to zap the partition table, create a new one, and then format the partitions, this change will only update the `vmlinuz`, and `initramfs.xz` being used to boot. It introduces an A/B style upgrade process, which will allow for easy rollbacks. One deviation from our original intention with upgrades is that this change does not completely reset a node. It falls just short of that and does not reset the partition table. This forces us to keep the current partition scheme in mind as we make changes in the future, because an upgrade assumes a specific partition scheme. We can improve upgrades further in the future, but this will at least make them more dependable. Finally, one more feature in this PR is the ability to keep state. This enables single node clusters to upgrade since we keep the etcd data around. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 17:32:18 -07:00

1 2

68 Commits