talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-11 07:31:18 +02:00

Author	SHA1	Message	Date
Artem Chernyshev	376fdcf6cb	feat: implement etcd remove-member cli command Fixes: https://github.com/talos-systems/talos/issues/3219 We already have `etcd leave`, which makes the node exclude itself from etcd members. But in case if the node can't remove itself because it doesn't have connection to etcd we need this etcd remove-member cli, which basically removes a node from a different node. No unit tests for that as it's going to destroy the test cluster. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-01 07:55:08 -08:00
Andrey Smirnov	589d01892c	fix: update the layout of the Disks API to match proxying requirements Fixes #3199 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-24 11:33:15 -08:00
Andrey Smirnov	7751920dba	feat: add a tool and package to convert self-hosted CP to static pods This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster is fully upgraded, control plane is still self-hosted (as it was bootstrapped with bootkube). Tool `talosctl convert-k8s` (and library behind it) performs the upgrade to self-hosted version. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 23:26:57 -08:00
Andrey Smirnov	e5bd35ae3c	feat: add resource watch API + CLI This uses API in `os-runtime` to pull the initial list of resources + updates for resource by type. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 13:24:47 -08:00
Andrey Smirnov	cc83b83808	feat: rename apply-config --no-reboot to --on-reboot This explains the intetion better: config is applied on reboot, and allows to easily distinguish it from `apply-config --immediate` which applies config immediately without a reboot (that is coming in a different PR). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 12:49:47 -08:00
Andrey Smirnov	d99a016af2	fix: correct response structure for GenerateConfig API Also fix recovery grpc handler to print panic stacktrace to the log. Any API should follow the structure compatible with apid proxying injection of errors/nodes. Explicitly fail GenerateConfig API on worker nodes, as it panics on worker nodes (missing certificates in node config). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-11 06:34:10 -08:00
Andrey Smirnov	edf5777222	feat: add an option to force upgrade without checks Our upgrades are safe by default - we check etcd health, take locks, etc. But sometimes upgrades might be a way to recover broken (or semi-broken) cluster, in that case we need upgrade to run even if the checks are not passing. This is not a safe way to do upgrades, but it might be a way to recover a cluster. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-09 10:20:03 -08:00
Andrey Smirnov	76a6794436	fix: kill all processes and umount all disk on reboot/shutdown There are several ways Talos node might be restarted or shut down: * error in sequence (initiated from machined) * panic in main goroutine (machined recovers panics) * error in sequence (initiated via API, event caught by machined) * reboot/shutdown via Talos API Before this change, paths (1) and (2) were handled in machined, and no disks were unmounted and processes killed, so technically all the processes are running and potentially writing to the filesystems. Paths (3) and (4) try to stop services (but not pods) and unmount explicitly mounted filesystems, followed by reboot directly from sequencer (bypassing machined handler). There was a bug that user disks were never explicitly unmounted (but they might have been unmounted if mounted on top `/var`). This refactors all the reboot/shutdown paths to flow through machined's main function: on paths (4) event is sent via event API from the sequencer back to the machined and machined initiates proper shutdown sequence. Refactoring in machined leads to all the paths (1)-(4) flowing through the same function `handle(error)`. Added two additional checks before flushing buffers: * kill all non-system processes, this also kills all mount namespaces * unmount any filesystem backed by `/dev/*` This ensures all filesystems are unmounted before buffers are flushed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-29 06:14:07 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	11863dd74d	feat: implement resource API in Talos This brings in `os-runtime` package and exposes resources with first iteration of read-only API. Two Talos resources (and one controller) are implemented: * legacy.Service resource tracks Talos 'service' `RUNNING` state * config.V1Alpha1 stores current runtime config Glue point between existing runtime and new os-runtime based runtime is in `v1alpha2` implementation and `V1Alpha2()` sub-interfaces of existing `Runtime`, `State`, `Controller` interfaces. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-19 11:45:46 -08:00
Alexey Palazhchenko	f3465b8e3e	feat: support type filter in list API and CLI Closes #2068. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2020-12-24 06:34:02 -08:00
Andrey Smirnov	6a0e652f0c	fix: correctly transport gRPC errors from apid Before these changes, errors were always sent as strings, so if original error was gRPC error (which is almost always the case for apid), it is formatted as string and original fields (like code) are lost in the formatted string. With this change, apid sends errors as official `grpc.Status` protobuf structure, and client decodes that into Go grpc.Status based error. This change is backwards and forwards compatible. This should fix more cases when integration tests were not able to ignore grpc `transport is closing` errors when they were sent as strings from the apid endpoint. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-23 11:08:51 -08:00
Artem Chernyshev	a83e8758db	feat: add commands to manage/query etcd cluster Used already existing protobufs for that. Commands: `talosctl etcd members -n <node>` `talosctl etcd leave -n <node>` `talosctl etcd forfeit-leadership -n <node>` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-22 11:49:10 -08:00
Andrey Smirnov	54ed80e244	feat: reset with system disk wipe spec Idea is to add an option to perform "selective" reset: default reset operation is to wipe all partitions (triggering reinstall), while spec allows only to wipe some of the operations. Other operations are performed exactly in the same way for any reset flow. Possible use case: reset only `EPHEMERAL` partition. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-10 11:31:07 -08:00
Andrey Smirnov	350280eb59	feat: implement "staged" (failsafe/backup) upgrades Regular upgrade path takes just one reboot, but it requires all the processes to be stopped on the node before upgrade might proceed. Under some circumstances and with potential Talos bugs it might not work rendering Talos upgrades almost impossible. Staged upgrades build upon regular install flow to run the upgrade on the node reboot. Such upgrades require two reboots of the node, and it requires two pulls of the installer image, but they should be much less suspicious to the failure. Once the upgrade is staged, node can be rebooted in any possible way, including hard reset and upgrade is performed on the next boot. New ADV format was implemented as well to allow to store install image ref/options across reboots. New format allows for bigger values and takes 50% of the `META` partition. Old ADV is still kept for compatibility reasons. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-08 08:34:26 -08:00
Artem Chernyshev	5d48bd5f6a	feat: allow disabling NoSchedule taint on masters using TUI installer I think this should come handy for setting up single node SBC clusters. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-07 07:31:54 -08:00
Artem Chernyshev	63e0d02aa9	feat: add TUI for configuring network interfaces settings Allows configuring: - cidr. - dhcp enable/disable. - MTU. - Ignore. - Dhcp metric. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-03 11:05:55 -08:00
Artem Chernyshev	c7062e3f4d	feat: make GenerateConfiguration accept current time as a parameter If the node time is out of sync, it can generate incorrect configuration. And maintenance mode does not allow us starting ntp, because there is no containerd. By providing current UTC time of the machine where talosctl client is running, it is possible to force GenerateConfiguration use correct time. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-03 08:28:11 -08:00
Artem Chernyshev	f96cffd2b2	feat: add ability to choose CNI config Initial version which only allows setting CNI using preset, no custom CNI urls are supported at the moment. Still need to figure out what kind of UI can be used for that. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-26 06:49:54 -08:00
Andrey Smirnov	9a32e34cb1	feat: implement apply configuration without reboot This allows config to be written to disk without being applied immediately. Small refactoring to extract common code paths. At first, I tried to implement this via the sequencer, but looks like it's too hard to get it right, as sequencer lacks context and config to be written is not applied to the runtime. Fixes #2828 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-23 12:42:44 -08:00
Artem Chernyshev	8513123d22	feat: return client config as the second value in GenerateConfiguration To be used in interactive installer to output the node client configuration to a file. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-17 07:20:05 -08:00
Artem Chernyshev	0f924b5122	feat: add generate config gRPC API Fixes: https://github.com/talos-systems/talos/issues/2766 This API is implemented in Maintenance and Machine services. Can be used to generate configuration on the node, instead of using talosctl to generate it locally. To be used in interactive installer and talosctl gen config. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-13 08:07:32 -08:00
Artem Chernyshev	93e30a1738	chore: remove maintenance service interface and use machine service Now maintenance service implements `MachineService` interface, stubbing all not implemented methods. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-11 12:33:44 -08:00
Andrew Rynhard	71321214a1	feat: add storage API This is the initial implementation of a storage API. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-11-11 10:12:25 -08:00
Andrey Smirnov	026244097a	refactor: drop osd compatibility layer Fixes #2761 Service `osd` was merged into machined on Jul, 13th, before 0.6 release. It's time to drop the backwards compatibility with clients before 0.6. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-11 09:38:19 -08:00
Andrew Rynhard	562f816526	refactor: use gRPC for interactive installation Instead of hosting a web service, we decided to implement a gRPC service that exposes APIs that can be used in a client-side interactive installer. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-11-03 08:36:44 -08:00
Artem Chernyshev	e7e99cf1b3	feat: support disk usage command in talosctl Usage example: ```bash talosctl du --nodes 10.5.0.2 /var -H -d 2 NODE NAME 10.5.0.2 8.4 kB etc 10.5.0.2 1.3 GB lib 10.5.0.2 16 MB log 10.5.0.2 25 kB run 10.5.0.2 4.1 kB tmp 10.5.0.2 1.3 GB . ``` Supported flags: - `-a` writes counts for all files, not just directories. - `-d` recursion depth - '-H' humanize size outputs. - '-t' size threshold (skip files if < size or > size). Fixes: https://github.com/talos-systems/talos/issues/2504 Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-13 09:30:31 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Seán C McCord	ff92d2a14b	feat: add ApplyConfiguration API Adds the ability to apply (replace) an existing node configuration with a new one via the Machine API. Fixes #2345 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-09-29 14:44:06 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	74413b1393	fix: ignore sequence lock errors in machined This prevents reboots when some actions triggers sequence while another sequence is still running. Fixes #2209 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 14:36:06 -07:00
Andrey Smirnov	ad99cb6421	feat: implement talosctl dashboard command This builds a simple CLI UI for Talos cluster monitoring. Some new APIs were added for monitoring based on Prometheus procfs package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 14:24:04 -07:00
Andrey Smirnov	c54639e541	feat: implement server-side API for cluster health checks This implements existing server-side health checks as defined in `internal/pkg/cluster/checks` in Talos API. Summary of changes: * new `cluster` API * `apid` now listens without auth on local file socket * `cluster` API is for now implemented in `machined`, but we can move it to the new service if we find it more appropriate * `talosctl health` by default now does server-side health check UX: `talosctl health` without arguments does health check for the cluster if it has healthy K8s to return master/worker nodes. If needed, node list can be overridden with flags. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-15 13:52:13 -07:00
Andrey Smirnov	cbb7ca8390	refactor: merge osd into machined This merges `osd` API into `machined`. API was copied from `osd` into `machined`, and `osd` API was deprecated. For backwards compatibility, `machined` still implements `osd` API, so older Talos API clients can still talk to the node without changes. Docs were updated. No functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-13 12:50:00 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	a6b3bd2ff6	feat: implement service events This implements service events, adds test for events API based on service events as they're the easiest to generate on demand. Disabled validate test for 'metal' as it validates disk device against local system which doesn't make much sense. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-03 13:52:53 -07:00
Andrew Rynhard	a5a2d959ed	feat: upgrade runc to v1.0.0-rc90 This updates runc to the same version vendored by containerd. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-02 13:19:33 -07:00
Andrew Rynhard	11ad2a5ea8	feat: add rollback API This adds an API for rolling back the version of Talos loaded by the bootloader. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-06-09 16:18:40 -07:00
Andrey Smirnov	1739439674	fix: update Events API response type to match proxying conventions Streaming APIs are not supposed to wrap response into `repeated` container, as streaming allows to send as many responses back as required. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-15 11:57:47 -07:00
Andrew Rynhard	7915c73a86	fix: register event service with router This adds the events streaming RPC to routerd. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-15 07:33:32 -07:00
Andrew Rynhard	1902519727	feat: add events API This adds an event stream to the runtime, and the ability to stream events via the API. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-13 12:18:10 -07:00
Andrew Rynhard	8e07b1bab3	feat: add bootstrap API This adds the ability to bootstrap a cluster using the API. The API simply starts the bootkube service. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-07 16:47:28 -07:00
Andrew Rynhard	56d7bf19fe	feat: add recovery API This adds an API for recovering the self-hosted control plane. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-04 19:38:30 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Andrew Rynhard	69fa63a7b2	refactor: perform upgrade upon reboot This PR introduces a new strategy for upgrades. Instead of attempting to zap the partition table, create a new one, and then format the partitions, this change will only update the `vmlinuz`, and `initramfs.xz` being used to boot. It introduces an A/B style upgrade process, which will allow for easy rollbacks. One deviation from our original intention with upgrades is that this change does not completely reset a node. It falls just short of that and does not reset the partition table. This forces us to keep the current partition scheme in mind as we make changes in the future, because an upgrade assumes a specific partition scheme. We can improve upgrades further in the future, but this will at least make them more dependable. Finally, one more feature in this PR is the ability to keep state. This enables single node clusters to upgrade since we keep the etcd data around. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 17:32:18 -07:00
Andrew Rynhard	fe7847e0b8	feat: add reboot flag to reset API This adds the ability to automatically reboot a machine after a reboot. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-02-19 05:10:58 -08:00
Spencer Smith	8092362098	fix: fix reset command This PR will fix the reset command to actually wipe the system disk as expected. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-02-18 16:18:43 -05:00
Brad Beam	88df1b50b8	feat(networkd): Add health api This introduces a health/ready api for networkd. This will allow us to better determine the state of networkd and allow for some level of monitoring. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2020-01-29 09:09:27 -06:00
Andrey Smirnov	6e05dd70c4	feat: add support for tailing logs Fixes #1564 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-17 22:35:47 +03:00
Andrew Rynhard	ad863a7f92	refactor: rename protobuf services, RPCs, and messages This PR brings our protobuf files into conformance with the protobuf style guide, and community conventions. It is purely renames, along with generated docs. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-12-11 11:41:40 -08:00

1 2

72 Commits