talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-11 07:31:18 +02:00

Author	SHA1	Message	Date
Andrey Smirnov	1df841bb54	refactor: change the interface of META Use a global instance, handle loading/saving META in global context. Deprecate legacy syslinux ADV, provide an easier interface for consumers. Expose META as resources. Fix the bootloader revert process (it was completely broken for quite a while :sad:). This is a first step which mostly does preparation work, real changes will come in the next PRs: * add APIs to write to META * consume META keys for platform network config for `metal` * custom key for URL `${code}` Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2023-03-15 15:43:16 +04:00
Utku Ozdemir	f55f5df739	feat: move dashboard package & run it in tty2 Move dashboard package into a common location where both Talos and talosctl can use it. Add support for overriding stdin, stdout, stderr and ctt in process runner. Create a dashboard service which runs the dashboard on /dev/tty2. Redirect kernel messages to tty1 and switch to tty2 after starting the dashboard on it. Related to siderolabs/talos#6841, siderolabs/talos#4791. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>	2023-02-28 12:00:25 +01:00
Utku Ozdemir	5ac9f43e45	feat: start machined earlier & in maintenance mode Load & start machined earlier and in initialize sequence, so that it is possible to use its API over its unix socket in maintenance mode. Additionally, do not return features from Version API if a config is not yet available. Related to siderolabs/talos#4791. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>	2023-02-21 12:21:36 +01:00
Noel Georgi	5cb2915d8e	feat: use wrapper for starting processes Use a wrapper for starting processes which can setup proper cgroups, OOMscore, and also drop capabilities for the process, then it calls `execve`. The containerd tests is also fixed to support cgroups when running tests in buildkit. It used to pass previously as we did not error if cgroup setup failed. Signed-off-by: Noel Georgi <git@frezbo.dev>	2023-02-03 18:32:09 +05:30
Andrey Smirnov	96aa9638f7	chore: rename talos-systems/talos to siderolabs/talos There's a cyclic dependency on siderolink library which imports talos machinery back. We will fix that after we get talos pushed under a new name. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-03 16:50:32 +04:00
Andrey Smirnov	343c55762e	chore: replace talos-systems Go modules with siderolabs This the first step towards replacing all import paths to be based on `siderolabs/` instead of `talos-systems/`. All updates contain no functional changes, just refactorings to adapt to the new path structure. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-01 12:55:40 +04:00
Dmitriy Matrenichev	93e55b85f2	chore: bump golangci-lint to v1.50.0 I had to do several things: - contextcheck now supports Go 1.18 generics, but I had to disable it because of this https://github.com/kkHAIKE/contextcheck/issues/9 - dupword produces to many false positives, so it's also disabled - revive found all packages which didn't have a documentation comment before. And tehre is A LOT of them. I updated some of them, but gave up at some point and just added them to exclude rules for now. - change lint-vulncheck to use `base` stage as base Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2022-10-20 18:33:19 +03:00
Andrey Smirnov	139c62d762	feat: allow upgrades in maintenance mode (only over SideroLink) This implements a simple way to upgrade Talos node running in maintenance mode (only if Talos is installed, i.e. if `STATE` and `EPHEMERAL` partitions are wiped). Upgrade is only available over SideroLink for security reasons. Upgrade in maintenance mode doesn't support any options, and it works without machine configuration, so proxy environment variables are not available, registry mirrors can't be used, and extensions are not installed. Fixes #6224 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-09-30 21:16:15 +04:00
Andrey Smirnov	8c3ac4c42b	chore: limit GOMAXPROCS for Talos services Fixes #5971 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-08-24 15:42:49 +04:00
Andrey Smirnov	f9b664c947	fix: reload trusted CA list when client is recreated Fixes #5652 This reworks and unifies HTTP client/transport management in Talos: * cleanhttp is used everywhere consistently * DefaultClient is using pooled client, other clients use regular transport * like before, Proxy vars are inspected on each request (but now consistently) * manifest download functions now recreate the client on each run to pick up latest changes * system CA list is picked up from a fixed locations, and supports reloading on changes Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-08-04 20:01:35 +04:00
Artem Chernyshev	ae1bec59e9	feat: allow running only one sequence at a time Fix `Talos` sequencer to run only a single sequence at the same time. Sequences priority was updated. To match the table: \| what is running (columns) what is requested (rows) \| boot \| reboot \| reset \| upgrade \| \|----------------------------------------------------\|------\|--------\|-------\|---------\| \| reboot \| Y \| Y \| Y \| N \| \| reset \| Y \| N \| N \| N \| \| upgrade \| Y \| N \| N \| N \| With a small addition that `WithTakeover` is still there. If set, priority is ignored. This is mainly used for `Shutdown` sequence invokation. And if doing apply config with reboot enabled. Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2022-07-27 17:21:36 +03:00
Philipp Sauter	f54d907871	fix: enable orderly poweroff in hyper-v on Azure Previously Talos would not shutdown gracefully if hyper-v issued the 'perform_shutdown' call. Said call would execute '/sbin/poweroff' which did not exist in Talos. We hardlink machined to '/sbin/poweroff' and make it send a shutdown API call to PID 1 machined. Fixes #5641 Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com>	2022-06-15 12:49:17 +02:00
Dmitriy Matrenichev	e06e1473b0	feat: update golangci-lint to 1.45.0 and gofumpt to 0.3.0 - Update golangci-lint to 1.45.0 - Update gofumpt to 0.3.0 - Fix gofumpt errors - Add goimports and format imports since gofumports is removed - Update Dockerfile - Fix .golangci.yml configuration - Fix linting errors Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2022-03-24 08:14:04 +04:00
Artem Chernyshev	27af5d41c6	feat: pause the boot process on some failures instead of rebooting Some failures can be fixed by updating the machine configuration. Now `userDisks` and `userFiles` do not make Talos to enter into reboot loop but pause for 35 minutes. Additionally, `apid` and `machined` are now started right after containerd is up and running. That makes it possible for the operator to connect to the node using talosctl and fix the config. Fixes: https://github.com/talos-systems/talos/issues/4669 Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2022-03-21 17:39:45 +03:00
Andrey Smirnov	3257751bc0	fix: initialize Drainer properly Because of the bug Drainer never worked properly, as `shutdown` channel wasn't initialied. Also add unit-tests and add some small clean-ups which don't affect functionality. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-11-26 17:18:05 +03:00
Artem Chernyshev	7433150fd8	feat: implement events sink controller Report Talos events to any gRPC server. Destination address is specified by using kernel parameters. Fixes: https://github.com/talos-systems/talos/issues/4458 Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2021-11-25 18:37:31 +03:00
Alexey Palazhchenko	0dad5f4d78	chore: small cleanup Remove empty tests. Remove unused parameter. Remove extra parameter. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>	2021-10-14 08:54:24 +00:00
Alexey Palazhchenko	f63ab9dd9b	feat: implement `talosctl config new` command Refs #3421. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-17 09:06:43 -07:00
Alexey Palazhchenko	c81cfb2167	chore: allow building with debug handlers Refs #3534. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-05-13 02:20:15 -07:00
Andrey Smirnov	2ea20f598a	feat: replace timed with time sync controller This is a complete rewrite of time sync process. Now the time sync process starts early at boot time, and it adapts to configuration changes: * before config is available, `pool.ntp.org` is used * once config is available, configured time servers are used Controller updates same time sync resource as other controllers had dependency on, so they have a chance to wait for the time sync event. Talos services which depend on time now wait on same resource instead of waiting on timed health. New features: * time sync now sticks to the particular time server unless there's an error from that server, and server is changed in that case, this improves time sync accuracy * time sync acts on config changes immediately, so it's possible to reconfigure time sync at any time * there's a new 'epoch' field in time sync resources which allows time-dependent controllers to regenerate certs when there's a big enough jump in time Features to implement later: * apid shouldn't depend on timed, it should be started early and it should regenerate certs on time jump * trustd should be updated in same way Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-29 09:29:43 -07:00
Andrey Smirnov	b0209fd29d	refactor: move networkd, timed APIs to machined, remove routerd This moves implementation of the user-facing APIs to the machined, and as now all the APIs are implemented by machined, remove routerd and adjust apid to proxy to machined. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-24 00:00:28 -07:00
Andrey Smirnov	ac8764702f	refactor: move apid, routerd, timed and trustd to single executable This removes container images for the aforementioned services, they are now built into `machined` executable which launches one or another service based on `argv[0]`. Containers are started with rootfs directory which contains only a single executable file for the service. This creates rootfs on squashfs for each container in `/opt/<container>`. Service `networkd` is not touched as it's handled in #3350. This removes all the image imports, snapshots and other things which were associated with the existing way to run containers. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-23 09:48:11 -07:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Artem Chernyshev	4e47f6766e	feat: bypass lock if ACPI reboot/shutdown issued Fixes: https://github.com/talos-systems/talos/issues/2997 Listen for restart events in parallel with the boot sequence and cancel the context if got `RestartEvent`. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-03 22:05:59 +03:00
Artem Chernyshev	638af35db0	chore: properly propagate context object in the controller This is required to correctly handle ACPI reboot or forceful reboots during sequence that locks the controller. Additionally fix `NoSchedule` untaint when the configuration is changed. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-03 16:59:27 +03:00
Artem Chernyshev	f96548e165	refactor: extract go-cmd into a separate library To be used in the `go-blockdevice` library. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-02-16 10:31:20 -08:00
Andrey Smirnov	512c79e8d6	fix: lower memory usage a bit by disabling memory profiling As of now, we're not using Go profiling, so it's safe to disable it to save some memory and CPU costs. Once we start using it, we can re-enable it conditionally. Each process allocates around 1.4MiB on amd64 for memory profiling buckets. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-01 04:49:59 -08:00
Andrey Smirnov	76a6794436	fix: kill all processes and umount all disk on reboot/shutdown There are several ways Talos node might be restarted or shut down: * error in sequence (initiated from machined) * panic in main goroutine (machined recovers panics) * error in sequence (initiated via API, event caught by machined) * reboot/shutdown via Talos API Before this change, paths (1) and (2) were handled in machined, and no disks were unmounted and processes killed, so technically all the processes are running and potentially writing to the filesystems. Paths (3) and (4) try to stop services (but not pods) and unmount explicitly mounted filesystems, followed by reboot directly from sequencer (bypassing machined handler). There was a bug that user disks were never explicitly unmounted (but they might have been unmounted if mounted on top `/var`). This refactors all the reboot/shutdown paths to flow through machined's main function: on paths (4) event is sent via event API from the sequencer back to the machined and machined initiates proper shutdown sequence. Refactoring in machined leads to all the paths (1)-(4) flowing through the same function `handle(error)`. Added two additional checks before flushing buffers: * kill all non-system processes, this also kills all mount namespaces * unmount any filesystem backed by `/dev/*` This ensures all filesystems are unmounted before buffers are flushed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-29 06:14:07 -08:00
Andrey Smirnov	11863dd74d	feat: implement resource API in Talos This brings in `os-runtime` package and exposes resources with first iteration of read-only API. Two Talos resources (and one controller) are implemented: * legacy.Service resource tracks Talos 'service' `RUNNING` state * config.V1Alpha1 stores current runtime config Glue point between existing runtime and new os-runtime based runtime is in `v1alpha2` implementation and `V1Alpha2()` sub-interfaces of existing `Runtime`, `State`, `Controller` interfaces. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-19 11:45:46 -08:00
Andrey Smirnov	a2efa44663	chore: enable gci linter Fixes were applied automatically. Import ordering might be questionable, but it's strict: * stdlib * other packages * same package imports Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 08:09:48 -08:00
Andrey Smirnov	93f6586900	fix: don't abort reboot sequence on bootloader meta failure If bootloader meta failed to be found/to be reverted, don't abort the whole sequence of actions leading to reboot, otherwise control returns back and machined tries to run next sequence in failed state. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-07 13:59:22 -07:00
Andrew Rynhard	1a4059a553	feat: add grub bootloader This moves to using grub instead of syslinux. BREAKING CHANGE: Single node upgrades will fail in this change. This will also break the A/B fallback setup since this version introduces an entirely new partition scheme, that any fallback will not know about. We plan on addressing these issues in a follow up change. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-09-01 12:06:43 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	74413b1393	fix: ignore sequence lock errors in machined This prevents reboots when some actions triggers sequence while another sequence is still running. Fixes #2209 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 14:36:06 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	fb585902a3	chore: replace underlying event implementation with single slice The idea here is to use single slice of events for all the consumers. Each consumer keeps its own position within the stream, and stream is structured as circular buffer to avoid using too much memory. This implementation allows for one more future: looking "back" into the event history and returning past event starting with some offset (e.g. timestamp, event ID, etc.). This feature is not implemented yet. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-20 11:12:43 -07:00
Andrew Rynhard	a733a9714f	fix: run machined API as a service In recent refactoring the machined API service was changed to run outside of the service framework. This brings it back as a service. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-15 17:27:19 -07:00
Andrew Rynhard	1902519727	feat: add events API This adds an event stream to the runtime, and the ability to stream events via the API. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-13 12:18:10 -07:00
Andrew Rynhard	83062f37bd	fix: write machined RPC logs to file This ensures that the machined RPC logs are written to disk so that users can retrieve them via the log API. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-07 14:17:59 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Andrew Rynhard	a10acd592a	chore: address random CI nits This PR does the following: - updates the conform config - cleans up conform scopes - moves slash commands to the talos-bot - adds a check list to the pull request template - disables codecov comments - uses `BOT_TOKEN` so all actions are performed as the talos-bot user - adds a `make conformance` target to make it easy for contributors to check their commit before creating a PR - bumps golangci-lint to v1.24.0 Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-13 13:01:14 -07:00
Andrew Rynhard	83d0851563	fix: delete tag on revert with empty label We need to ensure that we delete the upgrade tag from the ADV even if the tag value is an empty string. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-30 15:15:22 -07:00
Andrew Rynhard	47327eca09	fix: move empty label check We should always set the fallback tag on an upgrade, and only revert if the tag value is not an empty string. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-30 13:42:08 -07:00
Andrew Rynhard	6fe5fed6f9	fix: make upgrades work with UEFI Since the `--once` option of `extlinux` seems to only work with BIOS, we needed to change to remove any reliance on this option. Instead of booting the upgraded version once, and then making it the default after a successful boot, we now make it the default, and then revert on any boot error. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-26 13:34:00 -07:00
Andrew Rynhard	69fa63a7b2	refactor: perform upgrade upon reboot This PR introduces a new strategy for upgrades. Instead of attempting to zap the partition table, create a new one, and then format the partitions, this change will only update the `vmlinuz`, and `initramfs.xz` being used to boot. It introduces an A/B style upgrade process, which will allow for easy rollbacks. One deviation from our original intention with upgrades is that this change does not completely reset a node. It falls just short of that and does not reset the partition table. This forces us to keep the current partition scheme in mind as we make changes in the future, because an upgrade assumes a specific partition scheme. We can improve upgrades further in the future, but this will at least make them more dependable. Finally, one more feature in this PR is the ability to keep state. This enables single node clusters to upgrade since we keep the etcd data around. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 17:32:18 -07:00
Andrew Rynhard	fe7847e0b8	feat: add reboot flag to reset API This adds the ability to automatically reboot a machine after a reboot. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-02-19 05:10:58 -08:00
Spencer Smith	8092362098	fix: fix reset command This PR will fix the reset command to actually wipe the system disk as expected. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-02-18 16:18:43 -05:00
Andrey Smirnov	565c747582	fix: install sequence stuck on event bus machined's main.go waits for boot sequence to finish, while metal platform initializer tries to send a message to the event bus without any listeners, so this is pure deadlock. Resolve that by panicking from initializer, this aborts phase and sequence, and leads to reboot on panic. Not really clean as it leaves scary stacktraces in the dmesg, but it works. Cleanup might be done by introducing error value for reboot, and ignoring it when printing the errors. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-01-21 16:28:00 -06:00
Andrew Rynhard	5b5d171c07	fix: block when handling bus event If we don't block, there is the potential for multiple shutdown, reboot, and upgrade requests to be processed. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-01-20 09:19:50 -08:00
Brad Beam	f722adb865	fix(machined): Add additional defaults for http transport Followup from #1680. This also moves the setting from phases to machine.init to set it earlier in the boot sequence to ensure that we get the defaults set properly from the start and set it only once. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-12-30 08:13:22 -08:00

1 2

79 Commits