talos

mirror of https://github.com/siderolabs/talos.git synced 2025-08-08 07:37:06 +02:00

Author	SHA1	Message	Date
Andrey Smirnov	05fd042bb3	test: improve the reset integration tests Provide a trace for each step of the reset sequence taken, so if one of those fails, integration test produces a meaningful message instead of proceeding and failing somewhere else. More cleanup/refactor, should be functionally equivalent. Fixes #8635 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>	2024-04-24 18:35:39 +04:00
Dmitriy Matrenichev	19f15a840c	chore: bump golangci-lint to 1.57.0 Fix all discovered issues. Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2024-03-21 01:06:53 +03:00
Andrey Smirnov	a52d3cda3b	chore: update gen and COSI runtime No actual changes, adapting to use new APIs. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>	2023-09-22 12:13:13 +04:00
Andrey Smirnov	3c9f7a7de6	chore: re-enable nolintlint and typecheck linters Drop startup/rand.go, as since Go 1.20 `rand.Seed` is done automatically. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>	2023-08-25 01:05:41 +04:00
Noel Georgi	6b0373ebef	chore: move bash tests to integration move extensions and secureboot tests to integration. Makes it easier to test. Signed-off-by: Noel Georgi <git@frezbo.dev>	2023-08-17 19:58:35 +05:30
Noel Georgi	e3f3f5794d	feat: implement revert for sd-boot Implement revert for sd-boot. Signed-off-by: Noel Georgi <git@frezbo.dev>	2023-06-22 20:20:31 +05:30
Andrey Smirnov	badbc51e63	refactor: rewrite code to include preliminary support for multi-doc `config.Container` implements a multi-doc container which implements both `Container` interface (encoding, validation, etc.), and `Conifg` interface (accessing parts of the config). Refactor `generate` and `bundle` packages to support multi-doc, and provide backwards compatibility. Implement a first (mostly example) machine config document for SideroLink API URL. Many places don't properly support multi-doc yet (e.g. config patches). Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2023-05-31 18:38:05 +04:00
Artem Chernyshev	b520710810	feat: introduce new flag in reset API that makes Talos reset user disks Fixes: https://github.com/siderolabs/talos/issues/6815 Additionally, make it possible to run reset in maintenance mode: to enable a way for resetting system disk and remove all traces of Talos from it. The new reset flow works in a separate sequence, changed disk probe lookup to check the boot partition instead of the ephemeral one. Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2023-02-28 15:10:41 +03:00
Andrey Smirnov	96aa9638f7	chore: rename talos-systems/talos to siderolabs/talos There's a cyclic dependency on siderolink library which imports talos machinery back. We will fix that after we get talos pushed under a new name. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-03 16:50:32 +04:00
Andrey Smirnov	343c55762e	chore: replace talos-systems Go modules with siderolabs This the first step towards replacing all import paths to be based on `siderolabs/` instead of `talos-systems/`. All updates contain no functional changes, just refactorings to adapt to the new path structure. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-01 12:55:40 +04:00
Dmitriy Matrenichev	29bd632401	chore: remove old build tags syntax This commit removes lines contains old build tag syntax. Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2022-08-24 17:27:01 +03:00
Artem Chernyshev	e5994ff7a7	fix: skip `ResetDuringBoot` test if the `Cluster` config is unknown And improve retry logic in the test. Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2022-07-28 15:57:58 +03:00
Artem Chernyshev	ae1bec59e9	feat: allow running only one sequence at a time Fix `Talos` sequencer to run only a single sequence at the same time. Sequences priority was updated. To match the table: \| what is running (columns) what is requested (rows) \| boot \| reboot \| reset \| upgrade \| \|----------------------------------------------------\|------\|--------\|-------\|---------\| \| reboot \| Y \| Y \| Y \| N \| \| reset \| Y \| N \| N \| N \| \| upgrade \| Y \| N \| N \| N \| With a small addition that `WithTakeover` is still there. If set, priority is ignored. This is mainly used for `Shutdown` sequence invokation. And if doing apply config with reboot enabled. Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2022-07-27 17:21:36 +03:00
Utku Ozdemir	8d2be5e315	feat: extend node definition used in health checks Introduce `cluster.NodeInfo` to represent the basic info about a node which can be used in the health checks. This information, where possible, will be populated by the discovery service in following PRs. Part of siderolabs#5554. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>	2022-06-13 14:13:42 +02:00
Dmitriy Matrenichev	e06e1473b0	feat: update golangci-lint to 1.45.0 and gofumpt to 0.3.0 - Update golangci-lint to 1.45.0 - Update gofumpt to 0.3.0 - Fix gofumpt errors - Add goimports and format imports since gofumports is removed - Update Dockerfile - Fix .golangci.yml configuration - Fix linting errors Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>	2022-03-24 08:14:04 +04:00
Alexey Palazhchenko	7462733bcb	chore: update golangci-lint Fix context propagation. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>	2021-11-15 14:55:25 +00:00
Andrey Smirnov	a059454045	chore: build using Go 1.17 `initramfs` size for amd64 shrinks by 1.3 MiB. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-09-13 22:33:47 +03:00
Alexey Palazhchenko	eea750de2c	chore: rename "join" type to "worker" Closes #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-07-09 07:10:45 -07:00
Andrey Smirnov	62c702c4fd	fix: remove conflicting etcd member on rejoin with empty data directory This fixes a scenario when control plane node loses contents of `/var` without leaving etcd first: on reboot etcd data directory is empty, but member is already present in the etcd member list, so etcd won't be able to join because of raft log being empty. The fix is to remove a member with matching hostname if found in the etcd member list followed by new member add. The risk here is removing another member which has same hostname as the joining node, but having duplicate hostnames for control plane node is a problem anyways. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-03 15:11:44 -07:00
Andrey Smirnov	e0650218a6	feat: support etcd recovery from snapshot on bootstrap When Talos `controlplane` node is waiting for a bootstrap, `etcd` contents can be recovered from a snapshot created with `talosctl etcd snapshot` on a healthy cluster. Bootstrap process goes same way as before, but the etcd data directory is recovered from the snapshot. This flow enables disaster recovery for the control plane: given that periodic backups are available, destroy control plane nodes, re-create them with the same config, and bootstrap one node with the saved snapshot to recover etcd state at the time of the snapshot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-08 10:15:37 -07:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Andrey Smirnov	7f3dca8e4c	test: add support for IPv6 in talosctl cluster create Modify provision library to support multiple IPs, CIDRs, gateways, which can be IPv4/IPv6. Based on IP types, enable services in the cluster to run DHCPv4/DHCPv6 in the test environment. There's outstanding bug left with routes not being properly set up in the cluster so, IPs are not properly routable, but DHCPv6 works and IPs are allocated (validates DHCPv6 client). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-09 13:28:53 -08:00
Andrey Smirnov	87ccf0eb21	test: clear connection refused errors after reset After node reboot (and gRPC API unavailability), gRPC stack might cache connection refused errors for up to backoff timeout. Explicitly clear such errors in reset tests before trying to read data from the node to verify reset success. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-01 08:11:27 -08:00
Andrey Smirnov	3dae6df27b	test: stabilize upgrade test by running health check several times For single node clusters, control plane is unstable after reboot, run health check several times to let it settle down to avoid failures in subsequent checks. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-11 08:31:01 -08:00
Andrey Smirnov	54ed80e244	feat: reset with system disk wipe spec Idea is to add an option to perform "selective" reset: default reset operation is to wipe all partitions (triggering reinstall), while spec allows only to wipe some of the operations. Other operations are performed exactly in the same way for any reset flow. Possible use case: reset only `EPHEMERAL` partition. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-10 11:31:07 -08:00
Andrey Smirnov	350280eb59	feat: implement "staged" (failsafe/backup) upgrades Regular upgrade path takes just one reboot, but it requires all the processes to be stopped on the node before upgrade might proceed. Under some circumstances and with potential Talos bugs it might not work rendering Talos upgrades almost impossible. Staged upgrades build upon regular install flow to run the upgrade on the node reboot. Such upgrades require two reboots of the node, and it requires two pulls of the installer image, but they should be much less suspicious to the failure. Once the upgrade is staged, node can be rebooted in any possible way, including hard reset and upgrade is performed on the next boot. New ADV format was implemented as well to allow to store install image ref/options across reboots. New format allows for bigger values and takes 50% of the `META` partition. Old ADV is still kept for compatibility reasons. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-08 08:34:26 -08:00
Andrey Smirnov	8560fb9662	chore: enable nlreturn linter Most of the fixes were automatically applied. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 06:48:07 -08:00
Andrey Smirnov	773912833e	test: clean up integration test code, fix flakes This enables golangci-lint via build tags for integration tests (this should have been done long ago!), and fixes the linting errors. Two tests were updated to reduce flakiness: * apply config: wait for nodes to issue "boot done" sequence event before proceeding * recover: kill pods even if they appear after the initial set gets killed (potential race condition with previous test). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-19 15:44:14 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	47608fb874	refactor: make `pkg/config` not rely on `machined/../internal/runtime` This makes `pkg/config` directly importable from other projects. There should be no functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 12:40:12 -07:00
Andrey Smirnov	3d8418a689	feat: force nodes to be set in `talosctl` commands using the API With load-balancing enabled by default running `talosctl` without `--nodes` is risky, as it might hit any control plane by default without `--nodes`. Only two commands do not enforce this check, as they do their own node contexts: `crashdump` and `health` (client-side). Integration tests were updated to always supply `--nodes` cli argument, while doing that I refactored the storage for discovered nodes to use existing `cluster.Info` interface. The downside is that with e2e CAPI tests CLI tests will be mostly skipped as we don't support discovery in CLI tests at the momemnt. This can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for node discovery. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 12:17:43 -07:00
Andrey Smirnov	1a0e1bc393	chore: update module dependencies Fixes #2316 Simply update dependencies we don't track on version level to be compatible with Talos components (like etcd or k8s). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:00:50 -07:00
Andrey Smirnov	931237b23c	test: update init node check in reset API tests Previously we assumed that node 0 is the init node, and it can't be reset. With new bootstrap API approach, there's no init node, and all the nodes can be reset. This corrects the check to skip only the init node, and with bootstrap API there's no init node (so no nodes are skipped). Fixes #2277 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 10:48:14 -07:00
Andrey Smirnov	6fb55229a2	test: fix and improve reboot/reset tests These tests rely on node uptime checks. These checks are quite flaky. Following fixes were applied: * code was refactored as common method shared between reset/reboot tests (reboot all nodes does checks in a different way, so it wasn't updated) * each request to read uptime times out in 5 seconds, so that checks don't wait forever when node is down (or connection is aborted) * to account for node availability vs. lower uptime in the beginning of test, add extra elapsed time to the check condition Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-29 13:56:48 -07:00
Andrey Smirnov	23be80fd96	test: stabilize tests by bumping timeouts Bump timeouts for reset API test as K8s control plane teardown might take 3 minutes on its own. Bump Go Firecracker SDK timeout when talking to firecracker process. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-06 08:26:18 -07:00
Andrew Rynhard	56d7bf19fe	feat: add recovery API This adds an API for recovering the self-hosted control plane. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-04 19:38:30 -07:00
Andrey Smirnov	682dd433ba	refactor: move Talos client package to `pkg/` As this implements Go client for Talos API, it makes sense to publish it one the top level. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-01 23:45:58 +03:00
Andrey Smirnov	b94be4f6a1	test: mark long tests as !short This skips long-running integration tests if `-test.short` mode is enabled. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-27 22:34:26 +03:00
Andrew Rynhard	5dbc26c7a3	feat: rename osctl to talosctl This is a rename of the osctl binary. We decided that talosctl is a better name for the Talos CLI. This does not break any APIs, but does make older documentation only accurate for previous versions of Talos. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 19:07:39 -07:00
Andrey Smirnov	d5f80858dd	test: add 'reset' integration test for Reset() API Every node is reset, rebooted and it comes back up again except for the init node due to known issues with init node boostrapping etcd cluster from scratch when metadata is missing (as node was wiped). Planned workaround is to prohibit resetting init node (should be coming next). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-06 23:05:46 +03:00

40 Commits