talos

mirror of https://github.com/siderolabs/talos.git synced 2025-11-21 18:51:46 +01:00

Author	SHA1	Message	Date
Andrey Smirnov	0da370dfef	test: unlock CABPT/CACPPT provider versions We should always test latest versions of our providers. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-02-10 00:14:15 +03:00
Artem Chernyshev	2f2bdb26aa	feat: replace flags with --mode in `apply`, `edit` and `patch` commands Fixes: https://github.com/talos-systems/talos/issues/4588 Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2022-01-13 16:09:53 +03:00
Andrey Smirnov	2f4b9d8d6d	feat: make machine configuration read-only in Talos (almost) Talos shouldn't try to re-encode the machine config it was provided with. So add a `ReadonlyWrapper` around `v1alpha1.Config` which makes sure that raw config object is not available anymore (it's a private field), but config accessors are available for read-only access. Another thing that `ReadonlyWrapper` does is that it preserves the original `[]byte` encoding of the config keeping it exactly same way as it was loaded from file or read over the network. Improved `talosctl edit mc` to preserve the config as it was submitted, and preserve the edits on error from Talos (previously edits were lost). `ReadonlyWrapper` is not used on config generation path though - config there is represented by `v1alpha.Config` and can be freely modified. Why almost? Some parts of Talos (platform code) patch the machine configuration with new data. We need to fix platforms to provide networking configuration in a different way, but this will come with other PRs later. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-12-28 20:12:55 +03:00
Andrey Smirnov	d2a7e082c2	test: retry in discovery tests Sometimes pushing/pulling to Kubernetes registry is delayed due to backoff on failed attempts to talk to the API server when the cluster is still bootstrapping. Workaround that by adding retries. Also disable kernel module controller in container mode, as it will keep always failing. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-12-28 16:55:41 +03:00
Noel Georgi	4c96e936ed	docs: add cilium guide - Add Cilium CNI install guide - Use Canal CNI for default examples Fixes #4477 Signed-off-by: Noel Georgi <git@frezbo.dev>	2021-12-16 20:37:03 +05:30
Rohit Dandamudi	7f9922296a	feat: add powercycle mode in reboot - Fixes #4569 - Updated reboot process sequence - Updted api.descriptors to avoid proto type change linting error https://github.com/talos-systems/talos/pull/4612#discussion_r758599242 Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com> Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com>	2021-12-02 22:40:04 +05:30
Andrey Smirnov	753a82188f	refactor: move pkg/resources to machinery Fixes #4420 No functional changes, just moving packages around. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-11-15 19:50:35 +03:00
Alexey Palazhchenko	7462733bcb	chore: update golangci-lint Fix context propagation. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>	2021-11-15 14:55:25 +00:00
Andrey Smirnov	b6b78e7fef	test: add cluster discovery integration tests This verifies that members match cluster state and that both cluster registries work in sync producing same discovery data. Fixes #4191 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-10-25 21:03:29 +03:00
Andrey Smirnov	3e100aa977	test: workaround EventsWatch test flakiness This test sometimes fails with a message like: ``` === RUN TestIntegration/api.EventsSuite/TestEventsWatch assertion_compare.go:323: Error Trace: events.go:88 Error: "0" is not greater than or equal to "14" Test: TestIntegration/api.EventsSuite/TestEventsWatch Messages: [] ``` I believe the root cause is that the initial (first event) delivery might be more than 100ms, so instead of waiting for 100ms for each event, block for 500ms for all events to arrive. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-10-15 12:51:56 +03:00
Andrey Smirnov	b450b7cef0	chore: deprecate Interfaces and Routes APIs Fixes #4094 Deprecate old networkd APIs, `talosctl interfaces` and `talosctl routes` now suggest different commands to be used to achieve same task. TUI installer was updated to stop using Interfaces API. Those APIs will be completely removed in 0.14. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-09-27 15:21:02 +03:00
Andrey Smirnov	a059454045	chore: build using Go 1.17 `initramfs` size for amd64 shrinks by 1.3 MiB. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-09-13 22:33:47 +03:00
Alexey Palazhchenko	eea750de2c	chore: rename "join" type to "worker" Closes #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-07-09 07:10:45 -07:00
Andrey Smirnov	b969e7720e	chore: update references to old protobuf package This simply uses new protobuf package instead of old one. Old protobuf package is still in use by Talos dependencies. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-08 05:34:12 -07:00
Andrey Smirnov	10c28758a4	fix: ignore DeadlineExceeded error correctly on bootstrap The problem was that gRPC method `status.Code(err)` doesn't unwrap errors, while Talos client returns errors wrapped with `multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return error code correctly. Fix that by introducing our own client method which correctly goes over the chain of wrapped errors. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-07 12:02:26 -07:00
Andrey Smirnov	62c702c4fd	fix: remove conflicting etcd member on rejoin with empty data directory This fixes a scenario when control plane node loses contents of `/var` without leaving etcd first: on reboot etcd data directory is empty, but member is already present in the etcd member list, so etcd won't be able to join because of raft log being empty. The fix is to remove a member with matching hostname if found in the etcd member list followed by new member add. The risk here is removing another member which has same hostname as the joining node, but having duplicate hostnames for control plane node is a problem anyways. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-03 15:11:44 -07:00
Andrey Smirnov	0acb04ad7a	feat: implement route network controllers Route handling is very similar to addresses: * `RouteStatus` describes kernel routing table state, `RouteStatusController` reflects kernel state into resources * `RouteSpec` defines routes to be configured * `RouteConfigController` creates `RouteSpec`s based on cmdline and machine configuration * `RouteMergeController` merges different configuration layers into the final representation * `RouteSpecController` applies the specs to the kernel routing table Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-05-25 11:09:21 -07:00
Alexey Palazhchenko	1fcf38f9d6	feat: add support for "none" CNI type Closes #3411. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-04-09 12:53:00 -07:00
Andrey Smirnov	e0650218a6	feat: support etcd recovery from snapshot on bootstrap When Talos `controlplane` node is waiting for a bootstrap, `etcd` contents can be recovered from a snapshot created with `talosctl etcd snapshot` on a healthy cluster. Bootstrap process goes same way as before, but the etcd data directory is recovered from the snapshot. This flow enables disaster recovery for the control plane: given that periodic backups are available, destroy control plane nodes, re-create them with the same config, and bootstrap one node with the saved snapshot to recover etcd state at the time of the snapshot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-08 10:15:37 -07:00
Andrey Smirnov	7d91258475	test: fix data race in apply config tests Variable `chanErr` was read before waiting for the goroutine to finish. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-31 10:46:50 -07:00
Andrey Smirnov	204caf8eb9	test: fix apply-config integration test, bump clusterctl version Tests for ApplyConfig API were relying on not really supported behavior of modifying config via the `Provider` interface (and it was "fixed" in another PR which cleans up such access to the configuration). Cluster version bumped to try to workaround strange CAPI bootstrap failures in e2e-capi. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-31 09:55:53 -07:00
Andrey Smirnov	2ea20f598a	feat: replace timed with time sync controller This is a complete rewrite of time sync process. Now the time sync process starts early at boot time, and it adapts to configuration changes: * before config is available, `pool.ntp.org` is used * once config is available, configured time servers are used Controller updates same time sync resource as other controllers had dependency on, so they have a chance to wait for the time sync event. Talos services which depend on time now wait on same resource instead of waiting on timed health. New features: * time sync now sticks to the particular time server unless there's an error from that server, and server is changed in that case, this improves time sync accuracy * time sync acts on config changes immediately, so it's possible to reconfigure time sync at any time * there's a new 'epoch' field in time sync resources which allows time-dependent controllers to regenerate certs when there's a big enough jump in time Features to implement later: * apid shouldn't depend on timed, it should be started early and it should regenerate certs on time jump * trustd should be updated in same way Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-29 09:29:43 -07:00
Alexey Palazhchenko	d7e9f6d6a8	chore: build integration tests with -race Refs https://github.com/talos-systems/talos/issues/3378. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-26 10:08:12 -07:00
Artem Chernyshev	6ffabe5169	feat: add ability to find disk by disk properties Fixes: https://github.com/talos-systems/talos/issues/3323 Not exactly matching with udevd generated `by-<id>` symlinks, but should provide sufficient amount of property selectors to be able to pick specific disks for any kind of disk: sd card, hdd, ssd, nvme. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-23 14:23:02 -07:00
Artem Chernyshev	22f375300c	chore: update golanci-lint to 1.38.0 Fix all discovered issues. Detected couple bugs, fixed them as well. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-12 06:50:02 -08:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Andrey Smirnov	31e56e63db	fix: update in-cluster kubeconfig validity to match other certs Talos generates in-cluster kubeconfig for the kube-scheduler and kube-controller-manager to authenticate to kube-apiserver. Bug was that validity of that kubeconfig was set to 24h by mistake. Fix that by bumping validity to default for other Kubernetes certs (1 year). Add a certificate refresh at 50% of the validity. Fix bugs with copying secret resources which was leading to updates not being propagated correctly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-03-01 11:16:04 -08:00
Artem Chernyshev	7108bb3f5b	test: upgrade master to master tests Verify upgrade flow using the same version of the installer. Run that with disk encryption enabled. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-02-24 07:56:44 -08:00
Artem Chernyshev	06b8c09484	test: enable disk encryption key rotation test Verify that disk encryption sync operations work properly. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-02-20 06:17:55 -08:00
Andrey Smirnov	32d2588528	test: update integration tests to use wrapped client for etcd APIs This continues the fix from #3167. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-18 08:08:48 -08:00
Artem Chernyshev	58ff2c9808	feat: implement ephemeral partition encryption This PR introduces the first part of disk encryption support. New config section `systemDiskEncryption` was added into MachineConfig. For now it contains only Ephemeral partition encryption. Encryption itself supports two kinds of keys for now: - node id deterministic key. - static key which is hardcoded in the config and mainly used for test purposes. Talosctl cluster create can now be told to encrypt ephemeral partition by using `--encrypt-ephemeral` flag. Additionally: - updated pkgs library version. - changed Dockefile to copy cryptsetup deps from pkgs. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-02-17 13:39:04 -08:00
Andrey Smirnov	cc83b83808	feat: rename apply-config --no-reboot to --on-reboot This explains the intetion better: config is applied on reboot, and allows to easily distinguish it from `apply-config --immediate` which applies config immediately without a reboot (that is coming in a different PR). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 12:49:47 -08:00
Andrey Smirnov	d99a016af2	fix: correct response structure for GenerateConfig API Also fix recovery grpc handler to print panic stacktrace to the log. Any API should follow the structure compatible with apid proxying injection of errors/nodes. Explicitly fail GenerateConfig API on worker nodes, as it panics on worker nodes (missing certificates in node config). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-11 06:34:10 -08:00
Andrey Smirnov	7f3dca8e4c	test: add support for IPv6 in talosctl cluster create Modify provision library to support multiple IPs, CIDRs, gateways, which can be IPv4/IPv6. Based on IP types, enable services in the cluster to run DHCPv4/DHCPv6 in the test environment. There's outstanding bug left with routes not being properly set up in the cluster so, IPs are not properly routable, but DHCPv6 works and IPs are allocated (validates DHCPv6 client). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-09 13:28:53 -08:00
Andrey Smirnov	87ccf0eb21	test: clear connection refused errors after reset After node reboot (and gRPC API unavailability), gRPC stack might cache connection refused errors for up to backoff timeout. Explicitly clear such errors in reset tests before trying to read data from the node to verify reset success. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-01 08:11:27 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	47fb5720cf	test: skip etcd tests on non-HA clusters We can't test much of the flow on single-node clusters. Fixes #3013 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-08 07:39:36 -08:00
Andrey Smirnov	a8dd2ff30d	fix: checkpoint controller-manager and scheduler Default manifests created by bootkube so far were only enabling pod-checkpointer for kube-apiserver. This seems to have issues with single-node control plane scenario, when without scheduler and controller-manager node might fall into `NodeAffinity` state. See https://github.com/talos-systems/bootkube-plugin/pull/23 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-28 11:53:17 -08:00
Andrey Smirnov	3dae6df27b	test: stabilize upgrade test by running health check several times For single node clusters, control plane is unstable after reboot, run health check several times to let it settle down to avoid failures in subsequent checks. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-11 08:31:01 -08:00
Andrey Smirnov	54ed80e244	feat: reset with system disk wipe spec Idea is to add an option to perform "selective" reset: default reset operation is to wipe all partitions (triggering reinstall), while spec allows only to wipe some of the operations. Other operations are performed exactly in the same way for any reset flow. Possible use case: reset only `EPHEMERAL` partition. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-10 11:31:07 -08:00
Andrey Smirnov	350280eb59	feat: implement "staged" (failsafe/backup) upgrades Regular upgrade path takes just one reboot, but it requires all the processes to be stopped on the node before upgrade might proceed. Under some circumstances and with potential Talos bugs it might not work rendering Talos upgrades almost impossible. Staged upgrades build upon regular install flow to run the upgrade on the node reboot. Such upgrades require two reboots of the node, and it requires two pulls of the installer image, but they should be much less suspicious to the failure. Once the upgrade is staged, node can be rebooted in any possible way, including hard reset and upgrade is performed on the next boot. New ADV format was implemented as well to allow to store install image ref/options across reboots. New format allows for bigger values and takes 50% of the `META` partition. Old ADV is still kept for compatibility reasons. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-08 08:34:26 -08:00
Artem Chernyshev	8aad711f18	feat: implement network interfaces list API To be used in the interactive installer to configure networking. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-27 10:48:45 -08:00
Artem Chernyshev	f96cffd2b2	feat: add ability to choose CNI config Initial version which only allows setting CNI using preset, no custom CNI urls are supported at the moment. Still need to figure out what kind of UI can be used for that. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-26 06:49:54 -08:00
Andrey Smirnov	9a32e34cb1	feat: implement apply configuration without reboot This allows config to be written to disk without being applied immediately. Small refactoring to extract common code paths. At first, I tried to implement this via the sequencer, but looks like it's too hard to get it right, as sequencer lacks context and config to be written is not applied to the runtime. Fixes #2828 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-23 12:42:44 -08:00
Artem Chernyshev	2588e2960b	feat: make GenerateConfiguration API reuse current node auth Fixes: https://github.com/talos-systems/talos/issues/2819 Only if requested config type is not `TypeInit`. This functionality will help implementing TUI installer cluster extension workflow. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-23 12:12:15 -08:00
Artem Chernyshev	8513123d22	feat: return client config as the second value in GenerateConfiguration To be used in interactive installer to output the node client configuration to a file. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-17 07:20:05 -08:00
Artem Chernyshev	0f924b5122	feat: add generate config gRPC API Fixes: https://github.com/talos-systems/talos/issues/2766 This API is implemented in Maintenance and Machine services. Can be used to generate configuration on the node, instead of using talosctl to generate it locally. To be used in interactive installer and talosctl gen config. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-13 08:07:32 -08:00
Andrey Smirnov	8560fb9662	chore: enable nlreturn linter Most of the fixes were automatically applied. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 06:48:07 -08:00
Andrey Smirnov	773912833e	test: clean up integration test code, fix flakes This enables golangci-lint via build tags for integration tests (this should have been done long ago!), and fixes the linting errors. Two tests were updated to reduce flakiness: * apply config: wait for nodes to issue "boot done" sequence event before proceeding * recover: kill pods even if they appear after the initial set gets killed (potential race condition with previous test). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-19 15:44:14 -07:00
Artem Chernyshev	e7e99cf1b3	feat: support disk usage command in talosctl Usage example: ```bash talosctl du --nodes 10.5.0.2 /var -H -d 2 NODE NAME 10.5.0.2 8.4 kB etc 10.5.0.2 1.3 GB lib 10.5.0.2 16 MB log 10.5.0.2 25 kB run 10.5.0.2 4.1 kB tmp 10.5.0.2 1.3 GB . ``` Supported flags: - `-a` writes counts for all files, not just directories. - `-d` recursion depth - '-H' humanize size outputs. - '-t' size threshold (skip files if < size or > size). Fixes: https://github.com/talos-systems/talos/issues/2504 Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-13 09:30:31 -07:00

1 2

92 Commits