talos

mirror of https://github.com/siderolabs/talos.git synced 2025-11-22 03:01:14 +01:00

Author	SHA1	Message	Date
Andrey Smirnov	47fb5720cf	test: skip etcd tests on non-HA clusters We can't test much of the flow on single-node clusters. Fixes #3013 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-08 07:39:36 -08:00
Andrey Smirnov	a8dd2ff30d	fix: checkpoint controller-manager and scheduler Default manifests created by bootkube so far were only enabling pod-checkpointer for kube-apiserver. This seems to have issues with single-node control plane scenario, when without scheduler and controller-manager node might fall into `NodeAffinity` state. See https://github.com/talos-systems/bootkube-plugin/pull/23 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-28 11:53:17 -08:00
Andrey Smirnov	3dae6df27b	test: stabilize upgrade test by running health check several times For single node clusters, control plane is unstable after reboot, run health check several times to let it settle down to avoid failures in subsequent checks. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-11 08:31:01 -08:00
Andrey Smirnov	54ed80e244	feat: reset with system disk wipe spec Idea is to add an option to perform "selective" reset: default reset operation is to wipe all partitions (triggering reinstall), while spec allows only to wipe some of the operations. Other operations are performed exactly in the same way for any reset flow. Possible use case: reset only `EPHEMERAL` partition. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-10 11:31:07 -08:00
Andrey Smirnov	350280eb59	feat: implement "staged" (failsafe/backup) upgrades Regular upgrade path takes just one reboot, but it requires all the processes to be stopped on the node before upgrade might proceed. Under some circumstances and with potential Talos bugs it might not work rendering Talos upgrades almost impossible. Staged upgrades build upon regular install flow to run the upgrade on the node reboot. Such upgrades require two reboots of the node, and it requires two pulls of the installer image, but they should be much less suspicious to the failure. Once the upgrade is staged, node can be rebooted in any possible way, including hard reset and upgrade is performed on the next boot. New ADV format was implemented as well to allow to store install image ref/options across reboots. New format allows for bigger values and takes 50% of the `META` partition. Old ADV is still kept for compatibility reasons. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-08 08:34:26 -08:00
Artem Chernyshev	8aad711f18	feat: implement network interfaces list API To be used in the interactive installer to configure networking. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-27 10:48:45 -08:00
Artem Chernyshev	f96cffd2b2	feat: add ability to choose CNI config Initial version which only allows setting CNI using preset, no custom CNI urls are supported at the moment. Still need to figure out what kind of UI can be used for that. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-26 06:49:54 -08:00
Andrey Smirnov	9a32e34cb1	feat: implement apply configuration without reboot This allows config to be written to disk without being applied immediately. Small refactoring to extract common code paths. At first, I tried to implement this via the sequencer, but looks like it's too hard to get it right, as sequencer lacks context and config to be written is not applied to the runtime. Fixes #2828 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-23 12:42:44 -08:00
Artem Chernyshev	2588e2960b	feat: make GenerateConfiguration API reuse current node auth Fixes: https://github.com/talos-systems/talos/issues/2819 Only if requested config type is not `TypeInit`. This functionality will help implementing TUI installer cluster extension workflow. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-23 12:12:15 -08:00
Artem Chernyshev	8513123d22	feat: return client config as the second value in GenerateConfiguration To be used in interactive installer to output the node client configuration to a file. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-17 07:20:05 -08:00
Artem Chernyshev	0f924b5122	feat: add generate config gRPC API Fixes: https://github.com/talos-systems/talos/issues/2766 This API is implemented in Maintenance and Machine services. Can be used to generate configuration on the node, instead of using talosctl to generate it locally. To be used in interactive installer and talosctl gen config. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-13 08:07:32 -08:00
Andrey Smirnov	8560fb9662	chore: enable nlreturn linter Most of the fixes were automatically applied. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-09 06:48:07 -08:00
Andrey Smirnov	773912833e	test: clean up integration test code, fix flakes This enables golangci-lint via build tags for integration tests (this should have been done long ago!), and fixes the linting errors. Two tests were updated to reduce flakiness: * apply config: wait for nodes to issue "boot done" sequence event before proceeding * recover: kill pods even if they appear after the initial set gets killed (potential race condition with previous test). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-19 15:44:14 -07:00
Artem Chernyshev	e7e99cf1b3	feat: support disk usage command in talosctl Usage example: ```bash talosctl du --nodes 10.5.0.2 /var -H -d 2 NODE NAME 10.5.0.2 8.4 kB etc 10.5.0.2 1.3 GB lib 10.5.0.2 16 MB log 10.5.0.2 25 kB run 10.5.0.2 4.1 kB tmp 10.5.0.2 1.3 GB . ``` Supported flags: - `-a` writes counts for all files, not just directories. - `-d` recursion depth - '-H' humanize size outputs. - '-t' size threshold (skip files if < size or > size). Fixes: https://github.com/talos-systems/talos/issues/2504 Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-13 09:30:31 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Seán C McCord	ff92d2a14b	feat: add ApplyConfiguration API Adds the ability to apply (replace) an existing node configuration with a new one via the Machine API. Fixes #2345 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-09-29 14:44:06 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Marco De Luca	1fbb171fd0	test: determine reboots using boot id Changed the RebootSuite to use /proc/sys/kernel/random/boot_id rather than /proc/uptime Signed-off-by: Marco De Luca <marcodl404@gmail.com>	2020-08-26 06:09:02 -07:00
Andrey Smirnov	6a7cc02648	fix: handle bootkube recover correctly, support recovery from etcd Bootkube recover process (and `talosctl recover`) was actually regenerating assets each time `recover` runs forcing control plane to be at the state when cluster got created. This PR fixes that by running recover process correctly. Recovery via etcd was fixed to handle encrypted etcd data: it follows the way `apiserver` handles encryption at rest, and as at the moment AES CBC is the only supported encryption method, code simply follows the same path. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-18 14:24:14 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	47608fb874	refactor: make `pkg/config` not rely on `machined/../internal/runtime` This makes `pkg/config` directly importable from other projects. There should be no functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 12:40:12 -07:00
Andrey Smirnov	3d8418a689	feat: force nodes to be set in `talosctl` commands using the API With load-balancing enabled by default running `talosctl` without `--nodes` is risky, as it might hit any control plane by default without `--nodes`. Only two commands do not enforce this check, as they do their own node contexts: `crashdump` and `health` (client-side). Integration tests were updated to always supply `--nodes` cli argument, while doing that I refactored the storage for discovered nodes to use existing `cluster.Info` interface. The downside is that with e2e CAPI tests CLI tests will be mostly skipped as we don't support discovery in CLI tests at the momemnt. This can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for node discovery. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 12:17:43 -07:00
Andrey Smirnov	1a0e1bc393	chore: update module dependencies Fixes #2316 Simply update dependencies we don't track on version level to be compatible with Talos components (like etcd or k8s). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:00:50 -07:00
Andrey Smirnov	cbb7ca8390	refactor: merge osd into machined This merges `osd` API into `machined`. API was copied from `osd` into `machined`, and `osd` API was deprecated. For backwards compatibility, `machined` still implements `osd` API, so older Talos API clients can still talk to the node without changes. Docs were updated. No functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-13 12:50:00 -07:00
Andrey Smirnov	931237b23c	test: update init node check in reset API tests Previously we assumed that node 0 is the init node, and it can't be reset. With new bootstrap API approach, there's no init node, and all the nodes can be reset. This corrects the check to skip only the init node, and with bootstrap API there's no init node (so no nodes are skipped). Fixes #2277 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 10:48:14 -07:00
Andrey Smirnov	5ecddf2866	feat: add round-robin LB policy to Talos client by default Handling of multiple endpoints has already been implemented in #2094. This PR enables round-robin policy so that grpc picks up new endpoint for each call (and not send each request to the first control plane node). Endpoint list is randomized to handle cases when only one request is going to be sent, so that it doesn't go always to the first node in the list. gprc handles dead/unresponsive nodes automatically for us. `talosctl cluster create` and provision tests switched to use client-side load balancer for Talos API. On the additional improvements we got: * `talosctl` now reports correct node IP when using commands without `-n`, not the loadbalancer IP (if using multiple endpoints of course) * loadbalancer can't provide reliable handling of errors when upstream server is unresponsive or there're no upstreams available, grpc returns much more helpful errors Fixes #1641 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 08:35:15 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	a6b3bd2ff6	feat: implement service events This implements service events, adds test for events API based on service events as they're the easiest to generate on demand. Disabled validate test for 'metal' as it validates disk device against local system which doesn't make much sense. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-03 13:52:53 -07:00
Andrey Smirnov	81d1c2bfe7	chore: enable godot linter Issues were fixed automatically. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-30 10:39:56 -07:00
Andrey Smirnov	6fb55229a2	test: fix and improve reboot/reset tests These tests rely on node uptime checks. These checks are quite flaky. Following fixes were applied: * code was refactored as common method shared between reset/reboot tests (reboot all nodes does checks in a different way, so it wasn't updated) * each request to read uptime times out in 5 seconds, so that checks don't wait forever when node is down (or connection is aborted) * to account for node availability vs. lower uptime in the beginning of test, add extra elapsed time to the check condition Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-29 13:56:48 -07:00
Andrey Smirnov	0a4645fe80	feat: implement circular buffer for system logs This replaces logging to files with inotify following to pure in-memory circular buffer which grows on demand capped at specified maximum capacity. The concern with previous approach was that logs on tmpfs were growing without any bound potentially consuming all the node memory. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-26 15:33:54 -07:00
Andrey Smirnov	28a6eb207a	test: add node name to error messages in RebootAllNodes This makes troubleshooting easier. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-07 12:12:46 -07:00
Andrey Smirnov	23be80fd96	test: stabilize tests by bumping timeouts Bump timeouts for reset API test as K8s control plane teardown might take 3 minutes on its own. Bump Go Firecracker SDK timeout when talking to firecracker process. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-06 08:26:18 -07:00
Andrew Rynhard	56d7bf19fe	feat: add recovery API This adds an API for recovering the self-hosted control plane. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-04 19:38:30 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Spencer Smith	31668f1c4c	chore: update timeout values for e2e tests This PR will update the values for timeout when testing e2e. We were hitting issues in GCP on the reboot test, as the nodes seemed to be taking a few minutes to become responsive again. I also moved the "cluster health" check in the node-by-node reboot test to use the default suite context, so it'll have a timeout of 30m instead of the 5 that it had initially. This seems to solve the node-by-node bailing as well. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-04-03 19:16:30 -04:00
Andrey Smirnov	682dd433ba	refactor: move Talos client package to `pkg/` As this implements Go client for Talos API, it makes sense to publish it one the top level. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-01 23:45:58 +03:00
Andrey Smirnov	b94be4f6a1	test: mark long tests as !short This skips long-running integration tests if `-test.short` mode is enabled. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-27 22:34:26 +03:00
Andrew Rynhard	5dbc26c7a3	feat: rename osctl to talosctl This is a rename of the osctl binary. We decided that talosctl is a better name for the Talos CLI. This does not break any APIs, but does make older documentation only accurate for previous versions of Talos. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 19:07:39 -07:00
Andrey Smirnov	d5f80858dd	test: add 'reset' integration test for Reset() API Every node is reset, rebooted and it comes back up again except for the init node due to known issues with init node boostrapping etcd cluster from scratch when metadata is missing (as node was wiped). Planned workaround is to prohibit resetting init node (should be coming next). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-06 23:05:46 +03:00
Andrey Smirnov	9bfb5f1501	test: fix `RebootAllNodes` test to reboot all nodes in one call As calls to the nodes are proxied through `apid` on init node, we can't reboot all nodes concurrently, as init node might be already down by the moment any other node is going to be rebooted. Rewrite the test to reboot all the nodes in a single multi-node request. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-17 14:34:00 -08:00
Andrey Smirnov	491e7e58e0	test: implement RebootAllNodes test This complements "rolling restart" RebootNodeByNode test by providing more of a disaster scenario, when all the nodes are restarted at once. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-17 13:58:57 -08:00
Andrey Smirnov	76c2038b13	chore: implement loadbalancer for firecracker provisioner This PR contains generic simple TCP loadbalancer code, and glue code for firecracker provisioner to use this loadbalancer. K8s control plane is passed through the load balancer, and Talos API is passed only to the init node (for now, as some APIs, including kubeconfig, don't work with non-init node). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-13 23:07:13 +03:00
Andrey Smirnov	a2dee289d1	test: skip reboot tests Seems that with a single endpoint k8s is not able to recover (?). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-04 08:37:32 -08:00
Andrey Smirnov	afa8a48174	chore: implement reboot test Reboot test does node-by-node reboots followed by cluster health checks (same as done by provisioner). Fixed bug with `Read()` returning `Reader` instead of `ReadCloser` (minor). Allowed `bootkube` to be `Skipped` (for rebooted node). Added support for doing checks via provided client instance. Implemented generic capabilities to skip tests based on cluster platform. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-03 11:02:43 -08:00
Andrey Smirnov	6e05dd70c4	feat: add support for tailing logs Fixes #1564 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-17 22:35:47 +03:00
Andrey Smirnov	1fbf40796f	feat: implement streaming mode of dmesg, parse messages Fixes #1563 This implements dmesg reading via `/dev/kmsg`, with message parsing and formatting. Kernel log facility and severity are parsed, timestamp is calculated relative to boot time (it's accurate unless time jumps a lot during node lifetime). New flags to follow dmesg was added, tail flag allows to stream only new message (ignoring old messages). We could try to implement tailing last N messages, just a bit more work, open to suggestions (for symmetry with regular logs). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-16 17:40:15 +03:00
Andrew Rynhard	ad863a7f92	refactor: rename protobuf services, RPCs, and messages This PR brings our protobuf files into conformance with the protobuf style guide, and community conventions. It is purely renames, along with generated docs. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-12-11 11:41:40 -08:00
Andrey Smirnov	399aeda0b9	feat: rename confusing target options, --endpoints, etc. Fixes #1610 1. In `talosconfig`, deprecate `Target` in favor of `Endpoints` (client-side LB to come next). 2. In `osctl`, use `--nodes` in place of `--target`. 3. In `osctl` add option `--endpoints` to override `Endpoints` for the call. Other changes are just updates to catch up with the changes. Most probably I missed something... And CAPI provider needs update. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-10 02:23:54 +03:00
Andrey Smirnov	16f1f6996e	test: add retries to the test which verifies cluster version It fails on AWS, need to figure out if it's transient failure or not. While I was there, found lots of small bugs when endpoint is unresponsive, or target nodes are unresponsive and fixed them. In retry formatting added `\t` so that embedded errors are better aligned in the output (same as multierror). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-06 11:24:58 -08:00

1 2

56 Commits