talos

mirror of https://github.com/siderolabs/talos.git synced 2025-11-02 09:21:13 +01:00

Author	SHA1	Message	Date
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	97d18b1c43	test: fix cli tests after load-balancing got enabled There were three problems: * cli tests did commands in sequence assuming they all hit the same node, but with load-balancing it's no longer true * restart test was affected, as it hit different node for check after restart, and it succeeded immediately, while on original node process was still starting which resulted in failure in the next tests; replace the check to make sure service is up and healthy, so that test leaves cluster in a good state * restart API response had wrong format (no message returned) which resulted in failures with apid proxy (when used with `-n`) Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 14:06:30 -07:00
Andrey Smirnov	5ecddf2866	feat: add round-robin LB policy to Talos client by default Handling of multiple endpoints has already been implemented in #2094. This PR enables round-robin policy so that grpc picks up new endpoint for each call (and not send each request to the first control plane node). Endpoint list is randomized to handle cases when only one request is going to be sent, so that it doesn't go always to the first node in the list. gprc handles dead/unresponsive nodes automatically for us. `talosctl cluster create` and provision tests switched to use client-side load balancer for Talos API. On the additional improvements we got: * `talosctl` now reports correct node IP when using commands without `-n`, not the loadbalancer IP (if using multiple endpoints of course) * loadbalancer can't provide reliable handling of errors when upstream server is unresponsive or there're no upstreams available, grpc returns much more helpful errors Fixes #1641 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 08:35:15 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	a6b3bd2ff6	feat: implement service events This implements service events, adds test for events API based on service events as they're the easiest to generate on demand. Disabled validate test for 'metal' as it validates disk device against local system which doesn't make much sense. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-03 13:52:53 -07:00
Andrey Smirnov	81d1c2bfe7	chore: enable godot linter Issues were fixed automatically. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-30 10:39:56 -07:00
Andrey Smirnov	6fb55229a2	test: fix and improve reboot/reset tests These tests rely on node uptime checks. These checks are quite flaky. Following fixes were applied: * code was refactored as common method shared between reset/reboot tests (reboot all nodes does checks in a different way, so it wasn't updated) * each request to read uptime times out in 5 seconds, so that checks don't wait forever when node is down (or connection is aborted) * to account for node availability vs. lower uptime in the beginning of test, add extra elapsed time to the check condition Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-29 13:56:48 -07:00
Andrey Smirnov	51112a1d86	fix: use kubernetes version in config generator Update all k8s image references to point to the version specified by the user. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-26 17:05:19 -07:00
Andrey Smirnov	0a4645fe80	feat: implement circular buffer for system logs This replaces logging to files with inotify following to pure in-memory circular buffer which grows on demand capped at specified maximum capacity. The concern with previous approach was that logs on tmpfs were growing without any bound potentially consuming all the node memory. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-26 15:33:54 -07:00
Andrew Rynhard	d0d2ac3c74	test: default to using the bootstrap API This moves our test scripts to using the bootstrap API. Some automation around invoking the bootstrap API was also added to give the same ease of use when creating clusters with the CLI. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-06-24 08:46:10 -07:00
Andrew Rynhard	77150f51cf	chore: update provision test versions This adds latest 0.6 alpha and 0.5 stable to the upgrade tests. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-29 14:58:54 -07:00
Andrey Smirnov	795a10b681	test: improve reboot/reset test resiliency against request timeouts After node reboot test code tries endlessly to read the uptime until it goes down after reboot, but during actual reboot API won't be responsive and it might happen that this call will time out only with parent context canceling, and by that time retry timeout is already exhausted, so no more attempts will be made (while node successfully booted after a reboot). ``` uptime didn't go down: before 219.730000, after 267.020000 uptime didn't go down: before 219.730000, after 268.030000 EOF rpc error: code = DeadlineExceeded desc = context deadline exceeded timeout ``` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-22 12:31:06 -07:00
Andrey Smirnov	652531853f	test: update Talos versions for upgrade tests Our policy it to support two last releases (0.4, 0.5 at the moment). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-20 07:43:10 -07:00
Seán C McCord	3e0e01e2c3	fix: refactor client creation API Create a new `client.New` to make external API systems easier to construct. A new type `client.OptionFunc` allows the client to be extended with specific configuration. This also makes a first pass at supporting multiple endpoints properly by creating a custom grpc resolver. (Proper load balancing support is still a TODO.) Fixes #2093 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-05-11 10:21:07 -07:00
Andrey Smirnov	28a6eb207a	test: add node name to error messages in RebootAllNodes This makes troubleshooting easier. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-07 12:12:46 -07:00
Andrey Smirnov	23be80fd96	test: stabilize tests by bumping timeouts Bump timeouts for reset API test as K8s control plane teardown might take 3 minutes on its own. Bump Go Firecracker SDK timeout when talking to firecracker process. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-06 08:26:18 -07:00
Andrew Rynhard	56d7bf19fe	feat: add recovery API This adds an API for recovering the self-hosted control plane. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-04 19:38:30 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Andrey Smirnov	55dcbbc8d0	feat: add commands talosctl health/crashdump This extracts health & crashdump features which were specific to provisioning code into separate package which can be used standalone. Everything else is just new glue. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-27 20:43:10 -07:00
Andrey Smirnov	ff2267eb99	test: update versions used for upgrade tests We should stick to the latest version in each release series. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-07 15:51:56 -07:00
Spencer Smith	31668f1c4c	chore: update timeout values for e2e tests This PR will update the values for timeout when testing e2e. We were hitting issues in GCP on the reboot test, as the nodes seemed to be taking a few minutes to become responsive again. I also moved the "cluster health" check in the node-by-node reboot test to use the default suite context, so it'll have a timeout of 30m instead of the 5 that it had initially. This seems to solve the node-by-node bailing as well. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-04-03 19:16:30 -04:00
Andrey Smirnov	682dd433ba	refactor: move Talos client package to `pkg/` As this implements Go client for Talos API, it makes sense to publish it one the top level. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-01 23:45:58 +03:00
Andrey Smirnov	b94be4f6a1	test: mark long tests as !short This skips long-running integration tests if `-test.short` mode is enabled. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-27 22:34:26 +03:00
Spencer Smith	3a4eaeeef0	feat: upgrade kubernetes to 1.18 This PR will pull in the latest release of k8s 1.18 so we can start validating it through our test suite. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-03-26 14:59:43 -04:00
Andrey Smirnov	e38cde9b48	chore: update upgrade tests for new version, split into two tracks This updates upgrade tests to run two flows with 3+1 clusters: 1. 0.3 -> current (testing upgrade with partition wiping) 2. 0.4-alpha.7 -> current (testing upgrade without partition wiping, boot-a/boot-b) And small upgrade with preserve enabled for single-node cluster. Provision tests are now split into two parallel tracks in Drone. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-24 15:30:00 -07:00
Andrew Rynhard	5dbc26c7a3	feat: rename osctl to talosctl This is a rename of the osctl binary. We decided that talosctl is a better name for the Talos CLI. This does not break any APIs, but does make older documentation only accurate for previous versions of Talos. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 19:07:39 -07:00
Andrew Rynhard	69fa63a7b2	refactor: perform upgrade upon reboot This PR introduces a new strategy for upgrades. Instead of attempting to zap the partition table, create a new one, and then format the partitions, this change will only update the `vmlinuz`, and `initramfs.xz` being used to boot. It introduces an A/B style upgrade process, which will allow for easy rollbacks. One deviation from our original intention with upgrades is that this change does not completely reset a node. It falls just short of that and does not reset the partition table. This forces us to keep the current partition scheme in mind as we make changes in the future, because an upgrade assumes a specific partition scheme. We can improve upgrades further in the future, but this will at least make them more dependable. Finally, one more feature in this PR is the ability to keep state. This enables single node clusters to upgrade since we keep the etcd data around. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 17:32:18 -07:00
Andrey Smirnov	0babc39653	feat: split `osctl` commands into Talos API and cluster management This keeps backwards compatibility with `osctl` CLI binary with the exception of `osctl config generate` which was renamed to `osctl gen config` to avoid confusion with other `osctl config` commands which operate on client config, not Talos server config. Command implementation and helpers were split into subpackages for cleaner code and more visible boundaries. The resulting binary still combines commands from both sections into a single binary. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-20 22:45:04 +03:00
Andrey Smirnov	d5f80858dd	test: add 'reset' integration test for Reset() API Every node is reset, rebooted and it comes back up again except for the init node due to known issues with init node boostrapping etcd cluster from scratch when metadata is missing (as node was wiped). Planned workaround is to prohibit resetting init node (should be coming next). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-06 23:05:46 +03:00
Andrey Smirnov	2e3681054d	chore: improve handling of etcd responses in bootkube pre-func Try more attempts, wait for the response. Treat empty response as no error (as this is what to expect when key is not set yet). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-06 21:06:48 +03:00
Andrey Smirnov	bbe2c53d29	feat: generate kubeconfig on the fly on request This extracts admin kubeconfig generation out of bootkube, now based on Talos x509 library. On each API request for `kubeconfig`, config is generated on the fly and sent back on the wire. This fixes two issues: * any master node can now generate `kubeconfig` (worker nodes can do that too, but that should probably change in the future) * after upgrade-and-wipe the disk scenario, `osctl kubeconfig` still works Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-28 21:00:52 +03:00
Andrey Smirnov	d5d3035c8c	test: enable upgrade tests 0.4.x -> latest With the fix #1904, it's now possible to upgrade 0.4.x with `machine.File` extra files (caused by registry mirror for registry.ci.svc). Bump resources for upgrade tests in attempt to speed it up. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-26 00:09:32 +03:00
Andrey Smirnov	923ef4537b	test: implement new class of tests: provision tests (upgrades) This class of tests is included/excluded by build tags, but as it is pretty different from other integration tests, we build it as separate executable. Provision tests provision cluster for the test run, perform some actions and verify results (could be upgrade, reset, scale up/down, etc.) There's now framework to implement upgrade tests, first of the tests tests upgrade from latest 0.3 (0.3.2 at the moment) to current version of Talos (being built in CI). Tests starts by booting with 0.3 kernel/initramfs, runs 0.3 installer to install 0.3.2 cluster, wait for bootstrap, followed by upgrade to 0.4 in rolling fashion. As Firecracker supports bootloader, this boots 0.4 system from boot disk (as installed by installer). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-21 07:04:03 -08:00
Andrey Smirnov	9bfb5f1501	test: fix `RebootAllNodes` test to reboot all nodes in one call As calls to the nodes are proxied through `apid` on init node, we can't reboot all nodes concurrently, as init node might be already down by the moment any other node is going to be rebooted. Rewrite the test to reboot all the nodes in a single multi-node request. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-17 14:34:00 -08:00
Andrey Smirnov	491e7e58e0	test: implement RebootAllNodes test This complements "rolling restart" RebootNodeByNode test by providing more of a disaster scenario, when all the nodes are restarted at once. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-17 13:58:57 -08:00
Andrey Smirnov	76c2038b13	chore: implement loadbalancer for firecracker provisioner This PR contains generic simple TCP loadbalancer code, and glue code for firecracker provisioner to use this loadbalancer. K8s control plane is passed through the load balancer, and Talos API is passed only to the init node (for now, as some APIs, including kubeconfig, don't work with non-init node). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-13 23:07:13 +03:00
Andrey Smirnov	a2dee289d1	test: skip reboot tests Seems that with a single endpoint k8s is not able to recover (?). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-04 08:37:32 -08:00
Andrey Smirnov	afa8a48174	chore: implement reboot test Reboot test does node-by-node reboots followed by cluster health checks (same as done by provisioner). Fixed bug with `Read()` returning `Reader` instead of `ReadCloser` (minor). Allowed `bootkube` to be `Skipped` (for rebooted node). Added support for doing checks via provided client instance. Implemented generic capabilities to skip tests based on cluster platform. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-03 11:02:43 -08:00
Andrey Smirnov	0afd0f651b	chore: provide provisioned cluster info to integration test Integration test can optionally consume cluster state as generated by the call to `osctl cluster create` and use it to discover nodes in integration tests. This means that now CLI tests can use that as discovery source, and API/K8s tests by default as well. Flat list of nodes is to be replaced by something more complex in the next iteration, but it's good for this PR. As a demo, add CLI test with multiple nodes (dmesg). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-01-31 18:21:30 +03:00
Andrey Smirnov	9da687d2a3	test: firecracker provisioner fixes, implement cluster destroy This implements `osctl cluster destroy` for Firecracker, adds new utility command `osctl cluser show`. Firecracker mode now has control process for firecracker VMs, allowing clean reboots and background operations. Lots of small fixes to Firecracker mode, clean CNI shutdown, cleaning up netns, etc. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-01-21 17:11:06 -08:00
Andrew Rynhard	f3623d22b0	refactor: use tls.Config as client credentials The `client.Creds` struct was not used very often, and made using the `client.NewClient` function impossible to use in combination with the `RemoteRenewingFileCertificateProvider`. This modifies `client.NewClient` to accept a `tls.Config` instead of `client.Creds`, allowing for the use of `RemoteRenewingFileCertificateProvider` with `client.NewClient`. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-01-21 17:10:07 -08:00
Andrey Smirnov	ebd40bd0eb	chore: use osctl cluster --wait in basic-integration There are few workarounds for Drone way of running integration test: DinD runs as a separate pod, and we can only access its exposed on the "host" ports, while from Talos cluster this endpoint is not reachable. So internally Talos nodes still use addresses like "10.5.0.2", while test is using "docker" to access it (that's name of the `docker` service in the pipeline). When running locally, 127.0.0.1 is used as endpoint, which should work fine both on OS X and Linux. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-30 15:15:42 -08:00
Andrey Smirnov	3a021e4579	test: add integration tests for (most) CLI commands I added tests for all the commands which work reliably in container mode. Some tests are naive, some are more sophisticated. While going through the tests, I think I found a small bug in `osctl gen keypair`. When we get reliable KVM tests, I can revisit and add missing tests for time, reboot, shutdown and friends. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-20 23:33:35 +03:00
Andrey Smirnov	f3dff87957	fix: fail on muliple nodes for commands which don't support it Fixes #1663 (I believe it's 0.3 backport strong candidate). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-18 18:51:40 +03:00
Andrey Smirnov	6e05dd70c4	feat: add support for tailing logs Fixes #1564 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-17 22:35:47 +03:00
Andrey Smirnov	1fbf40796f	feat: implement streaming mode of dmesg, parse messages Fixes #1563 This implements dmesg reading via `/dev/kmsg`, with message parsing and formatting. Kernel log facility and severity are parsed, timestamp is calculated relative to boot time (it's accurate unless time jumps a lot during node lifetime). New flags to follow dmesg was added, tail flag allows to stream only new message (ignoring old messages). We could try to implement tailing last N messages, just a bit more work, open to suggestions (for symmetry with regular logs). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-16 17:40:15 +03:00
Andrew Rynhard	ad863a7f92	refactor: rename protobuf services, RPCs, and messages This PR brings our protobuf files into conformance with the protobuf style guide, and community conventions. It is purely renames, along with generated docs. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-12-11 11:41:40 -08:00
Andrey Smirnov	399aeda0b9	feat: rename confusing target options, --endpoints, etc. Fixes #1610 1. In `talosconfig`, deprecate `Target` in favor of `Endpoints` (client-side LB to come next). 2. In `osctl`, use `--nodes` in place of `--target`. 3. In `osctl` add option `--endpoints` to override `Endpoints` for the call. Other changes are just updates to catch up with the changes. Most probably I missed something... And CAPI provider needs update. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-10 02:23:54 +03:00
Andrey Smirnov	16f1f6996e	test: add retries to the test which verifies cluster version It fails on AWS, need to figure out if it's transient failure or not. While I was there, found lots of small bugs when endpoint is unresponsive, or target nodes are unresponsive and fixed them. In retry formatting added `\t` so that embedded errors are better aligned in the output (same as multierror). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-06 11:24:58 -08:00
Andrey Smirnov	edb40437ec	feat: add support for `osctl logs -f` Now default is not to follow the logs (which is similar to `kubectl logs`). Integration test was added for `Logs()` API and `osctl logs` command. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-05 13:58:52 -08:00

1 2 3 4

160 Commits