talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-06 21:21:53 +02:00

Author	SHA1	Message	Date
Andrey Smirnov	28a6eb207a	test: add node name to error messages in RebootAllNodes This makes troubleshooting easier. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-07 12:12:46 -07:00
Andrey Smirnov	23be80fd96	test: stabilize tests by bumping timeouts Bump timeouts for reset API test as K8s control plane teardown might take 3 minutes on its own. Bump Go Firecracker SDK timeout when talking to firecracker process. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-06 08:26:18 -07:00
Andrew Rynhard	56d7bf19fe	feat: add recovery API This adds an API for recovering the self-hosted control plane. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-04 19:38:30 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Spencer Smith	31668f1c4c	chore: update timeout values for e2e tests This PR will update the values for timeout when testing e2e. We were hitting issues in GCP on the reboot test, as the nodes seemed to be taking a few minutes to become responsive again. I also moved the "cluster health" check in the node-by-node reboot test to use the default suite context, so it'll have a timeout of 30m instead of the 5 that it had initially. This seems to solve the node-by-node bailing as well. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-04-03 19:16:30 -04:00
Andrey Smirnov	682dd433ba	refactor: move Talos client package to `pkg/` As this implements Go client for Talos API, it makes sense to publish it one the top level. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-01 23:45:58 +03:00
Andrey Smirnov	b94be4f6a1	test: mark long tests as !short This skips long-running integration tests if `-test.short` mode is enabled. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-27 22:34:26 +03:00
Andrew Rynhard	5dbc26c7a3	feat: rename osctl to talosctl This is a rename of the osctl binary. We decided that talosctl is a better name for the Talos CLI. This does not break any APIs, but does make older documentation only accurate for previous versions of Talos. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 19:07:39 -07:00
Andrey Smirnov	d5f80858dd	test: add 'reset' integration test for Reset() API Every node is reset, rebooted and it comes back up again except for the init node due to known issues with init node boostrapping etcd cluster from scratch when metadata is missing (as node was wiped). Planned workaround is to prohibit resetting init node (should be coming next). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-06 23:05:46 +03:00
Andrey Smirnov	9bfb5f1501	test: fix `RebootAllNodes` test to reboot all nodes in one call As calls to the nodes are proxied through `apid` on init node, we can't reboot all nodes concurrently, as init node might be already down by the moment any other node is going to be rebooted. Rewrite the test to reboot all the nodes in a single multi-node request. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-17 14:34:00 -08:00
Andrey Smirnov	491e7e58e0	test: implement RebootAllNodes test This complements "rolling restart" RebootNodeByNode test by providing more of a disaster scenario, when all the nodes are restarted at once. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-17 13:58:57 -08:00
Andrey Smirnov	76c2038b13	chore: implement loadbalancer for firecracker provisioner This PR contains generic simple TCP loadbalancer code, and glue code for firecracker provisioner to use this loadbalancer. K8s control plane is passed through the load balancer, and Talos API is passed only to the init node (for now, as some APIs, including kubeconfig, don't work with non-init node). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-13 23:07:13 +03:00
Andrey Smirnov	a2dee289d1	test: skip reboot tests Seems that with a single endpoint k8s is not able to recover (?). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-04 08:37:32 -08:00
Andrey Smirnov	afa8a48174	chore: implement reboot test Reboot test does node-by-node reboots followed by cluster health checks (same as done by provisioner). Fixed bug with `Read()` returning `Reader` instead of `ReadCloser` (minor). Allowed `bootkube` to be `Skipped` (for rebooted node). Added support for doing checks via provided client instance. Implemented generic capabilities to skip tests based on cluster platform. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-03 11:02:43 -08:00
Andrey Smirnov	6e05dd70c4	feat: add support for tailing logs Fixes #1564 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-17 22:35:47 +03:00
Andrey Smirnov	1fbf40796f	feat: implement streaming mode of dmesg, parse messages Fixes #1563 This implements dmesg reading via `/dev/kmsg`, with message parsing and formatting. Kernel log facility and severity are parsed, timestamp is calculated relative to boot time (it's accurate unless time jumps a lot during node lifetime). New flags to follow dmesg was added, tail flag allows to stream only new message (ignoring old messages). We could try to implement tailing last N messages, just a bit more work, open to suggestions (for symmetry with regular logs). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-16 17:40:15 +03:00
Andrew Rynhard	ad863a7f92	refactor: rename protobuf services, RPCs, and messages This PR brings our protobuf files into conformance with the protobuf style guide, and community conventions. It is purely renames, along with generated docs. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-12-11 11:41:40 -08:00
Andrey Smirnov	399aeda0b9	feat: rename confusing target options, --endpoints, etc. Fixes #1610 1. In `talosconfig`, deprecate `Target` in favor of `Endpoints` (client-side LB to come next). 2. In `osctl`, use `--nodes` in place of `--target`. 3. In `osctl` add option `--endpoints` to override `Endpoints` for the call. Other changes are just updates to catch up with the changes. Most probably I missed something... And CAPI provider needs update. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-10 02:23:54 +03:00
Andrey Smirnov	16f1f6996e	test: add retries to the test which verifies cluster version It fails on AWS, need to figure out if it's transient failure or not. While I was there, found lots of small bugs when endpoint is unresponsive, or target nodes are unresponsive and fixed them. In retry formatting added `\t` so that embedded errors are better aligned in the output (same as multierror). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-06 11:24:58 -08:00
Andrey Smirnov	edb40437ec	feat: add support for `osctl logs -f` Now default is not to follow the logs (which is similar to `kubectl logs`). Integration test was added for `Logs()` API and `osctl logs` command. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-05 13:58:52 -08:00
Andrey Smirnov	10a40a15d9	fix: extract errors from API response This PR only touches `Version` method, but I will expand it to other methods in the next PR. When proxying to many upstreams, errors are wrapped as responses as we can't return error and response from grpc call. Reflect-based function was introduced to filter out responses which contain errors as multierror. Reflection was used, as each response is a different Go type, and we can't write a generic function for it. osctl was updated to support having both resp & err not nil. One failed response shouldn't result in error. Re-enabled integration test for multiple targets and version consistency, need e2e validation. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-05 09:44:10 -08:00
Andrey Smirnov	8c7fadde95	test: disable discovery-based test as it might break e2e It seems to work reliably in basic-integration, but fails in e2e (receives less responses than expected). We can re-enable once we get to the root cause of the problem. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-15 14:29:27 -08:00
Andrey Smirnov	af2b6fa130	test: implement node discovery for integration tests This adds support for node discovery for API-based tests, but discovery is based on k8s state. Discovery can be overridden if we provide a list of node IPs as a flag. Also adds a test for K8s API server version. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-14 15:35:07 -08:00
Andrey Smirnov	551fa45d33	test: add CLI integration test This starts with a very simple test for `osctl version` using regexps as output of the command depends a lot on current version. We might use more of 'gold' matches for other commands potentially. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-05 17:59:23 -08:00
Andrey Smirnov	b0aef2cf22	test: add integration test framework This is just first steps and core foundation. It can be used like: ``` make integration.test osctl cluster create build/integration.test -test.v ``` This should run the test against the Docker instance. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-05 17:21:38 +03:00

25 Commits