These tests rely on node uptime checks. These checks are quite flaky.
Following fixes were applied:
* code was refactored as common method shared between reset/reboot tests
(reboot all nodes does checks in a different way, so it wasn't updated)
* each request to read uptime times out in 5 seconds, so that checks
don't wait forever when node is down (or connection is aborted)
* to account for node availability vs. lower uptime in the beginning of
test, add extra elapsed time to the check condition
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR will update the values for timeout when testing e2e. We were
hitting issues in GCP on the reboot test, as the nodes seemed to be
taking a few minutes to become responsive again. I also moved the
"cluster health" check in the node-by-node reboot test to use the
default suite context, so it'll have a timeout of 30m instead of the 5
that it had initially. This seems to solve the node-by-node bailing as
well.
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
This is a rename of the osctl binary. We decided that talosctl is a
better name for the Talos CLI. This does not break any APIs, but does
make older documentation only accurate for previous versions of Talos.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Every node is reset, rebooted and it comes back up again except for the
init node due to known issues with init node boostrapping etcd cluster
from scratch when metadata is missing (as node was wiped).
Planned workaround is to prohibit resetting init node (should be coming
next).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
As calls to the nodes are proxied through `apid` on init node, we can't
reboot all nodes concurrently, as init node might be already down by the
moment any other node is going to be rebooted.
Rewrite the test to reboot all the nodes in a single multi-node
request.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This complements "rolling restart" RebootNodeByNode test by providing
more of a disaster scenario, when all the nodes are restarted at once.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR contains generic simple TCP loadbalancer code, and glue code for
firecracker provisioner to use this loadbalancer.
K8s control plane is passed through the load balancer, and Talos API is
passed only to the init node (for now, as some APIs, including
kubeconfig, don't work with non-init node).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Reboot test does node-by-node reboots followed by cluster health checks
(same as done by provisioner).
Fixed bug with `Read()` returning `Reader` instead of `ReadCloser`
(minor).
Allowed `bootkube` to be `Skipped` (for rebooted node).
Added support for doing checks via provided client instance.
Implemented generic capabilities to skip tests based on cluster
platform.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>