Bump timeouts for reset API test as K8s control plane teardown might
take 3 minutes on its own.
Bump Go Firecracker SDK timeout when talking to firecracker process.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.
A few highlights are:
- no more event bus
- functional approach to tasks (no more types defined for each task)
- the task function definition now offers a lot more context, like
access to raw API requests, the current sequence, a logger, the new
state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
install force option
Additionally, this pulls various packes in under machined to make the
code easier to navigate.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This PR will update the values for timeout when testing e2e. We were
hitting issues in GCP on the reboot test, as the nodes seemed to be
taking a few minutes to become responsive again. I also moved the
"cluster health" check in the node-by-node reboot test to use the
default suite context, so it'll have a timeout of 30m instead of the 5
that it had initially. This seems to solve the node-by-node bailing as
well.
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
This is a rename of the osctl binary. We decided that talosctl is a
better name for the Talos CLI. This does not break any APIs, but does
make older documentation only accurate for previous versions of Talos.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Every node is reset, rebooted and it comes back up again except for the
init node due to known issues with init node boostrapping etcd cluster
from scratch when metadata is missing (as node was wiped).
Planned workaround is to prohibit resetting init node (should be coming
next).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
As calls to the nodes are proxied through `apid` on init node, we can't
reboot all nodes concurrently, as init node might be already down by the
moment any other node is going to be rebooted.
Rewrite the test to reboot all the nodes in a single multi-node
request.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This complements "rolling restart" RebootNodeByNode test by providing
more of a disaster scenario, when all the nodes are restarted at once.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR contains generic simple TCP loadbalancer code, and glue code for
firecracker provisioner to use this loadbalancer.
K8s control plane is passed through the load balancer, and Talos API is
passed only to the init node (for now, as some APIs, including
kubeconfig, don't work with non-init node).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Reboot test does node-by-node reboots followed by cluster health checks
(same as done by provisioner).
Fixed bug with `Read()` returning `Reader` instead of `ReadCloser`
(minor).
Allowed `bootkube` to be `Skipped` (for rebooted node).
Added support for doing checks via provided client instance.
Implemented generic capabilities to skip tests based on cluster
platform.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes#1563
This implements dmesg reading via `/dev/kmsg`, with message parsing and
formatting. Kernel log facility and severity are parsed, timestamp is
calculated relative to boot time (it's accurate unless time jumps a
lot during node lifetime).
New flags to follow dmesg was added, tail flag allows to stream only new
message (ignoring old messages). We could try to implement tailing last
N messages, just a bit more work, open to suggestions (for symmetry with
regular logs).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR brings our protobuf files into conformance with the protobuf
style guide, and community conventions. It is purely renames, along with
generated docs.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Fixes#1610
1. In `talosconfig`, deprecate `Target` in favor of `Endpoints`
(client-side LB to come next).
2. In `osctl`, use `--nodes` in place of `--target`.
3. In `osctl` add option `--endpoints` to override `Endpoints` for the
call.
Other changes are just updates to catch up with the changes. Most
probably I missed something... And CAPI provider needs update.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
It fails on AWS, need to figure out if it's transient failure or not.
While I was there, found lots of small bugs when endpoint is
unresponsive, or target nodes are unresponsive and fixed them.
In retry formatting added `\t` so that embedded errors are better
aligned in the output (same as multierror).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Now default is not to follow the logs (which is similar to `kubectl logs`).
Integration test was added for `Logs()` API and `osctl logs` command.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR only touches `Version` method, but I will expand it to other
methods in the next PR.
When proxying to many upstreams, errors are wrapped as responses as we
can't return error and response from grpc call. Reflect-based function
was introduced to filter out responses which contain errors as
multierror. Reflection was used, as each response is a different Go
type, and we can't write a generic function for it.
osctl was updated to support having both resp & err not nil. One failed
response shouldn't result in error.
Re-enabled integration test for multiple targets and version
consistency, need e2e validation.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
It seems to work reliably in basic-integration, but fails in e2e
(receives less responses than expected). We can re-enable once we get to
the root cause of the problem.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This adds support for node discovery for API-based tests, but discovery
is based on k8s state. Discovery can be overridden if we provide a list
of node IPs as a flag.
Also adds a test for K8s API server version.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This starts with a very simple test for `osctl version` using regexps as
output of the command depends a lot on current version.
We might use more of 'gold' matches for other commands potentially.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is just first steps and core foundation.
It can be used like:
```
make integration.test
osctl cluster create
build/integration.test -test.v
```
This should run the test against the Docker instance.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>