41 Commits

Author SHA1 Message Date
Andrey Smirnov
ff2267eb99 test: update versions used for upgrade tests
We should stick to the latest version in each release series.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-04-07 15:51:56 -07:00
Spencer Smith
31668f1c4c chore: update timeout values for e2e tests
This PR will update the values for timeout when testing e2e. We were
hitting issues in GCP on the reboot test, as the nodes seemed to be
taking a few minutes to become responsive again. I also moved the
"cluster health" check in the node-by-node reboot test to use the
default suite context, so it'll have a timeout of 30m instead of the 5
that it had initially. This seems to solve the node-by-node bailing as
well.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-04-03 19:16:30 -04:00
Andrey Smirnov
682dd433ba refactor: move Talos client package to pkg/
As this implements Go client for Talos API, it makes sense to publish it
one the top level.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-04-01 23:45:58 +03:00
Andrey Smirnov
b94be4f6a1 test: mark long tests as !short
This skips long-running integration tests if `-test.short` mode is
enabled.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-27 22:34:26 +03:00
Spencer Smith
3a4eaeeef0 feat: upgrade kubernetes to 1.18
This PR will pull in the latest release of k8s 1.18 so we can start
validating it through our test suite.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-26 14:59:43 -04:00
Andrey Smirnov
e38cde9b48 chore: update upgrade tests for new version, split into two tracks
This updates upgrade tests to run two flows with 3+1 clusters:

1. 0.3 -> current (testing upgrade with partition wiping)
2. 0.4-alpha.7 -> current (testing upgrade without partition wiping,
boot-a/boot-b)

And small upgrade with preserve enabled for single-node cluster.

Provision tests are now split into two parallel tracks in Drone.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-24 15:30:00 -07:00
Andrew Rynhard
5dbc26c7a3 feat: rename osctl to talosctl
This is a rename of the osctl binary. We decided that talosctl is a
better name for the Talos CLI. This does not break any APIs, but does
make older documentation only accurate for previous versions of Talos.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 19:07:39 -07:00
Andrew Rynhard
69fa63a7b2 refactor: perform upgrade upon reboot
This PR introduces a new strategy for upgrades. Instead of attempting to
zap the partition table, create a new one, and then format the
partitions, this change will only update the `vmlinuz`, and
`initramfs.xz` being used to boot. It introduces an A/B style upgrade
process, which will allow for easy rollbacks. One deviation from our
original intention with upgrades is that this change does not completely
reset a node. It falls just short of that and does not reset the
partition table. This forces us to keep the current partition scheme in
mind as we make changes in the future, because an upgrade assumes a
specific partition scheme. We can improve upgrades further in the
future, but this will at least make them more dependable. Finally, one
more feature in this PR is the ability to keep state. This enables
single node clusters to upgrade since we keep the etcd data around.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 17:32:18 -07:00
Andrey Smirnov
0babc39653 feat: split osctl commands into Talos API and cluster management
This keeps backwards compatibility with `osctl` CLI binary with the
exception of `osctl config generate` which was renamed to `osctl
gen config` to avoid confusion with other `osctl config`
commands which operate on client config, not Talos server config.

Command implementation and helpers were split into subpackages for
cleaner code and more visible boundaries. The resulting binary still
combines commands from both sections into a single binary.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-20 22:45:04 +03:00
Andrey Smirnov
d5f80858dd test: add 'reset' integration test for Reset() API
Every node is reset, rebooted and it comes back up again except for the
init node due to known issues with init node boostrapping etcd cluster
from scratch when metadata is missing (as node was wiped).

Planned workaround is to prohibit resetting init node (should be coming
next).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-06 23:05:46 +03:00
Andrey Smirnov
2e3681054d chore: improve handling of etcd responses in bootkube pre-func
Try more attempts, wait for the response. Treat empty response as no
error (as this is what to expect when key is not set yet).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-06 21:06:48 +03:00
Andrey Smirnov
bbe2c53d29 feat: generate kubeconfig on the fly on request
This extracts admin kubeconfig generation out of bootkube, now based on
Talos x509 library. On each API request for `kubeconfig`, config is
generated on the fly and sent back on the wire.

This fixes two issues:

* any master node can now generate `kubeconfig` (worker nodes can do
that too, but that should probably change in the future)
* after upgrade-and-wipe the disk scenario, `osctl kubeconfig` still
works

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-28 21:00:52 +03:00
Andrey Smirnov
d5d3035c8c test: enable upgrade tests 0.4.x -> latest
With the fix #1904, it's now possible to upgrade 0.4.x with
`machine.File` extra files (caused by registry mirror for
registry.ci.svc).

Bump resources for upgrade tests in attempt to speed it up.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-26 00:09:32 +03:00
Andrey Smirnov
923ef4537b test: implement new class of tests: provision tests (upgrades)
This class of tests is included/excluded by build tags, but as it is
pretty different from other integration tests, we build it as separate
executable. Provision tests provision cluster for the test run, perform
some actions and verify results (could be upgrade, reset, scale up/down,
etc.)

There's now framework to implement upgrade tests, first of the tests
tests upgrade from latest 0.3 (0.3.2 at the moment) to current version
of Talos (being built in CI). Tests starts by booting with 0.3
kernel/initramfs, runs 0.3 installer to install 0.3.2 cluster, wait for
bootstrap, followed by upgrade to 0.4 in rolling fashion. As Firecracker
supports bootloader, this boots 0.4 system from boot disk (as installed
by installer).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-21 07:04:03 -08:00
Andrey Smirnov
9bfb5f1501 test: fix RebootAllNodes test to reboot all nodes in one call
As calls to the nodes are proxied through `apid` on init node, we can't
reboot all nodes concurrently, as init node might be already down by the
moment any other node is going to be rebooted.

Rewrite the test to reboot all the nodes in a single multi-node
request.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-17 14:34:00 -08:00
Andrey Smirnov
491e7e58e0 test: implement RebootAllNodes test
This complements "rolling restart" RebootNodeByNode test by providing
more of a disaster scenario, when all the nodes are restarted at once.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-17 13:58:57 -08:00
Andrey Smirnov
76c2038b13 chore: implement loadbalancer for firecracker provisioner
This PR contains generic simple TCP loadbalancer code, and glue code for
firecracker provisioner to use this loadbalancer.

K8s control plane is passed through the load balancer, and Talos API is
passed only to the init node (for now, as some APIs, including
kubeconfig, don't work with non-init node).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-13 23:07:13 +03:00
Andrey Smirnov
a2dee289d1 test: skip reboot tests
Seems that with a single endpoint k8s is not able to recover (?).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-04 08:37:32 -08:00
Andrey Smirnov
afa8a48174 chore: implement reboot test
Reboot test does node-by-node reboots followed by cluster health checks
(same as done by provisioner).

Fixed bug with `Read()` returning `Reader` instead of `ReadCloser`
(minor).

Allowed `bootkube` to be `Skipped` (for rebooted node).

Added support for doing checks via provided client instance.

Implemented generic capabilities to skip tests based on cluster
platform.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-03 11:02:43 -08:00
Andrey Smirnov
0afd0f651b chore: provide provisioned cluster info to integration test
Integration test can optionally consume cluster state as generated by
the call to `osctl cluster create` and use it to discover nodes in
integration tests.

This means that now CLI tests can use that as discovery source, and
API/K8s tests by default as well.

Flat list of nodes is to be replaced by something more complex in the
next iteration, but it's good for this PR.

As a demo, add CLI test with multiple nodes (dmesg).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-31 18:21:30 +03:00
Andrey Smirnov
9da687d2a3 test: firecracker provisioner fixes, implement cluster destroy
This implements `osctl cluster destroy` for Firecracker, adds
new utility command `osctl cluser show`.

Firecracker mode now has control process for firecracker VMs, allowing
clean reboots and background operations.

Lots of small fixes to Firecracker mode, clean CNI shutdown, cleaning up
netns, etc.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-21 17:11:06 -08:00
Andrew Rynhard
f3623d22b0 refactor: use tls.Config as client credentials
The `client.Creds` struct was not used very often, and made using the
`client.NewClient` function impossible to use in combination with the
`RemoteRenewingFileCertificateProvider`. This modifies
`client.NewClient` to accept a `tls.Config` instead of `client.Creds`,
allowing for the use of `RemoteRenewingFileCertificateProvider` with
`client.NewClient`.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-21 17:10:07 -08:00
Andrey Smirnov
ebd40bd0eb chore: use osctl cluster --wait in basic-integration
There are few workarounds for Drone way of running integration test:
DinD runs as a separate pod, and we can only access its exposed on the
"host" ports, while from Talos cluster this endpoint is not reachable.

So internally Talos nodes still use addresses like "10.5.0.2", while
test is using "docker" to access it (that's name of the `docker` service
in the pipeline).

When running locally, 127.0.0.1 is used as endpoint, which should work
fine both on OS X and Linux.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-30 15:15:42 -08:00
Andrey Smirnov
3a021e4579 test: add integration tests for (most) CLI commands
I added tests for all the commands which work reliably in container mode.

Some tests are naive, some are more sophisticated. While going
through the tests, I think I found a small bug in `osctl gen keypair`.

When we get reliable KVM tests, I can revisit and add missing
tests for time, reboot, shutdown and friends.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-20 23:33:35 +03:00
Andrey Smirnov
f3dff87957 fix: fail on muliple nodes for commands which don't support it
Fixes #1663

(I believe it's 0.3 backport strong candidate).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-18 18:51:40 +03:00
Andrey Smirnov
6e05dd70c4 feat: add support for tailing logs
Fixes #1564

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-17 22:35:47 +03:00
Andrey Smirnov
1fbf40796f feat: implement streaming mode of dmesg, parse messages
Fixes #1563

This implements dmesg reading via `/dev/kmsg`, with message parsing and
formatting. Kernel log facility and severity are parsed, timestamp is
calculated relative to boot time (it's accurate unless time jumps a
lot during node lifetime).

New flags to follow dmesg was added, tail flag allows to stream only new
message (ignoring old messages). We could try to implement tailing last
N messages, just a bit more work, open to suggestions (for symmetry with
regular logs).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-16 17:40:15 +03:00
Andrew Rynhard
ad863a7f92 refactor: rename protobuf services, RPCs, and messages
This PR brings our protobuf files into conformance with the protobuf
style guide, and community conventions. It is purely renames, along with
generated docs.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-11 11:41:40 -08:00
Andrey Smirnov
399aeda0b9 feat: rename confusing target options, --endpoints, etc.
Fixes #1610

1. In `talosconfig`, deprecate `Target` in favor of `Endpoints`
(client-side LB to come next).

2. In `osctl`, use `--nodes` in place of `--target`.

3. In `osctl` add option `--endpoints` to override `Endpoints` for the
call.

Other changes are just updates to catch up with the changes. Most
probably I missed something... And CAPI provider needs update.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-10 02:23:54 +03:00
Andrey Smirnov
16f1f6996e test: add retries to the test which verifies cluster version
It fails on AWS, need to figure out if it's transient failure or not.

While I was there, found lots of small bugs when endpoint is
unresponsive, or target nodes are unresponsive and fixed them.

In retry formatting added `\t` so that embedded errors are better
aligned in the output (same as multierror).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-06 11:24:58 -08:00
Andrey Smirnov
edb40437ec feat: add support for osctl logs -f
Now default is not to follow the logs (which is similar to `kubectl logs`).

Integration test was added for `Logs()` API and `osctl logs` command.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-05 13:58:52 -08:00
Andrey Smirnov
10a40a15d9 fix: extract errors from API response
This PR only touches `Version` method, but I will expand it to other
methods in the next PR.

When proxying to many upstreams, errors are wrapped as responses as we
can't return error and response from grpc call. Reflect-based function
was introduced to filter out responses which contain errors as
multierror. Reflection was used, as each response is a different Go
type, and we can't write a generic function for it.

osctl was updated to support having both resp & err not nil. One failed
response shouldn't result in error.

Re-enabled integration test for multiple targets and version
consistency, need e2e validation.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-05 09:44:10 -08:00
Andrey Smirnov
96a7289f06 test: fix integration version test as 'NODE:' might be missing
When invoked without `-t`, `osctl` shouldn't print `NODE:` anymore.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-03 07:45:41 -08:00
Andrey Smirnov
5b7bea2471 feat: use grpc-proxy in apid
This replaces codegen version of apid proxying with
talos-systems/grpc-proxy based version. Proxying is transparent, it
doesn't require exact information about methods and response types. It
requires some common layout response to enhance it properly with node
metadata or errors.

There should be no signifcant changes to the API with the previous
version, but it's worth mentioning a few changes:

1. grpc.ClientConn is established just once per upstream (either local
service or remote apid instance).

2. When called without `-t` (`targets`), apid proxies immediately down
to local service skipping proxying to itself (as before), which results
in empty node metadata in response (before it had local node IP). Might
revert this later to proxy to itself (?).

3. Streaming APIs are now fully supported with multiple targets, but
message definition doesn't contain `ResponseMetadata`, so streaming APIs
are broken now with targets (needs a fix).

4. Errors are now returned as responses with `Error` field set in
`ResponseMetadata`, this requires client library update and `osctl` to
handle it properly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-29 22:57:25 +03:00
Andrey Smirnov
8c7fadde95 test: disable discovery-based test as it might break e2e
It seems to work reliably in basic-integration, but fails in e2e
(receives less responses than expected). We can re-enable once we get to
the root cause of the problem.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-15 14:29:27 -08:00
Andrey Smirnov
af2b6fa130 test: implement node discovery for integration tests
This adds support for node discovery for API-based tests, but discovery
is based on k8s state. Discovery can be overridden if we provide a list
of node IPs as a flag.

Also adds a test for K8s API server version.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-14 15:35:07 -08:00
Sekerin Evgeniy
83d5f4c721 feat: Add context key to osctl
Added context key for change context on osctl

Signed-off-by: Sekerin Evgeniy <sekerin.e.a@gmail.com>
2019-11-13 11:32:15 -08:00
Andrey Smirnov
63212ab17e test: fix integration test for k8s version
Push versions to constants, introduce 'platform' to version API to
discover node mode. Check kernel version for non-containers.

A bit of refactoring on version package to expose something closer to a
single response.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-11 13:42:21 -08:00
Andrey Smirnov
cdda81df66 test: add k8s integration tests
Once again, mostly groundwork and one simple test for node versions.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-06 17:08:44 -08:00
Andrey Smirnov
551fa45d33 test: add CLI integration test
This starts with a very simple test for `osctl version` using regexps as
output of the command depends a lot on current version.

We might use more of 'gold' matches for other commands potentially.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-05 17:59:23 -08:00
Andrey Smirnov
b0aef2cf22 test: add integration test framework
This is just first steps and core foundation.

It can be used like:

```
make integration.test
osctl cluster create
build/integration.test -test.v
```

This should run the test against the Docker instance.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-05 17:21:38 +03:00