Commit Graph

56 Commits

Author SHA1 Message Date
Andrey Smirnov
47fb5720cf test: skip etcd tests on non-HA clusters
We can't test much of the flow on single-node clusters.

Fixes #3013

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-08 07:39:36 -08:00
Andrey Smirnov
a8dd2ff30d fix: checkpoint controller-manager and scheduler
Default manifests created by bootkube so far were only enabling
pod-checkpointer for kube-apiserver. This seems to have issues with
single-node control plane scenario, when without scheduler and
controller-manager node might fall into `NodeAffinity` state.

See https://github.com/talos-systems/bootkube-plugin/pull/23

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-28 11:53:17 -08:00
Andrey Smirnov
3dae6df27b test: stabilize upgrade test by running health check several times
For single node clusters, control plane is unstable after reboot, run
health check several times to let it settle down to avoid failures in
subsequent checks.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-11 08:31:01 -08:00
Andrey Smirnov
54ed80e244 feat: reset with system disk wipe spec
Idea is to add an option to perform "selective" reset: default reset
operation is to wipe all partitions (triggering reinstall), while spec
allows only to wipe some of the operations.

Other operations are performed exactly in the same way for any reset
flow.

Possible use case: reset only `EPHEMERAL` partition.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-10 11:31:07 -08:00
Andrey Smirnov
350280eb59 feat: implement "staged" (failsafe/backup) upgrades
Regular upgrade path takes just one reboot, but it requires all the
processes to be stopped on the node before upgrade might proceed. Under
some circumstances and with potential Talos bugs it might not work
rendering Talos upgrades almost impossible.

Staged upgrades build upon regular install flow to run the upgrade on
the node reboot. Such upgrades require two reboots of the node, and it
requires two pulls of the installer image, but they should be much less
suspicious to the failure. Once the upgrade is staged, node can be
rebooted in any possible way, including hard reset and upgrade is
performed on the next boot.

New ADV format was implemented as well to allow to store install image
ref/options across reboots. New format allows for bigger values and
takes 50% of the `META` partition. Old ADV is still kept for
compatibility reasons.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-08 08:34:26 -08:00
Artem Chernyshev
8aad711f18 feat: implement network interfaces list API
To be used in the interactive installer to configure networking.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-27 10:48:45 -08:00
Artem Chernyshev
f96cffd2b2 feat: add ability to choose CNI config
Initial version which only allows setting CNI using preset, no custom
CNI urls are supported at the moment. Still need to figure out what kind
of UI can be used for that.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-26 06:49:54 -08:00
Andrey Smirnov
9a32e34cb1 feat: implement apply configuration without reboot
This allows config to be written to disk without being applied
immediately.

Small refactoring to extract common code paths.

At first, I tried to implement this via the sequencer, but looks like
it's too hard to get it right, as sequencer lacks context and config to
be written is not applied to the runtime.

Fixes #2828

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-23 12:42:44 -08:00
Artem Chernyshev
2588e2960b feat: make GenerateConfiguration API reuse current node auth
Fixes: https://github.com/talos-systems/talos/issues/2819

Only if requested config type is not `TypeInit`.
This functionality will help implementing TUI installer cluster
extension workflow.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-23 12:12:15 -08:00
Artem Chernyshev
8513123d22 feat: return client config as the second value in GenerateConfiguration
To be used in interactive installer to output the node client
configuration to a file.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-17 07:20:05 -08:00
Artem Chernyshev
0f924b5122 feat: add generate config gRPC API
Fixes: https://github.com/talos-systems/talos/issues/2766

This API is implemented in Maintenance and Machine services.
Can be used to generate configuration on the node, instead of using
talosctl to generate it locally.

To be used in interactive installer and talosctl gen config.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-13 08:07:32 -08:00
Andrey Smirnov
8560fb9662 chore: enable nlreturn linter
Most of the fixes were automatically applied.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 06:48:07 -08:00
Andrey Smirnov
773912833e test: clean up integration test code, fix flakes
This enables golangci-lint via build tags for integration tests (this
should have been done long ago!), and fixes the linting errors.

Two tests were updated to reduce flakiness:

* apply config: wait for nodes to issue "boot done" sequence event
before proceeding
* recover: kill pods even if they appear after the initial set gets
killed (potential race condition with previous test).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-19 15:44:14 -07:00
Artem Chernyshev
e7e99cf1b3 feat: support disk usage command in talosctl
Usage example:

```bash
talosctl du --nodes 10.5.0.2 /var -H -d 2
NODE       NAME
10.5.0.2   8.4 kB   etc
10.5.0.2   1.3 GB   lib
10.5.0.2   16 MB    log
10.5.0.2   25 kB    run
10.5.0.2   4.1 kB   tmp
10.5.0.2   1.3 GB   .
```

Supported flags:
- `-a` writes counts for all files, not just directories.
- `-d` recursion depth
- '-H' humanize size outputs.
- '-t' size threshold (skip files if < size or > size).

Fixes: https://github.com/talos-systems/talos/issues/2504

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-10-13 09:30:31 -07:00
Andrew Rynhard
4eeef28e90 feat: add etcd API
This adds RPCs for basic etcd management tasks.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-10-06 11:30:04 -07:00
Seán C McCord
ff92d2a14b feat: add ApplyConfiguration API
Adds the ability to apply (replace) an existing node configuration with
a new one via the Machine API.

Fixes #2345

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2020-09-29 14:44:06 -07:00
Andrey Smirnov
f6ecf000c9 refactor: extract packages loadbalancer and retry
This removes in-tree packages in favor of:

* github.com/talos-systems/go-retry
* github.com/talos-systems/go-loadbalancer

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-02 13:46:22 -07:00
Marco De Luca
1fbb171fd0 test: determine reboots using boot id
Changed the RebootSuite to use /proc/sys/kernel/random/boot_id rather than /proc/uptime

Signed-off-by: Marco De Luca <marcodl404@gmail.com>
2020-08-26 06:09:02 -07:00
Andrey Smirnov
6a7cc02648 fix: handle bootkube recover correctly, support recovery from etcd
Bootkube recover process (and `talosctl recover`) was actually
regenerating assets each time `recover` runs forcing control plane to be
at the state when cluster got created. This PR fixes that by running
recover process correctly.

Recovery via etcd was fixed to handle encrypted etcd data:
it follows the way `apiserver` handles encryption at rest, and as at
the moment AES CBC is the only supported encryption method, code simply
follows the same path.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-18 14:24:14 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
47608fb874 refactor: make pkg/config not rely on machined/../internal/runtime
This makes `pkg/config` directly importable from other projects.

There should be no functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-29 12:40:12 -07:00
Andrey Smirnov
3d8418a689 feat: force nodes to be set in talosctl commands using the API
With load-balancing enabled by default running `talosctl` without
`--nodes` is risky, as it might hit any control plane by default without
`--nodes`.

Only two commands do not enforce this check, as they do their own node
contexts: `crashdump` and `health` (client-side).

Integration tests were updated to always supply `--nodes` cli argument,
while doing that I refactored the storage for discovered nodes to use
existing `cluster.Info` interface.

The downside is that with e2e CAPI tests CLI tests will be mostly
skipped as we don't support discovery in CLI tests at the momemnt. This
can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for
node discovery.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 12:17:43 -07:00
Andrey Smirnov
1a0e1bc393 chore: update module dependencies
Fixes #2316

Simply update dependencies we don't track on version level to be
compatible with Talos components (like etcd or k8s).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 12:00:50 -07:00
Andrey Smirnov
cbb7ca8390 refactor: merge osd into machined
This merges `osd` API into `machined`. API was copied from `osd` into
`machined`, and `osd` API was deprecated.

For backwards compatibility, `machined` still implements `osd` API, so
older Talos API clients can still talk to the node without changes.

Docs were updated. No functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-13 12:50:00 -07:00
Andrey Smirnov
931237b23c test: update init node check in reset API tests
Previously we assumed that node 0 is the init node, and it can't be
reset. With new bootstrap API approach, there's no init node, and all
the nodes can be reset. This corrects the check to skip only the init
node, and with bootstrap API there's no init node (so no nodes are
skipped).

Fixes #2277

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 10:48:14 -07:00
Andrey Smirnov
5ecddf2866 feat: add round-robin LB policy to Talos client by default
Handling of multiple endpoints has already been implemented in #2094.

This PR enables round-robin policy so that grpc picks up new endpoint
for each call (and not send each request to the first control plane
node).

Endpoint list is randomized to handle cases when only one request is
going to be sent, so that it doesn't go always to the first node in the
list.

gprc handles dead/unresponsive nodes automatically for us.

`talosctl cluster create` and provision tests switched to use
client-side load balancer for Talos API.

On the additional improvements we got:

* `talosctl` now reports correct node IP when using commands without
`-n`, not the loadbalancer IP (if using multiple endpoints of course)

* loadbalancer can't provide reliable handling of errors when upstream
server is unresponsive or there're no upstreams available, grpc returns
much more helpful errors

Fixes #1641

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 08:35:15 -07:00
Andrey Smirnov
4cc074cdba feat: implement API access to event history
1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids
are sortable and unique enough. Xids also encode event publishing
time with a second precision.

2. Add three ways to look back into event history: based on number of
events, on time and ID. Lookup via ID might be used to restart event
polling in case of broken API connection from the same moment.

3. Reimplement core event buffer with positions which are always
incremented instead of generation+index, this implementation is much
more simple (idea from circular buffer).

4. By default, Events API works the same - it shows no history and
starts streaming new events only.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 10:54:50 -07:00
Andrey Smirnov
a6b3bd2ff6 feat: implement service events
This implements service events, adds test for events API based on
service events as they're the easiest to generate on demand.

Disabled validate test for 'metal' as it validates disk device against
local system which doesn't make much sense.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-03 13:52:53 -07:00
Andrey Smirnov
81d1c2bfe7 chore: enable godot linter
Issues were fixed automatically.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-30 10:39:56 -07:00
Andrey Smirnov
6fb55229a2 test: fix and improve reboot/reset tests
These tests rely on node uptime checks. These checks are quite flaky.

Following fixes were applied:

* code was refactored as common method shared between reset/reboot tests
(reboot all nodes does checks in a different way, so it wasn't updated)

* each request to read uptime times out in 5 seconds, so that checks
don't wait forever when node is down (or connection is aborted)

* to account for node availability vs. lower uptime in the beginning of
test, add extra elapsed time to the check condition

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-29 13:56:48 -07:00
Andrey Smirnov
0a4645fe80 feat: implement circular buffer for system logs
This replaces logging to files with inotify following to pure in-memory
circular buffer which grows on demand capped at specified maximum
capacity.

The concern with previous approach was that logs on tmpfs were growing
without any bound potentially consuming all the node memory.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-26 15:33:54 -07:00
Andrey Smirnov
28a6eb207a test: add node name to error messages in RebootAllNodes
This makes troubleshooting easier.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-05-07 12:12:46 -07:00
Andrey Smirnov
23be80fd96 test: stabilize tests by bumping timeouts
Bump timeouts for reset API test as K8s control plane teardown might
take 3 minutes on its own.

Bump Go Firecracker SDK timeout when talking to firecracker process.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-05-06 08:26:18 -07:00
Andrew Rynhard
56d7bf19fe feat: add recovery API
This adds an API for recovering the self-hosted control plane.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-05-04 19:38:30 -07:00
Andrew Rynhard
49307d554d refactor: improve machined
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.

A few highlights are:

- no more event bus
- functional approach to tasks (no more types defined for each task)
  - the task function definition now offers a lot more context, like
    access to raw API requests, the current sequence, a logger, the new
    state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
  install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
  install force option

Additionally, this pulls various packes in under machined to make the
code easier to navigate.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-28 08:20:55 -07:00
Spencer Smith
31668f1c4c chore: update timeout values for e2e tests
This PR will update the values for timeout when testing e2e. We were
hitting issues in GCP on the reboot test, as the nodes seemed to be
taking a few minutes to become responsive again. I also moved the
"cluster health" check in the node-by-node reboot test to use the
default suite context, so it'll have a timeout of 30m instead of the 5
that it had initially. This seems to solve the node-by-node bailing as
well.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-04-03 19:16:30 -04:00
Andrey Smirnov
682dd433ba refactor: move Talos client package to pkg/
As this implements Go client for Talos API, it makes sense to publish it
one the top level.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-04-01 23:45:58 +03:00
Andrey Smirnov
b94be4f6a1 test: mark long tests as !short
This skips long-running integration tests if `-test.short` mode is
enabled.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-27 22:34:26 +03:00
Andrew Rynhard
5dbc26c7a3 feat: rename osctl to talosctl
This is a rename of the osctl binary. We decided that talosctl is a
better name for the Talos CLI. This does not break any APIs, but does
make older documentation only accurate for previous versions of Talos.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 19:07:39 -07:00
Andrey Smirnov
d5f80858dd test: add 'reset' integration test for Reset() API
Every node is reset, rebooted and it comes back up again except for the
init node due to known issues with init node boostrapping etcd cluster
from scratch when metadata is missing (as node was wiped).

Planned workaround is to prohibit resetting init node (should be coming
next).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-06 23:05:46 +03:00
Andrey Smirnov
9bfb5f1501 test: fix RebootAllNodes test to reboot all nodes in one call
As calls to the nodes are proxied through `apid` on init node, we can't
reboot all nodes concurrently, as init node might be already down by the
moment any other node is going to be rebooted.

Rewrite the test to reboot all the nodes in a single multi-node
request.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-17 14:34:00 -08:00
Andrey Smirnov
491e7e58e0 test: implement RebootAllNodes test
This complements "rolling restart" RebootNodeByNode test by providing
more of a disaster scenario, when all the nodes are restarted at once.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-17 13:58:57 -08:00
Andrey Smirnov
76c2038b13 chore: implement loadbalancer for firecracker provisioner
This PR contains generic simple TCP loadbalancer code, and glue code for
firecracker provisioner to use this loadbalancer.

K8s control plane is passed through the load balancer, and Talos API is
passed only to the init node (for now, as some APIs, including
kubeconfig, don't work with non-init node).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-13 23:07:13 +03:00
Andrey Smirnov
a2dee289d1 test: skip reboot tests
Seems that with a single endpoint k8s is not able to recover (?).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-04 08:37:32 -08:00
Andrey Smirnov
afa8a48174 chore: implement reboot test
Reboot test does node-by-node reboots followed by cluster health checks
(same as done by provisioner).

Fixed bug with `Read()` returning `Reader` instead of `ReadCloser`
(minor).

Allowed `bootkube` to be `Skipped` (for rebooted node).

Added support for doing checks via provided client instance.

Implemented generic capabilities to skip tests based on cluster
platform.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-03 11:02:43 -08:00
Andrey Smirnov
6e05dd70c4 feat: add support for tailing logs
Fixes #1564

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-17 22:35:47 +03:00
Andrey Smirnov
1fbf40796f feat: implement streaming mode of dmesg, parse messages
Fixes #1563

This implements dmesg reading via `/dev/kmsg`, with message parsing and
formatting. Kernel log facility and severity are parsed, timestamp is
calculated relative to boot time (it's accurate unless time jumps a
lot during node lifetime).

New flags to follow dmesg was added, tail flag allows to stream only new
message (ignoring old messages). We could try to implement tailing last
N messages, just a bit more work, open to suggestions (for symmetry with
regular logs).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-16 17:40:15 +03:00
Andrew Rynhard
ad863a7f92 refactor: rename protobuf services, RPCs, and messages
This PR brings our protobuf files into conformance with the protobuf
style guide, and community conventions. It is purely renames, along with
generated docs.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-11 11:41:40 -08:00
Andrey Smirnov
399aeda0b9 feat: rename confusing target options, --endpoints, etc.
Fixes #1610

1. In `talosconfig`, deprecate `Target` in favor of `Endpoints`
(client-side LB to come next).

2. In `osctl`, use `--nodes` in place of `--target`.

3. In `osctl` add option `--endpoints` to override `Endpoints` for the
call.

Other changes are just updates to catch up with the changes. Most
probably I missed something... And CAPI provider needs update.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-10 02:23:54 +03:00
Andrey Smirnov
16f1f6996e test: add retries to the test which verifies cluster version
It fails on AWS, need to figure out if it's transient failure or not.

While I was there, found lots of small bugs when endpoint is
unresponsive, or target nodes are unresponsive and fixed them.

In retry formatting added `\t` so that embedded errors are better
aligned in the output (same as multierror).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-06 11:24:58 -08:00