talos

mirror of https://github.com/siderolabs/talos.git synced 2025-08-09 08:07:05 +02:00

Author	SHA1	Message	Date
Alexey Palazhchenko	7462733bcb	chore: update golangci-lint Fix context propagation. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>	2021-11-15 14:55:25 +00:00
Andrey Smirnov	b6b78e7fef	test: add cluster discovery integration tests This verifies that members match cluster state and that both cluster registries work in sync producing same discovery data. Fixes #4191 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-10-25 21:03:29 +03:00
Andrey Smirnov	a059454045	chore: build using Go 1.17 `initramfs` size for amd64 shrinks by 1.3 MiB. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2021-09-13 22:33:47 +03:00
Alexey Palazhchenko	f63ab9dd9b	feat: implement `talosctl config new` command Refs #3421. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-17 09:06:43 -07:00
Andrey Smirnov	5811f4dda1	feat: implement link (interface) controllers The structure of the controllers is really similar to addresses and routes: * `LinkSpec` resource describes desired link state * `LinkConfig` controller generates `LinkSpecs` based on machine configuration and kernel cmdline * `LinkMerge` controller merges multiple configuration sources into a single `LinkSpec` paying attention to the config layer priority * `LinkSpec` controller applies the specs to the kernel state Controller `LinkStatus` (which was implemented before) watches the kernel state and publishes current link status. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-01 09:36:25 -07:00
Andrey Smirnov	e0650218a6	feat: support etcd recovery from snapshot on bootstrap When Talos `controlplane` node is waiting for a bootstrap, `etcd` contents can be recovered from a snapshot created with `talosctl etcd snapshot` on a healthy cluster. Bootstrap process goes same way as before, but the etcd data directory is recovered from the snapshot. This flow enables disaster recovery for the control plane: given that periodic backups are available, destroy control plane nodes, re-create them with the same config, and bootstrap one node with the saved snapshot to recover etcd state at the time of the snapshot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-08 10:15:37 -07:00
Alexey Palazhchenko	df52c13581	chore: fix //nolint directives That's the recommended syntax: https://golangci-lint.run/usage/false-positives/ Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-03-05 05:58:33 -08:00
Andrey Smirnov	87ccf0eb21	test: clear connection refused errors after reset After node reboot (and gRPC API unavailability), gRPC stack might cache connection refused errors for up to backoff timeout. Explicitly clear such errors in reset tests before trying to read data from the node to verify reset success. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-01 08:11:27 -08:00
Andrey Smirnov	ff4d702f77	fix: implement preserving contents of partition on install This fixes A/B upgrades and rollback API. Installer manifest supports now an option to preserve partition contents while disk is being re-partitioned and partitions are re-formatted. Mount `/boot` partition as needed (to find current label before starting the installation and in the rollback API). Fix upgrade API for non-master nodes. Contents of `/boot`, `/system/state` and META partitions are preserved in memory while the disk is re-partitioned. Remove `--save` flag from the installer as it's not being used. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-22 23:56:39 +03:00
Andrey Smirnov	56f1ee37fd	feat: upgrade Kubernetes to 1.19.3 Just minor release bump. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-20 05:12:32 -07:00
Andrey Smirnov	773912833e	test: clean up integration test code, fix flakes This enables golangci-lint via build tags for integration tests (this should have been done long ago!), and fixes the linting errors. Two tests were updated to reduce flakiness: * apply config: wait for nodes to issue "boot done" sequence event before proceeding * recover: kill pods even if they appear after the initial set gets killed (potential race condition with previous test). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-19 15:44:14 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Marco De Luca	1fbb171fd0	test: determine reboots using boot id Changed the RebootSuite to use /proc/sys/kernel/random/boot_id rather than /proc/uptime Signed-off-by: Marco De Luca <marcodl404@gmail.com>	2020-08-26 06:09:02 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	9379cf9ee1	refactor: expose `provision` as public package This change is only moving packages and updating import paths. Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other projects to import Talos provisioning library. As cluster checks are almost always required as part of provisioning process, package `internal/pkg/cluster` was also made public as `pkg/cluster`. Other changes were direct dependencies discovered by `importvet` which were updated. Public packages (useful, general purpose packages with stable API): * `internal/pkg/conditions` -> `pkg/conditions` * `internal/pkg/tail` -> `pkg/tail` Private packages (used only on provisioning library internally): * `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp` * `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz` * `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-12 05:12:05 -07:00
Andrey Smirnov	47608fb874	refactor: make `pkg/config` not rely on `machined/../internal/runtime` This makes `pkg/config` directly importable from other projects. There should be no functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 12:40:12 -07:00
Andrey Smirnov	3d8418a689	feat: force nodes to be set in `talosctl` commands using the API With load-balancing enabled by default running `talosctl` without `--nodes` is risky, as it might hit any control plane by default without `--nodes`. Only two commands do not enforce this check, as they do their own node contexts: `crashdump` and `health` (client-side). Integration tests were updated to always supply `--nodes` cli argument, while doing that I refactored the storage for discovered nodes to use existing `cluster.Info` interface. The downside is that with e2e CAPI tests CLI tests will be mostly skipped as we don't support discovery in CLI tests at the momemnt. This can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for node discovery. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 12:17:43 -07:00
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	81d1c2bfe7	chore: enable godot linter Issues were fixed automatically. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-30 10:39:56 -07:00
Andrey Smirnov	6fb55229a2	test: fix and improve reboot/reset tests These tests rely on node uptime checks. These checks are quite flaky. Following fixes were applied: * code was refactored as common method shared between reset/reboot tests (reboot all nodes does checks in a different way, so it wasn't updated) * each request to read uptime times out in 5 seconds, so that checks don't wait forever when node is down (or connection is aborted) * to account for node availability vs. lower uptime in the beginning of test, add extra elapsed time to the check condition Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-29 13:56:48 -07:00
Andrey Smirnov	795a10b681	test: improve reboot/reset test resiliency against request timeouts After node reboot test code tries endlessly to read the uptime until it goes down after reboot, but during actual reboot API won't be responsive and it might happen that this call will time out only with parent context canceling, and by that time retry timeout is already exhausted, so no more attempts will be made (while node successfully booted after a reboot). ``` uptime didn't go down: before 219.730000, after 267.020000 uptime didn't go down: before 219.730000, after 268.030000 EOF rpc error: code = DeadlineExceeded desc = context deadline exceeded timeout ``` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-22 12:31:06 -07:00
Seán C McCord	3e0e01e2c3	fix: refactor client creation API Create a new `client.New` to make external API systems easier to construct. A new type `client.OptionFunc` allows the client to be extended with specific configuration. This also makes a first pass at supporting multiple endpoints properly by creating a custom grpc resolver. (Proper load balancing support is still a TODO.) Fixes #2093 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-05-11 10:21:07 -07:00
Andrew Rynhard	56d7bf19fe	feat: add recovery API This adds an API for recovering the self-hosted control plane. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-04 19:38:30 -07:00
Andrew Rynhard	49307d554d	refactor: improve machined This is a rewrite of machined. It addresses some of the limitations and complexity in the implementation. This introduces the idea of a controller. A controller is responsible for managing the runtime, the sequencer, and a new state type introduced in this PR. A few highlights are: - no more event bus - functional approach to tasks (no more types defined for each task) - the task function definition now offers a lot more context, like access to raw API requests, the current sequence, a logger, the new state interface, and the runtime interface. - no more panics to handle reboots - additional initialize and reboot sequences - graceful gRPC server shutdown on critical errors - config is now stored at install time to avoid having to download it at install time and at boot time - upgrades now use the local config instead of downloading it - the upgrade API's preserve option takes precedence over the config's install force option Additionally, this pulls various packes in under machined to make the code easier to navigate. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-04-28 08:20:55 -07:00
Andrey Smirnov	55dcbbc8d0	feat: add commands talosctl health/crashdump This extracts health & crashdump features which were specific to provisioning code into separate package which can be used standalone. Everything else is just new glue. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-27 20:43:10 -07:00
Andrey Smirnov	682dd433ba	refactor: move Talos client package to `pkg/` As this implements Go client for Talos API, it makes sense to publish it one the top level. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-04-01 23:45:58 +03:00
Andrew Rynhard	5dbc26c7a3	feat: rename osctl to talosctl This is a rename of the osctl binary. We decided that talosctl is a better name for the Talos CLI. This does not break any APIs, but does make older documentation only accurate for previous versions of Talos. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-03-20 19:07:39 -07:00
Andrey Smirnov	d5f80858dd	test: add 'reset' integration test for Reset() API Every node is reset, rebooted and it comes back up again except for the init node due to known issues with init node boostrapping etcd cluster from scratch when metadata is missing (as node was wiped). Planned workaround is to prohibit resetting init node (should be coming next). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-03-06 23:05:46 +03:00
Andrey Smirnov	afa8a48174	chore: implement reboot test Reboot test does node-by-node reboots followed by cluster health checks (same as done by provisioner). Fixed bug with `Read()` returning `Reader` instead of `ReadCloser` (minor). Allowed `bootkube` to be `Skipped` (for rebooted node). Added support for doing checks via provided client instance. Implemented generic capabilities to skip tests based on cluster platform. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-02-03 11:02:43 -08:00
Andrey Smirnov	0afd0f651b	chore: provide provisioned cluster info to integration test Integration test can optionally consume cluster state as generated by the call to `osctl cluster create` and use it to discover nodes in integration tests. This means that now CLI tests can use that as discovery source, and API/K8s tests by default as well. Flat list of nodes is to be replaced by something more complex in the next iteration, but it's good for this PR. As a demo, add CLI test with multiple nodes (dmesg). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-01-31 18:21:30 +03:00
Andrew Rynhard	f3623d22b0	refactor: use tls.Config as client credentials The `client.Creds` struct was not used very often, and made using the `client.NewClient` function impossible to use in combination with the `RemoteRenewingFileCertificateProvider`. This modifies `client.NewClient` to accept a `tls.Config` instead of `client.Creds`, allowing for the use of `RemoteRenewingFileCertificateProvider` with `client.NewClient`. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-01-21 17:10:07 -08:00
Andrey Smirnov	ebd40bd0eb	chore: use osctl cluster --wait in basic-integration There are few workarounds for Drone way of running integration test: DinD runs as a separate pod, and we can only access its exposed on the "host" ports, while from Talos cluster this endpoint is not reachable. So internally Talos nodes still use addresses like "10.5.0.2", while test is using "docker" to access it (that's name of the `docker` service in the pipeline). When running locally, 127.0.0.1 is used as endpoint, which should work fine both on OS X and Linux. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-30 15:15:42 -08:00
Andrey Smirnov	399aeda0b9	feat: rename confusing target options, --endpoints, etc. Fixes #1610 1. In `talosconfig`, deprecate `Target` in favor of `Endpoints` (client-side LB to come next). 2. In `osctl`, use `--nodes` in place of `--target`. 3. In `osctl` add option `--endpoints` to override `Endpoints` for the call. Other changes are just updates to catch up with the changes. Most probably I missed something... And CAPI provider needs update. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-10 02:23:54 +03:00
Andrey Smirnov	16f1f6996e	test: add retries to the test which verifies cluster version It fails on AWS, need to figure out if it's transient failure or not. While I was there, found lots of small bugs when endpoint is unresponsive, or target nodes are unresponsive and fixed them. In retry formatting added `\t` so that embedded errors are better aligned in the output (same as multierror). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-12-06 11:24:58 -08:00
Andrey Smirnov	5b7bea2471	feat: use grpc-proxy in apid This replaces codegen version of apid proxying with talos-systems/grpc-proxy based version. Proxying is transparent, it doesn't require exact information about methods and response types. It requires some common layout response to enhance it properly with node metadata or errors. There should be no signifcant changes to the API with the previous version, but it's worth mentioning a few changes: 1. grpc.ClientConn is established just once per upstream (either local service or remote apid instance). 2. When called without `-t` (`targets`), apid proxies immediately down to local service skipping proxying to itself (as before), which results in empty node metadata in response (before it had local node IP). Might revert this later to proxy to itself (?). 3. Streaming APIs are now fully supported with multiple targets, but message definition doesn't contain `ResponseMetadata`, so streaming APIs are broken now with targets (needs a fix). 4. Errors are now returned as responses with `Error` field set in `ResponseMetadata`, this requires client library update and `osctl` to handle it properly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-29 22:57:25 +03:00
Andrey Smirnov	af2b6fa130	test: implement node discovery for integration tests This adds support for node discovery for API-based tests, but discovery is based on k8s state. Discovery can be overridden if we provide a list of node IPs as a flag. Also adds a test for K8s API server version. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-14 15:35:07 -08:00
Sekerin Evgeniy	83d5f4c721	feat: Add context key to osctl Added context key for change context on osctl Signed-off-by: Sekerin Evgeniy <sekerin.e.a@gmail.com>	2019-11-13 11:32:15 -08:00
Andrey Smirnov	551fa45d33	test: add CLI integration test This starts with a very simple test for `osctl version` using regexps as output of the command depends a lot on current version. We might use more of 'gold' matches for other commands potentially. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-05 17:59:23 -08:00
Andrey Smirnov	b0aef2cf22	test: add integration test framework This is just first steps and core foundation. It can be used like: ``` make integration.test osctl cluster create build/integration.test -test.v ``` This should run the test against the Docker instance. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-05 17:21:38 +03:00

39 Commits