talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-08 14:11:13 +02:00

Author	SHA1	Message	Date
Andrey Smirnov	96aa9638f7	chore: rename talos-systems/talos to siderolabs/talos There's a cyclic dependency on siderolink library which imports talos machinery back. We will fix that after we get talos pushed under a new name. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-03 16:50:32 +04:00
Andrey Smirnov	343c55762e	chore: replace talos-systems Go modules with siderolabs This the first step towards replacing all import paths to be based on `siderolabs/` instead of `talos-systems/`. All updates contain no functional changes, just refactorings to adapt to the new path structure. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-11-01 12:55:40 +04:00
Andrey Smirnov	a6b010a8b4	chore: update Go to 1.19, Linux to 5.15.58 See https://go.dev/doc/go1.19 Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>	2022-08-03 17:03:58 +04:00
Utku Ozdemir	8d2be5e315	feat: extend node definition used in health checks Introduce `cluster.NodeInfo` to represent the basic info about a node which can be used in the health checks. This information, where possible, will be populated by the discovery service in following PRs. Part of siderolabs#5554. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>	2022-06-13 14:13:42 +02:00
Artem Chernyshev	27af5d41c6	feat: pause the boot process on some failures instead of rebooting Some failures can be fixed by updating the machine configuration. Now `userDisks` and `userFiles` do not make Talos to enter into reboot loop but pause for 35 minutes. Additionally, `apid` and `machined` are now started right after containerd is up and running. That makes it possible for the operator to connect to the node using talosctl and fix the config. Fixes: https://github.com/talos-systems/talos/issues/4669 Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>	2022-03-21 17:39:45 +03:00
Andrey Smirnov	10c28758a4	fix: ignore DeadlineExceeded error correctly on bootstrap The problem was that gRPC method `status.Code(err)` doesn't unwrap errors, while Talos client returns errors wrapped with `multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return error code correctly. Fix that by introducing our own client method which correctly goes over the chain of wrapped errors. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-07-07 12:02:26 -07:00
Andrey Smirnov	60d7360944	fix: ignore deadline exceeded errors on bootstrap With the recent changes, bootstrap API might wait for the time to be in sync (as the apid is launched before time is sync). We set timeout to 500ms for the bootstrap API call, so there's a chance that a call might time out, and we should ignore it. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-30 06:59:36 -07:00
Andrey Smirnov	d8c2bca1b5	feat: reimplement apid certificate generation on top of COSI This PR can be split into two parts: * controllers * apid binding into COSI world Controllers ----------- * `k8s.EndpointController` provides control plane endpoints on worker nodes (it isn't required for now on control plane nodes) * `secrets.RootController` now provides OS top-level secrets (CA cert) and secret configuration * `secrets.APIController` generates API secrets (certificates) in a bit different way for workers and control plane nodes: controlplane nodes generate directly, while workers reach out to `trustd` on control plane nodes via `k8s.Endpoint` resource apid Binding ------------ Resource `secrets.API` provides binding to protobuf by converting itself back and forth to protobuf spec. apid no longer receives machine configuration, instead it receives gRPC-backed socket to access Resource API. apid watches `secrets.API` resource, fetches certs and CA from it and uses that in its TLS configuration. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-23 13:07:00 -07:00
Andrey Smirnov	5811f4dda1	feat: implement link (interface) controllers The structure of the controllers is really similar to addresses and routes: * `LinkSpec` resource describes desired link state * `LinkConfig` controller generates `LinkSpecs` based on machine configuration and kernel cmdline * `LinkMerge` controller merges multiple configuration sources into a single `LinkSpec` paying attention to the config layer priority * `LinkSpec` controller applies the specs to the kernel state Controller `LinkStatus` (which was implemented before) watches the kernel state and publishes current link status. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-01 09:36:25 -07:00
Andrey Smirnov	e0650218a6	feat: support etcd recovery from snapshot on bootstrap When Talos `controlplane` node is waiting for a bootstrap, `etcd` contents can be recovered from a snapshot created with `talosctl etcd snapshot` on a healthy cluster. Bootstrap process goes same way as before, but the etcd data directory is recovered from the snapshot. This flow enables disaster recovery for the control plane: given that periodic backups are available, destroy control plane nodes, re-create them with the same config, and bootstrap one node with the saved snapshot to recover etcd state at the time of the snapshot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-08 10:15:37 -07:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	8ebaa60b71	fix: retry connection refused errors while bootstrapping a cluster This fixes a random failure at least in the tests. As the nodes are booting, one node might boot earlier than others. As client is using all control plane endpoints with load-balancing, check for apid might succeed via one node, but next request might hit a different endpoint which still has cached connection error, and we should retry that. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-28 08:32:58 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Andrey Smirnov	5c6c522994	refactor: extract cluster bootstrapper via API as common component It should be useful in any project provisioning Talos clusters. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-19 14:32:58 -07:00

14 Commits