50 Commits

Author SHA1 Message Date
Andrey Smirnov
f62d17125b
chore: update crypto to use new import path siderolabs/crypto
No functional changes in this PR, just updating import paths.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-07 23:02:50 +04:00
Dmitriy Matrenichev
b59ca5810e
chore: move from inet.af/netaddr to net/netip and go4.org/netipx
Closes #6007

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-08-25 17:51:32 +03:00
Andrey Smirnov
053af1d59e
fix: update etcd certificates when node addresses changes
Fixes #6110

I somehow missed the fact that etcd certs were not made fully reactive
to node address changes (I wrongly assume it was already the fact).

This PR refactors etcd certificate generation process to be
resource-based and introduces unit-tests for the controller.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-25 00:27:52 +04:00
Utku Ozdemir
84e712a9f1
feat: introduce Talos API access from Kubernetes
We add a new CRD, `serviceaccounts.talos.dev` (with `tsa` as short name), and its controller which allows users to get a `Secret` containing a short-lived Talosconfig in their namespaces with the roles they need. Additionally, we introduce the `talosctl inject serviceaccount` command to accept a YAML file with Kubernetes manifests and inject them with Talos service accounts so that they can be directly applied to Kubernetes afterwards. If Talos API access feature is enabled on Talos side, the injected workloads will be able to talk to Talos API.

Closes siderolabs/talos#4422.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-08-08 18:27:26 +02:00
Andrey Smirnov
7795de313a
fix: use controllers/resources for etcd configuration
This extracts etcd configuration and finalized run arguments as
resources managed by controllers.

The biggest change in terms of UX is that Talos now waits for the etcd
configured subnet to be actually available before starting etcd.
Previously etcd quickly failed if the requested subnet was not available
on the host.

Coupled with other fixes (#5951, #5988), this should bring etcd
join/promote sequence back into proper shape.

I also reverted all temporary measures for discovering etcd endpoints,
now etcd join doesn't depend on Kubernetes (once again).

Fixes #5889

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-04 21:14:43 +04:00
Andrey Smirnov
6fc38bae69
fix: iterate over etcd members endpoints for member promotion
This uses all available (potential) etcd endpoints, which includes the
member being promoted as well. We avoid failures by iterating over the
list of endpoints on each attempt to make sure each and every endpoint
is tried.

Part of #5889

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-02 00:16:33 +04:00
Andrey Smirnov
626ef05e60
fix: correct SANs for etcd certs
I would like to rewrite whole cert generation process, but for now a few
fixes:

* client cert doesn't need any SANs
* peer cert should contain only non-localhost SANs
* server cert same as before (localhost + addresses)

See https://etcd.io/docs/v3.5/op-guide/security/ for details.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-07-11 14:51:27 +04:00
Andrey Smirnov
8a038d40ee
fix: stabilize etcd join and promote sequences
There were two issues with using discovery service for join and promote:

* on join, that resulted in joining too fast which triggers race bugs in
  etcd cert generation (to be fixed as separate PR)
* on promote, Talos has to connect to non-learner member of the cluster
  which is somehow "automatic" with Kuberentes discovery, as it only
  lists `kube-apiserver` running which is up only when etcd on the same
  node is healthy. etcd client doesn't allow to avoid learner members,
  as even getting a member list from a learner doesn't work (to be fixed
  as a separate PR)

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-07-08 21:28:13 +04:00
Andrey Smirnov
b085343dcb
feat: use discovery information for etcd join (and other etcd calls)
Talos historically relied on `kubernetes` `Endpoints` resource (which
specifies `kube-apiserver` endpoints) to find other controlplane members
of the cluster to connect to the `etcd` nodes for the cluster (when node
local etcd instance is not up, for example). This method works great,
but it relies on Kubernetes endpoint being up. If the Kubernetes API is
down for whatever reason, or if the loadbalancer malfunctions, endpoints
are not available and join/leave operations don't work.

This PR replaces the endpoints lookup to use the `Endpoints` COSI
resource which is filled in using two methods:

* from the discovery data (if discovery is enabled, default to enabled)
* from the Kubernetes `Endpoints` resource

If the discovery is disabled (or not available), this change does almost
nothing: still Kubernetes is used to discover control plane endpoints,
but as the data persists in memory, even if the Kubernetes control plane
endpoint went down, cached copy will be used to connect to the endpoint.

If the discovery is enabled, Talos can join the etcd cluster immediately
on boot without waiting for Kubernetes to be up on the bootstrap node
which means that Talos cluster initial bootstrap runs in parallel on all
control plane nodes, while previously nodes were waiting for the first
node to finish bootstrap enough to fill in the endpoints data.

As the `etcd` communication is anyways protected with mutual TLS,
there's no risk even if the discovery data is stale or poisoned, as etcd
operations would fail on TLS mismatch.

Most of the changes in this PR actually enable populating Talos
`Endpoints` resource based on the `Kubernetes` `endpoints` resource
using the watch API.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-21 22:00:27 +03:00
Andrey Smirnov
2cd3f9be1f
feat: filter out SideroLink addresses by default
As SideroLink addresses are ephemeral and point-to-point, filter them
out for node addresses, Kubelet, etcd, etc.

Fixes #4448

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-11-30 15:31:31 +03:00
Alexey Palazhchenko
7462733bcb
chore: update golangci-lint
Fix context propagation.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-15 14:55:25 +00:00
Andrey Smirnov
ef02295928
fix: print etcd member ID in hex
It's confusing in base 10, as every other place prints member ID in hex.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-13 15:25:49 +03:00
Alexey Palazhchenko
eea750de2c chore: rename "join" type to "worker"
Closes #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-09 07:10:45 -07:00
Andrey Smirnov
9bf899bdd8 fix: make forfeit leadership connect to the right node
I believe `clientv3.SetEndpoints()` calls doesn't make etcd client
connect to the endpoints mentioned immediately, it might stil reuse old
connection (?).

At the same time `MaintenanceClient` which implements `MoveLeader` calls
doesn't support explicit endpoint setting (as other similar calls do),
so we have to manually force the connection to the leader node we need.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-06 13:50:57 -07:00
Andrey Smirnov
6d13d2cf92 fix: close Kubernetes API client
The problem is that there's no official way to close Kuberentes client
underlying TCP/HTTP connections. So each time Talos initializes
connection to the control plane endpoint, new client is built, but this
client is never closed, so the connection stays active on the load
balancers, on the API server level, etc. It also eats some resources out
of Talos itself.

We add a way to close underlying connections by using helper from the
Kubernetes client libraries to force close all TCP connections which
should shut down all HTTP/2 connections as well.

Alternative approach might be to cache a client for some time, but many
of the clients are created with temporary PKI, so even cached client
still needs to be closed once it gets stale, and it's not clear how to
recreate a client in case existing one is broken for one reason or
another (and we need to force a re-connection).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:25:26 -07:00
Andrey Smirnov
aaa36f3b4f fix: ignore 'not a leader' error on forfeit leadership
When forfeiting etcd leadership, it might be that the node still reports
leadership status while not being a leader once the actual API call is
used. We should ignore such an error as the node is not a leader.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:23:24 -07:00
Serge Logvinov
b52b206665 feat: split etcd certificates to peer/client
Changes:
* Etcd peer port key usage: ServerAuth,ClientAuth
* Etcd client port key usage: ServerAuth,ClientAuth
* Talos etcd client key usage: ClientAuth
* KubeAPI etcd client key usage: ClientAuth
* List of etcd allowed ciphers

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-23 13:26:48 -07:00
Andrey Smirnov
4ac9bea27d fix: stop etcd client logs from going to the server console
When etcd calls start failing, these log messages start spamming the
console a lot. Disable the client log until we figure out a better
destination for them.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-17 10:04:14 -07:00
Alexey Palazhchenko
f63ab9dd9b feat: implement talosctl config new command
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-17 09:06:43 -07:00
Andrey Smirnov
59cfd312c1 chore: bump dependencies via dependabot
There were some upstream code changes in etcd, some code got moved
around.

PRs #3651 #3652 #3653 #3654 #3655 #3655 #3656 #3657 #3658
    #3659 #3660 #3661 #3662 #3663

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-05-24 12:15:15 -07:00
Andrey Smirnov
6cb266e74e fix: update etcd client errors, print etcd join failures
Better error message to understand where the error is coming from, also
print errors to console when etcd is trying to join - this is invaluable
to understand why etcd doesn't join the cluster.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-15 11:54:25 -07:00
Andrey Smirnov
ce795f1cea fix: command etcd remove-member shouldn't remove etcd data directory
There are two APIs and `talosctl` commands:

* `etcd leave` removes the member from the cluster and removes etcd
data directory for the called node
* `etcd remove-member <node>` removes some other node from the etcd
cluster, but it doesn't affect called node state

This fixes confusing naming of the methods vs. what they're doing.

Fixes #3340

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-22 02:11:06 -07:00
Artem Chernyshev
22f375300c chore: update golanci-lint to 1.38.0
Fix all discovered issues.
Detected couple bugs, fixed them as well.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-12 06:50:02 -08:00
Alexey Palazhchenko
df52c13581 chore: fix //nolint directives
That's the recommended syntax:
https://golangci-lint.run/usage/false-positives/

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-05 05:58:33 -08:00
Artem Chernyshev
376fdcf6cb feat: implement etcd remove-member cli command
Fixes: https://github.com/talos-systems/talos/issues/3219

We already have `etcd leave`, which makes the node exclude itself from
etcd members.
But in case if the node can't remove itself because it doesn't have
connection to etcd we need this etcd remove-member cli, which basically removes
a node from a different node.

No unit tests for that as it's going to destroy the test cluster.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-01 07:55:08 -08:00
Andrey Smirnov
953ce643ab feat: bump etcd client library to 3.5.0-alpha.0
This version is finally using working `go.mod` files and tags, so no
more hacks with imports, and allows us to bump `grpc` library to the
latest version (I also did for this PR).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-25 10:36:15 -08:00
Andrey Smirnov
2277ce8abe feat: move to ECDSA keys for all Kubernetes/etcd certs and keys
ECDSA keys are smaller which decreases Talos config size, they are more
efficient in terms of key generation, signing, etc., so it makes boot
performance better (and config generation as well).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-02 13:25:00 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00
Andrey Smirnov
a2efa44663 chore: enable gci linter
Fixes were applied automatically.

Import ordering might be questionable, but it's strict:

* stdlib
* other packages
* same package imports

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 08:09:48 -08:00
Andrey Smirnov
8560fb9662 chore: enable nlreturn linter
Most of the fixes were automatically applied.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 06:48:07 -08:00
Seán C McCord
a7a27e7edd feat: extend etcd health check on upgrade
When an etcd node is upgraded, we now perform additional quorum checks.
This is necessary because when etcd nodes are upgraded, they are removed
from membership.  If, for instance, two etcd nodes were to upgrade
simultaneously, quorum may be lost.  This, of course, does not apply to
single-node etcd clusters.

Fixes #1422

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2020-10-23 15:49:55 -07:00
Andrew Rynhard
4eeef28e90 feat: add etcd API
This adds RPCs for basic etcd management tasks.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-10-06 11:30:04 -07:00
Andrew Rynhard
d4f103ffcb fix: pass config via stdin
In order to perform upgrades the way we would like, it is important that
we avoid any bind mounts into containers. This change ensures that all
system services get their config via stdin.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-08-20 15:26:13 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
2697b99b7d refactor: extract pkg/net as github.com/talos-systems/net
This extracts common package as new module/repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-14 11:04:50 -07:00
Andrey Smirnov
52c5911fcd chore: extract pkg/crypto as external module
Package `pkg/crypto` was extracted as `github.com/talos-systems/crypto`
repository and Go module.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-14 06:33:30 -07:00
Andrey Smirnov
7474b8ba52 feat: upgrade etcd to 3.4.10
This upgrades etcd to latest v3.4.x version as smooth upgrade from
version 3.3.22 in 0.6.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-13 07:33:51 -07:00
Andrey Smirnov
47608fb874 refactor: make pkg/config not rely on machined/../internal/runtime
This makes `pkg/config` directly importable from other projects.

There should be no functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-29 12:40:12 -07:00
Andrey Smirnov
ddbe9cfc2f fix: update timeouts on service startup to match boot timeout
There's a global timeout for all services to be up: it's 5 minutes. We
need to make sure each service startup takes less than that, otherwise
boot sequence is aborted and there's no way to see the error message for
each particular service.

Also propagate contexts correctly and set some default timeouts to make
sure API operations are not hanging forever.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 07:39:36 -07:00
Seán C McCord
8ba8742f2e fix: wrap etcd address URLs with formatting
Make certain that all etcd address URLs are properly wrapped to handle
IPv6 addresses.

Fixes #2120

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2020-05-18 10:30:16 -07:00
Andrew Rynhard
49307d554d refactor: improve machined
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.

A few highlights are:

- no more event bus
- functional approach to tasks (no more types defined for each task)
  - the task function definition now offers a lot more context, like
    access to raw API requests, the current sequence, a logger, the new
    state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
  install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
  install force option

Additionally, this pulls various packes in under machined to make the
code easier to navigate.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-28 08:20:55 -07:00
Andrew Rynhard
69fa63a7b2 refactor: perform upgrade upon reboot
This PR introduces a new strategy for upgrades. Instead of attempting to
zap the partition table, create a new one, and then format the
partitions, this change will only update the `vmlinuz`, and
`initramfs.xz` being used to boot. It introduces an A/B style upgrade
process, which will allow for easy rollbacks. One deviation from our
original intention with upgrades is that this change does not completely
reset a node. It falls just short of that and does not reset the
partition table. This forces us to keep the current partition scheme in
mind as we make changes in the future, because an upgrade assumes a
specific partition scheme. We can improve upgrades further in the
future, but this will at least make them more dependable. Finally, one
more feature in this PR is the ability to keep state. This enables
single node clusters to upgrade since we keep the etcd data around.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 17:32:18 -07:00
Spencer Smith
12bfd8dd94 feat: allow for persistence of config data
This PR will allow users to set the `persist: true` value in their
config data to tell talos not to re-pull the config data at each reboot.
The default will still remain as a "pull every time" methodolgy in order
to encourage immutability by default.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-06 11:42:00 -05:00
Spencer Smith
7719a67834 fix: refuse to upgrade if single master
This PR adds some simple logic to bail early in the upgrade process if
there only seems to be a single etcd node present in the cluster. This
allows us to keep from blowing up non-HA clusters if users issue an
upgrade command.

Will close #1770.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-01-13 15:31:01 -05:00
Andrew Rynhard
898cf01f0a refactor: unify generate type and machine type
We have been using two packages that define a config type and a machine
type, when really they are one and the same. This unifies the types down
to one set.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-10 16:46:28 -08:00
Andrew Rynhard
3e5ca30aa5 refactor: simplify NewTemporaryClientFromPKI
This is a simple refactor that reduces the number of arguments required
by `NewTemporaryClientFromPKI`.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-03 09:10:24 -08:00
Andrew Rynhard
c9732458c1 fix: verify that all etcd members are running before upgrading
This verifies that all etcd members are running before performing an
upgrade. Without this we run the risk of destroying the etcd cluster.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 18:17:13 -08:00
Andrew Rynhard
33468f4d6a fix: don't use 127.0.0.1 for etcd client
We should use 127.0.0.1 only in special cases (like when bootstrapping
the cluster). There is the potential that the local etcd member is
unhealthy and/or not responsive. This adds function for creating an etcd
client configured with all control plane node IPs in order to better
handle this case.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 16:54:15 -08:00
Andrew Rynhard
ce911c02da refactor: use etcd package
This DRYs things up by using the etcd package for client creation.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-01 21:02:44 -07:00
Andrew Rynhard
5abbb9b041 fix: Avoid running bootkube on reboots
Since bootkube should only be ran once, we need a way to determine if it
has already been ran. This makes use of etcd to store a key-value pair
indicating that the cluster has been initialized.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-01 15:20:43 -07:00