2626 Commits

Author SHA1 Message Date
Caleb Woodbine
da6f786cab fix: kuberentes => kubernetes typo
uh uh, small typo... nothing to see here.

Signed-off-by: Caleb Woodbine <calebwoodbine.public@gmail.com>
2021-07-19 05:59:35 -07:00
Artem Chernyshev
2e463348b2 fix: pass all logs through the options.Log method
Looks like I've missed some 🤦

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-15 08:32:48 -07:00
Andrey Smirnov
4e9c5afb6d fix: make ethtool optional in link status controller
When Talos runs in a container, `ethtool` availability depends on host
kernel support, and we don't strictly need `ethtool` to make networking
work, so make it optional instead of hard failure.

Example: https://gist.github.com/rgl/392d6e16d176f28430230b06ec80496c

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-15 08:32:15 -07:00
Artem Chernyshev
bf61c2cc4a fix: write upgrade logs only to the LogOutput if it's defined
No need to print them to stdout in that case.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-15 07:02:45 -07:00
Andrey Smirnov
9c73257cb1 feat: update Go to 1.16.6
See:

* https://github.com/talos-systems/tools/pull/140
* https://github.com/talos-systems/pkgs/pull/300
* https://github.com/talos-systems/extras/pull/21

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-14 06:44:22 -07:00
Artem Chernyshev
23ef1d40af chore: add ability to redirect talos upgrade module logs to io.Writer
This is going to be useful in the third party code which is using
upgrade modules, to collect output logs instead of printing them to the
stdout.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-07-13 08:12:06 -07:00
dependabot[bot]
33e9d6c984 chore: bump github.com/aws/aws-sdk-go in /hack/cloud-image-uploader
Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.39.0 to 1.39.4.
- [Release notes](https://github.com/aws/aws-sdk-go/releases)
- [Changelog](https://github.com/aws/aws-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/aws/aws-sdk-go/compare/v1.39.0...v1.39.4)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-07-12 05:06:06 -07:00
dependabot[bot]
604434c43e chore: bump github.com/prometheus/procfs from 0.6.0 to 0.7.0
Bumps [github.com/prometheus/procfs](https://github.com/prometheus/procfs) from 0.6.0 to 0.7.0.
- [Release notes](https://github.com/prometheus/procfs/releases)
- [Commits](https://github.com/prometheus/procfs/compare/v0.6.0...v0.7.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/procfs
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-07-12 04:33:39 -07:00
dependabot[bot]
2ea28f62d8 chore: bump node from 16.3.0-alpine to 16.4.2-alpine
Bumps node from 16.3.0-alpine to 16.4.2-alpine.

---
updated-dependencies:
- dependency-name: node
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-07-12 03:20:49 -07:00
Andrey Smirnov
b358a189bc fix: correctly pick route scope for link-local destination
Route scope doesn't depend on destination IP type being link-local, e.g.
in Azure route to link local address is create with gateway, and that
should be global (universe) scope route.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-09 13:01:27 -07:00
Serge Logvinov
6848d43142 feat: can change clusterdns ip lists
Add change clusterdns ip list on node

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-07-09 12:33:34 -07:00
Andrey Smirnov
72b76abfd4 fix: workaround issues when IPv6 is fully or partially disabled
Fixes #3847

Fixes #3919

1. Looks like `::1/128` is assigned to `lo` interface by the kernel
without our help, and kernel does it properly whether IPv6 is enabled
for not (including particular interface).

2. If IPv6 is disabled completely with command line, we should ignore
failures to write ipv6 sysctls (as these are not security-related,
skipping them isn't a risk).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-09 12:33:22 -07:00
Alexey Palazhchenko
679b08f4fa docs: update docs for 0.12
Plus remove versions in a few places.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-09 09:39:51 -07:00
Andrey Smirnov
6fbec9e0cb fix: cache etcd client used for healthchecks
We run etcd health check every 30s, and create/destroy client every 30s.
This puts a lot of pressure on etcd itself and machined.

There's protobuf overhead, TLS connection overhead, etc.

As we don't support changing etcd PKI (yet), client created once is good
enough for the lifetime of the node.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-09 07:40:00 -07:00
Alexey Palazhchenko
eea750de2c chore: rename "join" type to "worker"
Closes #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-09 07:10:45 -07:00
Andrey Smirnov
951493ac83 docs: update what's new for Talos 0.11
This is just copy-paste from our changelog.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-08 14:47:48 -07:00
Andrey Smirnov
b47d1098b1 docs: promote 0.11 docs to be the latest
Also adds AWS AMIs for 0.11.0

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-08 13:37:12 -07:00
Andrey Smirnov
d930a26502 chore: implement DeepCopy for machine configuration
Resources code extensively uses DeepCopy to prevent in-memory copy of
the resource to be mutated outside of the resource model.

Previous implementation relied on YAML serialization to copy the
machine configuration which was slow, potentially might lead to panics
and it generates pressure on garbage collection.

This implementation uses k8s code generator to generate DeepCopy methods
with some manual helpers when code generator can't handle it.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-08 07:21:24 -07:00
Andrey Smirnov
fe4ed3c734 chore: ignore tags which don't look like semantic version
This allows us to use tags for Go submodules `pkg/machinery/v0.11.0` and
still keeps Talos tag follow semantic version `v0.11.0`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-08 07:20:04 -07:00
Andrey Smirnov
b969e7720e chore: update references to old protobuf package
This simply uses new protobuf package instead of old one.

Old protobuf package is still in use by Talos dependencies.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-08 05:34:12 -07:00
Alexey Palazhchenko
2ba8ac9ab4 docs: add documentation directory for 0.12
Plus, convert a few absolute URLs with a version number to relative URLs without versions.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-08 04:44:51 -07:00
Andrey Smirnov
011e2885e7 fix: validate bond slaves addressing
This extends network device machine configuration validation to make
sure that bond slaves don't have any addressing methods set, as this
might run into a conflict with the bond setup.

Also makes sure no interface is part of two bonds.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-07 12:32:34 -07:00
Andrey Smirnov
10c28758a4 fix: ignore DeadlineExceeded error correctly on bootstrap
The problem was that gRPC method `status.Code(err)` doesn't unwrap
errors, while Talos client returns errors wrapped with
`multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return
error code correctly.

Fix that by introducing our own client method which correctly goes over
the chain of wrapped errors.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-07 12:02:26 -07:00
Andrey Smirnov
77fabaceca chore: ignore future pkg/machinery/vX.Y.Z tags
Drone shouldn't build releases for `pkg/machinery/vX.Y.Z` tags.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-07 10:33:10 -07:00
Andrey Smirnov
6b661114d0 fix: make COSI runtime history depth smaller
This reduces Talos memory usage.

See https://github.com/cosi-project/runtime/pull/51

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-07 10:32:54 -07:00
Andrey Smirnov
9bf899bdd8 fix: make forfeit leadership connect to the right node
I believe `clientv3.SetEndpoints()` calls doesn't make etcd client
connect to the endpoints mentioned immediately, it might stil reuse old
connection (?).

At the same time `MaintenanceClient` which implements `MoveLeader` calls
doesn't support explicit endpoint setting (as other similar calls do),
so we have to manually force the connection to the leader node we need.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-06 13:50:57 -07:00
Alexey Palazhchenko
4708beaee5 feat: implement talosctl config info command
Closes #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-06 00:58:47 -07:00
Andrey Smirnov
6d13d2cf92 fix: close Kubernetes API client
The problem is that there's no official way to close Kuberentes client
underlying TCP/HTTP connections. So each time Talos initializes
connection to the control plane endpoint, new client is built, but this
client is never closed, so the connection stays active on the load
balancers, on the API server level, etc. It also eats some resources out
of Talos itself.

We add a way to close underlying connections by using helper from the
Kubernetes client libraries to force close all TCP connections which
should shut down all HTTP/2 connections as well.

Alternative approach might be to cache a client for some time, but many
of the clients are created with temporary PKI, so even cached client
still needs to be closed once it gets stale, and it's not clear how to
recreate a client in case existing one is broken for one reason or
another (and we need to force a re-connection).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:25:26 -07:00
Andrey Smirnov
aaa36f3b4f fix: ignore 'not a leader' error on forfeit leadership
When forfeiting etcd leadership, it might be that the node still reports
leadership status while not being a leader once the actual API call is
used. We should ignore such an error as the node is not a leader.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:23:24 -07:00
Andrey Smirnov
22a4193678 fix: workaround 'Unauthorized' errors when accessing Kubernetes API
This should fix an error like:

```
failed to create etcd client: error getting kubernetes endpoints: Unauthorized
```

The problem is that the generated cert was used immediately, so even
slight time sync issue across nodes might render the cert not (yet)
usable. Cert is generated on one node, but might be used on any other
node (as it goes via the LB).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-05 14:15:03 -07:00
Alexey Palazhchenko
71c6f7004e chore: bump go.mod dependencies
Closes #3879, #3880, #3881, #3882, #3883, #3884, #3885, #3886.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-05 06:59:14 -07:00
Alexey Palazhchenko
915cd8fe20 docs: add guide for RBAC
Document how to enable RBAC without screwing up.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-05 05:56:29 -07:00
Serge Logvinov
f5721050de fix: controlplane keyusage
* kube-apiserver keyusage serverAuth
* kube-scheduler keyusage clientAuth
* kube-controller-manager keyusage clientAuth
* kubeconfig keyusage clientAuth

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-07-01 12:49:29 -07:00
Andrey Smirnov
3d7726613c fix: fill uuid argument correctly in the config download URL
It was broken, because `?uuid=` URL parses to `{"uuid": []string{""}}`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-01 10:50:39 -07:00
Serge Logvinov
d8602025c8 chore: update containerd config version 2
* Rename key cri -> io.containerd.grpc.v1.cri
* Disable plugins aufs,zfs,devmapper,btrfs (less warning messages on
  boot time)

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
2021-07-01 09:08:54 -07:00
Andrey Smirnov
5949ec4e6e docs: describe the new network configuration subsystem
Internal details, resources, examples inspecting the configuration.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-01 09:02:56 -07:00
Spencer Smith
444d72b4d7 feat: update pkgs version
This PR bumps pkgs to v0.7.0-alpha.0, so that we gain a fix for
hotplugging of nvme drives.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-07-01 07:55:00 -07:00
Andrey Smirnov
e883c12b31 fix: make output of upgrade-k8s command less scary
This removes `retrying error` messages while waiting for the API server
pod state to reflect changes from the updated static pod definition.

Log more lines to notify about the progress.

Skip `kube-proxy` if not found (as we allow it to be disabled).

```
$ talosctl upgrade-k8s -n 172.20.0.2 --from 1.21.0 --to 1.21.2
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
updating "kube-apiserver" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-controller-manager" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-scheduler" to version "1.21.2"
 > "172.20.0.2": starting update
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating daemonset "kube-proxy" to version "1.21.2"
kube-proxy skipped as DaemonSet was not found
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-01 06:54:36 -07:00
Andrey Smirnov
7f8e50de4d fix: restart the merge controllers on conflict
Fixes #3861

What this change effectively does is that it changes immediate reconcile
request to an error return, so that controller will be restarted with a
backoff.

More details:

* root cause of the update/teardown conflict is that the finalizer is
still pending on the tearing down resource
* finalizer might not be removed immediately, e.g. if the controller
which put the finalizer is itself in the crash loop
* if the merge controller queues reconcile immediately, it restarts
itself, but the finalizer is still there, so it once again goes into
reconcile loop and that goes forever until the finalizer is removed, so
instead if the controller fails, it will be restarted with exponential
backoff lowering the load on the system

Change is validated with the unit-tests reproducing the conflict.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-30 08:23:33 -07:00
Andrey Smirnov
60d7360944 fix: ignore deadline exceeded errors on bootstrap
With the recent changes, bootstrap API might wait for the time to be in
sync (as the apid is launched before time is sync). We set timeout to
500ms for the bootstrap API call, so there's a chance that a call might
time out, and we should ignore it.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-30 06:59:36 -07:00
Andrey Smirnov
ee06dd69fc fix: don't print git sha of the release twice in the dashboard
This is a small nit, Talos version already contains sha when it is
needed.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-30 03:45:34 -07:00
Andrey Smirnov
07fb61e5d2 fix: issue worker apid certs properly on renewal
This fixes endless block on RemoteGenerator.Close method rewriting the
RemoteGenerator using the retry package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-29 09:02:35 -07:00
Andrey Smirnov
84817f7334 chore: bump Talos version in upgrade tests
Preparing for 0.11 to be stable release soon.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-29 07:24:48 -07:00
Alexey Palazhchenko
2fa54107b2 chore: fix tests for disabled RBAC
This commit also introduces a hidden `--json` flag for `talosctl version` command
that is not supported and should be re-worked at #907.

Refs #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 13:56:40 -07:00
Andrey Smirnov
78583ba985 fix: don't set bond delay options if miimon is not enabled
Basically all delay options are interlocked with `miimon`: if `miimon`
is zero, all delays are set to zero, and kernel complains even if zero
delay attribute is sent while miimon is zero.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-28 11:55:39 -07:00
Alexey Palazhchenko
bbf1c091d4 feat: add RBAC to talosctl version output
Refs #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 07:10:25 -07:00
Andrey Smirnov
5f6ec3ef66 fix: handle cases when merged resource re-appears before being destroyed
The sequence of events to reproduce the problem:

* some resource was merged as final representation with ID `x`
* underlying source resource gets destroyed
* merge controller marks final resource `x` for teardown and waits
for the finalizers to be empty
* another source resource appears which gets merged to same final `x`
* as `x` is in the teardown phase, spec controller will ignore it
* merge controller doesn't see the problem as well, as `x` spec is
correct, but the phase is wrong (which merge controller ignores)

This pulls in COSI fix to return an error if a resource in teardown
phase is modified. This way merge controller knows that the resource `x`
is in the teardown phase, so it should be first fully torn down, and
then new representation should be re-created as new resource with same ID
`x`.

Regression unit-tests included (they don't reproduce the sequence of
events always reliably, but they do with 10% probability).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-28 06:45:44 -07:00
Rui Lopes
1e9a0e745d fix: documentation typos
Fix a couple of documentation typos.

Signed-off-by: Rui Lopes <rgl@ruilopes.com>
2021-06-28 02:50:31 -07:00
Alexey Palazhchenko
f228af4061 chore: bump go.mod dependencies
Closes #3848, #3849, #3850, #3851.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 02:25:43 -07:00
Spencer Smith
2060ceaa0b chore: add CAPI version to CI setup
This PR makes sure we pin to a known CAPI version because with the new
v0.4.x released, we'll fail until we support the v1alpha4 APIs.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-06-25 10:44:07 -04:00