306 Commits

Author SHA1 Message Date
Andrey Smirnov
a12eb76734 test: add unit-test for the installer manifest
This test only works on local machine (see notes in the file).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-15 13:31:42 -07:00
Andrew Rynhard
4eeef28e90 feat: add etcd API
This adds RPCs for basic etcd management tasks.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-10-06 11:30:04 -07:00
Andrey Smirnov
90d0efec48 feat: pull kubeconfig from the cluster on successful cluster create
Kubeconfig is merged into `~/.kube/config` with rename option
(existing configuration is never overwritten).

If endpoint was used, it is automatically put into the `kubeconfig`.

This should make OS X experience literally `talosctl cluster create`
followed by any `kubectl get ...`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-06 05:45:28 -07:00
Andrey Smirnov
018086d1fa refactor: extract blockdevice library
Library `blockdevice` was extracted as `talos-systems/go-blockdevice`,
this PR finalizes the move by removing Talos copy of it.

Some functions around `mkfs`/`growfs` were extracted as `makefs`
package, as they depend on `cmd` package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-05 11:18:43 -07:00
Andrey Smirnov
16eb47a1a3 feat: use kubeconfig merge in talosctl kubeconfig by default
Kubeconfig merge was completely rewritten to be "smarter":

* automatically apply renames done at previous stages to avoid asking
over and over again (in general should ask just once)

* skip checks if parts of the config match exactly

* allow overwrite as an option

* flexible way to control the output

* activating context in the end

* custom merged context name

Fixes #2578

Fixes #2587

Fixes #2577

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-03 05:36:15 -07:00
Seán C McCord
ff92d2a14b feat: add ApplyConfiguration API
Adds the ability to apply (replace) an existing node configuration with
a new one via the Machine API.

Fixes #2345

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2020-09-29 14:44:06 -07:00
Andrey Smirnov
54887c094d fix: provide unique username in generate kubeconfig
This allows for more clean merge of multiple kubeconfigs from different
Talos clusters.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-29 13:10:36 -07:00
Andrey Smirnov
98443cd0e9 fix: retry container image import
This bug is sometimes reproducible with QEMU/arm64, as it runs really
slow. Looks like multiple concurrent image unpacks sharing some layers
might fail unexpectedly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-28 08:58:47 -07:00
Andrey Smirnov
8236822c90 fix: retry image pulling, stop on 404, no duplicate pulls
This uses go-retry feature
(https://github.com/talos-systems/go-retry/pull/3) to print errors being
retried.

If image is not found in the index, abort retries immediately.

Don't pull installer image twice (if already pulled by the validation
code before).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-22 07:07:45 -07:00
Andrey Smirnov
f6ecf000c9 refactor: extract packages loadbalancer and retry
This removes in-tree packages in favor of:

* github.com/talos-systems/go-retry
* github.com/talos-systems/go-loadbalancer

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-02 13:46:22 -07:00
Andrew Rynhard
1a4059a553 feat: add grub bootloader
This moves to using grub instead of syslinux.

BREAKING CHANGE: Single node upgrades will fail in this change. This
will also break the A/B fallback setup since this version introduces
an entirely new partition scheme, that any fallback will not know about.
We plan on addressing these issues in a follow up change.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-09-01 12:06:43 -07:00
Andrew Rynhard
d4f103ffcb fix: pass config via stdin
In order to perform upgrades the way we would like, it is important that
we avoid any bind mounts into containers. This change ensures that all
system services get their config via stdin.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-08-20 15:26:13 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
2697b99b7d refactor: extract pkg/net as github.com/talos-systems/net
This extracts common package as new module/repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-14 11:04:50 -07:00
Andrey Smirnov
52c5911fcd chore: extract pkg/crypto as external module
Package `pkg/crypto` was extracted as `github.com/talos-systems/crypto`
repository and Go module.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-14 06:33:30 -07:00
Andrey Smirnov
7474b8ba52 feat: upgrade etcd to 3.4.10
This upgrades etcd to latest v3.4.x version as smooth upgrade from
version 3.3.22 in 0.6.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-13 07:33:51 -07:00
Andrey Smirnov
9379cf9ee1 refactor: expose provision as public package
This change is only moving packages and updating import paths.

Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other
projects to import Talos provisioning library.

As cluster checks are almost always required as part of provisioning
process, package `internal/pkg/cluster` was also made public as
`pkg/cluster`.

Other changes were direct dependencies discovered by `importvet` which
were updated.

Public packages (useful, general purpose packages with stable API):

* `internal/pkg/conditions` -> `pkg/conditions`
* `internal/pkg/tail` -> `pkg/tail`

Private packages (used only on provisioning library internally):

* `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp`
* `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz`
* `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils`

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-12 05:12:05 -07:00
Andrey Smirnov
050d34275a chore: integrate importvet
This integrates [importvet](https://github.com/talos-systems/importvet)
into `lint` target.

First rule file was added for public packages `pkg/` which shouldn't
depend on other parts of Talos tree (except for the API definitions).

Only one change: `internal/cis` was moved under single user -
`pkg/config/internal/cis` to satisfy the rules.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-11 13:19:15 -07:00
Andrew Rynhard
92523bc422 refactor: remove structs from config provider
This make the config provider a pure interface definition by removing
all concrete internal types, and making them an interface.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-08-06 13:21:41 -07:00
Andrey Smirnov
b5d082de8a fix: update qemu launcher on arm64 to boot Talos properly
This includes better machine args, support for UEFI parallel flash
images required as low-level bootloader, and miscallenous cleanups.

Qemu support was enabled for mapping host random source to the guest as
entropy source to prevent stalls on the boot waiting for the entropy.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-04 11:45:19 -07:00
Andrey Smirnov
a5d64d97c1 test: update qemu/firecracker provisioners
Fixes #2363 #2364 #2370 #2371

Several changes packed together:

* use compressed `vmlinuz` everywhere, firecracker provisioner
uncompresses it before first use, drop `vmlinux`

* handle reboots in qemu launcher to support reset API case, update
empty disk check to handle reset behavior (erasing partition table)

* make bootloader support default in provisioners, and flag to disable
that

* early support for target architecture for qemu provisioner

This should allow us to use `qemu` in CI/CD (not included into this PR):
integration test passes with qemu.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-30 21:17:25 +03:00
Andrew Rynhard
849959fefc feat: add dynamic config decoder
This adds the ability to dynamically decode mult-doc YAML files.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-30 08:07:14 -07:00
Andrey Smirnov
3926442704 feat: taint master nodes with NoSchedule taint
Fixes #2350

This also brings in a fix for `coredns` tolerations from
https://github.com/talos-systems/bootkube-plugin/pull/19.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-29 14:02:41 -07:00
Andrey Smirnov
47608fb874 refactor: make pkg/config not rely on machined/../internal/runtime
This makes `pkg/config` directly importable from other projects.

There should be no functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-29 12:40:12 -07:00
Artem Chernyshev
c6eb18eed5 feat: qemu provisioner
Starts and stops qemu VMs, has some initial configuration subset.
Sets up networking through CNI tools, sets up DHCP server which gives IP
addresses to nodes.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-28 14:55:35 -07:00
Andrey Smirnov
76c44ac468 test: remove apid load balancer for firecracker
We're not using load balancer for `apid` (always using client-side load
balancing), so we can remove this safely.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-28 20:21:21 +03:00
Andrey Smirnov
020aea1b89 fix: fail ntpd service if initial time sync fails
Talos depends on accurate time for many actions, so many services depend
on timed successful health check. If timed fails to do initial sync, it
enters pretty long wait loop for the next attempt which might not come
in time for the boot timeout. Instead, fail timed service on initial
sync and rely on service restart for another attempt.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-28 08:44:58 -07:00
Andrew Rynhard
d8b689e9d1 fix: generate admin kubeconfig with default namespace
This ensures that the generated kubeconfig has a namespace. This fixes
an edge case when a user attempts to use the kubeconfig from within a
pod of a different kubernetes cluster. If the kubeconfig does not have a
namespace, kubectl will use the "in cluster namespace" which is
unexpected, especially if the "in cluster namespace" does not exist in
the target cluster.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-27 20:08:17 -07:00
Andrey Smirnov
c85608b8d9 test: add an option to bind docker to specific host IP
This allows to override default `0.0.0.0` (`*`) to a specific IP to
avoid conflicts.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 21:13:28 +03:00
Andrey Smirnov
13c0052a6c test: fix racy test ReaderNoFollow
Due to the race between `Read()` and context cancellation, error might
be returned which we can safely ignore.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 21:13:11 +03:00
Artem Chernyshev
c70c08c8ce chore: extract loadbalancer, network, crashdup and process from firecracker
Second part of refactoring to split common logic for VM provisioners
from Firecracker provisioner.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-20 11:03:03 -07:00
Andrey Smirnov
f047c42ae7 test: provider correct installer kernel args for firecracker
Firecracker never executes the bootloader, so kernel args passed to the
installer aren't used, but if the same disk image is used to boot Talos
e.g. in `qemu`, it fails to set up console properly for example.

This PR simply provides those kernel args to the installer so that
they're persisted in the image.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 16:52:08 +03:00
Artem Chernyshev
19cd46459b chore: initial extraction of base vm provisioner
Created base provisioner struct for all VM based provisioners.
Moved state.go and reflect.go to the common module.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-18 15:45:54 -07:00
Artem Chernyshev
3d25ceb13e chore: move inmemhttp from firecracker provisioner to internal/pkg/
To be reused in qemu provisioner later on.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-18 07:11:50 -07:00
Andrey Smirnov
2f4fb34baf fix: update container name in docker crashdump
Small bug resulted in container names being cut in the wrong way.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 12:49:29 -07:00
Andrey Smirnov
41d5f7859a chore: update golangci-lint to 1.28.3
Fixes #2272

`gofumpt` is now included into `golangci-lint`, but not the
`gofumports`, so we keep it using it as separate binary, but we keep
versions in sync with `golangci-lint`.

This contains fixes from:

* `gofumpt` (automated, mostly around octal constants)
* `exhaustive` in `switch` statements
* `noctx` (adding context with default timeout to http requests)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 08:05:42 -07:00
Andrey Smirnov
c54639e541 feat: implement server-side API for cluster health checks
This implements existing server-side health checks as defined in
`internal/pkg/cluster/checks` in Talos API.

Summary of changes:

* new `cluster` API

* `apid` now listens without auth on local file socket

* `cluster` API is for now implemented in `machined`, but we can move it
to the new service if we find it more appropriate

* `talosctl health` by default now does server-side health check

UX: `talosctl health` without arguments does health check for the
cluster if it has healthy K8s to return master/worker nodes. If needed,
node list can be overridden with flags.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-15 13:52:13 -07:00
Andrey Smirnov
a4a2a3c83a feat: uncordon nodes automatically on boot
Talos will mark node as schedulable if it was previously cordoned by
Talos (for upgrade, reset, etc.)

If user marked node as not schedulable, Talos won't change it on boot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 15:32:36 -07:00
Andrey Smirnov
5ecddf2866 feat: add round-robin LB policy to Talos client by default
Handling of multiple endpoints has already been implemented in #2094.

This PR enables round-robin policy so that grpc picks up new endpoint
for each call (and not send each request to the first control plane
node).

Endpoint list is randomized to handle cases when only one request is
going to be sent, so that it doesn't go always to the first node in the
list.

gprc handles dead/unresponsive nodes automatically for us.

`talosctl cluster create` and provision tests switched to use
client-side load balancer for Talos API.

On the additional improvements we got:

* `talosctl` now reports correct node IP when using commands without
`-n`, not the loadbalancer IP (if using multiple endpoints of course)

* loadbalancer can't provide reliable handling of errors when upstream
server is unresponsive or there're no upstreams available, grpc returns
much more helpful errors

Fixes #1641

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 08:35:15 -07:00
Andrey Smirnov
aa687cf8cd fix: update the control plane cluster health check
Include kube-apiserver in the list of daemon sets to be checked, and
for each daemon set verify number of pods running and ready, as when
control plane is damaged daemon set properties are not updated properly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 17:53:21 +03:00
Andrey Smirnov
ddbe9cfc2f fix: update timeouts on service startup to match boot timeout
There's a global timeout for all services to be up: it's 5 minutes. We
need to make sure each service startup takes less than that, otherwise
boot sequence is aborted and there's no way to see the error message for
each particular service.

Also propagate contexts correctly and set some default timeouts to make
sure API operations are not hanging forever.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 07:39:36 -07:00
Andrey Smirnov
219425f629 test: resolve old TODO item
I had to copy over some oci stuff from newer package version, but as we
for a long time use newer oic, we don't need a copy anymore.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-02 11:09:58 -07:00
Andrey Smirnov
ba12095ac7 test: stabilize race unit-tests (circular, events)
Fixes #2243

These tests rely on some kind of sync between readers and writers, as if
circular buffer is overrun, test no longer runs as expected.

We use time-sensitive rate limiter to limit write speed to make sure
readers can always catch up. Lowering the rate should slow down writers
and make tests more likely to succeed.

For #2243, the failure was from buffer overrun: when overrun is
detected, `Watch` function closes the channel (and test "receives" zero
element).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 13:39:49 -07:00
Andrew Rynhard
888c8b948a feat: add /system directory
This adds the `/system` directory to provide a dedicated
directory for all system related runtime files.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-01 09:51:56 -07:00
Andrey Smirnov
81d1c2bfe7 chore: enable godot linter
Issues were fixed automatically.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-30 10:39:56 -07:00
Andrey Smirnov
4ad4511b38 chore: enable nolintlint linter
It makes sure our `//nolint:` directives are not redundant.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-30 07:39:19 -07:00
Andrey Smirnov
0a4645fe80 feat: implement circular buffer for system logs
This replaces logging to files with inotify following to pure in-memory
circular buffer which grows on demand capped at specified maximum
capacity.

The concern with previous approach was that logs on tmpfs were growing
without any bound potentially consuming all the node memory.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-26 15:33:54 -07:00
Andrew Rynhard
d0d2ac3c74 test: default to using the bootstrap API
This moves our test scripts to using the bootstrap API. Some
automation around invoking the bootstrap API was also added
to give the same ease of use when creating clusters with the
CLI.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-06-24 08:46:10 -07:00
Andrew Rynhard
6ea313fa7d fix: detect if partition table is missing
This adds a sentinel error for a missing partition table. This error
is used to detect if a partition table already exists when setting
up user defined disks.

In addition to the fix, this removes a legacy parameter from the
`PartitionTable` method that indicated that the partition table
should be read. It is safer to just read it every time. Also, I
can't think of a case when the block device partition table is nil
and we want to read.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-06-16 18:26:59 -07:00
Andrey Smirnov
a9766d31bc refactor: implement LoggingManager as central log flow processor
Using this `LoggingManager` all the log flows (reading and writing) were
refactored. Inteface of `LoggingManager` should be now generic enough to
replace log handling with almost any implementation - log rotation,
sending logs to remote destination, keeping logs in memory, etc.

There should be no functional changes.

As part of changes, `follow.Reader` was implemented which makes
appending file feel like a stream. `file.NewChunker` was refactored to
use `follow.Reader` and `stream.NewChunker` to do the actual work. So
basically now we have only a single instance of chunker - stream
chunker, as everything is represented as a stream.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-10 14:30:36 -07:00