979 Commits

Author SHA1 Message Date
Andrew Rynhard
ab4e058489 feat: upgrade Kubernetes to v1.16.0-rc.2
This brings in the release candidate for Kubernetes v1.16.0.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-16 14:56:55 -07:00
Andrey Smirnov
54dd1bd95d chore: make ntpd depend on networkd
As ntpd relies on outbound networking, it makes sense to wait for
networkd.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-17 00:30:08 +03:00
Andrey Smirnov
c2176ee0fa chore: update github.com/stretchr/testify library to 1.4.0
New release comes with bugfixes (we got some of them integrated for
not tagged release), and few interesting new assertions, including
`Eventually` for polling.

See: https://github.com/stretchr/testify/milestone/2?closed=1

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-16 19:06:47 +03:00
Andrey Smirnov
669bb5e1c6 chore: move interface type assertion to unit-tests
This moves optional interface checks to unit-tests, removing type checks
via global variable assignment.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-16 17:16:30 +03:00
Andrey Smirnov
7d8c40e3aa chore: randomize containerd namespace in tests
Looks like containerd creates shim file sockets in Linux abstract
namespace which are fixed (don't depend on containerd root directory)
and depend on container namespace and id. So if two containerd instances
on the same host run same namespace/id pair, that is going to create a
conflict on that shim filesocket.

Avoid that by randomizing namespace name. CRI tests should be fine as
namespace is fixed, but container ID is random.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-13 23:56:40 +03:00
Andrew Rynhard
75746266ce feat: upgrade Kubernetes to v1.16.0-rc.1
This brings in the latest RC of 1.16.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-12 20:20:48 -07:00
Andrey Smirnov
362d403707 chore: make TestRunRestartFailed test more reliable
Replace sleep with polling for desired state.

Fixes #1162

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-13 01:18:22 +03:00
Andrey Smirnov
b68e6395d8 feat(machined): filter actions stop/start/restart on per-service level
This implements 'default deny' policy for service operations via the
API: services do not allow operations.

Service whitelists itself for stop/start/restart by implementing the
interface and returning boolean flag which might depend on userdata.

Machined APIs `Stop/Start` were renamed to `ServiceStop`/`ServiceStart`
to avoid confusion with osd API `Restart` which is not related to
services. Old APIs are deprecated and compatibility code forwards old
APIs to the new code.

`ServiceRestart` API was introduced to distinguish restart action from
stop/start (previously restart was implemented as stop+start in the
CLI).

Service udevd-trigger was whitelisted for all operations (allows
stopping hanging run, restarting to trigger once again).

Services proxyd & ntpd were whitelisted for restart and start (start is
whitelisted to help with service stuck in stopped state while restarting).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-13 00:38:19 +03:00
Andrew Rynhard
5ee554128e chore: move from gofumpt to gofumports
The gofumports does everything that gofumpt does with the addition of
formatting imports. This change proposes the use of the `-local` flag so
that we can have imports separated in the following order:

- standard library
- third party
- Talos specific

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-12 07:49:12 -07:00
Andrew Rynhard
2e8b570302 chore: add fmt target
This provides a target that can be useful for developers. It will format
code according to our standards.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-11 15:19:53 -07:00
Andrey Smirnov
980829708e chore: upgrade golancgi-lint to 1.18.0
New linter 'funlen' was disabled as too many functions break the default
limit, but might be considered for the future.

To limit peak memory usage, `GOGC=50` was added to the golangci-lint run
to make Go's garbage collector more aggressive. With this setting peak
seems to be around 8Gb.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-11 15:18:57 -07:00
Andrew Rynhard
d563988778 fix: use /var/log for default log path
This moves the default log path to /var/log. An expection is made for
machined-api and system-containerd since they must have zero
dependencies on the ephemeral disk. In the case of machined-api, we
cannot stop the service since it is required to perform an upgrade. As
for system-containerd, it starts before any ephemeral disk is mounted so
we will fail to start the service since /var/log is a read-only file system.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-11 15:07:34 -07:00
Andrew Rynhard
20c88bac2c feat: move node certificate to tmpfs
This ensures that node certificates are ephemeral by storing them in a
tmpfs.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-11 14:10:34 -07:00
Spencer Smith
fa9b08145f docs: add machine configuration proposal
This PR will add the machine configuration proposal for review and merge
once agreed upon.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-11 12:01:40 -07:00
Andrew Rynhard
2955428850 chore: format code with gofumpt
The gofumpt linter is a stricter drop-in replacement for gofmt. The
rules are ones that I strongly agree with and I think it would be better
if we added this linter instead of nit picking every PR.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-11 11:03:29 -07:00
Brad Beam
a8c69bf753 chore(machined): Clean up unnecessary ticker alloc
There was a new ticker being created for each run of the healthcheck.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-11 12:10:02 -05:00
Andrew Rynhard
bf16b1e916 chore: remove invalid TODO
This TODO no longer applies. We have setteled on a fixed boot size. This
also removes variables no longer needed.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-10 10:53:36 -07:00
Andrew Rynhard
298ddc8f49 fix: enable slub_debug=P
This is the last KSPP kernel parameter we need to be compliant with KSPP
guidelines.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-10 10:53:19 -07:00
Andrew Rynhard
38690d72df chore: remove unneeded packages
This removes packages we don't need anymore.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-10 08:12:07 -07:00
Spencer Smith
473df84cf6 fix: move to per-platform console setup
This PR will make sure that each platform gets the console settings it
needs by setting them as extra flags in the makefile. This should ensure
that we have console logs flowing properly for each cloud.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-10 07:50:34 -07:00
Andrew Rynhard
761805e910 feat: set expiry of certificates to 24 hours
This defaults certificates to a 24 hour TTL.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-10 07:34:25 -07:00
Andrew Rynhard
e48cee6343 chore: remove existing AMI
We need to remove an exiting AMI, if it exists, in order to create a new
one with the same name.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-10 04:52:43 -07:00
Andrew Rynhard
44dd2fc7c9 chore: remove packer from installer
This moves to making AWS releases align with Azure, and GCP. We no
longer need packer since we will now release an artifact that users can
import.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-09 18:54:37 -07:00
Brad Beam
9a50da0ed7 fix(osd): Mount host directory for grpc sockets
Should prevent broken mounts from occurring when services are restarted.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-09 16:20:38 -05:00
Brad Beam
309856083b fix: Add retry/delay to probing device file
Fixes flakey image creation.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-09 16:05:35 -05:00
Brad Beam
63eb62f52c fix(machined): Fix hostname value when retrieving from cloud providers
There was an issue where the hostname was getting set too early in the boot. This caused
the hostnam retrieved from platform.Hostname() to be ignored.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-09 15:55:46 -05:00
Brad Beam
f21d1244bd test(ci): Add aws for e2e and conformance targets
Add additional scripts and steps to enable doing tests against aws.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-09 13:56:19 -05:00
Spencer Smith
aed8c06730 chore: rename v1 node configs to v1alpha1
This PR moves to using v1alpha1 as the inital node config version, so
we can graduate these configs a little more cleanly later on.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-09 13:03:49 -04:00
Brad Beam
be4f7e1e6a chore: Rename maintainers channel
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-09 10:59:48 -05:00
Seán C McCord
a99637cc0a fix: use ntp client constructor
Uses NTP client constructor so that defaults are appropriately used.

Fixes #1126

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 19:18:37 -07:00
Seán C McCord
3c41770478 fix: translate machine.network to networking.os
Add translation for v1 to v0 machine networking.  Also adds "Ignore"
property to v1 network interfaces.

Fixes #1134

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 18:20:10 -07:00
Seán C McCord
beecb70374 feat: Allow spec of canonical controlplane addr
Broke the binding between the discrete IP addresses of the control plane
elements and the ControlPlaneEndpoint.  This allows the specification of
a canonical controlplane address which may optionally be a DNS name.

Fixes #1131

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 17:18:52 -07:00
Seán C McCord
47a361c5b6 fix(osctl): use real userdata as defaults for install
This modifies `osctl install` to use the provided userdata as the source
for default installation values.  This allows such things as
userdata-supplied extra kernel parameters to be automatically
included in the bootloader.

Fixes #1102

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 17:00:12 -07:00
Seán C McCord
bcb6a2d3a5 fix: prepend custom options for kernel commandline
Added a decomposition option to the kernel.NewDefaultCmdline() so that
the Defaults can be added _after_ constructing a custom commandline.
This is then implemented for `osctl install`.

Fixes #1128

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 16:58:49 -07:00
Seán C McCord
f7ad24ec4f feat: allow network interface to be ignored
Added a property to userdata to allow a network interface to be ignored,
such that Talos will perform no operations on it (including DHCP).

Also added kernel commandline parameter (talos.network.interface.ignore)
to specify a network interface should be ignored.

Also allows chaining of kernel cmdline parameter Contains() where the
parameter in question does not exist.

Fixes #1124

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-07 16:33:52 -07:00
Andrew Rynhard
71e8a5fccf chore: remove top output border
This should give it a closer feel to the rest of the UX.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-06 19:48:12 -07:00
Brad Beam
2fadd4da6f chore(machined): Increase pid_max to 262k
Minor improvement for busy systems

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-06 19:47:24 -07:00
Spencer Smith
8b019d8f33 chore: update provider-components for capi v0.1.9
This PR updates our e2e tests with the provider-components file that's
generated by our capi v0.1.9 update.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-06 22:45:44 -04:00
Spencer Smith
71cddfd30b fix: remove basic integration teardown
This was breaking e2e testing, as we depend on it for applying CAPI and
launching VMs from there.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-06 15:15:24 -05:00
Andrew Rynhard
37a8ce78ae fix: prevent EBUSY when unmounting system disk
Reading /proc/mounts while simultaneously unmounting mountpoints
prevents unmounting all submounts under /var. This is due to the fact
that /proc/mounts will change as we perform unmounts, and that causes a
read of the file to become inaccurate. We now read /proc/mounts into
memory to get a snapshot of all submounts under /var, and then we
proceed with unmounting them.

This also adds some additional logging that I found to be useful while
debugging this. It also adds logic to skip of DaemonSet managed pods.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-06 05:05:59 -07:00
Brad Beam
f03975bdc3 chore: Retry check for HA control plane
Think this was causing some of our flakeyness for this test

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-05 22:04:38 -05:00
Brad Beam
a0ace6881b refactor(ntpd): Improvements to the robustness of ntp
- Use the Validate method to ensure we get an appropriate time back
- Hard set the clock initially, adjust clock by offsets afterwards
- Introduce functional opts to configure ntp client
- Add additional test coverage

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-05 21:52:29 -05:00
Andrew Rynhard
9337dcdfcd feat: configure interfaces concurrently
This uses a wait group to configure interfaces concurrently.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-05 14:45:42 -07:00
Andrew Rynhard
a6e12b498d chore: align time command with output standards
This changes the output to a table writer with all caps for headers.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-05 14:42:43 -07:00
Andrey Smirnov
c0698c1815 chore(machined): implement process reaper for PID 1 machined process
In UNIX, any zombies without parent process get re-parented to process
with PID 1 (usually running init), and PID 1 process should take care of
them (usually simply clean them up). Cleaning up zombies is important,
as they still take kerner resources, and having enormous amount of
zombie processes signifcantly degrades system performance.

For Talos, PID 1 process is machined, and machined itself forks to run
other processes in process runner and `pkg/cmd` one-time commands. Naive
solution of running `wait()` loop doesn't work as it might race with
`Process.Wait()` and clean up zombie which wasn't re-parented which
leads to process execution false failure.

After considering other solutions, we decided to go with the simple
approach: machined runs global zombie process reaper which publishes
information about reaped zombies. Any call to `Process.Wait()` (or
`Command.Wait()` which calls it) should be replaced with listening to
reaper's channel for notifications to catch info about the process which
was created in this call.

There are several changes in this PR:

1. Reaper implementation itself, started from machined.

2. Process runner and `pkg/cmd` can either use regular `Command.Wait()`
or use reaper notifications depending on reaper status (running/not
running). This allows using this code outside of machined.

3. Small bug fixes with process log which was affecting the tests.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-05 10:01:02 -07:00
Andrew Rynhard
db78ed93ec fix: set default install image
This sets the default install image just before installation. It was
erroneously placed in the boot verification.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-04 11:48:23 -07:00
Andrew Rynhard
fd3521649f chore: remove buildkit cache directory
This cache was more important back when builds of Talos took upwards of
40 minutes. Since this is no longer the case, and I have seen
performance issues by mounting a host path into the container, I think
we should drop this.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-04 10:43:41 -07:00
Seán C McCord
845cd92e5d fix: increase retries for DHCP
Increased retry count to 6 for DHCP.  In my testing, this worked
reliably in my setup, where the default (3) did not.

Ultimately, this should probably be configurable from the userdata.
Instead, this just makes it work for me.

Fixes #1099

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-02 19:02:53 -07:00
Andrey Smirnov
7ab0f8a7f2 chore: enable unit-tests-race
This is experiment to see how stable they are.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-02 19:02:38 -07:00
Andrey Smirnov
662ef94026 chore: make TestContainerdSuite/TestRunTwice more robust
Fixes #1010

Wait for containerd shim socket to be removed before running container
second time.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-02 19:02:05 -07:00