852 Commits

Author SHA1 Message Date
Spencer Smith
aed8c06730 chore: rename v1 node configs to v1alpha1
This PR moves to using v1alpha1 as the inital node config version, so
we can graduate these configs a little more cleanly later on.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-09 13:03:49 -04:00
Brad Beam
be4f7e1e6a chore: Rename maintainers channel
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-09 10:59:48 -05:00
Seán C McCord
a99637cc0a fix: use ntp client constructor
Uses NTP client constructor so that defaults are appropriately used.

Fixes #1126

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 19:18:37 -07:00
Seán C McCord
3c41770478 fix: translate machine.network to networking.os
Add translation for v1 to v0 machine networking.  Also adds "Ignore"
property to v1 network interfaces.

Fixes #1134

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 18:20:10 -07:00
Seán C McCord
beecb70374 feat: Allow spec of canonical controlplane addr
Broke the binding between the discrete IP addresses of the control plane
elements and the ControlPlaneEndpoint.  This allows the specification of
a canonical controlplane address which may optionally be a DNS name.

Fixes #1131

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 17:18:52 -07:00
Seán C McCord
47a361c5b6 fix(osctl): use real userdata as defaults for install
This modifies `osctl install` to use the provided userdata as the source
for default installation values.  This allows such things as
userdata-supplied extra kernel parameters to be automatically
included in the bootloader.

Fixes #1102

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 17:00:12 -07:00
Seán C McCord
bcb6a2d3a5 fix: prepend custom options for kernel commandline
Added a decomposition option to the kernel.NewDefaultCmdline() so that
the Defaults can be added _after_ constructing a custom commandline.
This is then implemented for `osctl install`.

Fixes #1128

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-08 16:58:49 -07:00
Seán C McCord
f7ad24ec4f feat: allow network interface to be ignored
Added a property to userdata to allow a network interface to be ignored,
such that Talos will perform no operations on it (including DHCP).

Also added kernel commandline parameter (talos.network.interface.ignore)
to specify a network interface should be ignored.

Also allows chaining of kernel cmdline parameter Contains() where the
parameter in question does not exist.

Fixes #1124

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-07 16:33:52 -07:00
Andrew Rynhard
71e8a5fccf chore: remove top output border
This should give it a closer feel to the rest of the UX.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-06 19:48:12 -07:00
Brad Beam
2fadd4da6f chore(machined): Increase pid_max to 262k
Minor improvement for busy systems

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-06 19:47:24 -07:00
Spencer Smith
8b019d8f33 chore: update provider-components for capi v0.1.9
This PR updates our e2e tests with the provider-components file that's
generated by our capi v0.1.9 update.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-06 22:45:44 -04:00
Spencer Smith
71cddfd30b fix: remove basic integration teardown
This was breaking e2e testing, as we depend on it for applying CAPI and
launching VMs from there.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-09-06 15:15:24 -05:00
Andrew Rynhard
37a8ce78ae fix: prevent EBUSY when unmounting system disk
Reading /proc/mounts while simultaneously unmounting mountpoints
prevents unmounting all submounts under /var. This is due to the fact
that /proc/mounts will change as we perform unmounts, and that causes a
read of the file to become inaccurate. We now read /proc/mounts into
memory to get a snapshot of all submounts under /var, and then we
proceed with unmounting them.

This also adds some additional logging that I found to be useful while
debugging this. It also adds logic to skip of DaemonSet managed pods.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-06 05:05:59 -07:00
Brad Beam
f03975bdc3 chore: Retry check for HA control plane
Think this was causing some of our flakeyness for this test

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-05 22:04:38 -05:00
Brad Beam
a0ace6881b refactor(ntpd): Improvements to the robustness of ntp
- Use the Validate method to ensure we get an appropriate time back
- Hard set the clock initially, adjust clock by offsets afterwards
- Introduce functional opts to configure ntp client
- Add additional test coverage

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-09-05 21:52:29 -05:00
Andrew Rynhard
9337dcdfcd feat: configure interfaces concurrently
This uses a wait group to configure interfaces concurrently.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-05 14:45:42 -07:00
Andrew Rynhard
a6e12b498d chore: align time command with output standards
This changes the output to a table writer with all caps for headers.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-05 14:42:43 -07:00
Andrey Smirnov
c0698c1815 chore(machined): implement process reaper for PID 1 machined process
In UNIX, any zombies without parent process get re-parented to process
with PID 1 (usually running init), and PID 1 process should take care of
them (usually simply clean them up). Cleaning up zombies is important,
as they still take kerner resources, and having enormous amount of
zombie processes signifcantly degrades system performance.

For Talos, PID 1 process is machined, and machined itself forks to run
other processes in process runner and `pkg/cmd` one-time commands. Naive
solution of running `wait()` loop doesn't work as it might race with
`Process.Wait()` and clean up zombie which wasn't re-parented which
leads to process execution false failure.

After considering other solutions, we decided to go with the simple
approach: machined runs global zombie process reaper which publishes
information about reaped zombies. Any call to `Process.Wait()` (or
`Command.Wait()` which calls it) should be replaced with listening to
reaper's channel for notifications to catch info about the process which
was created in this call.

There are several changes in this PR:

1. Reaper implementation itself, started from machined.

2. Process runner and `pkg/cmd` can either use regular `Command.Wait()`
or use reaper notifications depending on reaper status (running/not
running). This allows using this code outside of machined.

3. Small bug fixes with process log which was affecting the tests.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-05 10:01:02 -07:00
Andrew Rynhard
db78ed93ec fix: set default install image
This sets the default install image just before installation. It was
erroneously placed in the boot verification.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-04 11:48:23 -07:00
Andrew Rynhard
fd3521649f chore: remove buildkit cache directory
This cache was more important back when builds of Talos took upwards of
40 minutes. Since this is no longer the case, and I have seen
performance issues by mounting a host path into the container, I think
we should drop this.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-04 10:43:41 -07:00
Seán C McCord
845cd92e5d fix: increase retries for DHCP
Increased retry count to 6 for DHCP.  In my testing, this worked
reliably in my setup, where the default (3) did not.

Ultimately, this should probably be configurable from the userdata.
Instead, this just makes it work for me.

Fixes #1099

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-09-02 19:02:53 -07:00
Andrey Smirnov
7ab0f8a7f2 chore: enable unit-tests-race
This is experiment to see how stable they are.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-02 19:02:38 -07:00
Andrey Smirnov
662ef94026 chore: make TestContainerdSuite/TestRunTwice more robust
Fixes #1010

Wait for containerd shim socket to be removed before running container
second time.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-02 19:02:05 -07:00
Andrey Smirnov
d49c4baf62 chore: make health tests more robust
Fixes #1018 #1020

Add more wait loops to address cases when unit-tests are running
extremely slow under high load on the build machine.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-02 19:01:33 -07:00
Andrey Smirnov
3012851208 fix(machined): limit max stderr output, use pkg/cmd consistently
Use circular buffer instead of (unlimited) `bytes.Buffer` to limit
amount of stderr output captured. If command being run produces too much
output on stderr, this might consume too much RAM.

Use `pkg/cmd` to run command in `udevd` service. This should allow
easier udevd integration.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-02 19:01:15 -07:00
Brad Beam
1373806165 fix(init): Enable containerd subreaper
Should take care of our issue with Zombies

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-30 14:32:13 -07:00
Andrey Smirnov
029374f07d chore: disable go test result cache
Go by default caches unit-tests results via build cache, so if source
code doesn't have any changes, test results are cached on package level.
As our unit-tests are not that pure and depend on the environment, it
would be more helpful to make sure all the unit-tests during each build.

Setting number of test runs to one disable test result cache (but build
cache is still being used).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-30 22:03:00 +03:00
Andrew Rynhard
ef2154745d fix: leave etcd when upgrading control plane node
We need to remove the current node from etcd when upgrading.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-30 07:16:56 -07:00
Andrew Rynhard
1bbed6907b chore: fix generate version flag and mark v0 as deprecated
Since the command's name is 'generate' the 'gen' prefix is not needed
in the version flag. The flag is scoped under the generate command so
it should be very clear that the '--version' flag is used to control the
config version.

We also move to defaulting to v0 since v1 is new and still needs to be
tested in the real world. We can default to v1 in the next release.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-30 06:59:54 -07:00
Andrey Smirnov
de49903a5f chore: fix location of Go build cache mount for unit-tests-race
This step is based on `golang` image, so `GOCACHE` is set in a bit of a
different way.

No big deal, but should speed up subsequent runs a bit.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-29 16:35:14 -07:00
Brad Beam
a6ba81bf4e fix(networkd): Fix hostname retrieval
If multiple interfaces exist on a node, but the first interface was unsuccessful
in getting a dhcp response, we would seg fault when trying to retrieve the hostname
for that interface. This was due to d.Ack being nil and us having no guard around it

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-28 21:25:15 -05:00
Brad Beam
b1dc400fea chore: Fix azure image upload
Single quote causes variable to not be evaluated

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-28 20:38:30 -05:00
Brad Beam
9b91cd4511 chore: Clean up e2e scripts
- Use az/gcloud cli bundled with container
- Use consistent spacing in scripts ( 2 spaces vs tab )
- Updated count functions to handle the count inline
- Made platform kubeconfig the default

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-28 08:31:47 -05:00
Andrew Rynhard
d89b199825 chore: change upgrade request "url" to "image"
This aligns the nomenclature used throughout the codebase.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 21:43:20 -07:00
Andrew Rynhard
2e8f393fc5 chore: remove unused init token
This removes a token that we never used. Right now its just noise, so
let's remove it.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 21:36:52 -07:00
Andrew Rynhard
1b8bf0d3aa fix: use unique variables for CLI flags
Since the cluster create command and the upgrade command shared a common
variable, and the upgrade defaults to an empty string, we get an invalid
reference format error when attempting to create a cluster. This makes
the variables unique to avoid that.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 19:33:30 -07:00
Andrew Rynhard
295cbf9dc6 chore: remove generated raw disk
This was mistakenly removed.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 19:08:51 -07:00
Andrew Rynhard
66c848cc0d fix: make --target persistent across all commands
We have this flag missing in a number of places. This ensures that all
commands in the future will have this flags. A potential cleanup would
be to hide this flag in commands where it does not make sense. For now I
think its best to have everywhere.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 18:57:53 -07:00
Andrew Rynhard
d098785a17 chore: remove local upgrade functionality
We have no need for this anymore since installs and upgrades are now
completely handled in a container.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 18:44:18 -07:00
Andrew Rynhard
bf8fc1dcbd chore: lint protobuf definitions
This adds linting to our protobuf definitions via prototool.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 18:12:36 -07:00
Andrew Rynhard
4247b1befc chore: output top header in all caps
This changes the top output to be consistent with the rest of the CLI
output.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 18:04:39 -07:00
Andrew Rynhard
83b978c983 chore: prepare release v0.2.0-alpha.7
This is the official v0.2.0-alpha.7 release.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
v0.2.0-alpha.7
2019-08-27 15:00:30 -07:00
Andrew Rynhard
d4770d41ad feat: run installs via container
This moves to performing installs via a container.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 15:01:20 -05:00
Spencer Smith
739e232896 feat: upgrade kubernetes to v1.16.0-beta.1
This PR will upgrade to the latest beta of v1.16 in order to get us
closer to catching the v1.16.0 release as soon as it drops.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-08-27 13:25:33 -04:00
Brad Beam
f028d29d31 chore: Increase timers for healthchecks
We've seen some instances where the initial delay is not long enough (containerd)
as well as a period of every second increases the log size for services like
proxyd which log incoming connections.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-27 09:54:05 -07:00
Andrew Rynhard
0bdaff1a90 feat: perform upgrades via container
This moves to performing upgrades via a container.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 09:44:50 -07:00
Spencer Smith
f85750cdca feat: generate and use v1 machine configs
This PR will implement the v1 machine config proposal. This will allow
for a streamlined config for talos nodes.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-08-26 19:36:14 -04:00
Andrew Rynhard
15cfd42168 chore: upgrade tools
This brings in Go v1.12.9 to address CVEs and bugs.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-26 15:57:30 -07:00
Andrew Rynhard
43e20217e8 feat: add ability to pass data on event bus
We need to support eventing with associated data. This moves the event
bus to an observer design pattern that allows observers to register for
specific events, and to receive the associated data.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-26 13:27:02 -07:00
Spencer Smith
6f8e089271 chore: use kubeadm v1beta2 structs everywhere
This PR will move to using the external kubeadm v1beta2 structs for our
code base. This will hopefully allow for more stable integrations with
kubeadm in the long term, as well as solve some needs we have in the
machine config rewrite.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-08-26 12:07:36 -04:00