132 Commits

Author SHA1 Message Date
Andrew Rynhard
90c91807bd refactor: restructure the project layout
This change moves packages into more appropriate places.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 22:19:42 -07:00
Andrew Rynhard
ca35b85300 refactor: improve installation reliability
This change aims to make installations more unified and reliable. It
introduces the concept of a mountpoint manager that is capable of
mounting, unmounting, and moving a set of mountpoints in the correct
order.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 11:44:40 -07:00
Andrew Rynhard
e63c882b89 refactor: split machined into phases
This change aims to standardize the boot process. It introduces the
concept of a phase, which is comprised of tasks. Phases are ran in serial and
the tasks that make up a phase are ran concurrently.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-29 12:40:03 -07:00
Andrew Rynhard
b7a9acbe88 refactor: move setup logic into machined
The responsibility of init should only be to mount the rootfs. This
change moves Talos specific logic into machined. This will allow us to
define a version of Talos in a single binary instead of split across
two. This will enable cleaner upgrades and helps make the codebase
easier to reason about.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-26 07:48:49 -07:00
Andrew Rynhard
0ec17e4169 feat: run rootfs from squashfs
This change moves the rootfs to a squashfs image.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-25 08:38:31 -07:00
Spencer Smith
c9f0dbbd4c feat: set default mtu for gce platform
This PR is needed so that the eth0 device will have the proper mtu when
coming online in google cloud

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-07-17 19:16:50 -04:00
Andrew Rynhard
8e8aae98dd feat: add machined
This commit splits our current init into init and machined.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-16 13:12:21 -07:00
Brad Beam
7adef1ea62 feat(init): Add azure as a supported platform
Update initramfs to interact with azure endpoints for userdata.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-16 12:59:53 -07:00
Brad Beam
e9482a4041 fix: Fix integration of extra kernel args
Switch from `StringSliceVar` to `StringArrayVar` to maintain commas
in kernel args.

Update entrypoint script to allow specifying extra kernel args.

Remove default console settings in kernel config.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-16 14:38:55 -05:00
Andrew Rynhard
1e9548d149 feat: use new pkgs for initramfs and rootfs
This brings in the newly compiled libraries and binaries from our new
pkg builds.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-15 10:32:29 -07:00
Andrew Rynhard
992c54c667 chore: improve network setup logging
Minor improvements to help when debugging.
Without this, if bringing up the default interface fails, the logs can
be misleading.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-13 15:52:49 -07:00
Andrew Rynhard
c40802b122 fix: return non-nil response in reset
The gRPC response will fail to be decoded because our reply is nil.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-13 15:52:25 -07:00
Andrew Rynhard
d197d5c6cd feat: add install flag for extra kernel args
In addition to adding a flag, this adds a field to the user data that allows
for extra kernel arguments to be specified.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-12 13:27:44 -07:00
Andrey Smirnov
82fe5b55e5 chore: make unit-tests use isolated instances of containerd
This makes test launch their own isolated instance of containerd with
its own root/state directories and listening socket address. Each test
brings this instance up/down on its own.

Add options to override containerd address in the code (used only in the
tests).

Enable parallel go test runs once again.

P.S. I wish I could share that 'SetupSuite' phase across the tests, but
afaik there's no way in Go to share `_test.go` code across packages. If
we put it as normal package, this might pull in test dependencies (like
`testify`) into production code, which I don't like.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-07-10 19:46:32 +03:00
Brad Beam
551e24e268 fix(init): Dont log an error when context canceled
When we receive all the necessary files from trustd, we cancel the context. This
was treated as an error case and a message was logged accordingly. However,
this case was not really an error versus a signal to stop trying to fetch a
given file.

Fixes #723

Add basic FileSet tests. Minor refactor to FileSet call to allow easier testing
Add context canceled test for download
Add config tests and trustd coverage

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-06 14:54:02 -07:00
Brad Beam
c194621e56 feat(initramfs): Add kernel arg for default interface
Should allow us to handle edge cases where eth0 is not the primary interface

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-05 12:17:18 -07:00
Andrew Rynhard
5d8ee0a3a5 fix: use existing logic to perform reset
This PR moves the reset API to the init API definition.
It leverages the same code we use for upgrades.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-04 18:26:14 -07:00
Brad Beam
40d3484469
refactor: Userdata.download supports functional args (#819)
This also adds in support for downloading userdata that is initially encoded in
base64.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-03 10:05:20 -05:00
Andrey Smirnov
0662af19d1 chore: seed math.rand PRNG on startup in every service (#801)
This is important as otherwise `math/rand` outputs predictable sequence
each time.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-28 11:03:15 -07:00
Andrey Smirnov
6b0a66b514
fix(init): secret data at rest encryption key should be truly random (#797)
First, use cryptographically secure random number generator.

Second, generate random 32 bytes, don't limit them to any range, as
they're going to be base64-encoded anyways.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-28 17:57:51 +03:00
Andrew Rynhard
fde6b4b6b8
feat: enable debug in udevd service (#783)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-06-26 08:17:13 -07:00
Andrey Smirnov
6d5ee0ca80
feat(init): unify filesystem walkers for ls/cp APIs (#779)
This unifies low-level filesystem walker code for `ls` and `cp`.

New features:

* `ls` now reports relative filenames
* `ls` now prints symlink destination for symlinks
* `cp` now properly always reports errors from the API
* `cp` now reports all the errors back to the client

Example for `ls`:

```
osctl-linux-amd64 --talosconfig talosconfig ls -l /var
MODE          SIZE(B)   LASTMOD       NAME
drwxr-xr-x    4096      Jun 26 2019   .
Lrwxrwxrwx    4         Jun 25 2019   etc -> /etc
drwxr-xr-x    4096      Jun 26 2019   lib
drwxr-xr-x    4096      Jun 21 2019   libexec
drwxr-xr-x    4096      Jun 26 2019   log
drwxr-xr-x    4096      Jun 21 2019   mail
drwxr-xr-x    4096      Jun 26 2019   opt
Lrwxrwxrwx    6         Jun 21 2019   run -> ../run
drwxr-xr-x    4096      Jun 21 2019   spool
dtrwxrwxrwx   4096      Jun 21 2019   tmp
-rw-------    14979     Jun 26 2019   userdata.yaml
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-26 17:43:09 +03:00
Andrew Rynhard
85afe4f828
feat: use eudev for udevd (#780)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-06-25 19:25:57 -07:00
Andrew Rynhard
ebc725afa6
feat: add support for upgrading init nodes (#761)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-06-24 15:25:32 -07:00
Brad Beam
d935ee0b33 fix(init): Add modules mountpoint for kube services (#767)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-24 12:38:57 -07:00
Andrey Smirnov
76071abbb8
feat(init): move 'ls' API to init from osd (#755)
Service `osd` doesn't have access to rootfs, as it is running in a
container, so move API to `init` which has unconstrained access to
rootfs. (This is in line with another API, `osctl cp`).

Fixes: #752

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-21 22:29:39 +03:00
Andrey Smirnov
9ed45f7090 feat(osctl): implement 'cp' to copy files out of the Talos node (#740)
Actual API is implemented in the `init`, as it has access to root
filesystem. `osd` proxies API back to `init` with some tricks to support
grpc streaming.

Given some absolute path, `init` produces and streams back .tar.gz
archive with filesystem contents.

`osctl cp` works in two modes. First mode streams data to stdout, so
that we can do e.g.: `osctl cp /etc - | tar tz`. Second mode extracts
archive to specified location, dropping ownership info and adjusting
permissions a bit. Timestamps are not preserved.

If full dump with owner/permisisons is required, it's better to stream
data to `tar xz`, for quick and dirty look into filesystem contents
under unprivileged user it's easier to use in-place extraction.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-20 17:02:58 -07:00
Andrey Smirnov
854395517f
chore: improve test stability for containerd tests (#733)
This should be no-op but allows to depend less on timing for concurrent
operations.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-15 00:00:06 +03:00
Brad Beam
0d5f521291
feat(init): Add support for kubeadm reset during upgrade (#714)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-06 22:41:22 -05:00
Brad Beam
d68e303f27
feat(init): Add service stop api (#708)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-05 14:49:03 -05:00
Andrey Smirnov
7a4a677f04
fix(init): use 127.0.0.1 IP in healthchecks to avoid resolver weirdness (#715)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-05 19:30:28 +03:00
Brad Beam
1a01440482
feat(init): Add support for stopping individual services (#706)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-04 15:51:30 -05:00
Andrey Smirnov
bf6ef7043c
chore: address flaky tests instability (#713)
For #711, this should be a complete fix - waiting for container to be
started.

For #712, this should be more of a workaround - playing with timeouts to
hit the failure less likely. Idea of the test is that health check
should be aborted on timeout (1ms) while health check succeeds if not
aborted in 50ms. Before the fix it was 1ms/10ms, but still concurrently
there was a chance that goroutine exits successfully after 10ms while
1ms context deadline is not reached.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-04 23:22:05 +03:00
Andrey Smirnov
d9f4f378c2 fix(osd): consistent container ids in stats, ps and reset (#707)
Fixes: #689, #690

Refactor container inspection code into a package of its own with some
rudimentary tests. Use this package consistently in osd commands dealing
with containers.

Improvements for the next PRs:

* implement API to fetch info about container by ID (to avoid fetching
full list)

* handle and display errors on client side, not to the log of the
server

* more tests, including k8s containers (how can we do that?)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-03 20:51:01 -05:00
Brad Beam
8537e7eeb6
feat(init): Add support for control plane join config (#700)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-31 12:21:00 -05:00
Andrey Smirnov
dc79b0ad05
refactor(init): use 'switch' instead of long condition (#701)
Based on feedback from #699

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-31 17:39:38 +03:00
Andrey Smirnov
32826e3d14
feat(init): update 'waiting' state descritpion when conditions change (#698)
If some of the conditions are already satisfied, we can update the
description to reflect that, e.g.:

```
EVENTS   [Waiting]: Waiting for service "containerd" to be "up", service "networkd" to be "up", service "trustd" to be "up" (14m17s ago)
         [Waiting]: Waiting for service "trustd" to be "up" (14m16s ago)
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-31 16:40:32 +03:00
Andrew Rynhard
f95f8f87a4
feat: upgrade Kubernetes to v1.15.0-beta.1 (#696)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-30 18:56:30 -07:00
Andrey Smirnov
7b7f4d4484 fix(init): consider 'finished' services to be 'up' (#699)
This fixes a race condition when kubeadm doesn't start waiting for
networkd forever.

If one service enters 'Waiting' state after another service already
finishes running, we should consider 'Finished' service to be 'up' in
the sense it has finished successfully (vs. 'Failed' state which is not
'up').

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-30 20:02:03 -05:00
Brad Beam
a1e635a4b2
feat(init): Prioritize usage of local userdata (#694)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-30 09:56:14 -05:00
Andrey Smirnov
20f4d77d39 fix(init): move directory creation to kubeadm pre-func (#688)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-28 09:51:38 -07:00
Andrey Smirnov
40a5b7c177
feat(init): expose networkd as goroutine-based server (#682)
This adds generic goroutine runner which simply wraps service as process
goroutine. It supports log redirection and basic panic handling.

DHCP-related part of the network package was slightly adjusted to run as
service with logging updates (to redirect logs to a file) and context
canceling.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-27 17:07:28 +03:00
Brad Beam
d8249c8779
refactor(init): Allow kubeadm init on controlplane (#658)
* refactor(init): Allow kubeadm init on controlplane

This shifts the cluster formation from init(bootstrap) and join(control plane)
to init(control plane).

This makes use of the previously implemented initToken to provide a TTL for
cluster initialization to take place and allows us to mostly treat all control
plane nodes equal. This also sets up the path for us to handle master upgrades
and not be concerned with odd behavior when upgrading the previously defined
init node.

To facilitate kubeadm init across all control plane nodes, we make use of the
initToken to run `kubeadm init phase certs` command to generate any missing
certificates once. All other control plane nodes will attempt to sync the
necessary certs/files via all defined trustd endpoints and being the startup
process.

* feat(init): Add service runner context to PreFunc

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-24 16:05:49 -05:00
Andrey Smirnov
a0188aff73
feat(init): implement service dependencies, correct start and shutdown (#680)
This PR introduces dependencies between the services. Now each service
has two virtual events associated with it: 'up' (running and healthy)
and 'down' (finished or failed). These events are used to establish
correct order via conditions abstraction.

Service image unpacking was moved into 'pre' stage simplifying
`init/main.go`, service images are now closer to the code which runs the
service itself.

Step 'pre' now runs after 'wait' step, and service dependencies are now
mixed into other conditions of 'wait' step on startup.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-24 19:17:52 +03:00
Andrey Smirnov
06bff97a3f
refactor: change conditions to be interface, add descriptions (#677)
Conditions are now implemented as interface with two methods: `Wait` for
condition to be true (cancelable via context) and 'String' which
describes what condition is waiting for.

Generic 'WaitForAll' was implemented to wait for multiple conditions at
once.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-21 21:25:08 +03:00
Brad Beam
a64de7ed51
feat(init): Add initToken parameter to userdata (#664)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-20 14:23:38 -05:00
Andrey Smirnov
204873e257
refactor: fix filechunker not exiting on context cancel (#668)
This started as a simple unit-test for file chunker, but the first test
hung immediately, so I started looking into the code.

One problem was that when entering inotify() code, ctx cancel wasn't
considered. Another problem is that remove fsnotify was never triggered,
but I saw that with unit-test later.

Small nit was that inotify() was initialized every time we got to EOF,
which is not efficient for "follow" mode.

So I moved inotify into the main loop, and plugged context cancel watch
into the place when chunk is delivered. Chunker code is supposed to
block in two places: when it tries to deliver next chunk (as client
might be slow to recieve buffers) or when there's no new data (on
inotify). So it makes sense to assert context canceled condition in both
cases.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-20 18:00:40 +03:00
Andrey Smirnov
54168cef1c feat(init): implement healthchecks for the services (#667)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-18 08:44:56 -07:00
Andrey Smirnov
75b2ce7fd2
feat(init): implement services list API and osctl service CLI (#662)
This returns list of all the services registered, with their current
status, past events, health state, etc.

New CLI is `osctl service [<id>]`: without `<id>` it prints list of all
the services, with specific `<id>` it provides details for a service.

I decided to create "parallel" data structures in protobuf as Go
structures don't map nicely onto what protoc generates: pointers vs.
values, additional fields like mutexes, etc. Probably there's a better
approach, I'm open for it.

For CLI, I tried to keep CLI stuff in `cmd/` package, and I also created
simple wrapper to remove duplicated code which sets up client for each
command.

Examples:

```
$ osctl service
SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
containerd   Running   OK       21s ago       Health check successful
kubeadm      Running   ?        2s ago        Started task kubeadm (PID 280) for container kubeadm
kubelet      Running   ?        0s ago        Started task kubelet (PID 383) for container kubelet
ntpd         Running   ?        14s ago       Started task ntpd (PID 129) for container ntpd
osd          Running   ?        14s ago       Started task osd (PID 126) for container osd
proxyd       Waiting   ?        14s ago       Waiting for conditions
trustd       Running   ?        14s ago       Started task trustd (PID 125) for container trustd
udevd        Running   ?        14s ago       Started task udevd (PID 130) for container udevd
```

```
$ osctl service proxyd
ID       proxyd
STATE    Running
HEALTH   ?
EVENTS   [Preparing]: Running pre state (22s ago)
         [Waiting]: Waiting for conditions (22s ago)
         [Preparing]: Creating service runner (6s ago)
         [Running]: Started task proxyd (PID 461) for container proxyd (6s ago)
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-17 18:01:12 +03:00
Andrey Smirnov
d034987f71 fix(init): fix containerd healthcheck leaking memory in init/containerd (#661)
As containerd client API wasn't closed after use, connection was leaking
every time healthcheck was run.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-16 18:35:14 -05:00