This change aims to make installations more unified and reliable. It
introduces the concept of a mountpoint manager that is capable of
mounting, unmounting, and moving a set of mountpoints in the correct
order.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This change aims to standardize the boot process. It introduces the
concept of a phase, which is comprised of tasks. Phases are ran in serial and
the tasks that make up a phase are ran concurrently.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
The responsibility of init should only be to mount the rootfs. This
change moves Talos specific logic into machined. This will allow us to
define a version of Talos in a single binary instead of split across
two. This will enable cleaner upgrades and helps make the codebase
easier to reason about.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This PR is needed so that the eth0 device will have the proper mtu when
coming online in google cloud
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
Switch from `StringSliceVar` to `StringArrayVar` to maintain commas
in kernel args.
Update entrypoint script to allow specifying extra kernel args.
Remove default console settings in kernel config.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
Minor improvements to help when debugging.
Without this, if bringing up the default interface fails, the logs can
be misleading.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
In addition to adding a flag, this adds a field to the user data that allows
for extra kernel arguments to be specified.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This makes test launch their own isolated instance of containerd with
its own root/state directories and listening socket address. Each test
brings this instance up/down on its own.
Add options to override containerd address in the code (used only in the
tests).
Enable parallel go test runs once again.
P.S. I wish I could share that 'SetupSuite' phase across the tests, but
afaik there's no way in Go to share `_test.go` code across packages. If
we put it as normal package, this might pull in test dependencies (like
`testify`) into production code, which I don't like.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
When we receive all the necessary files from trustd, we cancel the context. This
was treated as an error case and a message was logged accordingly. However,
this case was not really an error versus a signal to stop trying to fetch a
given file.
Fixes#723
Add basic FileSet tests. Minor refactor to FileSet call to allow easier testing
Add context canceled test for download
Add config tests and trustd coverage
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
This PR moves the reset API to the init API definition.
It leverages the same code we use for upgrades.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
First, use cryptographically secure random number generator.
Second, generate random 32 bytes, don't limit them to any range, as
they're going to be base64-encoded anyways.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This unifies low-level filesystem walker code for `ls` and `cp`.
New features:
* `ls` now reports relative filenames
* `ls` now prints symlink destination for symlinks
* `cp` now properly always reports errors from the API
* `cp` now reports all the errors back to the client
Example for `ls`:
```
osctl-linux-amd64 --talosconfig talosconfig ls -l /var
MODE SIZE(B) LASTMOD NAME
drwxr-xr-x 4096 Jun 26 2019 .
Lrwxrwxrwx 4 Jun 25 2019 etc -> /etc
drwxr-xr-x 4096 Jun 26 2019 lib
drwxr-xr-x 4096 Jun 21 2019 libexec
drwxr-xr-x 4096 Jun 26 2019 log
drwxr-xr-x 4096 Jun 21 2019 mail
drwxr-xr-x 4096 Jun 26 2019 opt
Lrwxrwxrwx 6 Jun 21 2019 run -> ../run
drwxr-xr-x 4096 Jun 21 2019 spool
dtrwxrwxrwx 4096 Jun 21 2019 tmp
-rw------- 14979 Jun 26 2019 userdata.yaml
```
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Service `osd` doesn't have access to rootfs, as it is running in a
container, so move API to `init` which has unconstrained access to
rootfs. (This is in line with another API, `osctl cp`).
Fixes: #752
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Actual API is implemented in the `init`, as it has access to root
filesystem. `osd` proxies API back to `init` with some tricks to support
grpc streaming.
Given some absolute path, `init` produces and streams back .tar.gz
archive with filesystem contents.
`osctl cp` works in two modes. First mode streams data to stdout, so
that we can do e.g.: `osctl cp /etc - | tar tz`. Second mode extracts
archive to specified location, dropping ownership info and adjusting
permissions a bit. Timestamps are not preserved.
If full dump with owner/permisisons is required, it's better to stream
data to `tar xz`, for quick and dirty look into filesystem contents
under unprivileged user it's easier to use in-place extraction.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
For #711, this should be a complete fix - waiting for container to be
started.
For #712, this should be more of a workaround - playing with timeouts to
hit the failure less likely. Idea of the test is that health check
should be aborted on timeout (1ms) while health check succeeds if not
aborted in 50ms. Before the fix it was 1ms/10ms, but still concurrently
there was a chance that goroutine exits successfully after 10ms while
1ms context deadline is not reached.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: #689, #690
Refactor container inspection code into a package of its own with some
rudimentary tests. Use this package consistently in osd commands dealing
with containers.
Improvements for the next PRs:
* implement API to fetch info about container by ID (to avoid fetching
full list)
* handle and display errors on client side, not to the log of the
server
* more tests, including k8s containers (how can we do that?)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
If some of the conditions are already satisfied, we can update the
description to reflect that, e.g.:
```
EVENTS [Waiting]: Waiting for service "containerd" to be "up", service "networkd" to be "up", service "trustd" to be "up" (14m17s ago)
[Waiting]: Waiting for service "trustd" to be "up" (14m16s ago)
```
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This fixes a race condition when kubeadm doesn't start waiting for
networkd forever.
If one service enters 'Waiting' state after another service already
finishes running, we should consider 'Finished' service to be 'up' in
the sense it has finished successfully (vs. 'Failed' state which is not
'up').
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This adds generic goroutine runner which simply wraps service as process
goroutine. It supports log redirection and basic panic handling.
DHCP-related part of the network package was slightly adjusted to run as
service with logging updates (to redirect logs to a file) and context
canceling.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
* refactor(init): Allow kubeadm init on controlplane
This shifts the cluster formation from init(bootstrap) and join(control plane)
to init(control plane).
This makes use of the previously implemented initToken to provide a TTL for
cluster initialization to take place and allows us to mostly treat all control
plane nodes equal. This also sets up the path for us to handle master upgrades
and not be concerned with odd behavior when upgrading the previously defined
init node.
To facilitate kubeadm init across all control plane nodes, we make use of the
initToken to run `kubeadm init phase certs` command to generate any missing
certificates once. All other control plane nodes will attempt to sync the
necessary certs/files via all defined trustd endpoints and being the startup
process.
* feat(init): Add service runner context to PreFunc
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
This PR introduces dependencies between the services. Now each service
has two virtual events associated with it: 'up' (running and healthy)
and 'down' (finished or failed). These events are used to establish
correct order via conditions abstraction.
Service image unpacking was moved into 'pre' stage simplifying
`init/main.go`, service images are now closer to the code which runs the
service itself.
Step 'pre' now runs after 'wait' step, and service dependencies are now
mixed into other conditions of 'wait' step on startup.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Conditions are now implemented as interface with two methods: `Wait` for
condition to be true (cancelable via context) and 'String' which
describes what condition is waiting for.
Generic 'WaitForAll' was implemented to wait for multiple conditions at
once.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This started as a simple unit-test for file chunker, but the first test
hung immediately, so I started looking into the code.
One problem was that when entering inotify() code, ctx cancel wasn't
considered. Another problem is that remove fsnotify was never triggered,
but I saw that with unit-test later.
Small nit was that inotify() was initialized every time we got to EOF,
which is not efficient for "follow" mode.
So I moved inotify into the main loop, and plugged context cancel watch
into the place when chunk is delivered. Chunker code is supposed to
block in two places: when it tries to deliver next chunk (as client
might be slow to recieve buffers) or when there's no new data (on
inotify). So it makes sense to assert context canceled condition in both
cases.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This returns list of all the services registered, with their current
status, past events, health state, etc.
New CLI is `osctl service [<id>]`: without `<id>` it prints list of all
the services, with specific `<id>` it provides details for a service.
I decided to create "parallel" data structures in protobuf as Go
structures don't map nicely onto what protoc generates: pointers vs.
values, additional fields like mutexes, etc. Probably there's a better
approach, I'm open for it.
For CLI, I tried to keep CLI stuff in `cmd/` package, and I also created
simple wrapper to remove duplicated code which sets up client for each
command.
Examples:
```
$ osctl service
SERVICE STATE HEALTH LAST CHANGE LAST EVENT
containerd Running OK 21s ago Health check successful
kubeadm Running ? 2s ago Started task kubeadm (PID 280) for container kubeadm
kubelet Running ? 0s ago Started task kubelet (PID 383) for container kubelet
ntpd Running ? 14s ago Started task ntpd (PID 129) for container ntpd
osd Running ? 14s ago Started task osd (PID 126) for container osd
proxyd Waiting ? 14s ago Waiting for conditions
trustd Running ? 14s ago Started task trustd (PID 125) for container trustd
udevd Running ? 14s ago Started task udevd (PID 130) for container udevd
```
```
$ osctl service proxyd
ID proxyd
STATE Running
HEALTH ?
EVENTS [Preparing]: Running pre state (22s ago)
[Waiting]: Waiting for conditions (22s ago)
[Preparing]: Creating service runner (6s ago)
[Running]: Started task proxyd (PID 461) for container proxyd (6s ago)
```
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
As containerd client API wasn't closed after use, connection was leaking
every time healthcheck was run.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>