The problem was that etcd stop was only happening in `LeaveEtcd`, thus
upgrade with preserve was never stopping etcd leaving ephemeral
partition still busy.
Refactored code which was stopping service, shutting down all the
services to provide the interface we need:
* stop a service without considering reverse dependencies (force);
* stop a service (services) waiting for reverse dependencies;
* shutdown all the services waiting for reverse dependencies.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This change is only moving packages and updating import paths.
Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other
projects to import Talos provisioning library.
As cluster checks are almost always required as part of provisioning
process, package `internal/pkg/cluster` was also made public as
`pkg/cluster`.
Other changes were direct dependencies discovered by `importvet` which
were updated.
Public packages (useful, general purpose packages with stable API):
* `internal/pkg/conditions` -> `pkg/conditions`
* `internal/pkg/tail` -> `pkg/tail`
Private packages (used only on provisioning library internally):
* `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp`
* `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz`
* `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils`
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Logs:
```
[ 27.739699] [talos] bootstrap request received
[ 27.740500] [talos] bootstrap sequence: 3 phase(s)
[ 27.741297] [talos] phase etcd (1/3): 1 tasks(s)
[ 27.741991] [talos] task bootstrapEtcd (1/1): starting
[ 27.742855] [talos] service[etcd](Failed): Failed to run pre stage: context canceled
[ 27.744355] [talos] service[etcd](Finished): Bootstrap requested
```
`etcd` was stopped, `Finished` state was injected, but new service never
started. This is most likely a race in `Start`: it removes service from
`running` after it stops, but event that service got stopped is sent
before that, so task might see service as stopped, unload it, load it
back, but `Start()` will be no-op as service is considered to be
running.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
The problem was that flow to re-run the service with different
parameters was not consistent: it depends on whether services was loaded
before or not, but that is not reliable, as e.g. with bootstrap API
`bootkube` is loaded for the bootstrap and stays until reboot, and never
loaded for any other boot.
`Unload()` stops and removes the service completely so that new instance
of the service could be loaded and started.
This fixes the edge case with recovery API not running bootkube properly
before reboot after bootstrap.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.
A few highlights are:
- no more event bus
- functional approach to tasks (no more types defined for each task)
- the task function definition now offers a lot more context, like
access to raw API requests, the current sequence, a logger, the new
state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
install force option
Additionally, this pulls various packes in under machined to make the
code easier to navigate.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
`gomnd` disabled, as it complains about every number used in the code,
and `wsl` became much more thorough.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This extracts Docker Talos cluster provisioner as common code
which might be shared between `osctl cluster` and integration-test.
There should be almost no functional changes.
As proof of concept, abstract cluster readiness checks were implemented
based on provisioned cluster state. It implements same checks as
`basic-integration.sh` in pure Go via Talos/K8s clients.
`conditions` package was promoted from machined-internal to
`internal/pkg` as it is used to run the checks.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Not sure if there was an update in the fmt code path, but these are the
results after running `make fmt`.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
This removes the github.com/pkg/errors package in favor of the official
error wrapping in go 1.13.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This should provide a better UX around misconfigured Talos nodes. It is
just the start of something we can expand on.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This moves from translating a config into an internal config
representation, to using an interface. The idea is that an interface
gives us stronger compile time checks, and will prevent us from having to copy
from on struct to another. As long as a concrete type implements the
Configurator interface, it can be used to provide instructions to Talos.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This implements 'default deny' policy for service operations via the
API: services do not allow operations.
Service whitelists itself for stop/start/restart by implementing the
interface and returning boolean flag which might depend on userdata.
Machined APIs `Stop/Start` were renamed to `ServiceStop`/`ServiceStart`
to avoid confusion with osd API `Restart` which is not related to
services. Old APIs are deprecated and compatibility code forwards old
APIs to the new code.
`ServiceRestart` API was introduced to distinguish restart action from
stop/start (previously restart was implemented as stop+start in the
CLI).
Service udevd-trigger was whitelisted for all operations (allows
stopping hanging run, restarting to trigger once again).
Services proxyd & ntpd were whitelisted for restart and start (start is
whitelisted to help with service stuck in stopped state while restarting).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This re-arranges phases a bit so that shutdown actions are pushed back
to the top-level main.go of machined.
Small rudimentary event.Bus is introduce to facilitate event passing
(shutdown/restart) between various machined components and main.go. This
might be not the best implementation, just something to allow this
message passing without global variables or such.
Machined API was refactored to run as goroutine service.
ACPI & signal handlers re-built as phase tasks, and activated for
non-container, container modes respectively.
As part of the fix, now `docker stop` triggers correct shutdown of Talos
(not a big deal, but good for testing).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
It is now possible to `start`/`stop`/`restart` any service via `osctl`
commands.
There are some changes in `ServiceRunner` to support re-use (re-entering
running state). `Services` singleton now tracks service running state to
avoid calling `Start()` on already running `ServiceRunner` instance.
Method `Start()` was renamed to `LoadAndStart()` to break up service
loading (adding to the list of service) and actual service start.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>