25 Commits

Author SHA1 Message Date
Spencer Smith
c63c7f15e2 fix: respect nameservers when using docker cluster
This PR will fix some unexpected user behavior where nameservers were
always getting written to 8.8.8.8,1.1.1.1 for the docker-based talos
clusters. This occurred even when updating the docker daemon's config.
This PR will make the docker provisioner respect the --nameserver flag
and allow that to be used to override the defaults.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-05-15 13:58:30 -07:00
Andrew Rynhard
49307d554d refactor: improve machined
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.

A few highlights are:

- no more event bus
- functional approach to tasks (no more types defined for each task)
  - the task function definition now offers a lot more context, like
    access to raw API requests, the current sequence, a logger, the new
    state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
  install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
  install force option

Additionally, this pulls various packes in under machined to make the
code easier to navigate.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-28 08:20:55 -07:00
Andrew Rynhard
37a7906f09 chore: fix markdown linting issues
This fixes random markdown linting issues. The previous `sentences-per-line`
library seems to be broken now, and unmaintained. This moves to using
`textlint` instead.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-26 20:38:03 -07:00
Spencer Smith
71f97dbb69 feat: make machine config persist by default
This PR will change the default behavior for machine configs to
`persist: true`. This seems to be expected behavior from our users so
we'll move to this method for v0.5

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-04-15 12:12:42 -07:00
Andrey Smirnov
0af7624c7d fix: resolve race condition in createNodes
Due to the race, main goroutine might consume all the errors from
`errCh` and close `nodesCh`, so node goroutine might hit panic on send
to closed channel.

```
panic: send on closed channel

goroutine 40 [running]:
github.com/talos-systems/talos/internal/pkg/provision/providers/firecracker.(*provisioner).createNodes.func1(0x26ab668, 0xc00025a000, 0xc0005a83c0, 0xc00029d540, 0xc000536120, 0xc000464540, 0xc000041d80, 0x18, 0xc0006d406c, 0x4, ...)
	/src/internal/pkg/provision/providers/firecracker/node.go:55 +0x1fa
created by github.com/talos-systems/talos/internal/pkg/provision/providers/firecracker.(*provisioner).createNodes
	/src/internal/pkg/provision/providers/firecracker/node.go:50 +0x1ca
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-04-10 14:15:41 -07:00
Spencer Smith
b84d5e2660 feat: allow for exposing ports on docker clusters
This PR will introduce a `-p/--exposed-ports` flag to talosctl. This
flag will allow us to enable port forwards on worker nodes only. This
will allow for ingresses on docker clusters so we can hopefully use
ingress for Arges initial bootstrapping. I modeled this after how KIND allows ingresses
[here](https://kind.sigs.k8s.io/docs/user/ingress/)

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-30 15:24:25 -04:00
Andrew Rynhard
6fe5fed6f9 fix: make upgrades work with UEFI
Since the `--once` option of `extlinux` seems to only work with BIOS, we
needed to change to remove any reliance on this option. Instead of
booting the upgraded version once, and then making it the default after
a successful boot, we now make it the default, and then revert on any
boot error.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-26 13:34:00 -07:00
Andrew Rynhard
5dbc26c7a3 feat: rename osctl to talosctl
This is a rename of the osctl binary. We decided that talosctl is a
better name for the Talos CLI. This does not break any APIs, but does
make older documentation only accurate for previous versions of Talos.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 19:07:39 -07:00
Andrew Rynhard
69fa63a7b2 refactor: perform upgrade upon reboot
This PR introduces a new strategy for upgrades. Instead of attempting to
zap the partition table, create a new one, and then format the
partitions, this change will only update the `vmlinuz`, and
`initramfs.xz` being used to boot. It introduces an A/B style upgrade
process, which will allow for easy rollbacks. One deviation from our
original intention with upgrades is that this change does not completely
reset a node. It falls just short of that and does not reset the
partition table. This forces us to keep the current partition scheme in
mind as we make changes in the future, because an upgrade assumes a
specific partition scheme. We can improve upgrades further in the
future, but this will at least make them more dependable. Finally, one
more feature in this PR is the ability to keep state. This enables
single node clusters to upgrade since we keep the etcd data around.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 17:32:18 -07:00
Andrey Smirnov
d5f80858dd test: add 'reset' integration test for Reset() API
Every node is reset, rebooted and it comes back up again except for the
init node due to known issues with init node boostrapping etcd cluster
from scratch when metadata is missing (as node was wiped).

Planned workaround is to prohibit resetting init node (should be coming
next).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-06 23:05:46 +03:00
Andrey Smirnov
bbe2c53d29 feat: generate kubeconfig on the fly on request
This extracts admin kubeconfig generation out of bootkube, now based on
Talos x509 library. On each API request for `kubeconfig`, config is
generated on the fly and sent back on the wire.

This fixes two issues:

* any master node can now generate `kubeconfig` (worker nodes can do
that too, but that should probably change in the future)
* after upgrade-and-wipe the disk scenario, `osctl kubeconfig` still
works

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-28 21:00:52 +03:00
Andrey Smirnov
d5d3035c8c test: enable upgrade tests 0.4.x -> latest
With the fix #1904, it's now possible to upgrade 0.4.x with
`machine.File` extra files (caused by registry mirror for
registry.ci.svc).

Bump resources for upgrade tests in attempt to speed it up.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-26 00:09:32 +03:00
Andrew Rynhard
64b5b32732 refactor: use go-procfs
This makes use of the external procfs pacakge that is based on the
pacakge we are removing here.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-02-19 15:58:57 -08:00
Andrey Smirnov
afea21bc5a fix: stop firecracker launcher on signal
When inner function was added, `return nil` was not aborting launch
sequence, but rather leading to VM restart. `cluster destroy` still
worked fine, as it removes state directory and launcher exits on
failure.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-19 18:04:48 +03:00
Andrey Smirnov
33332f4c74 chore: support bootloader emulation in firecracker provisioner
Firecracker launches tries to open VM disk image before every boot,
parses partition table, finds boot partition, tries to read it as FAT32
filesystem, extracts uncompressed kernel from `bzImage` (firecracker
doesn't support `bzImage` yet), extracts initramfs and passes it to
firecracker binary.

This flow allows for extended tests, e.g. testing installer, upgrade and
downgrade tests, etc.

Bootloader emulation is disabled by default for now, can be enabled via
`--with-bootloader-emulation` flag to `osctl cluster create`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-13 23:21:37 +03:00
Andrey Smirnov
76c2038b13 chore: implement loadbalancer for firecracker provisioner
This PR contains generic simple TCP loadbalancer code, and glue code for
firecracker provisioner to use this loadbalancer.

K8s control plane is passed through the load balancer, and Talos API is
passed only to the init node (for now, as some APIs, including
kubeconfig, don't work with non-init node).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-13 23:07:13 +03:00
Andrey Smirnov
fae5e6915d chore: rework firecracker code around upstream Go SDK + PRs
This removes use of private fork with custom `ip=` kernel argument
handling and switches fully to upstream version of it.

Firecracker Go SDK version is `master` + following PRs:

* https://github.com/firecracker-microvm/firecracker-go-sdk/pull/167
* https://github.com/firecracker-microvm/firecracker-go-sdk/pull/177
* https://github.com/firecracker-microvm/firecracker-go-sdk/pull/178

MTU handling support was implemented as well.

Changes:

* hostname to each node is passed via `talos.hostname=` kernel arg
* IP configuration is generated by SDK from CNI result
* fixed bugs with wrong netmask
* nameservers & MTU is passed via Talos config

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-29 02:35:15 +03:00
Andrey Smirnov
cdfc0b8099 chore: remove Firecracker bridge interface in osctl cluster destroy
Cleaning things up so that IP network can be re-used with another
network name (and inteface name).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-28 17:18:45 +03:00
Andrey Smirnov
9da687d2a3 test: firecracker provisioner fixes, implement cluster destroy
This implements `osctl cluster destroy` for Firecracker, adds
new utility command `osctl cluser show`.

Firecracker mode now has control process for firecracker VMs, allowing
clean reboots and background operations.

Lots of small fixes to Firecracker mode, clean CNI shutdown, cleaning up
netns, etc.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-21 17:11:06 -08:00
Andrey Smirnov
2bf8540855 test: provision Talos clusters via Firecracker VMs
This is initial PR to push the initial code, it has several known
problems which are going to be addressed in follow-up PRs:

1. there's no "cluster destroy", so the only way to stop the VMs is to
`pkill firecracker`

2. provisioner creates state in `/tmp` and never deletes it, that is
required to keep cluster running when `osctl cluster create` finishes

3. doesn't run any controller process around firecracker to support
reboots/CNI cleanup (vethxyz interfaces are lingering on the host as
they're never cleaned up)

The plan is to create some structure in `~/.talos` to manage cluster
state, e.g. `~/.talos/clusters/<name>` which will contain all the
required files (disk images, file sockets, VM logs, etc.). This
directory structure will also work as a way to detect running clusters
and clean them up.

For point number 3, `osctl cluster create` is going to exec lightweight
process to control the firecracker VM process and to simulate VM reboots
if firecracker finishes cleanly (when VM reboots).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-16 00:27:08 +03:00
Andrew Rynhard
898cf01f0a refactor: unify generate type and machine type
We have been using two packages that define a config type and a machine
type, when really they are one and the same. This unifies the types down
to one set.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-10 16:46:28 -08:00
Spencer Smith
75d9f7b454 feat: support configurable docker-based clusters
This PR will allow users to issue `osctl config generate`, tweak the
configs to their liking, then use those configs to call `osctl cluster
create`.

Example workflow:

```
osctl config generate my-cluster https://10.5.0.2:6443 -o ./my-cluster

** tweaky tweak **

osctl cluster create --name my-cluster --input-dir "$PWD/my-cluster"
```

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-01-08 14:11:56 -05:00
Spencer Smith
6722a52aba chore: allow re-use of docker network for local clusters
This PR will allow users to use an existing docker network for their
talos cluster. Hoping this will be useful for those wanting further
control and configuration of their local docker clusters, as well as
possibly useful for us during CI. The docker networks can be pre-created
with something like: `docker network create my-cluster --subnet
192.168.0.0/24 --label talos.owned=true --label
talos.cluster.name=my-cluster`. Note that the labels are pre-reqs for our discovery and re-use of these networks.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-01-03 16:21:07 -05:00
Andrey Smirnov
ebd40bd0eb chore: use osctl cluster --wait in basic-integration
There are few workarounds for Drone way of running integration test:
DinD runs as a separate pod, and we can only access its exposed on the
"host" ports, while from Talos cluster this endpoint is not reachable.

So internally Talos nodes still use addresses like "10.5.0.2", while
test is using "docker" to access it (that's name of the `docker` service
in the pipeline).

When running locally, 127.0.0.1 is used as endpoint, which should work
fine both on OS X and Linux.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-30 15:15:42 -08:00
Andrey Smirnov
0081ac5fac refactor: extract Talos cluster provisioner as common code
This extracts Docker Talos cluster provisioner as common code
which might be shared between `osctl cluster` and integration-test.

There should be almost no functional changes.

As proof of concept, abstract cluster readiness checks were implemented
based on provisioned cluster state. It implements same checks as
`basic-integration.sh` in pure Go via Talos/K8s clients.

`conditions` package was promoted from machined-internal to
`internal/pkg` as it is used to run the checks.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-27 12:14:19 -08:00