This allows the config.Debug setting to control container output to allow better troubleshooting.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
Fixes#1419
This is required to avoid later startup failures while trying to connect
to etcd if it hasn't actually bootstrapped.
This health check does just connectivity check, no quorum/leader checks,
as they should depend on cluster state in general.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Just a small nit, as all the services share same package, global
variable with generic name might lead to fun collisions.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This fixes a long standing issue with upgrading the init node. We
currently have no way of knowing whether the init node should join an
existing etcd cluster, or create a new one. This makes use of the node's
metadata to determine if the node has already created the etcd cluster.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This introduces the notion of metadata for a node. In this initial pass
there are only two fields. A timestamp to indicate when the install was
performed, and a field to indicate if the install was performed as part
of an upgrade.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This starts with a very simple test for `osctl version` using regexps as
output of the command depends a lot on current version.
We might use more of 'gold' matches for other commands potentially.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR will re-enable e2e testing by using the new cluster api
bootstrap provider and various infra providers.
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
Since APId/gRPC connections should never go through a proxy, we will explicitly exclude
these environment variables from apid.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
Without host network namespace, networkd and ntpd didnt work properly. NTP failed to
start up because it couldnt reach the ntp servers and networkd failed to configure
the interfaces and display interface information.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
This is just first steps and core foundation.
It can be used like:
```
make integration.test
osctl cluster create
build/integration.test -test.v
```
This should run the test against the Docker instance.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
The name helper isn't very good. This renames it to Client. A new func
was also added, NewForConfig, that will allow for the creation of the helper
client from an arbitrary Kubernetes REST config.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This verifies that all etcd members are running before performing an
upgrade. Without this we run the risk of destroying the etcd cluster.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
We should use 127.0.0.1 only in special cases (like when bootstrapping
the cluster). There is the potential that the local etcd member is
unhealthy and/or not responsive. This adds function for creating an etcd
client configured with all control plane node IPs in order to better
handle this case.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
We should add an etcd member only if it has not already been added. When
a control plane node is rebooted, or down for whatever reason, when it
comes back up it will attempt to add itself again. When it does so, the
cluster is unhelathy due to the fact that the node was down. A feature
of etcd called "strict-reconfig-check" prevents any member adds when the
cluster is unhealthy since doing so would cause the cluster to lose
quorum.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This moves the Kubeconfig api endpoint to machined and consolidates the
"read a file" code into machined. This also changes Kubeconfig to
use the CopyOut method which changes Kubeconfig to a streaming grpc call.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
We need to stop etcd earlier in the upgrade sequence to prevent machined
from trying to restart it after leaving the etcd cluster. We also need
to remove the data-dir since all the data becomes invalid once we leave
the etcd cluster.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Using the CRI seems to be more dependable in ensuring that we don't hit
EBUSY when trying to reset the system disk after stopping all
containers.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This adds an extra phase to the upgrade sequence that ensures we don't hit
EBUSY when attempting to delete the ephemeral partition. This is crucial
because if we fail to do so, the disk does not have a bootloader and we
effectively destroy the machine. It works by attempting to open the block
device with O_EXCL: If the block device is in use by the system (e.g., mounted)
, open() fails with the error EBUSY.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
There is no need for these packages to be in the base image. This moves
to installing them using ONBUILD.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This moves to using the retry package for retrying NTP queries. It also
adds some additional logging that is useful when NTP queries fail.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
There are cases where we can see EBUSY when attempting to use the BLKPG
ioctl. The recommendation seems to be to retry when this happens.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Since dmesg is not streamed, it becomes difficult to debug issues with
machined. This fixes that by setting up the logging of machine to go to
/dev/kmsg and to a log file.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This adds a timestamp to /boot/installed. It can be useful for
determining the last known successful install.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This addresses an issue caused by containers that refuse to exit with
SIGTERM. After sending SIGTERM, we send SIGKILL after a timeout of one minute.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Trying to be smart about whether our not an install is being performed
as part of an upgrade has proven to be error prone. This moves to
perform installs with explicit args.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This adds `CAP_DAC_READ_SEARCH`, `CAP_DAC_OVERRIDE`, and `CAP_SYSLOG`
capabilities to osd. This fixes the ability to read dmesg and kubeconfig.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
Since bootkube should only be ran once, we need a way to determine if it
has already been ran. This makes use of etcd to store a key-value pair
indicating that the cluster has been initialized.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
- adjust ul margin to keep the bullets inside the content area
- fix a few docs page responsiveness problems on small screens
- adjust the layout of the logo relative to the docs sidebar
- clean up some vestigial CSS classes
Signed-off-by: Tim Gerla <tim@gerla.net>
The v0.2 docs are inaccurate, and in general just bad. Since we made so
many breaking changes in v0.3 I think its better we just hit the reset
button and stick to v0.3 going forward.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This sets the list-style-position to inside by default, and overrides
the landing page to use outside. This way we only need to maintain the
CSS for the landing page and not all the other potential places we would
want inside in the future.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>