6 Commits

Author SHA1 Message Date
Andrey Smirnov
a2233bfe46
fix: improve NTP sync process
Fixes #4425

* add more logging for responses and sync process
* adjust time sync constants
* change the way poll interval is chosen (increasing on good sync,
decreasing on variation)
* filter out spikes

Based on flow in https://github.com/systemd/systemd/blob/main/src/timesync/timesyncd-manager.c

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-11-11 20:39:07 +03:00
Andrey Smirnov
e8fccbf535
fix: clear time adjustment error when setting time to specific value
I don't think this is going to fix time issues, so just a small cleanup.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-11-10 16:34:27 +03:00
Andrey Smirnov
983d2459e2
feat: suppress logging NTP sync to the console
Only `jump` syncs are logged to the console and any errors syncing.
Regular `slew` syncs are suppressed (only visible in
`talosctl logs controller-runtime`).

The very first sync is always reported to console.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-10-08 15:15:36 +03:00
Artem Chernyshev
1db301edf6 feat: switch controller-runtime to zap.Logger
Enable logging using default development config with some fine tuning.
Additionally, now `info` and below logs go to kmsg.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-05-25 02:15:31 -07:00
Andrey Smirnov
4d50a4edd0 fix: update the way NTP sync uses adjtimex syscall
Fixes #3582

Time adjustment code was rewritten taking a peek at other time sync
implementations. Looks like `adjtimex` was used incorrectly before which
leads to huge time oscillations and `STA_UNSYNC` being set by the
kernel. Instead of setting time via `settimeofday`, use `adjtimex` as
well to set the time on big jump.

With this change, oscillation is pretty stable around zero, in
microsecond range (polling interval lowered for testing):

```
172.20.0.2: 2021/05/06 18:51:28  time.SyncController: adjusting time (slew) by -11.375µs via 192.36.143.130, state TIME_OK, status STA_PLL | STA_NANO
172.20.0.2: 2021/05/06 18:51:37  time.SyncController: adjusting time (slew) by 426.276µs via 192.36.143.130, state TIME_OK, status STA_PLL | STA_NANO
172.20.0.2: 2021/05/06 18:51:50  time.SyncController: adjusting time (slew) by -622.037µs via 192.36.143.130, state TIME_OK, status STA_PLL | STA_NANO
172.20.0.2: 2021/05/06 18:51:58  time.SyncController: adjusting time (slew) by 59.822µs via 192.36.143.130, state TIME_OK, status STA_PLL | STA_NANO
172.20.0.2: 2021/05/06 18:52:11  time.SyncController: adjusting time (slew) by 126.855µs via 192.36.143.130, state TIME_OK, status STA_NANO | STA_PLL
172.20.0.2: 2021/05/06 18:52:20  time.SyncController: adjusting time (slew) by 17.334µs via 192.36.143.130, state TIME_OK, status STA_NANO | STA_PLL
172.20.0.2: 2021/05/06 18:52:28  time.SyncController: adjusting time (slew) by -108.787µs via 192.36.143.130, state TIME_OK, status STA_NANO | STA_PLL
172.20.0.2: 2021/05/06 18:52:34  time.SyncController: adjusting time (slew) by -71.687µs via 192.36.143.130, state TIME_OK, status STA_PLL | STA_NANO
172.20.0.2: 2021/05/06 18:52:40  time.SyncController: adjusting time (slew) by 114.759µs via 192.36.143.130, state TIME_OK, status STA_PLL | STA_NANO
172.20.0.2: 2021/05/06 18:52:47  time.SyncController: adjusting time (slew) by 46.716µs via 192.36.143.130, state TIME_OK, status STA_PLL | STA_NANO
```

Also one should pick a time server close to the node to get lower RTT
and dispersion.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-05-07 07:27:18 -07:00
Andrey Smirnov
2ea20f598a feat: replace timed with time sync controller
This is a complete rewrite of time sync process.

Now the time sync process starts early at boot time, and it adapts to
configuration changes:

* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used

Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.

Talos services which depend on time now wait on same resource instead of
waiting on timed health.

New features:

* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy

* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time

* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time

Features to implement later:

* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump

* trustd should be updated in same way

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 09:29:43 -07:00