The new code has several instabilities that need to be addressed.
Fixes: https://github.com/siderolabs/omni/issues/1074
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [x] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [x] Expose metrics. Specifically for `connectionLatency`, `requestStartLatency`, `responseStartLatency`, `inFlightRequests` and `workingProbes`. Register those in prometheus.
- [x] Add happy path test for the http.Handler.
For #886
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Most of the code is simplifying/refactoring, but there are few fixes:
* increase LB upstream healthcheck interval to 1 minute
* pass a logger to the LB (as otherwise it creates its own)
* shutdown the LB by waiting for it to shutdown
* close the LB even when it fails to start to avoid leaking health check goroutines
Additionally, add an integration test for workload proxying.
Co-authored-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Improve the exposed service reliability by using a TCP loadbalancer between the nodes exposing the service.
Rework the exposed service proxy registry to be a COSI controller instead to simplify the logic, improve reliability and testability.
Closessiderolabs/omni#396.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>