Replace the misleading kube_router_controller_bgp_peers gauge
which only counts 'cluster nodes' with a new per peer metric
kube_router_bgp_peer_info with 'GaugeVec' that exposes actual
BGP session state from gobgp. labels include peer address, asn,
type, and state. Metric value is 1 if established and 0 otherwise.
Closes: https://github.com/cloudnativelabs/kube-router/issues/848
Signed-off-by: Roman Kuzmitskii <roman@damex.org>
Changes AFI SAFI configuration to:
* Use consolidated logic for AFI SAFI configuration for both internal
peers and external peers
* Configure AFI SAFI regardless of GracefulRestart enablement
* This is important because by default GoBGP only configures a default
AFI SAFI configuration for the address family of its configured
peering IP. Which means that previously dual-stack configurations
that did not enable GracefulRestart would not work (see: #1992)
The problem here stems from the fact that when netpol generates its list of expected ipsets, it includes the inet6:
prefix, however, when the proxy and routing controller sent their list of expected ipsets, they did not do so. This
meant that no matter how we handled it in ipset.go it was wrong for one or the other use-cases.
I decided to standardize on the netpol way of sending the list of expected ipset names so that BuildIPSetRestore() can
function in the same way for all invocations.
Attempt to filter out sets that we are not authoritative for to avoid
race conditions with other operators (like Istio) that might be
attempting to modify ipsets at the same time.
Back in commit 9fd46cc when I was pulling out the krnode struct I made a
mistake in the `syncNodeIPSets()` function and didn't grab the IPs of
all nodes, instead I only grabbed the IP of the current node multiple
times.
This caused other nodes (besides the current one) to get removed from
the `kube-router-node-ips` ipset which ensures that we don't NAT traffic
from pods to nodes (daemons and HostNetwork'd items).
This should fix that problem.
Over time, feedback from users has been that our interpretation of how
the kube-router service.local annotation interacts with the internal
traffic policy has been that it is too restrictive.
It seems like tuning it to fall in line with the local internal traffic
policy is too restrictive. This commit changes that posture, by equating
the service.local annotation with External Traffic Policy Local and
Internal Traffic Policy Cluster.
This means that when service.local is set the following will be true:
* ExternalIPs / LoadBalancer IPs will only be available on a node that
hosts the workload
* ExternalIPs / LoadBalancer IPs will only be BGP advertised (when
enabled) by nodes that host the workload
* Services will have the same posture as External Traffic Policy set to
local
* ClusterIPs will be available on all nodes for LoadBalancing
* ClusterIPs will only be BGP advertised (when enabled) by nodes that
host the workload
* Cluster IP services will have the same posture as Internal Traffic
Policy set to cluster
For anyone desiring the original functionality of the service.local
annotation that has been in place since kube-router v2.1.0, all that
would need to be done is to set `internalTrafficPolicy` to Local as
described here: https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/
Make the health controller more robust and extensible by adding in
constants for heart beats instead of 3 character random strings that are
easy to get wrong.
There are a couple of items that have typically ended up in a no-op for
us when considering routes to inject. However, now that we have a route
map where we track route state, we need this not just to be no-ops, but
also update the route state cache as well to ensure that the route
doesn't get replaced in the future.
When we find tunnels to clean up, we need to not only remove the tunnel
and the route to that tunnel, but also remove the route from the state
map.
When we discover that no route needs to be added to the host because
it's not in the same subnet and we weren't supposed to create a tunnel,
then we also clean it up and ensure that it isn't in our state as well.
This prepares the way for broader refactors in the way that we handle
nodes by:
* Separating frequently used node logic from the controller creation
steps
* Keeping reused code DRY-er
* Adding interface abstractions for key groups of node data and starting
to rely on those more rather than concrete types
* Separating node data from the rest of the controller data structure so
that it smaller definitions of data can be passed around to functions
that need it rather than always passing the entire controller which
contains more data / surface area than most functions need.
kube-router v2.X introduced the idea of iptables and ipset handlers that
allow kube-router to be dual-stack capable. However, the cleanup logic
for the various controllers was not properly ported when this happened.
When the cleanup functions run, they often have not had their
controllers fully initialized as cleanup should not be dependant on
kube-router being able to reach a kube-apiserver.
As such, they were missing these handlers. And as such they either
silently ended up doing noops or worse, they would run into nil pointer
failures.
This corrects that, so that kube-router no longer fails this way and
cleans up as it had in v1.X.
Ever since version v6.5.0 of iproute2, iproute2 no longer automatically
creates the /etc/iproute2 files, instead preferring to add files to
/usr/lib/iproute2 and then later on /usr/share/iproute2.
This adds fallback path matching to kube-router so that it can find
/etc/iproute2/rt_tables wherever it is defined instead of just failing.
This also means people running kube-router in containers will need to
change their mounts depending on where this file is located on their
host OS. However, ensuring that this file is copied to `/etc/iproute2`
is a legitimate way to ensure that this is consistent across a fleet of
multiple OS versions.
Upgrades to Go 1.21.7 now that Go 1.20 is no longer being maintained.
It also, resolves the race conditions that we were seeing with BGP
server tests when we upgraded from 1.20 -> 1.21. This appears to be
because some efficiency changed in 1.21 that caused BGP to write to the
events at the same time that the test harness was trying to read from
them. Solved this in a coarse manner by adding surrounding mutexes to
the test code.
Additionally, upgraded dependencies.
Adds support for spec.internalTrafficPolicy and fixes support for
spec.externalTrafficPolicy so that it only effects external traffic.
Keeps existing support for kube-router.io/service-local annotation which
overrides both to local when set to true. Any other value in this
annotation is ignored.
Prepare for upcoming changes by increasing unit test coverage to ensure
that we correctly handle different boundary conditions when we change
how service local / traffic policies work.