Differentiate headless services from ClusterIP being none, in
preparation for handling the service.kubernetes.io/headless label. One
might thing that handling these is similar, which it sort of is and sort
of isn't. ClusterIP is an immutable field, whereas labels are mutable.
This changes our handling of ClusterIP none-ness from the presence of
the headless label.
When we consider what to do with ClusterIP being none, that is
fundamentally different, because once it is None, the k8s API guarantees
that the service won't ever change.
Whereas the label can be added and removed.
With advertiseService set to false by default, it means that it won't
ever get re-evaluated if the service isn't a local host and will ALWAYS
result in withdrawing the VIPs which is incorrect. It needs to default
to true, and only override the boolean if serviceLocal is set to true.
Before this, we had 2 different ways to interact with ipsets, through
the handler interface which had the best handling for IPv6 because NPC
heavily utilizes it, and through the ipset struct which mostly repeated
the handler logic, but didn't handle some key things.
NPC utilized the handler functions and NSC / NRC mostly utilized the old
ipset struct functions. This caused a lot of duplication between the two
groups of functions and also caused issues with proper IPv6 handling.
This commit consolidates the two sets of usage into just the handler
interface. This greatly simplifies how the controllers interact with
ipsets and it also reduces the logic complexity on the ipset side.
This also fixes up some inconsistency with how we handled IPv6 ipset
names. ipset likes them to be prefixed with inet6:, but we weren't
always doing this in a way that made sense and was consistent across all
functions in the ipset struct.
Previously when a user selected to override the next-hop via GoBGP's
NextHopActions: Self functionality, we did it for all exported routes.
However, in a dual-stack use-case this causes problems for internal pod
IP routes that are spread via BGP advertisements.
Currently, kube-router only peers with an internal peer once over
whatever it's primary IP is according to it's Kubernetes node
information. This means that when overriding next-hop the IP is either
an IPv4 or IPv6 address depending on how the node has configured itself.
Therefore when it attempts to add a route for an IPv6 address and
override next-hop is configured, if the node's primary IP was an IPv4
address this will not succeed as a next-hop for an IPv6 address cannot
be an IPv4 gateway.
Rather than making the code base overly complicated with both an IPv4
and IPv6 peering for internal nodes, this change presents a bit of a
middle ground. By choosing not to override the next-hop for pod subnet
advertisements to internal (Kubernetes node) peers, we eliminate this
problem.
This does change the functionality of kube-router a bit, but one of the
foundational aspects to Kubernetes networking is that all nodes should
be able to contact each other. So I cannot currently think of a good
use-case where overriding the next-hop for pod subnets of internal peers
would be necessary, so I think that this is an ok concession to make.
FoU implementation now properly handles a whole host of things:
* It now actually handles IPv6 by changing the encapsulation protocol to
GUE instead of generic FoU. I worked with generic FoU tunnels for
several days and could get it to support IPv4 and IPv6 at all even
when placing using it with the IPv6 proto and with iproute2 in IPv6
mode (-6)
* It now handles converting between the two tunnel types seemlessly and
without leaving legacy tunnel artifacts behind. Previously, you could
change the encap type but it wouldn't change the tunnels
* Abstracted constants
The previous version of the bgp_policies code only allowed for creating
a policy when the policy didn't exist already. However, with the advent
of dual-stack we need to be able to add / remove statements if we add or
lose a specific IP family (e.g. IPv4 or IPv6) since they are handled in
different statements.
Given that the owner of GoBGP has let us know that policies are
idempotent, this now involves quite a bit of work. We need to follow the
following procedure:
add statements if missing -> add them to a policy -> if policy doesn't
equal the one already in GoBGP -> create the new policy and associate
it -> de-associate the old policy -> remove the old policy
Previously we used to do an idempotent sync all active VIPs any time we
got a service or endpoint update. However, this only worked when we
assumed a single-stack deployment model where IPs were never deleted
unless the whole service was deleted.
In a dual-stack model, we can add / remove LoadBalancer IPs and Cluster
IPs on updates. Given this, we need to take into account the finite
change that happens, and not just revert to sync-all because we'll never
stop advertising IPs that should be removed.
As a fall-back, we still have the outer Run loop that syncs all active
routes every X amount of seconds (configured by user CLI parameter). So
on that timer we'll still have something that syncs all active VIPs and
works as an outer control loop to ensure that desired state eventually
becomes active state if we accidentally remove a VIP that should have
been there.
When a single IP family's set looks to be equal, switch to continue
instead of return so that other families can still be evaluated as those
might have changes.
When enabled, generate the router id by hashing the primary IP.
With this no explicit router id has to be provided on IPv6-only clusters.
Signed-off-by: Erik Larsson <who+github@cnackers.org>
Use different IP ranges in BGP Policies unit test so that it becomes
more obvious when there are unit test failures resulting from
multi-processing of unit tests.
Fixes a problem where a user would end up with redundant external peers
in their BGP policies because getting peers is IP family agnostic and
yet is run twice on the same list.
This also ruined unit test consistency.
Without this logic, it appears that sometimes GoBGP is inclined to match
unintentional routes in policy because of the MATCHSET_ANY declaration
and the way that it interacts with empty sets.
In my testing, without this logic I found that it often resulted in
various routes not being advertised correctly and not even showing up in
GoBGP itself. My current guess is that policy keeps GoBGP from importing
the route into the RIB even from the Protobuf socket connection that
kube-router establishes directly.
We do a lot of getting defined sets for GoBGP and are planning to do
more of it in the future. This commit centralizes the logic for this and
reduces repetition.
Annotations were taken into account during startup, but after they were
advertised the affect of annotations was only additive because we
were only tracking current state of VIPs that should be advertised and
not taking into account VIPs that should be withdrawn for anything other
than service locality.
Fixes#1491
Rather than setting BGP Graceful Restart on both IPv4 and IPv6
regardless of which family is enabled, check the current mode via
nrc.isIpv6 and only set on appropriate family.
Note, this mode is exclusive as the current portions of NRC kube-router
code are only meant to work with IPv4 or IPv6 not both at the same time.
Fixes#1323
Changes the custom import reject annotation support to not only block
the given subnet exactly, but also all subnets of the subnet given.
For example, this change blocks 10.100.100.0/24 when customimportreject
annotation has 10.100.0.0/16 in it.
Added the following items to the original logic:
* Added map route entry deletion on withdrawl so that the system doesn't
incorrectly sync it back to the kernel's routing table
* Added an immediate route sync upon BGP path receive
* Added a mutex to ensure that deleted routes aren't accidentally synced
back to the system
* Added stopCh and wg (wait group) handling
* Increase default sync time from 15 seconds to 1 minute since this
scenario is unlikely and netlink calls could potentially be burdensome
in large clusters.