kube-router v2.X introduced the idea of iptables and ipset handlers that
allow kube-router to be dual-stack capable. However, the cleanup logic
for the various controllers was not properly ported when this happened.
When the cleanup functions run, they often have not had their
controllers fully initialized as cleanup should not be dependant on
kube-router being able to reach a kube-apiserver.
As such, they were missing these handlers. And as such they either
silently ended up doing noops or worse, they would run into nil pointer
failures.
This corrects that, so that kube-router no longer fails this way and
cleans up as it had in v1.X.
rp_filter on RedHat based OS's is often set to 1 instead of 2 which is
more permissive and allows the outbound route for traffic to differ from
the route of incoming traffic.
Upgrades to Go 1.21.7 now that Go 1.20 is no longer being maintained.
It also, resolves the race conditions that we were seeing with BGP
server tests when we upgraded from 1.20 -> 1.21. This appears to be
because some efficiency changed in 1.21 that caused BGP to write to the
events at the same time that the test harness was trying to read from
them. Solved this in a coarse manner by adding surrounding mutexes to
the test code.
Additionally, upgraded dependencies.
NodePort Health Check has long been part of the Kubernetes API, but
kube-router hasn't implemented it in the past. This is meant to be a
port that is assigned by the kube-controller-manager for LoadBalancer
services that have a traffic policy of `externalTrafficPolicy=Local`.
When set, the k8s networking implementation is meant to open a port and
provide HTTP responses that inform parties external to the Kubernetes
cluster about whether or not a local endpoint exists on the node. It
should return a 200 status if the node contains a local endpoint and
return a 503 status if the node does not contain a local endpoint.
This allows applications outside the cluster to choose their endpoint in
such a way that their source IP could be preserved. For more details
see:
https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-type-loadbalancer
Add isReady, isServing, and isTerminating to internal EndpointSlice
struct so that downstream consumers have more information about the
service to make decisions later on.
In order to be compliant with upstream network implementation
expectations we choose to proxy an endpoint as long as it is either
ready OR serving. This means that endpoints that are terminating will
still be proxied which makes kube-router conformant with the upstream
e2e tests.
Adds support for spec.internalTrafficPolicy and fixes support for
spec.externalTrafficPolicy so that it only effects external traffic.
Keeps existing support for kube-router.io/service-local annotation which
overrides both to local when set to true. Any other value in this
annotation is ignored.
Abide the service.kubernetes.io/headless label as defined by the
upstream standard.
Resolves the failing e2e test:
should implement service.kubernetes.io/headless
Differentiate headless services from ClusterIP being none, in
preparation for handling the service.kubernetes.io/headless label. One
might thing that handling these is similar, which it sort of is and sort
of isn't. ClusterIP is an immutable field, whereas labels are mutable.
This changes our handling of ClusterIP none-ness from the presence of
the headless label.
When we consider what to do with ClusterIP being none, that is
fundamentally different, because once it is None, the k8s API guarantees
that the service won't ever change.
Whereas the label can be added and removed.
This is a feature that has been requested a few times over the years and
would bring us closer to feature parity with other k8s network
implementations for service proxy.
In some cases it is possible for Endpoint.Conditions.Ready to be nil
during the early stages of initialization. When this happens it causes
kube-router to segfault. This fix tests for nil before testing for
Ready.
It used to be that the kubelet handled setting hairpin mode for us:
https://github.com/kubernetes/kubernetes/pull/13628
Then this functionality moved to the dockershim:
https://github.com/kubernetes/kubernetes/pull/62212
Then the functionality was removed entirely:
https://github.com/kubernetes/kubernetes/commit/83265c9171f
Unfortunately, it was lost that we ever depended on this in order for
our hairpin implementation to work, if we ever knew it at all.
Additionally, I suspect that containerd and cri-o implementations never
worked correctly with hairpinning.
Without this, the NAT rules that we implement for hairpinning don't work
correctly. Because hairpin_mode isn't implemented on the virtual
interface of the container on the host, the packet bubbles up to the
kube-bridge. At some point in the traffic flow, the route back to the
pod gets resolved to the mac address inside the container, at that
point, the packet's source mac and destination mac don't match the
kube-bridge interface and the packet is black-holed.
This can also be fixed by putting the kube-bridge interface into
promiscuous mode so that it accepts all mac addresses, but I think that
going back to the original functionality of enabling hairpin_mode on the
veth interface of the container is likely the lesser of two evils here
as putting the kube-bridge interface into promiscuous mode will likely
have unintentional consequences.
Before this, we had 2 different ways to interact with ipsets, through
the handler interface which had the best handling for IPv6 because NPC
heavily utilizes it, and through the ipset struct which mostly repeated
the handler logic, but didn't handle some key things.
NPC utilized the handler functions and NSC / NRC mostly utilized the old
ipset struct functions. This caused a lot of duplication between the two
groups of functions and also caused issues with proper IPv6 handling.
This commit consolidates the two sets of usage into just the handler
interface. This greatly simplifies how the controllers interact with
ipsets and it also reduces the logic complexity on the ipset side.
This also fixes up some inconsistency with how we handled IPv6 ipset
names. ipset likes them to be prefixed with inet6:, but we weren't
always doing this in a way that made sense and was consistent across all
functions in the ipset struct.
With the advent of IPv6 integrated into the NSC we no longer get all IPs
from endpoints, but rather just the primary IP of the pod (which is
often, but not always the IPv4 address).
In order to get all possible endpoint addresses for a given service we
need to switch to using EndpointSlice which also nicely groups addresses
into IPv4 and IPv6 by AddressType and also gives us more information
about the endpoint status by giving us attributes for serving and
terminating, instead of just ready or not ready.
This does mean that users will need to add another permission to their
RBAC in order for kube-router to access these objects.
There is absolutely no reason that we should ever assume netmasks, and
even if we did, we shouldn't modify them as a side-effect of a
completely different operation. No idea was this was ever coded this
way. Netmask is now set upstream to the appropriate mask for the IP
family.
During our initial run, fail fatally when we encounter problems rather
than just continuing on and causing subsequent problems and potentially
burying the real error.
This change allows to define two cluster CIDRs for compatibility with
Kubernetes dual-stack, with an assumption that two CIDRs are usually
IPv4 and IPv6.
Signed-off-by: Michal Rostecki <vadorovsky@gmail.com>
Previously, we would iterate over rulesFromNode, but then check it
against the entirety of the rulesNeeded hash. This resulted in the loop
breaking as soon as it found any matching rule from the host rather than
it breaking if it matched the rule that we were currently processing.
When we receive service or endpoint updates from Kubernetes we process a
type of partial sync because the service and endpoints have already been
updated by the handler. However, previously, this partial update did not
include updating the hairpinning rules for services.
This would cause hairpinning changes to be delayed until the next full
sync or until kube-router restart. This changes adds hairpinning into
the partial service sync flow.
* fact(NSC): consolidate constants to top
* fix(NSC): increase IPVS add service logging
* fix(NSC): improve logging for FWMark IPVS entries
* fix(NSC): add missing parameter to logging
* feat(NSC): generate unique FW marks
Because we trim the 32-bit FNV-1a hash to 16 bits there is the potential
for FW marks to collide with each other even for unique inputs of IP,
protocol, and port. This reduces that chance up to the 16-bit max by
keeping track of which FW marks we've already allocated and what IP,
protocol, port combo they've been allocated for.
Fixes#1045
* fact(NSC): move utility funcs to utils
* fix(NSC): reduce IPVS service shell outs
This also aligns it more with the almost identical function used for
non-FWmarked services ipvsAddService() which is also called from
setupExternalIPServices and passes in this same list of ipvsServices.
* fix(NSC): fix & consolidate DSR cleanup code
A lot of this is refactor work, but its important to know why the DSR
mangle tables were not being cleaned up in the first place. When we
transitioned to iptables-save to look over the mangle rules, we didn't
realize that iptables-save changes the format of the marks from integer
values (which is what the CLI works with) to hexadecimal.
This made it so that we were never actually matching on a mangle rule,
which left them all behind. When these mangle rules were left, it meant
that IPs that used to be part of a DSR service were essentially
black-holed on the system and were no longer route-able.
Fixes#1167
* doc(dsr): expand DSR documentation
fixes#1055
* ensure active service map is updated for non DSR services
Co-authored-by: Murali Reddy <muralimmreddy@gmail.com>