This was originally added in PR #210, but it appears to cause more
problems in my testing scenarios than it solves. When this is enabled,
it makes it so that services cannot be routed to from kube workers to
DSR enabled services when routed to other nodes in the cluster.
Previously, kube-router was only considering externalIPs when setting up
source routing policy, notably absent was consideration of LoadBalancer
IPs which are equally important for getting right with DSR.
This appears to have been a long-standing use-case that was never
correctly considered since when kube-router added a LoadBalancer
controller.
Instead of deleting and just hoping for the best, this change makes it
so that we check first whether or not a route exists. This helps to
reduce needless warnings that the user receives and is just all around
more accurate.
When ip rules are evaluated in the netlink library, default routes for
src and dst are equated to nil. This makes it difficult to evaluate
them and requires additional handling in order for them.
I filed an issue upstream so that this could potentially get fixed:
https://github.com/vishvananda/netlink/issues/1080 however if it doesn't
get resolved, this should allow us to move forward.
It has proven to be tricky to insert new rules without calling the
designated NewRule() function from the netlink library. Usually attempts
will fail with an operation not supported message.
This improves the reliability of rule insertion.
In order for a local route to be valid it needs to have the scope set to
host. When we were executing ip commands iproute2 just did this for us
to make the command accurate. Now that we're communicating with the
netlink socket, we need to do this conversion for ourselves.
Without this we get an error that says "invalid argument" from the
netlink subsystem. But if the route isn't local, then most of the
routing logic for services doesn't work correctly because it acts upon
external traffic as well as local traffic which isn't correct.
Previously we were accidentally deleting all routes that were found,
this mimics the previous functionality better by only deleting external
IPs that were found in the externalIPRouteTable that are no longer in
the activeExternalIPs map.
Also improves logging around any routes that are deleted as this is
likely of interest to all kube-router administrators.
Not making direct exec calls to user binary interfaces has long been a
principle of kube-router. When kube-router was first coded, the netlink
library was missing significant features that forced us to exec out.
However, now netlink seems to have most of the functionality that we
need.
This converts all of the places where we can use netlink to use the
netlink functionality.
Ever since version v6.5.0 of iproute2, iproute2 no longer automatically
creates the /etc/iproute2 files, instead preferring to add files to
/usr/lib/iproute2 and then later on /usr/share/iproute2.
This adds fallback path matching to kube-router so that it can find
/etc/iproute2/rt_tables wherever it is defined instead of just failing.
This also means people running kube-router in containers will need to
change their mounts depending on where this file is located on their
host OS. However, ensuring that this file is copied to `/etc/iproute2`
is a legitimate way to ensure that this is consistent across a fleet of
multiple OS versions.
This is a feature that has been requested a few times over the years and
would bring us closer to feature parity with other k8s network
implementations for service proxy.
It used to be that the kubelet handled setting hairpin mode for us:
https://github.com/kubernetes/kubernetes/pull/13628
Then this functionality moved to the dockershim:
https://github.com/kubernetes/kubernetes/pull/62212
Then the functionality was removed entirely:
https://github.com/kubernetes/kubernetes/commit/83265c9171f
Unfortunately, it was lost that we ever depended on this in order for
our hairpin implementation to work, if we ever knew it at all.
Additionally, I suspect that containerd and cri-o implementations never
worked correctly with hairpinning.
Without this, the NAT rules that we implement for hairpinning don't work
correctly. Because hairpin_mode isn't implemented on the virtual
interface of the container on the host, the packet bubbles up to the
kube-bridge. At some point in the traffic flow, the route back to the
pod gets resolved to the mac address inside the container, at that
point, the packet's source mac and destination mac don't match the
kube-bridge interface and the packet is black-holed.
This can also be fixed by putting the kube-bridge interface into
promiscuous mode so that it accepts all mac addresses, but I think that
going back to the original functionality of enabling hairpin_mode on the
veth interface of the container is likely the lesser of two evils here
as putting the kube-bridge interface into promiscuous mode will likely
have unintentional consequences.
Adds more logging information (in the form of warnings) when we come
across common errors that are not big enough to stop processing, but
will still confuse users when the error gets bubbled up to NSC.