The problem here stems from the fact that when netpol generates its list of expected ipsets, it includes the inet6:
prefix, however, when the proxy and routing controller sent their list of expected ipsets, they did not do so. This
meant that no matter how we handled it in ipset.go it was wrong for one or the other use-cases.
I decided to standardize on the netpol way of sending the list of expected ipset names so that BuildIPSetRestore() can
function in the same way for all invocations.
Attempt to filter out sets that we are not authoritative for to avoid
race conditions with other operators (like Istio) that might be
attempting to modify ipsets at the same time.
This was originally added in PR #210, but it appears to cause more
problems in my testing scenarios than it solves. When this is enabled,
it makes it so that services cannot be routed to from kube workers to
DSR enabled services when routed to other nodes in the cluster.
Previously, kube-router was only considering externalIPs when setting up
source routing policy, notably absent was consideration of LoadBalancer
IPs which are equally important for getting right with DSR.
This appears to have been a long-standing use-case that was never
correctly considered since when kube-router added a LoadBalancer
controller.
Instead of deleting and just hoping for the best, this change makes it
so that we check first whether or not a route exists. This helps to
reduce needless warnings that the user receives and is just all around
more accurate.
It used to be when we were using iproute2's CLI we needed to have the
fwmark as a hex number so we were passing it as a string in that format.
However, now that we use the netlink library directly, we already have
the fwmark in the condition that we need it. So instead of doing all of
these string <-> int conversions, lets just keep this simpler.
When ip rules are evaluated in the netlink library, default routes for
src and dst are equated to nil. This makes it difficult to evaluate
them and requires additional handling in order for them.
I filed an issue upstream so that this could potentially get fixed:
https://github.com/vishvananda/netlink/issues/1080 however if it doesn't
get resolved, this should allow us to move forward.
It has proven to be tricky to insert new rules without calling the
designated NewRule() function from the netlink library. Usually attempts
will fail with an operation not supported message.
This improves the reliability of rule insertion.
In order for a local route to be valid it needs to have the scope set to
host. When we were executing ip commands iproute2 just did this for us
to make the command accurate. Now that we're communicating with the
netlink socket, we need to do this conversion for ourselves.
Without this we get an error that says "invalid argument" from the
netlink subsystem. But if the route isn't local, then most of the
routing logic for services doesn't work correctly because it acts upon
external traffic as well as local traffic which isn't correct.
Previously we were accidentally deleting all routes that were found,
this mimics the previous functionality better by only deleting external
IPs that were found in the externalIPRouteTable that are no longer in
the activeExternalIPs map.
Also improves logging around any routes that are deleted as this is
likely of interest to all kube-router administrators.
Not making direct exec calls to user binary interfaces has long been a
principle of kube-router. When kube-router was first coded, the netlink
library was missing significant features that forced us to exec out.
However, now netlink seems to have most of the functionality that we
need.
This converts all of the places where we can use netlink to use the
netlink functionality.
Setting rp_filter to 2 when it is 0 will override its status to be always enabled (in the loose mode).
This behavior could break some networking solutions as it made packet admission rules more strict.
Over time, feedback from users has been that our interpretation of how
the kube-router service.local annotation interacts with the internal
traffic policy has been that it is too restrictive.
It seems like tuning it to fall in line with the local internal traffic
policy is too restrictive. This commit changes that posture, by equating
the service.local annotation with External Traffic Policy Local and
Internal Traffic Policy Cluster.
This means that when service.local is set the following will be true:
* ExternalIPs / LoadBalancer IPs will only be available on a node that
hosts the workload
* ExternalIPs / LoadBalancer IPs will only be BGP advertised (when
enabled) by nodes that host the workload
* Services will have the same posture as External Traffic Policy set to
local
* ClusterIPs will be available on all nodes for LoadBalancing
* ClusterIPs will only be BGP advertised (when enabled) by nodes that
host the workload
* Cluster IP services will have the same posture as Internal Traffic
Policy set to cluster
For anyone desiring the original functionality of the service.local
annotation that has been in place since kube-router v2.1.0, all that
would need to be done is to set `internalTrafficPolicy` to Local as
described here: https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/
Make the health controller more robust and extensible by adding in
constants for heart beats instead of 3 character random strings that are
easy to get wrong.
This prepares the way for broader refactors in the way that we handle
nodes by:
* Separating frequently used node logic from the controller creation
steps
* Keeping reused code DRY-er
* Adding interface abstractions for key groups of node data and starting
to rely on those more rather than concrete types
* Separating node data from the rest of the controller data structure so
that it smaller definitions of data can be passed around to functions
that need it rather than always passing the entire controller which
contains more data / surface area than most functions need.
This fixes the problem where if network policy is applied before any
communication between two pods, all subsequent communication fails
because the two pods aren't able to discovery each other.