383 Commits

Author SHA1 Message Date
Aaron U'Ren
959022fdca feat(NSC): add endpoint statuses to internal struct
Add isReady, isServing, and isTerminating to internal EndpointSlice
struct so that downstream consumers have more information about the
service to make decisions later on.
2024-03-01 16:52:05 -06:00
Aaron U'Ren
16daa08c7b feat(NSC): add endpoints that are ready or serving
In order to be compliant with upstream network implementation
expectations we choose to proxy an endpoint as long as it is either
ready OR serving. This means that endpoints that are terminating will
still be proxied which makes kube-router conformant with the upstream
e2e tests.
2024-03-01 16:52:05 -06:00
Aaron U'Ren
fcd21b4759 feat: fully support service traffic policies
Adds support for spec.internalTrafficPolicy and fixes support for
spec.externalTrafficPolicy so that it only effects external traffic.

Keeps existing support for kube-router.io/service-local annotation which
overrides both to local when set to true. Any other value in this
annotation is ignored.
2024-01-24 09:05:24 -08:00
Aaron U'Ren
84042603b0 feat: increase unit test coverage
Prepare for upcoming changes by increasing unit test coverage to ensure
that we correctly handle different boundary conditions when we change
how service local / traffic policies work.
2024-01-24 09:05:24 -08:00
Aaron U'Ren
24505f03ae fact(service_endpoints_sync.go): standardize error handling 2024-01-24 09:05:24 -08:00
Aaron U'Ren
d3cf4d13a7 feat(NSC): add / clarify log messages 2024-01-24 09:05:24 -08:00
Aaron U'Ren
d757f49d55 feat(NSC): honor headless label
Abide the service.kubernetes.io/headless label as defined by the
upstream standard.

Resolves the failing e2e test:
should implement service.kubernetes.io/headless
2024-01-05 10:27:23 -06:00
Aaron U'Ren
8afdee87d9 fact(NSC): differentiate headless services
Differentiate headless services from ClusterIP being none, in
preparation for handling the service.kubernetes.io/headless label. One
might thing that handling these is similar, which it sort of is and sort
of isn't. ClusterIP is an immutable field, whereas labels are mutable.
This changes our handling of ClusterIP none-ness from the presence of
the headless label.

When we consider what to do with ClusterIP being none, that is
fundamentally different, because once it is None, the k8s API guarantees
that the service won't ever change.

Whereas the label can be added and removed.
2024-01-05 10:27:23 -06:00
Aaron U'Ren
30d37695d6 fact(NSC): update Errorf syntax 2024-01-05 10:27:23 -06:00
Aaron U'Ren
a0fe844a93 feat(NSC): honor service-proxy-name label
Abide the service.kubernetes.io/service-proxy-name label as defined by
the upstream standard here:
https://github.com/kubernetes-sigs/kpng/blob/master/doc/service-proxy.md#ignored-servicesendpoints

Resolves the failing e2e test:
should implement service.kubernetes.io/service-proxy-name

Fixes: #979
2024-01-05 10:27:23 -06:00
Aaron U'Ren
ced5102d99 feat(NSC): add IPVS service timeouts
This is a feature that has been requested a few times over the years and
would bring us closer to feature parity with other k8s network
implementations for service proxy.
2023-12-26 14:26:11 -06:00
Aaron U'Ren
eb462bae08 feat(linux_networking.go): add more error info
Direct people to a potentially missing hostPID attribute in their
kube-router deployment if they are getting a no such file or directory
message.
2023-12-08 17:01:48 -06:00
Aaron U'Ren
aebaa48ea1 fix(NSC): handle endpoint slice ready nil
In some cases it is possible for Endpoint.Conditions.Ready to be nil
during the early stages of initialization. When this happens it causes
kube-router to segfault. This fix tests for nil before testing for
Ready.
2023-12-08 14:38:50 -06:00
Aaron U'Ren
0f3714b9b7 fix(hairpin): set hairpin_mode for veth iface
It used to be that the kubelet handled setting hairpin mode for us:
https://github.com/kubernetes/kubernetes/pull/13628

Then this functionality moved to the dockershim:
https://github.com/kubernetes/kubernetes/pull/62212

Then the functionality was removed entirely:
https://github.com/kubernetes/kubernetes/commit/83265c9171f

Unfortunately, it was lost that we ever depended on this in order for
our hairpin implementation to work, if we ever knew it at all.
Additionally, I suspect that containerd and cri-o implementations never
worked correctly with hairpinning.

Without this, the NAT rules that we implement for hairpinning don't work
correctly. Because hairpin_mode isn't implemented on the virtual
interface of the container on the host, the packet bubbles up to the
kube-bridge. At some point in the traffic flow, the route back to the
pod gets resolved to the mac address inside the container, at that
point, the packet's source mac and destination mac don't match the
kube-bridge interface and the packet is black-holed.

This can also be fixed by putting the kube-bridge interface into
promiscuous mode so that it accepts all mac addresses, but I think that
going back to the original functionality of enabling hairpin_mode on the
veth interface of the container is likely the lesser of two evils here
as putting the kube-bridge interface into promiscuous mode will likely
have unintentional consequences.
2023-12-07 12:44:51 -06:00
Martin -nexus- Mlynář
66890d5f12 feat: Disable binding overlay tunnels to specific device 2023-10-30 08:05:26 -05:00
Aaron U'Ren
4cd6d94826 fix(NSC): only run for enabled families
Don't run iptables or ipset logic for disabled families

Fixes #1558
2023-10-19 16:51:21 -05:00
Aaron U'Ren
1a4896f465 feat(lint): upgrade golangci-lint v1.50.1 -> v1.54.2 2023-10-07 14:20:28 -05:00
Aaron U'Ren
678b7129c3 fix(ecmp_vip.go): non-local service advertisement
With advertiseService set to false by default, it means that it won't
ever get re-evaluated if the service isn't a local host and will ALWAYS
result in withdrawing the VIPs which is incorrect. It needs to default
to true, and only override the boolean if serviceLocal is set to true.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
1a891c33ee fix(dsr): add family specific link inside pod
For IPv6 we need to have family specific links inside the pod to receive
the ip6ip6 and ipip traffic that we are sending.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
514a8af7ed fix(dsr): add family for fwmark 2023-10-07 08:52:31 -05:00
Aaron U'Ren
c92f76aadf fix(service_endpoints_sync.go): use save command 2023-10-07 08:52:31 -05:00
Aaron U'Ren
9abe20d581 fix(NSC): compare all pod IPs for endpoint check
Don't just compare the primary IP according to k8s, but all IPs that the
pod contains.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
9f23cf5a6e fix(linux_networking.go): add better error messages 2023-10-07 08:52:31 -05:00
Aaron U'Ren
7ce09a64d9 fix(linux_networking.go): don't return err on warn 2023-10-07 08:52:31 -05:00
Aaron U'Ren
9d63cc689b feat(debug): add some extra debug at level 3 2023-10-07 08:52:31 -05:00
Aaron U'Ren
4c6e19f2e1 feat(ipset): consolidate ipset usage across controllers
Before this, we had 2 different ways to interact with ipsets, through
the handler interface which had the best handling for IPv6 because NPC
heavily utilizes it, and through the ipset struct which mostly repeated
the handler logic, but didn't handle some key things.

NPC utilized the handler functions and NSC / NRC mostly utilized the old
ipset struct functions. This caused a lot of duplication between the two
groups of functions and also caused issues with proper IPv6 handling.

This commit consolidates the two sets of usage into just the handler
interface. This greatly simplifies how the controllers interact with
ipsets and it also reduces the logic complexity on the ipset side.

This also fixes up some inconsistency with how we handled IPv6 ipset
names. ipset likes them to be prefixed with inet6:, but we weren't
always doing this in a way that made sense and was consistent across all
functions in the ipset struct.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
c62e1b7902 feat(linux_networking.go): add more logging info
Adds more logging information (in the form of warnings) when we come
across common errors that are not big enough to stop processing, but
will still confuse users when the error gets bubbled up to NSC.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
da73dea69b feat(NSC): use EndpointSlice instead of Endpoints
With the advent of IPv6 integrated into the NSC we no longer get all IPs
from endpoints, but rather just the primary IP of the pod (which is
often, but not always the IPv4 address).

In order to get all possible endpoint addresses for a given service we
need to switch to using EndpointSlice which also nicely groups addresses
into IPv4 and IPv6 by AddressType and also gives us more information
about the endpoint status by giving us attributes for serving and
terminating, instead of just ready or not ready.

This does mean that users will need to add another permission to their
RBAC in order for kube-router to access these objects.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
15cd4eb099 feat(nsc): add more insight into sync steps 2023-10-07 08:52:31 -05:00
Aaron U'Ren
81bc9e20ef fix(nsc): don't modify netmask during flag setup
There is absolutely no reason that we should ever assume netmasks, and
even if we did, we shouldn't modify them as a side-effect of a
completely different operation. No idea was this was ever coded this
way. Netmask is now set upstream to the appropriate mask for the IP
family.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
903466b745 fix(nsc): fail fast during init
During our initial run, fail fatally when we encounter problems rather
than just continuing on and causing subsequent problems and potentially
burying the real error.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
25ecb098c6 feat(nsc): add dualstack capabilities 2023-10-07 08:52:31 -05:00
Aaron U'Ren
f397a1f011 feat: increase log level for save/restore msgs 2023-10-07 08:52:31 -05:00
Aaron U'Ren
68a7d03bac fix: take family metrics out of defer
Deferring these will end up making the end times match for both families
as the variables aren't tracked separately. Since these are the same
metrics, it should be safe to emit them at time of generation.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
301e856a92 fix(NPC): remove redundant assign 2023-10-07 08:52:31 -05:00
Brad Davidson
b06b4f05c3 Move ipset restore outside policy loop
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-07 08:52:31 -05:00
Brad Davidson
e34ef29fe2 Add additional save/restore metrics
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-07 08:52:31 -05:00
Brad Davidson
aa107d6376 Make metrics registerer/gathererer replacable
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-07 08:52:31 -05:00
Aaron U'Ren
e6f668cbb7 fix: syntax updates for Go 1.20.X and k8s 1.27 2023-10-07 08:52:31 -05:00
Aaron U'Ren
5cf1265fb7 fix(NRC): prevent adding routes with mixed families 2023-10-07 08:52:31 -05:00
Aaron U'Ren
bab0d4ff83 feat(bgp_policies.go): don't override-nexthop for internal peers
Previously when a user selected to override the next-hop via GoBGP's
NextHopActions: Self functionality, we did it for all exported routes.
However, in a dual-stack use-case this causes problems for internal pod
IP routes that are spread via BGP advertisements.

Currently, kube-router only peers with an internal peer once over
whatever it's primary IP is according to it's Kubernetes node
information. This means that when overriding next-hop the IP is either
an IPv4 or IPv6 address depending on how the node has configured itself.
Therefore when it attempts to add a route for an IPv6 address and
override next-hop is configured, if the node's primary IP was an IPv4
address this will not succeed as a next-hop for an IPv6 address cannot
be an IPv4 gateway.

Rather than making the code base overly complicated with both an IPv4
and IPv6 peering for internal nodes, this change presents a bit of a
middle ground. By choosing not to override the next-hop for pod subnet
advertisements to internal (Kubernetes node) peers, we eliminate this
problem.

This does change the functionality of kube-router a bit, but one of the
foundational aspects to Kubernetes networking is that all nodes should
be able to contact each other. So I cannot currently think of a good
use-case where overriding the next-hop for pod subnets of internal peers
would be necessary, so I think that this is an ok concession to make.
2023-10-07 08:52:31 -05:00
Erik Larsson
afdf553fa8 add loadbalancer address allocator
This adds a simple controller that will watch for services of type LoadBalancer
and try to allocated addresses from the specified IPv4 and/or IPv6 ranges.
It's assumed that kube-router (or another network controller) will announce the addresses.

As the controller uses leases for leader election and updates the service status new
RBAC permissions are required.
2023-10-07 08:52:31 -05:00
Aaron U'Ren
944ab91725 fix(FoU): make more robust
FoU implementation now properly handles a whole host of things:

* It now actually handles IPv6 by changing the encapsulation protocol to
  GUE instead of generic FoU. I worked with generic FoU tunnels for
  several days and could get it to support IPv4 and IPv6 at all even
  when placing using it with the IPv6 proto and with iproute2 in IPv6
  mode (-6)
* It now handles converting between the two tunnel types seemlessly and
  without leaving legacy tunnel artifacts behind. Previously, you could
  change the encap type but it wouldn't change the tunnels
* Abstracted constants
2023-10-07 08:52:31 -05:00
Aaron U'Ren
bac4ae6299 fix(FoU): add docs, sanity checking, and logic reduction 2023-10-07 08:52:31 -05:00
Kartik Raval
2a57d6c163 Adding FoU encapsulation over IPIP tunnel : added checks for restart and multi-node cases 2023-10-07 08:52:31 -05:00
Kartik Raval
6ce37e6167 Support for FoU encapsulation for IPIP tunnel 2023-10-07 08:52:31 -05:00
Aaron U'Ren
4861021797 fix(NPC): update IPBlocks to be ipFamily specific
Previously, IPBlocks (like srcIPBlocks) only contained a single IP
Family which meant that a len() > 0 would indicate that an IP block had
been defined in the NetworkPolicy. However, now the IPBlocks structs are
IP family specific which means that they will always contain 2 entries,
one for the IPv4 family and one of the IPv6 family. Which means that
this condition will evaluate to true for all NetworkPolicies and waste
system resources creating empty ipsets and bad iptables rules.
2023-10-07 08:52:31 -05:00
Boleyn Su
f0d7f1e17a netpol: Fix ipset only containing one IP when port name is used. 2023-10-07 08:52:31 -05:00
Aaron U'Ren
384ed97a76 fix(bgp_policy): allow for statement add / remove
The previous version of the bgp_policies code only allowed for creating
a policy when the policy didn't exist already. However, with the advent
of dual-stack we need to be able to add / remove statements if we add or
lose a specific IP family (e.g. IPv4 or IPv6) since they are handled in
different statements.

Given that the owner of GoBGP has let us know that policies are
idempotent, this now involves quite a bit of work. We need to follow the
following procedure:

add statements if missing -> add them to a policy -> if policy doesn't
  equal the one already in GoBGP -> create the new policy and associate
  it -> de-associate the old policy -> remove the old policy
2023-10-07 08:52:31 -05:00
Aaron U'Ren
1d5c9ce25c fix(ecmp_vip): update VIPs based on svc change
Previously we used to do an idempotent sync all active VIPs any time we
got a service or endpoint update. However, this only worked when we
assumed a single-stack deployment model where IPs were never deleted
unless the whole service was deleted.

In a dual-stack model, we can add / remove LoadBalancer IPs and Cluster
IPs on updates. Given this, we need to take into account the finite
change that happens, and not just revert to sync-all because we'll never
stop advertising IPs that should be removed.

As a fall-back, we still have the outer Run loop that syncs all active
routes every X amount of seconds (configured by user CLI parameter). So
on that timer we'll still have something that syncs all active VIPs and
works as an outer control loop to ensure that desired state eventually
becomes active state if we accidentally remove a VIP that should have
been there.
2023-10-07 08:52:31 -05:00