Commit Graph

33 Commits

Author SHA1 Message Date
Willy Tarreau
8a96216847 MEDIUM: sock-inet: re-check IPv6 connectivity every 30s
IPv6 connectivity might start off (e.g. network not fully up when
haproxy starts), so for features like resolvers, it would be nice to
periodically recheck.

With this change, instead of having the resolvers code rely on a variable
indicating connectivity, it will now call a function that will check for
how long a connectivity check hasn't been run, and will perform a new one
if needed. The age was set to 30s which seems reasonable considering that
the DNS will cache results anyway. There's no saving in spacing it more
since the syscall is very check (just a connect() without any packet being
emitted).

The variables remain exported so that we could present them in show info
or anywhere else.

This way, "dns-accept-family auto" will now stay up to date. Warning
though, it does perform some caching so even with a refreshed IPv6
connectivity, an older record may be returned anyway.
2025-05-09 15:45:44 +02:00
Willy Tarreau
5d41d476f3 MINOR: sock-inet: detect apparent IPv6 connectivity
In order to ease dual-stack deployments, we could at least try to
check if ipv6 seems to be reachable. For this we're adding a test
based on a UDP connect (no traffic) on port 53 to the base of
public addresses (2001::) and see if the connect() is permitted,
indicating that the routing table knows how to reach it, or fails.
Based on this result we're setting a global variable that other
subsystems might use to preset their defaults.
2025-04-24 17:52:28 +02:00
Aperence
20efb856e1 MEDIUM: protocol: add MPTCP per address support
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

  mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.

MPTCP is both supported for the frontend and backend sides.

Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.

Note that this is a re-implementation of Björn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.

Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.

This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.

Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]
Link: https://github.com/haproxy/haproxy/issues/1028 [4]
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5]

Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
2024-08-30 18:53:49 +02:00
Willy Tarreau
2a799b64b0 MINOR: protocol: add the real address family to the protocol
For custom families, there's sometimes an underlying real address and
it would be nice to be able to directly use the real family in calls
to bind() and connect() without having to add explicit checks for
exceptions everywhere.

Let's add a .real_family field to struct proto_fam for this. For now
it's always equal to the family except for non-transferable ones such
as rhttp where it's equal to the custom one (anything else could fit).
2024-08-21 17:37:46 +02:00
Willy Tarreau
785b89f551 MINOR: protocol: move the global reuseport flag to the protocols
Some protocol support SO_REUSEPORT and others not. Some have such a
limitation in the kernel, and others in haproxy itself (e.g. sock_unix
cannot support multiple bindings since each one will unbind the previous
one). Also it's really protocol-dependent and not just family-dependent
because on Linux for some time it was supported for TCP and not UDP.

Let's move the definition to the protocols instead. Now it's preset in
tcp/udp/quic when SO_REUSEPORT is defined, and is otherwise left unset.
The enabled() config condition test validates IPv4 (generally sufficient),
and -dR / noreuseport all protocols at once.
2023-04-23 09:46:15 +02:00
Willy Tarreau
9f53b7b41a BUG/MINOR: sock_inet: use SO_REUSEPORT_LB where available
On FreeBSD 13.1 I noticed that thread balancing using shards was not
always working. Sometimes several threads would work, but most of the
time a single one was taking all the traffic. This is related to how
SO_REUSEPORT works on FreeBSD since version 12, as it seems there is
no guarantee that multiple sockets will receive the traffic. However
there is SO_REUSEPORT_LB that is designed exactly for this, so we'd
rather use it when available.

This patch may possibly be backported, but nobody complained and it's
not sure that many users rely on shards. So better wait for some feedback
before backporting this.
2023-04-23 09:46:15 +02:00
Willy Tarreau
0e1aaf4e78 MEDIUM: proto: duplicate receivers marked RX_F_MUST_DUP
The different protocol's ->bind() function will now check the receiver's
RX_F_MUST_DUP flag to decide whether to bind a fresh new listener from
scratch or reuse an existing one and just duplicate it. It turns out
that the existing code already supports reusing FDs since that was done
as part of the FD passing and inheriting mechanism. Here it's not much
different, we pass the FD of the reference receiver, it gets duplicated
and becomes the new receiver's FD.

These FDs are also marked RX_F_INHERITED so that they are not exported
and avoid being touched directly (only the reference should be touched).
2023-04-21 17:41:26 +02:00
Willy Tarreau
145b17fd2f BUG/MEDIUM: listener: duplicate inherited FDs if needed
Since commit 36d9097cf ("MINOR: fd: Add BUG_ON checks on fd_insert()"),
there is currently a test in fd_insert() to detect that we're not trying
to reinsert an FD that had already been inserted. This test catches the
following anomalies:

   frontend fail1
       bind fd@0
       bind fd@0

and:

   frontend fail2
       bind fd@0 shards 2

What happens is that clone_listener() is called on a listener already
having an FD, and when sock_{inet,unix}_bind_receiver() are called, the
same FD will be registered multiple times and rightfully crash in the
sanity check.

It wouldn't be correct to block shards though (e.g. they could be used
in a default-bind line). What looks like a safer and more future-proof
approach simply is to dup() the FD so that each listener has one copy.
This is also the only solution that might allow later to support more
than 64 threads on an inherited FD.

This needs to be backported as far as 2.4. Better wait for at least
one extra -dev version before backporting though, as the bug should
not be triggered often anyway.
2023-01-11 11:27:20 +01:00
Amaury Denoyelle
487d04f6d7 BUG/MINOR: quic: set IP_PKTINFO socket option for QUIC receivers only
Move code which activates IP_PKTINFO socket option (or affiliated
options) from sock_inet_bind_receiver() to quic_bind_listener()
function. This change is useful for two reasons :

* first, and the most important one : this activates IP_PKTINFO only for
  QUIC receivers. The previous version impacted all datagram receivers,
  used for example by log-forwarder. This should reduce memory usage for
  these datagram sockets which do not need this option.

* second, USE_QUIC preprocessor statements are removed from
  src/sock_inet.c which clean up the code.

IP_PKTINFO was introduced recently by the following patch :
  97ecc7a8ea (quic-dev/qns)
  MEDIUM: quic: retrieve frontend destination address

For the moment, this does not impact any stable release. However, as
previous patch is scheduled for 2.6 backporting, the current change must
also be backported to the same versions.
2022-10-11 16:46:04 +02:00
Amaury Denoyelle
97ecc7a8ea MEDIUM: quic: retrieve frontend destination address
Retrieve the frontend destination address for a QUIC connection. This
address is retrieve from the first received datagram and then stored in
the associated quic-conn.

This feature relies on IP_PKTINFO or affiliated flags support on the
socket. This flag is set for each QUIC listeners in
sock_inet_bind_receiver(). To retrieve the destination address,
recvfrom() has been replaced by recvmsg() syscall. This operation and
parsing of msghdr structure has been extracted in a wrapper quic_recv().

This change is useful to finalize the implementation of 'dst' sample
fetch. As such, quic_sock_get_dst() has been edited to return local
address from the quic-conn. As a best effort, if local address is not
available due to kernel non-support of IP_PKTINFO, address of the
listener is returned instead.

This should be backported up to 2.6.
2022-10-10 11:48:27 +02:00
Willy Tarreau
9464bb1f05 MEDIUM: fd: add the tgid to the fd and pass it to fd_insert()
The file descriptors will need to know the thread group ID in addition
to the mask. This extends fd_insert() to take the tgid, and will store
it into the FD.

In the FD, the tgid is stored as a combination of tgid on the lower 16
bits and a refcount on the higher 16 bits. This allows to know when it's
really possible to trust the tgid and the running mask. If a refcount is
higher than 1 it indeed indicates another thread else might be in the
process of updating these values.

Since a closed FD must necessarily have a zero refcount, a test was
added to fd_insert() to make sure that it is the case.
2022-07-15 19:58:06 +02:00
Willy Tarreau
512dd2dc1c MINOR: fd: make fd_insert() apply the thread mask itself
It's a bit ugly to see that half of the callers of fd_insert() have to
apply all_threads_mask themselves to the bit field they're passing,
because usually it comes from a listener that may have other bits set.
Let's make the function apply the mask itself.
2022-07-15 19:58:06 +02:00
Willy Tarreau
82e378aa8a MINOR: fd/thread: get rid of thread_mask()
Since commit d2494e048 ("BUG/MEDIUM: peers/config: properly set the
thread mask") there must not remain any single case of a receiver that
is bound nowhere, so there's no need anymore for thread_mask().

We're adding a test in fd_insert() to make sure this doesn't happen by
accident though, but the function was removed and its rare uses were
replaced with the original value of the bind_thread msak.
2022-07-15 19:43:10 +02:00
Willy Tarreau
382474348c CLEANUP: tree-wide: use fd_set_nonblock() and fd_set_cloexec()
This gets rid of most open-coded fcntl() calls, some of which were passed
through DISGUISE() to avoid a useless test. The FD_CLOEXEC was most often
set without preserving previous flags, which could become a problem once
new flags are created. Now this will not happen anymore.
2022-04-26 10:59:48 +02:00
Willy Tarreau
01cac3f721 MEDIUM: listeners: split the thread mask between receiver and bind_conf
With groups at some point we'll have to have distinct masks/groups in the
receiver and the bind_conf, because a single bind_conf might require to
instantiate multiple receivers (one per group).

Let's split the thread mask and group to have one for the bind_conf and
another one for the receiver while it remains easy to do. This will later
allow to use different storage for the bind_conf if needed (e.g. support
multiple groups).
2021-10-14 21:27:48 +02:00
Willy Tarreau
6823a3acee MINOR: protocol: uniformize protocol errors
Some protocols fail with "error blah [ip:port]" and other fail with
"[ip:port] error blah". All this already appears in a "starting" or
"binding" context after a proxy name. Let's choose a more universal
approach like below where the ip:port remains at the end of the line
prefixed with "for".

  [WARNING]  (18632) : Binding [binderr.cfg:10] for proxy http: cannot bind receiver to device 'eth2' (No such device) for [0.0.0.0:1080]
  [WARNING]  (18632) : Starting [binderr.cfg:10] for proxy http: cannot set MSS to 12 for [0.0.0.0:1080]
2021-10-14 21:22:52 +02:00
Willy Tarreau
f78b52eb7d MINOR: inet: report the faulty interface name in "bind" errors
When a "bind ... interface foo" statement fails, let's report the
interface name in the error message to help locating it in the file.
2021-10-14 21:22:52 +02:00
Willy Tarreau
9063a660cc MINOR: fd: move .exported into fdtab[].state
No need to keep this flag apart any more, let's merge it into the global
state.
2021-04-07 18:10:36 +02:00
Willy Tarreau
4bfc6630ba CLEANUP: socket: replace SOL_IP/IPV6/TCP with IPPROTO_IP/IPV6/TCP
Historically we've used SOL_IP/SOL_IPV6/SOL_TCP everywhere as the socket
level value in getsockopt() and setsockopt() but as we've seen over time
it regularly broke the build and required to have them defined to their
IPPROTO_* equivalent. The Linux ip(7) man page says:

   Using the SOL_IP socket options level isn't portable; BSD-based
   stacks use the IPPROTO_IP level.

And it indeed looks like a pure linuxism inherited from old examples and
documentation. strace also reports SOL_* instead of IPPROTO_*, which does
not help... A check to linux/in.h shows they have the same values. Only
SOL_SOCKET and other non-IP values make sense since there is no IPPROTO
equivalent.

Let's get rid of this annoying confusion by removing all redefinitions of
SOL_IP/IPV6/TCP and using IPPROTO_* instead, just like any other operating
system. This also removes duplicated tests for the same value.

Note that this should not result in exposing syscalls to other OSes
as the only ones that were still conditionned to SOL_IPV6 were for
IPV6_UNICAST_HOPS which already had an IPPROTO_IPV6 equivalent, and
IPV6_TRANSPARENT which is Linux-specific.
2021-03-31 08:59:34 +02:00
Willy Tarreau
73bed9ff13 MINOR: protocol: add a ->set_port() helper to address families
At various places we need to set a port on an IPv4 or IPv6 address, and
it requires casts that are easy to get wrong. Let's add a new set_port()
helper to the address family to assist in this. It will be directly
accessible from the protocol and will make the operation seamless.
Right now this is only implemented for sock_inet as other families do
not need a port.
2020-12-04 15:08:00 +01:00
Willy Tarreau
233ad288cd CLEANUP: protocol: remove the now unused <handler> field of proto_fam->bind()
We don't need to specify the handler anymore since it's set in the
receiver. Let's remove this argument from the function and clean up
the remains of code that were still setting it.
2020-10-15 21:47:56 +02:00
Willy Tarreau
f2cda10b1d BUILD: sock_inet: include errno.h
I was careful to have it for sock_unix.c but missed it for sock_inet
which broke with commit 36722d227 ("MINOR: sock_inet: report the errno
string in binding errors") depending on the build options. No backport
is needed.
2020-09-17 14:02:01 +02:00
Willy Tarreau
36722d2274 MINOR: sock_inet: report the errno string in binding errors
With the socket binding code cleanup it becomes easy to add more info to
error messages. One missing thing used to be the error string, which is
now added after the generic one, for example:

  [ALERT] 260/082852 (12974) : Starting frontend f: cannot bind socket (Permission denied) [0.0.0.0:4]
  [ALERT] 260/083053 (13292) : Starting frontend f: cannot bind socket (Address already in use) [0.0.0.0:4444]
  [ALERT] 260/083104 (13298) : Starting frontend f: cannot bind socket (Cannot assign requested address) [1.1.1.1:4444]
2020-09-17 08:32:17 +02:00
Willy Tarreau
f1f660978c MINOR: protocol: retrieve the family-specific fields from the family
We now take care of retrieving sock_family, l3_addrlen, bind(),
addrcmp(), get_src() and get_dst() from the protocol family and
not just the protocol itself. There are very few places, this was
only seldom used. Interestingly in sock_inet.c used to rely on
->sock_family instead of ->sock_domain, and sock_unix.c used to
hard-code PF_UNIX instead of using ->sock_domain.

Also it appears obvious we have something wrong it the protocol
selection algorithm because sock_domain is the one set to the custom
protocols while it ought to be sock_family instead, which would avoid
having to hard-code some conversions for UDP namely.
2020-09-16 22:08:07 +02:00
Willy Tarreau
b0254cb361 MINOR: protocol: add a new proto_fam structure for protocol families
We need to specially handle protocol families which regroup common
functions used for a given address family. These functions include
bind(), addrcmp(), get_src() and get_dst() for now. Some fields are
also added about the address family, socket domain (protocol family
passed to the socket() syscall), and address length.

These protocol families are referenced from the protocols but not yet
used.
2020-09-16 22:08:07 +02:00
Willy Tarreau
d69ce1ffbc MEDIUM: sock_inet: implement sock_inet_bind_receiver()
This function collects all the receiver-specific code from both
tcp_bind_listener() and udp_bind_listener() in order to provide a more
generic AF_INET/AF_INET6 socket binding function. For now the API is
not very elegant because some info are still missing from the receiver
while there's no ideal place to fill them except when calling ->listen()
at the protocol level. It looks like some polishing code is needed in
check_config_validity() or somewhere around this in order to finalize
the receivers' setup. The main issue is that listeners and receivers
are created *before* bind_conf options are parsed and that there's no
finishing step to resolve some of them.

The function currently sets up a receiver and subscribes it to the
poller. In an ideal world we wouldn't subscribe it but let the caller
do it after having finished to configure the L4 stuff. The problem is
that the caller would then need to perform an fd_insert() call and to
possibly set the exported flag on the FD while it's not its job. Maybe
an improvement could be to have a separate sock_start_receiver() call
in sock.c.

For now the function is not used but it will soon be. It's already
referenced as tcp and udp's ->bind().
2020-09-16 22:08:07 +02:00
Willy Tarreau
3fd3bdc836 MINOR: receiver: move the FOREIGN and V6ONLY options from listener to settings
The new RX_O_FOREIGN, RX_O_V6ONLY and RX_O_V4V6 options are now set into
the rx_settings part during the parsing, so that we don't need to adjust
them in each and every listener anymore. We have to keep both v4v6 and
v6only due to the precedence from v6only over v4v6.
2020-09-16 22:08:07 +02:00
Willy Tarreau
37bafdcbb1 MINOR: sock_inet: move the IPv4/v6 transparent mode code to sock_inet
This code was highly redundant, existing for TCP clients, TCP servers
and UDP servers. Let's move it to sock_inet where it belongs. The new
functions are sock_inet4_make_foreign() and sock_inet6_make_foreign().
2020-08-28 18:51:36 +02:00
Willy Tarreau
e5bdc51bb5 REORG: sock_inet: move default_tcp_maxseg from proto_tcp.c
Let's determine it at boot time instead of doing it on first use. It
also saves us from having to keep it thread local. It's been moved to
the new sock_inet_prepare() function, and the variables were renamed
to sock_inet_tcp_maxseg_default and sock_inet6_tcp_maxseg_default.
2020-08-28 18:51:36 +02:00
Willy Tarreau
d88e8c06ac REORG: sock_inet: move v6only_default from proto_tcp.c to sock_inet.c
The v6only_default variable is not specific to TCP but to AF_INET6, so
let's move it to the right file. It's now immediately filled on startup
during the PREPARE stage so that it doesn't have to be tested each time.
The variable's name was changed to sock_inet6_v6only_default.
2020-08-28 18:51:36 +02:00
Willy Tarreau
25140cc573 REORG: inet: replace tcp_is_foreign() with sock_inet_is_foreign()
The function now makes it clear that it's independent on the socket
type and solely relies on the address family. Note that it supports
both IPv4 and IPv6 as we don't seem to need it per-family.
2020-08-28 18:51:36 +02:00
Willy Tarreau
c5a94c936b MINOR: sock_inet: implement sock_inet_get_dst()
This one is common to the TCPv4 and UDPv4 code, it retrieves the
destination address of a socket, taking care of the possiblity that for
an incoming connection the traffic was possibly redirected. The TCP and
UDP definitions were updated to rely on it and remove duplicated code.
2020-08-28 18:51:36 +02:00
Willy Tarreau
0d06df6448 MINOR: sock: introduce sock_inet and sock_unix
These files will regroup everything specific to AF_INET, AF_INET6 and
AF_UNIX socket definitions and address management. Some code there might
be agnostic to the socket type and could later move to af_xxxx.c but for
now we only support regular sockets so no need to go too far.

The files are quite poor at this step, they only contain the address
comparison function for each address family.
2020-08-28 18:51:36 +02:00