Commit Graph

78 Commits

Author SHA1 Message Date
Frederic Lecaille
52ec3430f2 MINOR: sock: Add protocol and socket types parameters to sock_create_server_socket()
This patch only adds <proto_type> new proto_type enum parameter and <sock_type>
socket type parameter to sock_create_server_socket() and adapts its callers.
This is to prepare the use of this function by QUIC servers/backends.
2025-06-11 18:37:34 +02:00
Ilia Shipitsin
27a6353ceb CLEANUP: assorted typo fixes in the code, commits and doc 2025-04-03 11:37:25 +02:00
Ilia Shipitsin
78b849b839 CLEANUP: assorted typo fixes in the code and comments
code, comments and doc actually.
2025-04-02 11:12:20 +02:00
Christopher Faulet
71320fc9c1 MINOR: tevt/connection: Add support for POLL_HUP/POLL_ERR events
Connection errors can be detected via connect/recv/send syscall, but also
because it was reported by the poller. So dedicated events, at the FD level,
are introduced to make the difference.

term_events tool was updated accordingly.
2025-01-31 10:41:50 +01:00
Christopher Faulet
f2778ccc7d MINOR: tevt/connection: Add dedicated termination events for lower locations
To be able to add more accurate termination events for each location, the
enum will be splitted by location. Indeed, there are at most 16 possbile
events. It will be pretty confusing to use same termination events for the
different locations. So the best is to split them.

In this patch, the termination events for the fd, hs and xprt locations are
introduced. For now some holes are added to keep similar events aligned
across enums. But this may change in future.
2025-01-31 10:41:50 +01:00
Christopher Faulet
e944944990 MINOR: tevt: Add the termination events log's fundations
Termination events logs will be used to report the events that led to close
a connection. Unlike flags, that reflect a state, the idea here is to store
a log to preserve the order of the events. Most of time, when debugging an
issue, the order of the events is crucial to be able to understand the root
cause of the issue. The traces are trully heplful to do so. But it is not
always possible to active them because it is pretty verbose. On heavily
loaded platforms, it is not acceptable. We hope that the termination events
logs will help us in that situations.

One termination events log will be be store at each layer (connection, mux
connection, mux stream...) as a 32-bits integer. Each event will be store on
8 bits, 4 bits for the location and 4 bits for the type. So the first four
events will be stored only for each layer. It should be enough why a
connection is closed.

In this patch, the enums defining the termination event locations and types
are added. The macro to report a new event is also added and a function to
convert a termination events log to a string that could be display in log
messages for instance.
2025-01-31 10:41:49 +01:00
Christopher Faulet
7262433183 BUG/MEDIUM: sock: Remove FD_POLL_HUP during connect() if FD_POLL_ERR is not set
epoll_wait() may return EPOLLUP and/or EPOLLRDHUP after an asynchronous
connect(), to indicate that the peer accepted the connection then
immediately closed before epoll_wait() returned. When this happens,
sock_conn_check() is called to check whether or not the connection correctly
established, and after that the receive channel of the socket is assumed to
already be closed. This lets haproxy send the request at best (if RDHUP and
not HUP) then immediately close.

Over the last two years, there were a few reports about this spuriously
happening on connections where network captures proved that the server had
not closed at all (and sometimes even received the request and responded to
it after haproxy had closed). The logs show that a successful connection is
immediately reported on error after the request was sent. After
investigations, it appeared that a EPOLLUP, or eventually a EPOLLRDHUP, can
be reported by epool_wait() during the connect() but in sock_conn_check(),
the connect() reports a success. So the connection is validated but the HUP
is handled on the first receive and an error is reported.

The same behavior could be observed on health-checks, leading HAProxy to
consider the server as DOWN while it is not.

The only explanation at this point is that it is a kernel bug, notably
because it does not even match the documentation for connect() nor epoll. In
addition for now it was only observed with Ubuntu kernels 5.4 and 5.15 and
was never reproduced on any other one.

We have no reproducer but here is the typical strace observed:

socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 114
fcntl(114, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(114, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(114, {sa_family=AF_INET, sin_port=htons(11000), sin_addr=inet_addr("A.B.C.D")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(19, EPOLL_CTL_ADD, 114, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP, data={u32=114, u64=114}}) = 0
epoll_wait(19, [{events=EPOLLIN, data={u32=15, u64=15}}, {events=EPOLLIN, data={u32=151, u64=151}}, {events=EPOLLIN, data={u32=59, u64=59}}, {events=EPOLLIN|EPOLLRDHUP, data={u32=114, u64=114}}], 200, 0) = 4
epoll_ctl(19, EPOLL_CTL_MOD, 114, {events=EPOLLOUT, data={u32=114, u64=114}}) = 0
epoll_wait(19, [{events=EPOLLOUT, data={u32=114, u64=114}}, {events=EPOLLIN, data={u32=15, u64=15}}, {events=EPOLLIN, data={u32=10, u64=10}}, {events=EPOLLIN, data={u32=165, u64=165}}], 200, 0) = 4
connect(114, {sa_family=AF_INET, sin_port=htons(11000), sin_addr=inet_addr("A.B.C.D")}, 16) = 0
sendto(114, "POST "..., 1009, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 1009
close(114)                              = 0

Some ressources about this issue:
  - https://www.spinics.net/lists/netdev/msg876470.html
  - https://github.com/haproxy/haproxy/issues/1863
  - https://github.com/haproxy/haproxy/issues/2368

So, to workaround the issue, we have decided to remove FD_POLL_HUP flag on
the FD during the connection establishement if FD_POLL_ERR is not reported
too in sock_conn_check(). This way, the call to connect() is able to
validate or reject the connection. At the end, if the HUP or RDHUP flags
were valid, either connect() would report the error itself, or the next
recv() would return 0 confirming the closure that the poller tried to
report. EPOLL_RDHUP is only an optimization to save a syscall anyway, and
this pattern is so rare that nobody will ever notice the extra call to
recv().

Please note that at least one reporter confirmed that using poll() instead
of epoll() also addressed the problem, so that can also be a temporary
workaround for those discovering the problem without the ability to
immediately upgrade.

The event is accounted via a COUNT_IF(), to be able to spot it in future
issue. Just in case.

This patch should fix the issue #1863 and #2368. It may be related
to #2751. It should be backported as far as 2.4. In 3.0 and below, the
COUNT_IF() must be removed.
2024-11-27 12:16:25 +01:00
Aurelien DARRAGON
d879bf6600 MEDIUM: sock: also restore effective unix family in get_{src,dst}()
As in previous commit, let's push the logic a bit further in order to
properly restore the effective UNIX socket type when leveraging
get_src() and get_dst() sock functions, since they rely on getpeername()
and getsockname() under the hood, both of which will actually loose the
effective family and return AF_UNIX for all our custom UNIX sockets.

To do this, add sock_restore_unix_family() helper function from the logic
implemented in the previous commit, and call this function from get_src()
and get_dst() in case of unix socket prior to returning.
2024-10-29 12:15:03 +01:00
Aurelien DARRAGON
ae64444303 MINOR: sock: restore effective UNIX family in sock_get_old_sockets()
When getting sockets from older process in sock_get_old_sockets(), we
leverage getsockname() to fill sockaddr struct from known fd.

However, the kernel doesn't know about our custom UNIX families such
as CUST_ABNS and CUST_ABNSZ which are both based on AF_UNIX real family.

Since haproxy socket API relies on effective family (and not real family)
to recognize the socket type instead of having to guess it by analyzing
the path content, let's restore it right after getsockname() since we
have all the infos needed to deduce the right family.

If the path starts with a NULL byte, we know that it is an abstract sock.
Then we simply check <addrlen> value from getsockname() to know if the
addr makes uses of the whole path space (normal ABNS) or partial path
space (zero ABNS / aka ABNZ) terminated by 0.
2024-10-29 12:14:57 +01:00
Valentine Krasnobaeva
4931d1ca5f BUG/MEIDUM: mworker: fix fd leak from master to worker
During re-execution master keeps always opened "reload" sockpair FDs and
shared sockpair ipc_fd[0], the latter is using to transfert listeners sockets
from the previously forked worker to the new one. So, these master's FDs are
inherited in the newly forked worker and must be closed in its context.

"reload" sockpair inherited FDs and shared sockpair FD (ipc_fd[0]) are closed
separately, becase master doesn't recreate "reload" sockpair each time after
its re-exec. It always keeps the same FDs for this "reload" sockpair. So in
worker context it can be closed immediately after the fork.

At contrast, shared sockpair is created each time after reload, when the new
worker will be forked. So, if N previous workers are still exist at this moment,
the new worker will inherit N ipc_fd[0] from master. So, it's more save to
close all these FDs after get_listeners_fd() and bind_listeners() calls.
Otherwise, early closed FDs in the worker context will be immediately bound to
listeners and we could potentially have some bugs.
2024-10-26 22:53:24 +02:00
Aperence
20efb856e1 MEDIUM: protocol: add MPTCP per address support
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

  mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.

MPTCP is both supported for the frontend and backend sides.

Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.

Note that this is a re-implementation of Björn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.

Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.

This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.

Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]
Link: https://github.com/haproxy/haproxy/issues/1028 [4]
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5]

Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
2024-08-30 18:53:49 +02:00
Aperence
2f171fe36a MEDIUM: sock: use protocol when creating socket
Use the protocol configured for a connection when creating the socket,
instead of always using 0.

This change is needed to allow new protocol to be used when creating
the sockets, such as MPTCP. Note however that this patch won't change
anything for now, as the only other value that proto->sock_prot could
hold is IPPROTO_TCP, which has the same behavior as 0 when passed to
socket.
2024-08-30 18:53:49 +02:00
Willy Tarreau
034974106f MINOR: socket: don't ban all custom families from reuseport
The test on ss_family >= AF_MAX is too strict if we want to support new
custom families, let's apply this to the real_family instead so that we
check that the underlying socket supports reuseport.
2024-08-21 17:37:46 +02:00
Willy Tarreau
d592ebdbeb MEDIUM: socket: always properly use the sock_domain for requested families
Now we make sure to always look up the protocol's domain for an address
family. Previously we would use it as-is, which prevented from properly
using custom addresses (which is when they differ).

This removes some hard-coded tests such as in log.c where UNIX vs UDP
was explicitly checked for example. It requires a bit of care, however,
so as to properly pass value 1 in the 3rd arg of the protocol_lookup()
for DGRAM stuff. Maybe one day we'll change these for defines or enums
to limit mistakes.
2024-08-21 17:36:58 +02:00
Willy Tarreau
732913f848 MINOR: protocol: properly assign the sock_domain and sock_family
When we finally split sock_domain from sock_family in 2.3, something
was not cleanly finished. The family is what should be stored in the
address while the domain is what is supposed to be passed to socket().
But for the custom addresses, we did the opposite, just because the
protocol_lookup() function was acting on the domain, not the family
(both of which are equal for non-custom addresses).

This is an API bug but there's no point backporting it since it does
not have visible effects. It was visible in the code since a few places
were using PF_UNIX while others were comparing the domain against AF_MAX
instead of comparing the family.

This patch clarifies this in the comments on top of proto_fam, addresses
the indexing issue and properly reconfigures the two custom families.
2024-08-21 16:46:15 +02:00
Amaury Denoyelle
45f40bac4c MEDIUM: config: prevent communication with privileged ports
This commit introduces a new global setting named
harden.reject_privileged_ports.{tcp|quic}. When active, communications
with clients which use privileged source ports are forbidden. Such
behavior is considered suspicious as it can be used as spoofing or
DNS/NTP amplication attack.

Value is configured per transport protocol. For each TCP and QUIC
distinct code locations are impacted by this setting. The first one is
in sock_accept_conn() which acts as a filter for all TCP based
communications just after accept() returns a new connection. The second
one is dedicated for QUIC communication in quic_recv(). In both cases,
if a privileged source port is used and setting is disabled, received
message is silently dropped.

By default, protection are disabled for both protocols. This is to be
able to backport it without breaking changes on stable release.

This should be backported as it is an interesting security feature yet
relatively simple to implement.
2024-05-24 14:36:31 +02:00
Valentine Krasnobaeva
83ab1479d0 BUG/MINOR: sock: fix sock_create_server_socket
Set stream_err value as SF_ERR_NONE, if obtained socket fd has passed all
common runtime and configuration related checks.

'.connect()' method implementation in higher protocol layers requires Stream
Error Flag as the return value. So, at the socket layer, we need to pass to
sock_create_server_socket() a variable to set this flag, because syscalls and
some socket options checks are convenient to performe at the socket layer.
2024-05-22 11:59:55 +02:00
Valentine Krasnobaeva
39caa20b3c MINOR: sock: set conn->err_code in case of EPERM
To improve the readability of sock_handle_system_err(), let's
set explicitly conn->err_code as CO_ER_SOCK_ERR in case of EPERM
(could be returned by setns syscall).
2024-05-21 20:14:31 +02:00
Valentine Krasnobaeva
5f713c03be BUG/MEDIUM: proto: fix fd leak in <proto>_connect_server
This fixes the fd leak, introduced in the commit d3fc982cd7
("MEDIUM: proto: make common fd checks in sock_create_server_socket").

Initially sock_create_server_socket() was designed to return only created
socket FD or -1. Its callers from upper protocol layers were required to test
the returned errno and were required then to apply different configuration
related checks to obtained positive sock_fd. A lot of this code was duplicated
among protocols implementations.

The new refactored version of sock_create_server_socket() gathers in one place
all duplicated checks, but in order to be complient with upper protocol
layers, it needs the 3rd parameter: 'stream_err', in which it sets the
Stream Error Flag for upper levels, if the obtained sock_fd has passed all
additional checks.

No backport needed since this was introduced in 3.0-dev10.
2024-05-21 20:14:05 +02:00
Valentine Krasnobaeva
13ef552488 MINOR: sock: add EPERM case in sock_handle_system_err
setns() may return EPERM if thread, that tries to move into different
namespace, do not have CAP_SYS_ADMIN capability in its Effective set.
So, extending sock_handle_system_err() with this error allows to send
appropriate log message and set SF_ERR_PRXCOND (SC termination
flag in log) as stream termination error code. This error code can be
simply checked with SF_ERR_MASK at protocol layer.
2024-04-30 21:39:32 +02:00
Valentine Krasnobaeva
d3fc982cd7 MEDIUM: proto: make common fd checks in sock_create_server_socket
quic_connect_server(), tcp_connect_server(), uxst_connect_server() duplicate
same code to check different ERRNOs, that socket() and setns() may return.
They also duplicate some runtime condition checks, applied to the obtained
server socket fd.

So, in order to remove these duplications and to improve code readability,
let's encapsulate socket() and setns() ERRNOs handling in
sock_handle_system_err(). It must be called just before fd's runtime condition
checks, which we also move in sock_create_server_socket by the same reason.
2024-04-30 21:39:24 +02:00
Valentine Krasnobaeva
772d070ab5 MINOR: sock_set_mark: take sock family in account
SO_MARK, SO_USER_COOKIE, SO_RTABLE socket options (used to set the special
mark/ID on socket, in order to perform mark-based routing) are only supported
by AF_INET sockets. So, let's check socket address family, when we enter into
this function.
2024-04-30 21:38:29 +02:00
Valentine Krasnobaeva
a0b5324cff MINOR: sock: rename sock to sock_fd in sock_create_server_socket
Renaming sock to sock_fd makes it more clear, that sock_create_server_socket
returns the fd of newly created server socket and then we check this fd.
As we heavily use "fd" variable name in all protocol implementations, let's
prefix this one with the name of its object file: sock.o.
2024-04-30 21:38:12 +02:00
Willy Tarreau
b4734c2bd7 BUG/MINOR: sock: handle a weird condition with connect()
As reported on github issue #2491, there's a very strange situation where
epoll_wait() appears to be reported EPOLLERR only (and not IN/OUT/HUP etc
as normally happens with EPOLLERR), and when connect() is called again to
check the state of the ongoing connection, it returns EALREADY, basically
saying "no news, please wait". This obviously triggers a wakeup loop. For
now it has remained impossible to reproduce this issue outside of the
reporter's environment, but that's definitely something that is impossible
to get out from.

The workaround here is to address the lowest level cause we can act on,
which is to avoid returning to wait if EPOLLERR was returned. Indeed, in
this case we know it will loop, so we must definitely take this one into
account. We only do that after connect() asks us to wait, so that a
properly established connection with a queued error at the end of an
exchange will not be diverted and will be handled as usual.

This should be backported to approximately all versions, at least as far
as 2.4 according to the reporter who observed it there.

Thanks to @donnyxray for their useful captures isolating the problem.
2024-04-19 17:04:25 +02:00
Aurelien DARRAGON
42a97d9feb MEDIUM: tcp-act/backend: support for set-bc-{mark,tos} actions
set-bc-{mark,tos} actions are pretty similar to set-fc-{mark,tos} to set
mark/tos on packets sent from haproxy to server: set-bc-{mark,tos} actions
act on the whole backend/srv connection: from connect() to connection
teardown, thus they may only be used before the connection to the server
is instantiated, meaning that they are only relevant for request-oriented
rules such as tcp-request or http-request rules. For now their use is
limited to content request rules, because tos and mark informations are
stored directly within the stream, thus it is required that the stream
already exists.

stream flags are used in combination with dedicated stream struct members
variables to pass 'tos' and 'mark' informations so that they are correctly
considered during stream connection assignment logic (prior to connecting
to actually connecting to the server)

'tos' and 'mark' fd sockopts are taken into account in conn hash
parameters for connection reuse mechanism.

The documentation was updated accordingly.
2024-02-01 10:58:30 +01:00
Willy Tarreau
445fc1fe3a BUG/MINOR: sock: mark abns sockets as non-suspendable and always unbind them
In 2.3, we started to get a cleaner socket unbinding mechanism with
commit f58b8db47 ("MEDIUM: receivers: add an rx_unbind() method in
the protocols"). This mechanism rightfully refrains from unbinding
when sockets are expected to be transferrable to another worker via
"expose-fd listeners", but this is not compatible with ABNS sockets,
which do not support reuseport, unbinding nor being renamed: in short
they will always prevent a new process from binding.

It turns out that this is not much visible because by pure accident,
GTUNE_SOCKET_TRANSFER is only set in the code dealing with master mode
and deamons, so it's never set in foreground mode nor in tests even if
present on the stats socket. However with master mode, it is now always
set even when not present on the stats socket, and will always conflict.

The only reasonable approach seems to consist in marking these abns
sockets as non-suspendable so that the generic sock_unbind() code can
decide to just unbind them regardless of GTUNE_SOCKET_TRANSFER.

This should carefully be backported as far as 2.4.
2023-11-20 11:38:26 +01:00
Willy Tarreau
6a4591c3d0 BUG/MEDIUM: connection: report connection errors even when no mux is installed
An annoying issue was met when testing the reverse-http mechanism, by
which failed connection attempts would apparently not be attempted again
when there was no connect timeout. It turned out to be more generalized
than the rhttp system, and actually affects all outgoing connections
relying on NPN or ALPN to choose the mux, on which no mux is installed
and for which the subscriber (ssl_sock) must be notified instead.

The problem appeared during 2.2-dev1 development. First, commit
062df2c23 ("MEDIUM: backend: move the connection finalization step to
back_handle_st_con()") broke the error reporting by testing CO_FL_ERROR
only under CO_FL_CONNECTED. While it still worked OK for cases where a
mux was present, it did not for this specific situation because no
single error path would be considered when no mux was present. Changing
the CO_FL_CONNECTED test to also include CO_FL_ERROR did work, until
a few commits later with 477902bd2 ("MEDIUM: connections: Get ride of
the xprt_done callback.") which removed the xprt_done callback that was
used to indicate success or failure of the transport layer setup, since,
as the commit explains, we can report this via the mux. What this last
commit says is true, except when there is no mux.

For this, however, the sock_conn_iocb() function (formerly conn_fd_handler)
is called for such errors, evaluates a number of conditions, none of which
is matched in this error condition case, since sock_conn_check() instantly
reports an error causing a jump to the leave label. There, the mux is
notified if installed, and the function returns. In other error condition
cases, readiness and activity are checked for both sides, the tasklets
woken up and the corresponding subscriber flags removed. This means that
a sane (and safe) approach would consist in just notifying the subscriber
in case of error, if such a subscriber still exists: if still there, it
means the event hasn't been caught earlier, then it's the right moment
to report it. And since this is done after conn_notify_mux(), it still
leaves all control to the mux once it's installed.

This commit should be progressively backported as far as 2.2 since it's
where the problem was introduced. It's important to clearly check the
error path in each function to make sure the fix still does what it's
supposed to.
2023-11-14 08:49:23 +01:00
Willy Tarreau
b073573c10 MINOR: sock: add a function to check for SO_REUSEPORT support at runtime
The new function _sock_supports_reuseport() will be used to check if a
protocol type supports SO_REUSEPORT or not. This will be useful to verify
that shards can really work.
2023-04-23 09:46:15 +02:00
Willy Tarreau
9b773ec118 CLEANUP: sock: always perform last connection updates before wakeup
Normally the task_wakeup() in sock_conn_io_cb() is expected to
happen on the same thread the FD is attached to. But due to the
way the code was arranged in the past (with synchronous callbacks)
we continue to update connections after the wakeup, which always
makes the reader have to think deeply whether it's possible or not
to call another thread there. Let's just move the tasklet_wakeup()
at the end to make sure there's no problem with that.
2023-03-08 16:07:32 +01:00
William Lallemand
708949da49 MINOR: sockpair: move send_fd_uxst() error message in caller
Move the ha_alert() in send_fd_uxst() in the callers and add the FD
numbers in the message.
2022-07-25 16:11:11 +02:00
Willy Tarreau
27a3245599 MEDIUM: fd: make fd_insert() take local thread masks
fd_insert() was already given a thread group ID and a global thread mask.
Now we're changing the few callers to take the group-local thread mask
instead. It's passed directly into the FD's thread mask. Just like for
previous commit, it must not change anything when a single group is
configured.
2022-07-15 20:16:30 +02:00
Willy Tarreau
9464bb1f05 MEDIUM: fd: add the tgid to the fd and pass it to fd_insert()
The file descriptors will need to know the thread group ID in addition
to the mask. This extends fd_insert() to take the tgid, and will store
it into the FD.

In the FD, the tgid is stored as a combination of tgid on the lower 16
bits and a refcount on the higher 16 bits. This allows to know when it's
really possible to trust the tgid and the running mask. If a refcount is
higher than 1 it indeed indicates another thread else might be in the
process of updating these values.

Since a closed FD must necessarily have a zero refcount, a test was
added to fd_insert() to make sure that it is the case.
2022-07-15 19:58:06 +02:00
Willy Tarreau
030b3e6bcc MINOR: connection: get rid of the CO_FL_ADDR_*_SET flags
Just like for the conn_stream, now that these addresses are dynamically
allocated, there is no single case where the pointer is set without the
corresponding flag, and the flag is used as a permission to dereference
the pointer. Let's just replace the test of the flag with a test of the
pointer and remove all flag assignment. This makes the code clearer
(especially in "if" conditions) and saves the need for future code to
think about properly setting the flag after setting the pointer.
2022-05-02 17:47:46 +02:00
Willy Tarreau
382474348c CLEANUP: tree-wide: use fd_set_nonblock() and fd_set_cloexec()
This gets rid of most open-coded fcntl() calls, some of which were passed
through DISGUISE() to avoid a useless test. The FD_CLOEXEC was most often
set without preserving previous flags, which could become a problem once
new flags are created. Now this will not happen anymore.
2022-04-26 10:59:48 +02:00
Willy Tarreau
acef5e27b0 MINOR: tree-wide: always consider EWOULDBLOCK in addition to EAGAIN
Some older systems may routinely return EWOULDBLOCK for some syscalls
while we tend to check only for EAGAIN nowadays. Modern systems define
EWOULDBLOCK as EAGAIN so that solves it, but on a few older ones (AIX,
VMS etc) both are different, and for portability we'd need to test for
both or we never know if we risk to confuse some status codes with
plain errors.

There were few entries, the most annoying ones are the switch/case
because they require to only add the entry when it differs, but the
other ones are really trivial.
2022-04-25 20:32:15 +02:00
Willy Tarreau
8f6c0c32b8 BUG/MINOR: sock: do not double-close the accepted socket on the error path
Coverity found in issue #1646 that I added a double-close bug in last
commit e4d09cedb ("MINOR: sock: check configured limits at the sock layer,
not the listener's") because the error path already closes the FD. No
backport needed.
2022-04-12 07:49:11 +02:00
Willy Tarreau
07ecfc5e88 MEDIUM: connection: panic when calling FD-specific functions on FD-less conns
Certain functions cannot be called on an FD-less conn because they are
normally called as part of the protocol-specific setup/teardown sequence.
Better place a few BUG_ON() to make sure none of them is called in other
situations. If any of them would trigger in ambiguous conditions, it would
always be possible to replace it with an error.
2022-04-11 19:31:47 +02:00
Willy Tarreau
e4d09cedb6 MINOR: sock: check configured limits at the sock layer, not the listener's
listener_accept() used to continue to enforce the FD limits relative to
global.maxsock by itself while it's the last FD-specific test in the
whole file. This test has nothing to do there, it ought to be placed in
sock_accept_conn() which is the one in charge of FD allocation and tests.
Similar tests are already located there by the way. The only tiny
difference is that listener_accept() used to pause for one second when
this limit was reached, while other similar conditions were pausing only
100ms, so now the same 100ms will apply. But that's not important and
could even be considered as an improvement.
2022-04-11 19:31:47 +02:00
Willy Tarreau
b510116fd2 MINOR: sock: move the unused socket cleaning code into its own function
The startup code used to scan the list of unused sockets retrieved from
an older process, and to close them one by one. This also required that
the knowledge of the internal storage of these temporary sockets was
known from outside sock.c and that the code was copy-pasted at every
call place.

This patch moves this into sock.c under the name
sock_drop_unused_old_sockets(), and removes the xfer_sock_list
definition from sock.h since the rest of the code doesn't need to know
this.

This cleanup is minimal and preliminary to a future fix that will need
to be backported to all versions featuring FD transfers over the CLI.
2022-01-28 19:04:02 +01:00
William Lallemand
2be557f7cb MEDIUM: mworker: seamless reload use the internal sockpairs
With the master worker, the seamless reload was still requiring an
external stats socket to the previous process, which is a pain to
configure.

This patch implements a way to use the internal socketpair between the
master and the workers to transfer the sockets during the reload.
This way, the master will always try to transfer the socket, even
without any configuration.

The master will still reload with the -x argument, followed by the
sockpair@ syntax. ( ex -x sockpair@4 ). Which use the FD of internal CLI
to the worker.
2021-11-24 19:00:39 +01:00
Tim Duesterhus
f897fc99bd CLEANUP: sock: Wrap accept4_broken = 1 into additional parenthesis
This makes it clear to static analysis tools that this assignment is
intentional and not a mistyped comparison.
2021-11-20 14:52:01 +01:00
Willy Tarreau
e3b4518414 MINOR: protocols: make use of the protocol type to select the protocol
Instead of using sock_type and ctrl_type to select a protocol, let's
make use of the new protocol type. For now they always match so there
is no change. This is applied to address parsing and to socket retrieval
from older processes.
2021-10-27 17:31:20 +02:00
Willy Tarreau
20b622e04b MINOR: connection: add a new CO_FL_WANT_DRAIN flag to force drain on close
Sometimes we'd like to do our best to drain pending data before closing
in order to save the peer from risking to receive an RST on close.

This adds a new connection flag CO_FL_WANT_DRAIN that is used to
trigger a call to conn_ctrl_drain() from conn_ctrl_close(), and the
sock_drain() function ignores fd_recv_ready() if this flag is set,
in order to catch latest data. It's not used for now.
2021-10-21 21:48:23 +02:00
Willy Tarreau
5d9ddc5442 BUILD: tree-wide: add several missing activity.h
A number of files currently access activity counters but rely on their
definitions to be inherited from other files (task.c, backend.c hlua.c,
sock.c, pool.c, stats.c, fd.c).
2021-10-07 01:36:51 +02:00
Willy Tarreau
5a9c637bf3 BUG/MEDIUM: sock: make sure to never miss early connection failures
As shown in issue #1251, it is possible for a connect() to report an
error directly via the poller without ever reporting send readiness,
but currentlt sock_conn_check() manages to ignore that situation,
leading to high CPU usage as poll() wakes up on these FDs.

The bug was apparently introduced in 1.5-dev22 with commit fd803bb4d
("MEDIUM: connection: add check for readiness in I/O handlers"), but
was likely only woken up by recent changes to conn_fd_handler() that
made use of wakeups instead of direct calls between 1.8 and 1.9,
voiding any chance to catch such errors in the early recv() callback.

The exact sequence that leads to this situation remains obscure though
because the poller does not report send readiness nor does it report an
error. Only HUP and IN are reported on the FD. It is also possible that
some recent kernel updates made this condition appear while it never
used to previously.

This needs to be backported to all stable branches, at least as far
as 2.0. Before 2.2 the code was in tcp_connect_probe() in proto_tcp.c.
2021-07-06 10:52:19 +02:00
Willy Tarreau
b41a6e9101 MINOR: fd: move .linger_risk into fdtab[].state
No need to keep this flag apart any more, let's merge it into the global
state. The CLI's output state was extended to 6 digits and the linger/cloned
flags moved inside the parenthesis.
2021-04-07 18:07:49 +02:00
Willy Tarreau
f509065191 MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state
For a long time we've had fdtab[].ev and fdtab[].state which contain two
arbitrary sets of information, one is mostly the configuration plus some
shutdown reports and the other one is the latest polling status report
which also contains some sticky error and shutdown reports.

These ones used to be stored into distinct chars, complicating certain
operations and not even allowing to clearly see concurrent accesses (e.g.
fd_delete_orphan() would set the state to zero while fd_insert() would
only set the event to zero).

This patch creates a single uint with the two sets in it, still delimited
at the byte level for better readability. The original FD_EV_* values
remained at the lowest bit levels as they are also known by their bit
value. The next step will consist in merging the remaining bits into it.

The whole bits are now cleared both in fd_insert() and _fd_delete_orphan()
because after a complete check, it is certain that in both cases these
functions are the only ones touching these areas. Indeed, for
_fd_delete_orphan(), the thread_mask has already been zeroed before a
poller can call fd_update_event() which would touch the state, so it
is certain that _fd_delete_orphan() is alone. Regarding fd_insert(),
only one thread will get an FD at any moment, and it as this FD has
already been released by _fd_delete_orphan() by definition it is certain
that previous users have definitely stopped touching it.

Strictly speaking there's no need for clearing the state again in
fd_insert() but it's cheap and will remove some doubts during some
troubleshooting sessions.
2021-04-07 18:04:39 +02:00
Willy Tarreau
61cfdf4fd8 CLEANUP: tree-wide: replace free(x);x=NULL with ha_free(&x)
This makes the code more readable and less prone to copy-paste errors.
In addition, it allows to place some __builtin_constant_p() predicates
to trigger a link-time error in case the compiler knows that the freed
area is constant. It will also produce compile-time error if trying to
free something that is not a regular pointer (e.g. a function).

The DEBUG_MEM_STATS macro now also defines an instance for ha_free()
so that all these calls can be checked.

178 occurrences were converted. The vast majority of them were handled
by the following Coccinelle script, some slightly refined to better deal
with "&*x" or with long lines:

  @ rule @
  expression E;
  @@
  - free(E);
  - E = NULL;
  + ha_free(&E);

It was verified that the resulting code is the same, more or less a
handful of cases where the compiler optimized slightly differently
the temporary variable that holds the copy of the pointer.

A non-negligible amount of {free(str);str=NULL;str_len=0;} are still
present in the config part (mostly header names in proxies). These
ones should also be cleaned for the same reasons, and probably be
turned into ist strings.
2021-02-26 21:21:09 +01:00
Remi Tricot-Le Breton
25dd0ad123 BUG/MINOR: sock: Unclosed fd in case of connection allocation failure
If allocating a connection object failed right after a successful accept
on a listener, the new file descriptor was not properly closed.

This fixes GitHub issue #905.
It can be backported to 2.3.
2021-02-05 12:14:51 +01:00
Willy Tarreau
472125bc04 MINOR: protocol: add a pair of check_events/ignore_events functions at the ctrl layer
Right now the connection subscribe/unsubscribe code needs to manipulate
FDs, which is not compatible with QUIC. In practice what we need there
is to be able to either subscribe or wake up depending on readiness at
the moment of subscription.

This commit introduces two new functions at the control layer, which are
provided by the socket code, to check for FD readiness or subscribe to it
at the control layer. For now it's not used.
2020-12-11 17:02:50 +01:00