For now it's the same as abns. We'll need to modify sock_unix_addrcmp(),
and a few other ones to support effective path length when dealing with
the \0. Let's check with Tristan's patch for this (upcoming patch).
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
This is a pre-requisite to adding the abnsz socket address family:
in this patch we make use of protocol API rework started by 732913f
("MINOR: protocol: properly assign the sock_domain and sock_family") in
order to implement a dedicated address family for ABNS sockets (based on
UNIX parent family).
Thanks to this, it will become trivial to implement a new ABNSZ (for abns
zero) family which is essentially the same as ABNS but with a slight
difference when it comes to path handling (ABNS uses the whole sun_path
length, while ABNSZ's path is zero terminated and evaluation stops at 0)
It was verified that this patch doesn't break reg-tests and behaves
properly (tests performed on the CLI with show sess and show fd).
Anywhere relevant, AF_CUST_ABNS is handled alongside AF_UNIX. If no
distinction needs to be made, real_family() is used to fetch the proper
real family type to handle it properly.
Both stream and dgram were converted, so no functional change should be
expected for this "internal" rework, except that proto will be displayed
as "abns_{stream,dgram}" instead of "unix_{stream,dgram}".
Before ("show sess" output):
0x64c35528aab0: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=21,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp=
After:
0x619da7ad74c0: proto=abns_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=22,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp=
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
For custom families, there's sometimes an underlying real address and
it would be nice to be able to directly use the real family in calls
to bind() and connect() without having to add explicit checks for
exceptions everywhere.
Let's add a .real_family field to struct proto_fam for this. For now
it's always equal to the family except for non-transferable ones such
as rhttp where it's equal to the custom one (anything else could fit).
When we finally split sock_domain from sock_family in 2.3, something
was not cleanly finished. The family is what should be stored in the
address while the domain is what is supposed to be passed to socket().
But for the custom addresses, we did the opposite, just because the
protocol_lookup() function was acting on the domain, not the family
(both of which are equal for non-custom addresses).
This is an API bug but there's no point backporting it since it does
not have visible effects. It was visible in the code since a few places
were using PF_UNIX while others were comparing the domain against AF_MAX
instead of comparing the family.
This patch clarifies this in the comments on top of proto_fam, addresses
the indexing issue and properly reconfigures the two custom families.
Add ->get_info() new control layer callback definition to protocol struct to
retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by
an integer into a long long int.
Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer
function (src/proto_tcp.c) and define it as the ->get_info() callback for
TCPv4 and TCPv6.
Note that get_tcp_info() is called for several TCP sample fetches.
This patch is useful to support some of these sample fetches for QUIC and to
keep the code simple and easy to maintain.
Since the following patch, protocol API to update a connection TID
affinity has been extended.
commit 1a43b9f32c71267e3cb514aa70a13c75adb20742
MINOR: proto: extend connection thread rebind API
The single callback set_affinity has been splitted in 3 different
functions which are called at different stages during listener_accept(),
depending on accept queue push success or not. However, the naming was
rendered confusing by the usage of function prefix 1 and 2.
Rename proto callback related to TID affinity update and use the
following names :
* bind_tid_prep
* bind_tid_commit
* bind_tid_reset
This commit should probably be backported at least up to 3.0 with the
above patch. This is because the fix was recently backported and it
would allow to keep changes minimal between the two versions. It could
even be backported up to 2.8 if there is no major conflict.
MINOR: listener: define callback for accept queue push
Extend API for connection thread rebind API by replacing single callback
set_affinity by three different ones. Each one of them is used at a
different stage of the operation :
* set_affinity1 is used similarly to previous set_affinity
* set_affinity2 is called directly from accept_queue_push_mp() when an
entry has been found in accept ring. This operation cannot fail.
* reset_affinity is called after set_affinity1 in case of failure from
accept_queue_push_mp() due to no space left in accept ring. This is
necessary for protocols which must reconfigure resources before
fallback on the current tid.
This patch does not have any functional changes. However, it will be
required to fix crashes for QUIC connections when accept queue ring is
full. As such, it must be backported with it.
This commit introduces a new global setting named
harden.reject_privileged_ports.{tcp|quic}. When active, communications
with clients which use privileged source ports are forbidden. Such
behavior is considered suspicious as it can be used as spoofing or
DNS/NTP amplication attack.
Value is configured per transport protocol. For each TCP and QUIC
distinct code locations are impacted by this setting. The first one is
in sock_accept_conn() which acts as a filter for all TCP based
communications just after accept() returns a new connection. The second
one is dedicated for QUIC communication in quic_recv(). In both cases,
if a privileged source port is used and setting is disabled, received
message is silently dropped.
By default, protection are disabled for both protocols. This is to be
able to backport it without breaking changes on stable release.
This should be backported as it is an interesting security feature yet
relatively simple to implement.
Previous commit renames 'proto_reverse_connect' module to 'proto_rhttp'.
This commits follows this by replacing various custom prefix by 'rhttp_'
to make the code uniform.
Note that 'reverse_' prefix was kept in connection module. This is
because if a new reversable protocol not based on HTTP is implemented,
it may be necessary to reused the same connection function which are
protocol agnostic.
A new protocol named "reverse_connect" is created. This will be used to
instantiate connections that are opened by a reverse bind.
For the moment, only a minimal set of callbacks are defined with no real
work. This will be extended along the next patches.
When testing if a protocol supports SO_REUSEPORT, we're now able to
verify if the OS does really support it. While it may be supported at
build time, it may possibly have been blocked in a container for
example so we'd rather know what it's like.
Some protocol support SO_REUSEPORT and others not. Some have such a
limitation in the kernel, and others in haproxy itself (e.g. sock_unix
cannot support multiple bindings since each one will unbind the previous
one). Also it's really protocol-dependent and not just family-dependent
because on Linux for some time it was supported for TCP and not UDP.
Let's move the definition to the protocols instead. Now it's preset in
tcp/udp/quic when SO_REUSEPORT is defined, and is otherwise left unset.
The enabled() config condition test validates IPv4 (generally sufficient),
and -dR / noreuseport all protocols at once.
We'll use these flags to know if some protocols are supported, and if
so, with what options/extensions. Reuseport will move there for example.
Two functions were added to globally set/clear a flag.
This field forces an unaligned hole between two list heads. Let's move
it up where it will be more easily combined with other fields. In
addition, turn it to unsigned while it's still not used.
There's a two-byte hole in proto_fam after sock_family, let's move the
l3_addrlen there as a ushort. Note that contrary to what the comment
says, it's still not used by hash algorithms though it could.
Define a new protocol callback set_affinity. This function is used
during listener_accept() to notify about a rebind on a new thread just
before pushing the connection on the selected thread queue. If the
callback fails, accept is done locally.
This change will be useful for protocols with state allocated before
accept is done. For the moment, only QUIC protocol is concerned. This
will allow to rebind the quic_conn to a new thread depending on its
load.
This should be backported up to 2.7 after a period of observation.
There's been some great confusion between proto_type, ctrl_type and
sock_type. It turns out that ctrl_type was improperly chosen because
it's not the control layer that is of this or that type, but the
transport layer, and it turns out that the transport layer doesn't
(normally) denaturate the underlying control layer, except for QUIC
which turns dgrams to streams. The fact that the SOCK_{DGRAM|STREAM}
set of values was used added to the confusion.
Let's replace it with xprt_type which reuses the later introduced
PROTO_TYPE_* values, and update the comments to explain which one
works at what level.
Right now the proto_fam descriptor provides a family-specific
get_src() and get_dst() pair of calls to retrieve a socket's source
or destination address. However this only works for connected mode
sockets. QUIC provides its own stream protocol, which relies on a
datagram protocol underneath, so the get_src()/get_dst() at that
protocol's family will not work, and QUIC would need to provide its
own.
This patch implements get_src() and get_dst() at the protocol level
from a connection, and makes sure that conn_get_src()/conn_get_dst()
will automatically use them if defined before falling back to the
family's pair of functions.
The protocol selection is currently performed based on the family,
control type and socket type. But this is often not enough, as both
only provide DGRAM or STREAM, leaving few variants. Protocols like
SCTP for example might be indistinguishable from TCP here. Same goes
for TCP extensions like MPTCP.
This commit introduces a new enum proto_type that is placed in each
and every protocol definition, that will usually more or less match
the sock_type, but being an enum, will support additional values.
No need to include the full tree management code, type files only
need the definitions. Doing so reduces the whole code size by around
3.6% and the build time is down to just 6s.
Right now the connection subscribe/unsubscribe code needs to manipulate
FDs, which is not compatible with QUIC. In practice what we need there
is to be able to either subscribe or wake up depending on readiness at
the moment of subscription.
This commit introduces two new functions at the control layer, which are
provided by the socket code, to check for FD readiness or subscribe to it
at the control layer. For now it's not used.
This is what we need to drain pending incoming data from an connection.
The code was taken from conn_sock_drain() without the connection-specific
stuff. It still takes a connection for now for API simplicity.
Currnetly conn_ctrl_init() does an fd_insert() and conn_ctrl_close() does an
fd_delete(). These are the two only short-term obstacles against using a
non-fd handle to set up a connection. Let's have pur these into the protocol
layer, along with the other connection-level stuff so that the generic
connection code uses them instead. This will allow to define new ones for
other protocols (e.g. QUIC).
Since we only support regular sockets at the moment, the code was placed
into sock.c and shared with proto_tcp, proto_uxst and proto_sockpair.
For the sake of an improved readability, let's group the protocol
field members according to where they're supposed to be defined:
- connection layer (note: for now even UDP needs one)
- binding layer
- address family
- socket layer
Nothing else was changed.
This field used to be needed before commit 2b5e0d8b6 ("MEDIUM: proto_udp:
replace last AF_CUST_UDP* with AF_INET*") as it was used as a protocol
entry selector. Since this commit it's always equal to the socket family's
value so it's entirely redundant. Let's remove it now to simplify the
protocol definition a little bit.
With the removal of the family-specific port setting, all protocol had
exactly the same implementation of ->add(). A generic one was created
with the name "default_add_listener" so that all other ones can now be
removed. The API was slightly adjusted so that the protocol and the
listener are passed instead of the listener and the port.
Note that all protocols continue to provide this ->add() method instead
of routinely calling default_add_listener() from create_listeners(). This
makes sure that any non-standard protocol will still be able to intercept
the listener addition if needed.
This could be backported to 2.3 along with the few previous patches on
listners as a pure code cleanup.
At various places we need to set a port on an IPv4 or IPv6 address, and
it requires casts that are easy to get wrong. Let's add a new set_port()
helper to the address family to assist in this. It will be directly
accessible from the protocol and will make the operation seamless.
Right now this is only implemented for sock_inet as other families do
not need a port.
We don't need to specify the handler anymore since it's set in the
receiver. Let's remove this argument from the function and clean up
the remains of code that were still setting it.
Now we define a new sock_accept_iocb() for socket-based stream protocols
and use it as a wrapper for listener_accept() which now takes a listener
and not an FD anymore. This will allow the receiver's I/O cb to be
redefined during registration, and more specifically to get rid of the
hard-coded hacks in protocol_bind_all() made for syslog.
The previous ->accept() callback in the protocol was removed since it
doesn't have anything to do with accept() anymore but is more generic.
A few places where listener_accept() was compared against the FD's IO
callback for debugging purposes on the CLI were updated.
For now we're still using the protocol's default accept() function as
the I/O callback registered by the receiver into the poller. While
this is usable for most TCP connections where a listener is needed,
this is not suitable for UDP where a different handler is needed.
Let's make this configurable in the receiver just like the upper layer
is configurable for listeners. In order to ease stream protocols
handling, the protocols will now provide a default I/O callback
which will be preset into the receivers upon allocation so that
almost none of them has to deal with it.
This per-protocol function will be used to accept an incoming
connection and return it as a struct connection*. As such the protocol
stack's internal representation of a connection will not need to be
handled by the listener code.
No protocol defines it anymore. The last user used to be the monitor-net
stuff that got partially broken already when the tcp_drain() function
moved to conn_sock_drain() with commit e215bba95 ("MINOR: connection:
make conn_sock_drain() work for all socket families") in 1.9-dev2.
A part of this will surely move back later when non-socket connections
arrive with QUIC but better keep the API clean and implement what's
needed in time instead.
Now we introdce a new .rx_listening() function to report if a receiver is
actually a listening socket. The reason for this is to help detect shared
sockets that might have been broken by sibling processes.
Now we have ->suspend() and ->resume() for listeners at the protocol
level. This means that it now becomes possible for a protocol to redefine
its own way to suspend and resume. The default functions are provided for
TCP, UDP and unix, and they are pass-through to the receiver equivalent
as it used to be till now. Nothing was defined for sockpair since it does
not need to suspend/resume during reloads, hence it will succeed.
The inner part now goes into the protocol and is used to decide how to
unbind a given protocol's listener. The existing code which is able to
also unbind the receiver was provided as a default function that we
currently use everywhere. Some complex listeners like QUIC will use this
to decide how to unbind without impacting existing connections, possibly
by setting up other incoming paths for the traffic.
This is used as a generic way to unbind a receiver at the end of
do_unbind_listener(). This allows to considerably simplify that function
since we can now let the protocol perform the cleanup. The generic code
was moved to sock.c, along with the conditional rx_disable() call. Now
the code also supports that the ->disable() function of the protocol
which acts on the listener performs the close itself and adjusts the
RX_F_BUOND flag accordingly.
These methods will be used to enable/disable accepting new connections
so that listeners do not play with FD directly anymore. Since all the
currently supported protocols work on socket for now, these are identical
to the rx_enable/rx_disable functions. However they were not defined in
sock.c since it's likely that some will quickly start to differ. At the
moment they're not used.
We have to take care of fd_updt before calling fd_{want,stop}_recv()
because it's allocated fairly late in the boot process and some such
functions may be called very early (e.g. to stop a disabled frontend's
listeners).
These methods will be used to enable/disable rx at the receiver level so
that callers don't play with FDs directly anymore. All our protocols use
the generic ones from sock.c at the moment. For now they're not used.
This one undoes ->rx_suspend(), it tries to restore an operational socket.
It was only implemented for TCP since it's the only one we support right
now.
The ->pause method is inappropriate since it doesn't exactly "pause" a
listener but rather temporarily disables it so that it's not visible at
all to let another process take its place. The term "suspend" is more
suitable, since the "pause" is actually what we'll need to apply to the
FULL and LIMITED states which really need to make a pause in the accept
process. And it goes well with the use of the "resume" function that
will also need to be made per-protocol.
Let's rename the function and make it act on the receiver since it's
already what it essentially does, hence the prefix "_rx" to make it
more explicit.
The protocol struct was a bit reordered because it was becoming a real
mess between the parts related to the listeners and those for the
receivers.
Since the listeners were split into receiver+listener, this field ought
to have been renamed because it's confusing. It really links receivers
and not listeners, as most of the time it's used via rx.proto_list!
The nb_listeners field was updated accordingly.
We don't need to cheat with the sock_domain anymore, we now always have
the SOCK_DGRAM sock_type as a complementary selector. This patch restores
the sock_domain to AF_INET* in the udp* protocols and removes all traces
of the now unused AF_CUST_*.
This one will be needed to more accurately select a protocol. It may
differ from the socket type for QUIC, which uses dgram at the socket
layer and provides stream at the control layer. The upper level requests
a control layer only so we need this field.
At some places (log fd@XXX, bind fd@XXX) we support using an explicit
file descriptor number, that is placed into the sockaddr for later use.
The problem is that till now it was done with an AF_UNSPEC family, which
is also used for other situations like missing info or rings (for logs).
Let's create an "official" family AF_CUST_EXISTING_FD for this case so
that we are certain the FD can be found in the address when it is set.
This removes the following fields from struct protocol that are now
retrieved from the protocol family instead: .sock_family, .sock_addrlen,
.l3_addrlen, .addrcmp, .bind, .get_src, .get_dst.
This also removes the UDP-specific udp{,6}_get_{src,dst}() functions
which were referenced but not used yet. Their goal was only to remap
the original AF_INET* addresses to AF_CUST_UDP*.
Note that .sock_domain is still there as it's used as a selector for
the protocol struct to be used.
We need to specially handle protocol families which regroup common
functions used for a given address family. These functions include
bind(), addrcmp(), get_src() and get_dst() for now. Some fields are
also added about the address family, socket domain (protocol family
passed to the socket() syscall), and address length.
These protocol families are referenced from the protocols but not yet
used.
This will be the function that must be used to bind the receiver. It
solely depends on the address family but for now it's simpler to have
it per protocol.