Commit Graph

7303 Commits

Author SHA1 Message Date
Amaury Denoyelle
37d7e52cc6 MINOR: cfgparse: forbid mixing reverse and standard listeners
Reverse HTTP listeners are very specific and share only a very limited
subset of keywords with other listeners. As such, it is probable
meaningless to mix standard and reverse addresses on the same bind line.
This patch emits a fatal error during configuration parsing if this is
the case.
2023-10-20 14:44:37 +02:00
Christopher Faulet
60e7116be0 BUG/MEDIUM: peers: Fix synchro for huge number of tables
The number of updates sent at once was limited to not loop too long to emit
updates when the buffer size is huge or when the number of sync tables is
huge. The limit can be configured and is set to 200 by default. However,
this fix introduced a bug. It is impossible to syncrhonize two peers if the
number of tables is higher than this limit. Thus by default, it is not
possible to sync two peers if there are more than 200 tables to sync.

Technically speacking, a teaching process is finished if we loop on all tables
with no new update messages sent. Because we are limited at each call, the loop
is splitted on several calls. However the restart point for the next loop is
always the last table for which we emitted an update message. Thus with more
tables than the limit, the loop never reachs the end point.

Worse, in conjunction with the bug fixed by "BUG/MEDIUM: peers: Be sure to
always refresh recconnect timer in sync task", it is possible to trigger the
watchdog because the applets may be woken up in loop and leave requesting
more room while its buffer is empty.

To fix the issue, restart conditions for a teaching loop were changed. If
the teach process is interrupted, we now save the restart point, called
stop_local_table. It is the last evaluated table on the previous loop. This
restart point is reset when the teach process is finished.

In additionn, the updates_sent variable in peer_send_msgs() was renamed to
updates to avoid ambiguities. Indeed, the variable is incremented, whether
messages were sent or not.

This patch must be backported as far as 2.6.
2023-10-20 14:32:12 +02:00
Willy Tarreau
3dd963b35f BUG/MINOR: mux-h2: fix http-request and http-keep-alive timeouts again
Stefan Behte reported that since commit f279a2f14 ("BUG/MINOR: mux-h2:
refresh the idle_timer when the mux is empty"), the http-request and
http-keep-alive timeouts don't work anymore on H2. Before this patch,
and since 3e448b9b64 ("BUG/MEDIUM: mux-h2: make sure control frames do
not refresh the idle timeout"), they would only be refreshed after stream
frames were sent (HEADERS or DATA) but the patch above that adds more
refresh points broke these so they don't expire anymore as long as
there's some activity.

We cannot just revert the fix since it also addressed an isse by which
sometimes the timeout would trigger too early and provoque truncated
responses. The right approach here is in fact to only use refresh the
idle timer when the mux buffer was flushed from any such stream frames.

In order to achieve this, we're now setting a flag on the connection
whenever we write a stream frame, and we consider that flag when deciding
to refresh the buffer after it's emptied. This way we'll only clear that
flag once the buffer is empty and there were stream data in it, not if
there were no such stream data. In theory it remains possible to leave
the flag on if some control data is appended after the buffer and it's
never cleared, but in practice it's not a problem as a buffer will always
get sent in large blocks when the window opens. Even a large buffer should
be emptied once in a while as control frames will not fill it as much as
data frames could.

Given the patch above was backported as far as 2.6, this patch should
also be backported as far as 2.6.
2023-10-18 17:17:58 +02:00
Willy Tarreau
91ed52976c MINOR: dgram: allow to set rcv/sndbuf for dgram sockets as well
tune.rcvbuf.client and tune.rcvbuf.server are not suitable for shared
dgram sockets because they're per connection so their units are not the
same. However, QUIC's listener and log servers are not connected and
take per-thread or per-process traffic where a socket log buffer might
be too small, causing undesirable packet losses and retransmits in the
case of QUIC. This essentially manifests in listener mode with new
connections taking a lot of time to set up under heavy traffic due to
the small queues causing delays. Let's add a few new settings allowing
to set these shared socket sizes on the frontend and backend side (which
reminds that these are per-front/back and not per client/server hence
not per connection).
2023-10-18 17:01:19 +02:00
Christopher Faulet
203211f4cb REORG: stconn/muxes: Rename init step in fast-forwarding
Instead of speaking of an initialisation stage for each data
fast-forwarding, we now use the negociate term. Thus init_ff/init_fastfwd
functions were renamed nego_ff/nego_fastfwd.
2023-10-18 12:46:55 +02:00
Christopher Faulet
023564b685 MINOR: global: Add an option to disable the zero-copy forwarding
The zero-copy forwarding or the mux-to-mux forwarding is a way to
fast-forward data without using the channels buffers. Data are transferred
from a mux to the other one. The kernel splicing is an optimization of the
zero-copy forwarding. But it can also use normal buffers (but not channels
ones). This way, it could be possible to fast-forward data with muxes not
supporting the kernel splicing (H2 and H3 muxes) but also with applets.

However, this mode can introduce regressions or bugs in future (just like
the kernel splicing). Thus, It could be usefull to disable this optim. To do
so, in configuration, the global tune settting
'tune.disable-zero-copy-forwarding' may be set in a global section or the
'-dZ' command line parameter may be used to start HAProxy. Of course, this
also disables the kernel splicing.
2023-10-17 18:51:13 +02:00
Christopher Faulet
322d660d08 MINOR: tree-wide: Only rely on co_data() to check channel emptyness
Because channel_is_empty() function does now only check the channel's
buffer, we can remove it and rely on co_data() instead. Of course, all tests
must be inverted.

channel_is_empty() is thus removed.
2023-10-17 18:51:13 +02:00
Christopher Faulet
20c463955d MEDIUM: channel: don't look at iobuf to report an empty channel
It is important to split channels and I/O buffers. When data are pushed in
an I/O buffer, we consider them as forwarded. The channel never sees
them. Fast-forwarded data are now handled in the SE only.
2023-10-17 18:51:13 +02:00
Christopher Faulet
2d80eb5b7a MEDIUM: mux-h1: Add fast-forwarding support
The H1 multiplexer now implements callbacks function to produce and consume
fast-forwarded data.
2023-10-17 18:51:13 +02:00
Christopher Faulet
91f1c5519a MEDIUM: raw-sock: Specifiy amount of data to send via snd_pipe callback
When data were sent using the kernel splicing, we tried to send all data
with no restriction. Most of time it is valid. However, because the payload
representation may differ between the producer and the consumer, it is
important to be able to specify how must data to send via the splicing.

Of course, for performance reason, it is important to maximize amount of
data send via splicing at each call. However, on edge-cases, this now can be
limited.
2023-10-17 18:51:13 +02:00
Christopher Faulet
7ffb7624fe MINOR: connection: Remove mux callbacks about splicing
The kernel splicing support was totally remove waiting for the mux-to-mux
fast-forward implementation. So corresponding mux callbacks can be removed
now.
2023-10-17 18:51:13 +02:00
Christopher Faulet
8b89fe3d8f MINOR: stconn: Temporarily remove kernel splicing support
mux-to-mux fast-forwarding will be added. To avoid mix with the splicing and
simplify the commits, the kernel splicing support is removed from the
stconn. CF_KERN_SPLICING flag is removed and the support is no longer tested
in process_stream().

In the stconn part, rcv_pipe() callback function is no longer called.

Reg-tests scripts testing the kernel splicing are temporarly marked as
broken.
2023-10-17 18:51:13 +02:00
Christopher Faulet
242c6f0ded MINOR: connection: Add new mux callbacks to perform data fast-forwarding
To perform the mux-to-mux data fast-forwarding, 4 new callbacks were added
into the mux_ops structure. 2 callbacks will be used from the stconn for
fast-forward data. The 2 other callbacks will be used by the endpoint to
request an iobuf to the opposite endpoint.

 * fastfwd() callback function is used by a producer to forward data

 * resume_fastfwd() callback function is used by a consumer if some data are
   blocked in the iobuf, to resume the data forwarding.

 * init_fastfwd() must be used by an endpoint (the producer one), inside the
   fastfwd() callback to request an iobuf to the opposite side (the consumer
   one).

 * done_fastfwd() must be used by an endpoint (the producer one) at the end
   of fastfwd() to notify the opposite endpoint (the consumer one) if data
   were forwarded or not.

This API is still under development, so it may evolved. Especially when the
fast-forward will be extended to applets.

2 helper functions were also added into the SE api to wrap init_fastfwd()
and done_fastfwd() callback function of the underlying endpoint.

For now, this API is unsed and not implemented at all in muxes.
2023-10-17 18:51:13 +02:00
Christopher Faulet
1d68bebb70 MINOR: stconn: Extend iobuf to handle a buffer in addition to a pipe
It is unused for now, but the iobuf structure now owns a pointer to a
buffer. This buffer will be used to perform mux-to-mux fast-forwarding when
splicing is not supported or unusable. This pointer should be filled by an
endpoint to let the opposite one forward data.

Extra fields, in addition to the buffer, are mandatory because the buffer
may already contains some data. the ".offset" field may be used may be used
as the position to start to copy data. Finally, the amount of data copied in
this buffer must be saved in ".data" field.

Some flags are also added to prepare next changes. And helper stconn
fnuctions are updated to also count data in the buffer. For a first
implementation, it is not planned to handle data in the buffer and in the
pipe in same time. But it will be possible to do so.
2023-10-17 18:51:13 +02:00
Christopher Faulet
e52519ac83 MINOR: stconn: Start to introduce mux-to-mux fast-forwarding notion
Instead of talking about kernel splicing at stconn/sedesc level, we now try
to talk about mux-to-mux fast-forwarding. To do so, 2 functions were added
to know if there are fast-forwarded data and to retrieve this amount of
data. Of course, for now, there is only data in a pipe.

In addition, some flags were renamed to reflect this notion. Note the
channel's documentation was not updated yet.
2023-10-17 18:51:13 +02:00
Christopher Faulet
8bee0dcd7d MEDIUM: stconn/channel: Move pipes used for the splicing in the SE descriptors
The pipes used to put data when the kernel splicing is in used are moved in
the SE descriptors. For now, it is just a simple remplacement but there is a
major difference with the pipes in the channel. The data are pushed in the
consumer's pipe while it was pushed in the producer's pipe. So it means the
request data are now pushed in the pipe of the backend SE descriptor and
response data are pushed in the pipe of the frontend SE descriptor.

The idea is to hide the pipe from the channel/SC side and to be able to
handle fast-forwading in pipe but also in buffer. To do so, the pipe is
inside a new entity, called iobuf. This entity will be extended.
2023-10-17 18:51:13 +02:00
Willy Tarreau
68d02e5fa9 BUG/MINOR: mux-h2: make up other blocked streams upon removal from list
An interesting issue was met when testing the mux-to-mux forwarding code.

In order to preserve fairness, in h2_snd_buf() if other streams are waiting
in send_list or fctl_list, the stream that is attempting to send also goes
to its list, and will be woken up by h2_process_mux() or h2_send() when
some space is released. But on rare occasions, there are only a few (or
even a single) streams waiting in this list, and these streams are just
quickly removed because of a timeout or a quick h2_detach() that calls
h2s_destroy(). In this case there's no even to wake up the other waiting
stream in its list, and this will possibly resume processing after some
client WINDOW_UPDATE frames or even new streams, so usually it doesn't
last too long and it not much noticeable, reason why it was left that
long. In addition, measures have shown that in heavy network-bound
benchmark, this exact situation happens on less than 1% of the streams
(reached 4% with mux-mux).

The fix here consists in replacing these LIST_DEL_INIT() calls on
h2s->list with a function call that checks if other streams were queued
to the send_list recently, and if so, which also tries to resume them
by calling h2_resume_each_sending_h2s(). The detection of late additions
is made via a new flag on the connection, H2_CF_WAIT_INLIST, which is set
when a stream is queued due to other streams being present, and which is
cleared when this is function is called.

It is particularly difficult to reproduce this case which is particularly
timing-dependent, but in a constrained environment, a test involving 32
conns of 20 streams each, all downloading a 10 MB object previously
showed a limitation of 17 Gbps with lots of idle CPU time, and now
filled the cable at 25 Gbps.

This should be backported to all versions where it applies.
2023-10-17 16:43:44 +02:00
Aurelien DARRAGON
94d0f77deb MINOR: server: introduce "log-bufsize" kw
"log-bufsize" may now be used for a log server (in a log backend) to
configure the bufsize of implicit ring associated to the server (which
defaults to BUFSIZE).
2023-10-13 10:05:07 +02:00
Aurelien DARRAGON
b30bd7adba MEDIUM: log/balance: support for the "hash" lb algorithm
hash lb algorithm can be configured with the "log-balance hash <cnv_list>"
directive. With this algorithm, the user specifies a converter list with
<cnv_list>.

The produced log message will be passed as-is to the provided converter
list, and the resulting hash will be used to select the log server that
will receive the log message.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
7251344748 MINOR: sample: add sample_process_cnv() function
split sample_process() in 2 parts in order to be able to only process
the converter part of a sample expression from an existing input sample
struct passed as parameter.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
a7563158f7 MINOR: lbprm: support for the "none" hash-type function
Allow the use of the "none" hash-type function so that the key resulting
from the sample expression is directly used as the hash.

This can be useful to do the hashing manually using available hashing
converters, or even custom ones, and then inform haproxy that it can
directly rely on the sample expression result which is explictly handled
as an integer in this case.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
9a74a6cb17 MAJOR: log: introduce log backends
Using "mode log" in a backend section turns the proxy in a log backend
which can be used to log-balance logs between multiple log targets
(udp or tcp servers)

log backends can be used as regular log targets using the log directive
with "backend@be_name" prefix, like so:

  | log backend@mybackend local0

A log backend will distribute log messages to servers according to the
log load-balancing algorithm that can be set using the "log-balance"
option from the log backend section. For now, only the roundrobin
algorithm is supported and set by default.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
e58a9b4baf MINOR: sink: add sink_new_from_srv() function
This helper function can be used to create a new sink from an existing
server struct (and thus existing proxy as well), in order to spare some
resources when possible.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
5c0d1c1a74 MEDIUM: sink: inherit from caller fmt in ring_write() when rings didn't set one
implicit rings were automatically forced to the parent logger format, but
this was done upon ring creation.

This is quite restrictive because we might want to choose the desired
format right before generating the log header (ie: when producing the
log message), depending on the logger (log directive) that is
responsible for the log message, and with current logic this is not
possible. (To this day, we still have dedicated implicit ring per log
directive, but this might change)

In ring_write(), we check if the sink->fmt is specified:
 - defined: we use it since it is the most precise format
   (ie: for named rings)
 - undefined: then we fallback to the format from the logger

With this change, implicit rings' format is now set to UNSPEC upon
creation. This is safe because the log header building function
automatically enforces the "raw" format when UNSPEC is set. And since
logger->format also defaults to "raw", no change of default behavior
should be expected.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
6dad0549a5 MEDIUM: log/sink: simplify log header handling
Introduce log_header struct to easily pass log header data between
functions and use that to simplify the logic around log header
handling.

While at it, some outdated comments were updated as well.

No change in behavior should be expected.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
a9b185f34e MEDIUM: log: introduce log target
log targets were immediately embedded in logger struct (previously
named logsrv) and could not be used outside of this context.

In this patch, we're introducing log_target type with the associated
helper functions so that it becomes possible to declare and use log
targets outside of loggers scope.
2023-10-13 10:05:06 +02:00
Aurelien DARRAGON
18da35c123 MEDIUM: tree-wide: logsrv struct becomes logger
When 'log' directive was implemented, the internal representation was
named 'struct logsrv', because the 'log' directive would directly point
to the log target, which used to be a (UDP) log server exclusively at
that time, hence the name.

But things have become more complex, since today 'log' directive can point
to ring targets (implicit, or named) for example.

Indeed, a 'log' directive does no longer reference the "final" server to
which the log will be sent, but instead it describes which log API and
parameters to use for transporting the log messages to the proper log
destination.

So now the term 'logsrv' is rather confusing and prevents us from
introducing a new level of abstraction because they would be mixed
with logsrv.

So in order to better designate this 'log' directive, and make it more
generic, we chose the word 'logger' which now replaces logsrv everywhere
it was used in the code (including related comments).

This is internal rewording, so no functional change should be expected
on user-side.
2023-10-13 10:05:06 +02:00
Amaury Denoyelle
7d76ffb2a4 BUG/MINOR: quic: fix qc.cids access on quic-conn fail alloc
CIDs tree is now allocated dynamically since the following commit :
  276697438d
  MINOR: quic: Use a pool for the connection ID tree.

This can caused a crash if qc_new_conn() is interrupted due to an
intermediary failed allocation. When freeing all connection members,
free_quic_conn_cids() is used. However, this function does not support a
NULL cids.

To fix this, simply check that cids is NULL during free_quic_conn_cids()
prologue.

This bug was reproduced using -dMfail.

No need to backport.
2023-10-13 08:52:16 +02:00
Willy Tarreau
5798b5bb14 BUG/MAJOR: connection: make sure to always remove a connection from the tree
Since commit 5afcb686b ("MAJOR: connection: purge idle conn by last usage")
in 2.9-dev4, the test on conn->toremove_list added to conn_get_idle_flag()
in 2.8 by commit 3a7b539b1 ("BUG/MEDIUM: connection: Preserve flags when a
conn is removed from an idle list") becomes misleading. Indeed, now both
toremove_list and idle_list are shared by a union since the presence in
these lists is mutually exclusive. However, in conn_get_idle_flag() we
check for the presence in the toremove_list to decide whether or not to
delete the connection from the tree. This test now fails because instead
it sees the presence in the idle or safe list via the union, and concludes
the element must not be removed. Thus the element remains in the tree and
can be found later after the connection is released, causing crashes that
Tristan reported in issue #2292.

The following config is sufficient to reproduce it with 2 threads:

   defaults
        mode http
        timeout client 5s
        timeout server 5s
        timeout connect 1s

   listen front
        bind :8001
        server next 127.0.0.1:8002

   frontend next
        bind :8002
        timeout http-keep-alive 1
        http-request redirect location /

Sending traffic with a few concurrent connections and some short timeouts
suffices to instantly crash it after ~10k reqs:

   $ h2load -t 4 -c 16 -n 10000 -m 1 -w 1 http://0:8001/

With Amaury we analyzed the conditions in which the function is called
in order to figure a better condition for the test and concluded that
->toremove_list is never filled there so we can safely remove that part
from the test and just move the flag retrieval back to what it was prior
to the 2.8 patch above. Note that the patch is not reverted though, as
the parts that would drop the unexpected flags removal are unchanged.

This patch must NOT be backported. The code in 2.8 works correctly, it's
only the change in 2.9 that makes it misbehave.
2023-10-12 14:20:03 +02:00
Amaury Denoyelle
f59f8326f9 REORG: quic: cleanup traces definition
Move all QUIC trace definitions from quic_conn.h to quic_trace-t.h. Also
remove multiple definition trace_quic macro definition into
quic_trace.h. This forces all QUIC source files who relies on trace to
include it while reducing the size of quic_conn.h.
2023-10-11 14:15:31 +02:00
Frédéric Lécaille
bd83b6effb BUG/MINOR: quic: Avoid crashing with unsupported cryptographic algos
This bug was detected when compiling haproxy against aws-lc TLS stack
during QUIC interop runner tests. Some algorithms could be negotiated by haproxy
through the TLS stack but not fully supported by haproxy QUIC implentation.
This leaded tls_aead() to return NULL (same thing for tls_md(), tls_hp()).
As these functions returned values were never checked, they could triggered
segfaults.

To fix this, one closes the connection as soon as possible with a
handshake_failure(40) TLS alert. Note that as the TLS stack successfully
negotiates an algorithm, it provides haproxy with CRYPTO data before entering
->set_encryption_secrets() callback. This is why this callback
(ha_set_encryption_secrets() on haproxy side) is modified to release all
the CRYPTO frames before triggering a CONNECTION_CLOSE with a TLS alert. This is
done calling qc_release_pktns_frms() for all the packet number spaces.
Modify some quic_tls_keys_hexdump to avoid crashes when the ->aead or ->hp EVP_CIPHER
are NULL.
Modify qc_release_pktns_frms() to do nothing if the packet number space passed
as parameter is not intialized.

This bug does not impact the QUIC TLS compatibily mode (USE_QUIC_OPENSSL_COMPAT).

Thank you to @ilia-shipitsin for having reported this issue in GH #2309.

Must be backported as far as 2.6.
2023-10-11 11:52:22 +02:00
William Lallemand
deed2b6077 BUILD: ssl: enable keylog for WolfSSL
Enable the keylog feature when linking against an WolfSSL library which
has the 'HAVE_SECRET_CALLBACK' define.

Only supports <= TLSv1.2 secret dump.
2023-10-09 21:34:25 +02:00
William Lallemand
9a4c53d96c CLEANUP: ssl: remove compat functions for openssl < 1.0.0
The openssl-compat.h file has some function which were implemented in
order to provide compatibility with openssl < 1.0.0. Most of them where
to support the 0.9.8 version, but we don't support this version anymore.

This patch removes the deprecated code from openssl-compat.h
2023-10-09 17:27:53 +02:00
William Lallemand
1918bcbc12 BUILD: ssl: enable keylog for awslc
AWSLC suports SSL_CTX_set_keylog_callback(), this patch enables the
build with the keylog feature for this library.
2023-10-09 16:17:30 +02:00
William Lallemand
4428ac4f70 BUILD: ssl: add 'secure_memcmp' converter for WolfSSL and awslc
CRYPTO_memcmp is supported by both awslc and wolfssl, lets add the
suport for the 'secure_memcmp' converter into the build.
2023-10-09 15:44:50 +02:00
William Lallemand
bf426eecd7 BUILD: ssl: add 'ssl_c_r_dn' fetch for WolfSSL
WolfSSL supports SSL_get0_verified_chain() so we can activate this
feature.
2023-10-09 15:09:47 +02:00
William Lallemand
d75bc06bdc BUILD: ssl: enable 'ciphersuites' for WolfSSL
WolfSSL supports setting the 'ciphersuites', lets enable the keyword for
it.
2023-10-09 14:56:43 +02:00
Willy Tarreau
1e3422e6b0 BUG/MEDIUM: actions: always apply a longest match on prefix lookup
Many actions take arguments after a parenthesis. When this happens, they
have to be tagged in the parser with KWF_MATCH_PREFIX so that a sub-word
is sufficient (since by default the whole block including the parenthesis
is taken).

The problem with this is that the parser stops on the first match. This
was OK years ago when there were very few actions, but over time new ones
were added and many actions are the prefix of another one (e.g. "set-var"
is the prefix of "set-var-fmt"). And what happens in this case is that the
first word is picked. Most often that doesn't cause trouble because such
similar-looking actions involve the same custom parser so actually the
wrong selection of the first entry results in the correct parser to be
used anyway and the error to be silently hidden.

But it's getting worse when accidentally declaring prefixes in multiple
files, because in this case it will solely depend on the object file link
order: if the longest name appears first, it will be properly detected,
but if it appears last, its other prefix will be detected and might very
well not be related at all and use a distinct parser. And this is random
enough to make some actions succeed or fail depending on the build options
that affect the linkage order. Worse: what if a keyword is the prefix of
another one, with a different parser but a compatible syntax ? It could
seem to work by accident but not do the expected operations.

The correct solution is to always look for the longest matching name.
This way the correct keyword will always be matched and used and there
will be no risk to randomly pick the wrong anymore.

This fix must be backported to the relevant stable releases.
2023-10-06 17:06:44 +02:00
Christopher Faulet
a633338b55 BUG/MEDIUM: stconn: Fix comparison sign in sc_need_room()
sc_need_room() function may be called with a negative value. In this case,
the intent is to be notified if any space was made in the channel buffer. In
the function, we get the min between the requested room and the maximum
possible room in the buffer, considering it may be an HTX buffer.

However this max value is unsigned and leads to an unsigned comparison,
casting the negative value to an unsigned value. Of course, in this case,
this always leads to the wrong result. This bug seems to have no effect but
it is hard to be sure.

To fix the issue, we take care to respect the requested room sign by casting
the max value to a signed integer.

This patch must be backported to 2.8.
2023-10-06 15:34:31 +02:00
Aurelien DARRAGON
205d480d9f MINOR: sink: refine forward_px usage
now forward_px only serves as a hint to know if a proxy was created
specifically for the sink, in which case the sink is responsible for it.

Everywhere forward_px was used in appctx context: get the parent proxy from
the sft->srv instead.

This permits to finally get rid of the double link dependency between sink
and proxy.
2023-10-06 15:34:31 +02:00
Willy Tarreau
90fa2eaa15 MINOR: haproxy: permit to register features during boot
The regtests are using the "feature()" predicate but this one can only
rely on build-time options. It would be nice if some runtime-specific
options could be detected at boot time so that regtests could more
flexibly adapt to what is supported (capabilities, splicing, etc).

Similarly, certain features that are currently enabled with USE_XXX
could also be automatically detected at build time using ifdefs and
would simplify the configuration, but then we'd lose the feature
report in the feature list which is convenient for regtests.

This patch makes sure that haproxy -vv shows the variable's contents
and not the macro's contents, and adds a new hap_register_feature()
to allow the code to register a new keyword.
2023-10-06 11:40:02 +02:00
Remi Tricot-Le Breton
a5e96425a2 MEDIUM: cache: Add "Origin" header to secondary cache key
This patch add a hash of the Origin header to the cache's secondary key.
This enables to manage store responses that have a "Vary: Origin" header
in the cache when vary is enabled.
This cannot be considered as a means to manage CORS requests though, it
only processes the Origin header and hashes the presented value without
any form of URI normalization.

This need was expressed by Philipp Hossner in GitHub issue #251.

Co-Authored-by: Philipp Hossner <philipp.hossner@posteo.de>
2023-10-05 10:53:54 +02:00
William Lallemand
45174e4fdc BUILD: quic: allow USE_QUIC to work with AWSLC
This patch fixes the build with AWSLC and USE_QUIC=1, this is only meant
to be able to build for now and it's not feature complete.

The set_encryption_secrets callback has been split in set_read_secret
and set_write_secret.

Missing features:

- 0RTT was disabled.
- TLS1_3_CK_CHACHA20_POLY1305_SHA256, TLS1_3_CK_AES_128_CCM_SHA256 were disabled
- clienthello callback is missing, certificate selection could be
  limited (RSA + ECDSA at the same time)
2023-10-04 16:55:19 +02:00
Christopher Faulet
f32e28eddc MINOR: mux-h1: Add flags if outgoing msg contains a header about its payload
If a "Content-length" or "Transfer-Encoding; chunked" headers is found or
inserted in an outgoing message, a specific flag is now set on the H1
stream. H1S_F_HAVE_CLEN is set for "Content-length" header and
H1S_F_HAVE_CHNK for "Transfer-Encoding: chunked".

This will be useful to properly format outgoing messages, even if one of
these headers was removed by hand (with no update of the message meta-data).
2023-10-04 15:34:18 +02:00
Amaury Denoyelle
bd001ff346 MINOR: backend: refactor specific source address allocation
Refactor alloc_bind_address() function which is used to allocate a
sockaddr if a connection to a target server relies on a specific source
address setting.

The main objective of this change is to be able to use this function
outside of backend module, namely for preconnections using a reverse
server. As such, this function is now exported globally.

For reverse connect, there is no stream instance. As such, the function
parts which relied on it were reduced to the minimal. Now, stream is
only used if a non-static address is configured which is useful for
usesrc client|clientip|hdr_ip. These options have no sense for reverse
connect so it should be safe to use the same function.
2023-10-03 17:49:12 +02:00
Amaury Denoyelle
2ac5d9a657 MINOR: quic: handle perm error on bind during runtime
Improve EACCES permission errors encounterd when using QUIC connection
socket at runtime :

* First occurence of the error on the process will generate a log
  warning. This should prevent users from using a privileged port
  without mandatory access rights.

* Socket mode will automatically fallback to listener socket for the
  receiver instance. This requires to duplicate the settings from the
  bind_conf to the receiver instance to support configurations with
  multiple addresses on the same bind line.
2023-10-03 16:52:02 +02:00
Amaury Denoyelle
3ef6df7387 MINOR: quic: define quic-socket bind setting
Define a new bind option quic-socket :
  quic-socket [ connection | listener ]

This new setting works in conjunction with the existing configuration
global tune.quic.socket-owner and reuse the same semantics.

The purpose of this setting is to allow to disable connection socket
usage on listener instances individually. This will notably be useful
when needing to deactivating it when encountered a fatal permission
error on bind() at runtime.
2023-10-03 16:49:26 +02:00
Willy Tarreau
7c69c9b51f BUG/MAJOR: plock: fix major bug in pl_take_w() introduced with EBO
When EBO was brought to pl_take_w() by plock commit 60d750d ("plock: use
EBO when waiting for readers to leave in take_w() and stow()"), a mistake
was made: the mask against which the current value of the lock is tested
excludes the first reader like in stow(), but it must not because it was
just obtained via an ldadd() which means that it doesn't count itself.

The problem this causes is that if there is exactly one reader when a
writer grabs the lock, the writer will not wait for it to leave before
starting its operations.

The solution consists in checking for any reader in the IF. However the
mask passed to pl_wait_unlock_*() must still exclude the lowest bit as
it's verified after a subsequent load.

Kudos to Remi Tricot-Le Breton for reporting and bisecting this issue
with a reproducer.

No backport is needed since this was brought in 2.9-dev3 with commit
8178a5211 ("MAJOR: threads/plock: update the embedded library again").
The code is now on par with plock commit ada70fe.
2023-10-03 08:28:12 +02:00
Amaury Denoyelle
337c71423f MINOR: connection: define mux flag for reverse support
Add a new MUX flag MX_FL_REVERSABLE. This value is used to indicate that
MUX instance supports connection reversal. For the moment, only HTTP/2
multiplexer is flagged with it.

This allows to dynamically check if reversal can be completed during MUX
installation. This will allow to relax requirement on config writing for
'tcp-request session attach-srv' which currently cannot be used mixed
with non-http/2 listener instances, even if used conditionnally with an
ACL.
2023-09-29 18:09:08 +02:00
Amaury Denoyelle
ac1164de7c MINOR: connection: define error for reverse connect
Define a new error code for connection CO_ER_REVERSE. This will be used
to report an issue which happens on a connection targetted for reversal
before reverse process is completed.
2023-09-29 18:08:26 +02:00
Emeric Brun
3c250cb847 Revert "BUG/MEDIUM: quic: missing check of dcid for init pkt including a token"
This reverts commit 072e774939.

Doing h2load with h3 tests we notice this behavior:

Client ---- INIT no token SCID = a , DCID = A ---> Server (1)
Client <--- RETRY+TOKEN DCID = a, SCID = B    ---- Server (2)
Client ---- INIT+TOKEN SCID = a , DCID = B    ---> Server (3)
Client <--- INIT DCID = a, SCID = C           ---- Server (4)
Client ---- INIT+TOKEN SCID = a, DCID = C     ---> Server (5)

With (5) dropped by haproxy due to token validation.

Indeed the previous patch adds SCID of retry packet sent to the aad
of the token ciphering aad. It was useful to validate the next INIT
packets including the token are sent by the client using the new
provided SCID for DCID as mantionned into the RFC 9000.
But this stateless information is lost on received INIT packets
following the first outgoing INIT packet from the server because
the client is also supposed to re-use a second time the lastest
received SCID for its new DCID. This will break the token validation
on those last packets and they will be dropped by haproxy.

It was discussed there:
https://mailarchive.ietf.org/arch/msg/quic/7kXVvzhNCpgPk6FwtyPuIC6tRk0/

To resume: this is not the role of the server to verify the re-use of
retry's SCID for DCID in further client's INIT packets.

The previous patch must be reverted in all versions where it was
backported (supposed until 2.6)
2023-09-29 09:27:22 +02:00
Willy Tarreau
d956db6638 CLEANUP: stream: remove the now unused stream_dump() function
It was superseded by strm_dump_to_buffer() which provides much more
complete information and supports anonymizing.
2023-09-29 09:20:27 +02:00
Willy Tarreau
c185bc4656 MEDIUM: stream: now provide full stream dumps in case of loops
When a stream is caught looping, we produce some output to help figure
its internal state explaining why it's looping. The problem is that this
debug output is quite old and the info it provides are quite insufficient
to debug a modern process, and since such bugs happen only once or twice
a year the situation doesn't improve.

On the other hand the output of "show sess all" is extremely detailed
and kept up to date with code evolutions since it's a heavily used
debugging tool.

This commit replaces the call to the totally outdated stream_dump() with
a call to strm_dump_to_buffer(), and removes the filters dump since they
are already emitted there, and it now produces much more exploitable
output:

  [ALERT]    (5936) : A bogus STREAM [0x7fa8dc02f660] is spinning at 5653514 calls per second and refuses to die, aborting now! Please report this error to developers:
  0x7fa8dc02f660: [28/Sep/2023:09:53:08.811818] id=2 proto=tcpv4 source=127.0.0.1:58306
     flags=0xc4a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 srv_conn=0x133f220, pend_pos=(nil) waiting=0 epoch=0x1
     frontend=public (id=2 mode=http), listener=? (id=1) addr=127.0.0.1:4080
     backend=public (id=2 mode=http) addr=127.0.0.1:61932
     server=s1 (id=1) addr=127.0.0.1:7443
     task=0x7fa8dc02fa40 (state=0x01 nice=0 calls=5749559 rate=5653514 exp=3s tid=1(1/1) age=1s)
     txn=0x7fa8dc02fbf0 flags=0x3000 meth=1 status=-1 req.st=MSG_DONE rsp.st=MSG_RPBEFORE req.f=0x4c rsp.f=0x00
     scf=0x7fa8dc02f5f0 flags=0x00000482 state=EST endp=CONN,0x7fa8dc02b4b0,0x05004001 sub=1 rex=58s wex=<NEVER>
         h1s=0x7fa8dc02b4b0 h1s.flg=0x100010 .sd.flg=0x5004001 .req.state=MSG_DONE .res.state=MSG_RPBEFORE
          .meth=GET status=0 .sd.flg=0x05004001 .sc.flg=0x00000482 .sc.app=0x7fa8dc02f660
          .subs=0x7fa8dc02f608(ev=1 tl=0x7fa8dc02fae0 tl.calls=0 tl.ctx=0x7fa8dc02f5f0 tl.fct=sc_conn_io_cb)
          h1c=0x7fa8dc0272d0 h1c.flg=0x0 .sub=0 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7fa8dc0273f0 .exp=<NEVER>
         co0=0x7fa8dc027040 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=LISTENER:0x12840c0
         flags=0x00000300 fd=32 fd.state=20 updt=0 fd.tmask=0x2
     scb=0x7fa8dc02fb30 flags=0x00001411 state=EST endp=CONN,0x7fa8dc0300c0,0x05000001 sub=1 rex=58s wex=<NEVER>
         h1s=0x7fa8dc0300c0 h1s.flg=0x4010 .sd.flg=0x5000001 .req.state=MSG_DONE .res.state=MSG_RPBEFORE
          .meth=GET status=0 .sd.flg=0x05000001 .sc.flg=0x00001411 .sc.app=0x7fa8dc02f660
          .subs=0x7fa8dc02fb48(ev=1 tl=0x7fa8dc02feb0 tl.calls=2 tl.ctx=0x7fa8dc02fb30 tl.fct=sc_conn_io_cb)
          h1c=0x7fa8dc02ff00 h1c.flg=0x80000000 .sub=1 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7fa8dc030020 .exp=<NEVER>
         co1=0x7fa8dc02fcd0 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=SERVER:0x133f220
         flags=0x10000300 fd=33 fd.state=10421 updt=0 fd.tmask=0x2
     req=0x7fa8dc02f680 (f=0x1840000 an=0x8000 pipe=0 tofwd=0 total=79)
         an_exp=<NEVER> buf=0x7fa8dc02f688 data=(nil) o=0 p=0 i=0 size=0
         htx=0xc18f60 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
     res=0x7fa8dc02f6d0 (f=0x80000000 an=0x1400000 pipe=0 tofwd=0 total=0)
         an_exp=<NEVER> buf=0x7fa8dc02f6d8 data=(nil) o=0 p=0 i=0 size=0
         htx=0xc18f60 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
    call trace(10):
    |       0x59f2b7 [0f 0b 0f 1f 80 00 00 00]: stream_dump_and_crash+0x1f7/0x2bf
    |       0x5a0d71 [e9 af e6 ff ff ba 40 00]: process_stream+0x19f1/0x3a56
    |       0x68d7bb [49 89 c7 4d 85 ff 74 77]: run_tasks_from_lists+0x3ab/0x924
    |       0x68e0b4 [29 44 24 14 8b 4c 24 14]: process_runnable_tasks+0x374/0x6d6
    |       0x656f67 [83 3d f2 75 84 00 01 0f]: run_poll_loop+0x127/0x5a8
    |       0x6575d7 [48 8b 1d 42 50 5c 00 48]: main+0x1b22f7
    | 0x7fa8e0f35e45 [64 48 89 04 25 30 06 00]: libpthread:+0x7e45
    | 0x7fa8e0e5a4af [48 89 c7 b8 3c 00 00 00]: libc:clone+0x3f/0x5a

Note that the output is subject to the global anon key so that IPs and
object names can be anonymized if required. It could make sense to
backport this and the few related previous patches next time such an
issue is reported.
2023-09-29 09:20:27 +02:00
Willy Tarreau
5743eeea88 MINOR: stream: make stream_dump() always multi-line
There used to be two working modes for this function, a single-line one
and a multi-line one, the difference being made on the "eol" argument
which could contain either a space or an LF (and with the prefix being
adjusted accordingly). Let's get rid of the single-line mode as it's
what limits the output contents because it's difficult to produce
exploitable structured data this way. It was only used in the rare case
of spinning streams and applets and these are the ones lacking info. Now
a spinning stream produces:

[ALERT]    (3511) : A bogus STREAM [0x227e7b0] is spinning at 5581202 calls per second and refuses to die, aborting now! Please report this error to developers:
  strm=0x227e7b0,c4a src=127.0.0.1 fe=public be=public dst=s1
  txn=0x2041650,3000 txn.req=MSG_DONE,4c txn.rsp=MSG_RPBEFORE,0
  rqf=1840000 rqa=8000 rpf=80000000 rpa=1400000
  scf=0x24af280,EST,482 scb=0x24af430,EST,1411
  af=(nil),0 sab=(nil),0
  cof=0x7fdb28026630,300:H1(0x24a6f60)/RAW((nil))/tcpv4(33)
  cob=0x23199f0,10000300:H1(0x24af630)/RAW((nil))/tcpv4(32)
  filters={}
  call trace(11):
  (...)
2023-09-29 09:20:27 +02:00
Willy Tarreau
48b2233d36 CLEANUP: freq_ctr: make all freq_ctr readers take a const
Since 2.4-dev18 with commit b4476c6a8 ("CLEANUP: freq_ctr: make
arguments of freq_ctr_total() const"), most of the freq_ctr readers
should be fine with a const, except that they were not updated to
reflect this and they continue to force variable on some functions
that call them. Let's update this. This could even be backported if
needed.
2023-09-29 09:20:27 +02:00
Vladimir Vdovin
f8b81f6eb7 MINOR: support for http-request set-timeout client
Added set-timeout for frontend side of session, so it can be used to set
custom per-client timeouts if needed. Added cur_client_timeout to fetch
client timeout samples.
2023-09-28 08:49:22 +02:00
Amaury Denoyelle
b9bb3b932c MINOR: proto_reverse_connect: emit log for preconnect
Add reporting using send_log() for preconnect operation. This is minimal
to ensure we understand the current status of listener in active reverse
connect.

To limit logging quantity, only important transition are considered.
This requires to implement a minimal state machine as a new field in
receiver structure.

Here are the logs produced :
* Initiating : first time preconnect is enabled on a listener
* Error : last preconnect attempt interrupted on a connection error
* Reaching maxconn : all necessary connections were reversed and are
  operational on a listener
2023-09-22 17:21:53 +02:00
Amaury Denoyelle
1f43fb71be MINOR: proto_reverse_connect: refactor preconnect failure
When a connection is freed during preconnect before reversal, the error
must be notified to the listener to remove any connection reference and
rearm a new preconnect attempt. Currently, this can occur through 2 code
paths :
* conn_free() called directly by H2 mux
* error during conn_create_mux(). For this case, connection is flagged
  with CO_FL_ERROR and reverse_connect task is woken up. The process
  task handler is then responsible to call conn_free() for such
  connection.

Duplicated steps where done both in conn_free() and process task
handler. These are now removed. To facilitate code maintenance,
dedicated operation have been centralized in a new function
rev_notify_preconn_err() which is called by conn_free().
2023-09-22 16:43:36 +02:00
Emeric Brun
27b2fd2e06 MINOR: quic: handle external extra CIDs generator.
This patch adds the ability to externalize and customize the code
of the computation of extra CIDs after the first one was derived from
the ODCID.

This is to prepare interoperability with extra components such as
different QUIC proxies or routers for instance.

To process the patch defines two function callbacks:
- the first one to compute a hash 64bits from the first generated CID
  (itself continues to be derived from ODCID). Resulting hash is stored
  into the 'quic_conn' and 64bits is chosen large enought to be able to
  store an entire haproxy's CID.
- the second callback re-uses the previoulsy computed hash to derive
  an extra CID using the custom algorithm. If not set haproxy will
  continue to choose a randomized CID value.

Those two functions have also the 'cluster_secret' passed as an argument:
this way, it is usable for obfuscation or ciphering.
2023-09-22 10:32:14 +02:00
Aurelien DARRAGON
acb7d8a89c MINOR: pattern: fix pat_{parse,match}_ip() function comments
Function comments were outdated, probably because they have not been
updated during the previous refactors.

Fixing comments to better reflect the current behavior.

This may be backported up to 2.2, or even 2.0 by slightly adapting the
patch (in 2.0, such functions are documented in proto/pattern.h)
2023-09-21 09:50:55 +02:00
Willy Tarreau
cbbee15462 CLEANUP: ring: rename the ring lock "RING_LOCK" instead of "LOGSRV_LOCK"
The ring lock was initially mostly used for the logs and used to inherit
its name in lock stats. Now that it's exclusively used by rings, let's
rename it accordingly.
2023-09-20 21:38:33 +02:00
Willy Tarreau
cec8b42cb3 MEDIUM: logs: atomically check and update the log sample index
The log server lock is pretty visible in perf top when using log samples
because it's taken for each server in turn while trying to validate and
update the log server's index. Let's change this for a CAS, since we have
the index and the range at hand now. This allow us to remove the logsrv
lock.

The test on 4 servers now shows a 3.7 times improvement thanks to much
lower contention. Without log sampling a test producing 4.4M logs/s
delivers 4.4M logs/s at 21 CPUs used, everything spent in the kernel.
After enabling 4 samples (1:4, 2:4, 3:4 and 4:4), the throughput would
previously drop to 1.13M log/s with 37 CPUs used and 75% spent in
process_send_log(). Now with this change, 4.25M logs/s are emitted,
using 26 CPUs and 22% in process_send_log(). That's a 3.7x throughput
improvement for a 30% global CPU usage reduction, but in practice it
mostly shows that the performance drop caused by having samples is much
less noticeable (each of the 4 servers has its index updated for each
log).

Note that in order to even avoid incrementing an index for each log srv
that is consulted, it would be more convenient to have a single index
per frontend and apply the modulus on each log server in turn to see if
the range has to be updated. It would then only perform one write per
range switch. However the place where this is done doesn't have access
to a frontend, so some changes would need to be performed for this, and
it would require to update the current range independently in each
logsrv, which is not necessarily easier since we don't know yet if we
can commit it.
2023-09-20 21:38:33 +02:00
Willy Tarreau
e00470378b MINOR: logs: use a single index to store the current range and index
By using a single long long to store both the current range and the
next index, we'll make it possible to perform atomic operations instead
of locking. Let's only regroup them for now under a new "curr_rg_idx".
The upper word is the range, the lower is the index.
2023-09-20 21:38:33 +02:00
Willy Tarreau
3f1284560f MINOR: log: remove the unused curr_idx in struct smp_log_range
This index is useless because it only serves to know when the global
index reached the end, while the global one already knows it. Let's
just drop it and perform the test on the global range.

It was verified with the following config that the first server continues
to take 1/10 of the traffic, the 2nd one 2/10, the 3rd one 3/10 and the
4th one 4/10:

    log 127.0.0.1:10001 sample 1:10 local0
    log 127.0.0.1:10002 sample 2,5:10 local0
    log 127.0.0.1:10003 sample 3,7,9:10 local0
    log 127.0.0.1:10004 sample 4,6,8,10:10 local0
2023-09-20 21:38:33 +02:00
Willy Tarreau
4351364700 MINOR: logs: clarify the check of the log range
The test of the log range is not very clear, in part due to the
reuse of the "curr_idx" name that happens at two levels. The call
to in_smp_log_range() applies to the smp_info's index to which 1 is
added: it verifies that the next index is still within the current
range.

Let's just have a local variable "next_index" in process_send_log()
that gets assigned the next index (current+1) and compare it to the
current range's boundaries. This makes the test much clearer. We can
then simply remove in_smp_log_range() that's no longer needed.
2023-09-20 21:38:33 +02:00
Willy Tarreau
6cbb5a057b Revert "MAJOR: import: update mt_list to support exponential back-off"
This reverts commit c618ed5ff4.

The list iterator is broken. As found by Fred, running QUIC single-
threaded shows that only the first connection is accepted because the
accepter relies on the element being initialized once detached (which
is expected and matches what MT_LIST_DELETE_SAFE() used to do before).
However while doing this in the quic_sock code seems to work, doing it
inside the macro show total breakage and the unit test doesn't work
anymore (random crashes). Thus it looks like the fix is not trivial,
let's roll this back for the time it will take to fix the loop.
2023-09-15 17:13:43 +02:00
Willy Tarreau
e3b2704e26 BUG/MINOR: freq_ctr: fix possible negative rate with the scaled API
In 1.9 with commit 627505d36 ("MINOR: freq_ctr: add swrate_add_scaled()
to work with large samples") we got the ability to indicate when adding
some values that they represent a number of samples. However there is an
issue in the calculation which is that the number of samples that is
added to the sum before the division in order to avoid fading away too
fast, is multiplied by the scale. The problem it causes is that this is
done in the negative part of the expression, and that as soon if the sum
of old_sum and v*s is too small (e.g. zero), we end up with a negative
value of -s.

This is visible in "show pools" which occasionally report a very large
value on "needed_avg" since 2.9, though the bug has been there for longer.
Indeed in 2.9 since they're hashed in buckets, it suffices that any
thread got one such error once for the sum to be wrong. One possible
impact is memory usage not shrinking after a short burst due to pools
refraining from releasing objects, believing they don't have enough.

This must be backported to all versions. Note that the opportunistic
version can be dropped before 2.8.
2023-09-14 11:09:07 +02:00
Willy Tarreau
c618ed5ff4 MAJOR: import: update mt_list to support exponential back-off
The new mt_list code supports exponential back-off on conflict, which
is important for use cases where there is contention on a large number
of threads. The API evolved a little bit and required some updates:

  - mt_list_for_each_entry_safe() is now in upper case to explicitly
    show that it is a macro, and only uses the back element, doesn't
    require a secondary pointer for deletes anymore.

  - MT_LIST_DELETE_SAFE() doesn't exist anymore, instead one just has
    to set the list iterator to NULL so that it is not re-inserted
    into the list and the list is spliced there. One must be careful
    because it was usually performed before freeing the element. Now
    instead the element must be nulled before the continue/break.

  - MT_LIST_LOCK_ELT() and MT_LIST_UNLOCK_ELT() have always been
    unclear. They were replaced by mt_list_cut_around() and
    mt_list_connect_elem() which more explicitly detach the element
    and reconnect it into the list.

  - MT_LIST_APPEND_LOCKED() was only in haproxy so it was left as-is
    in list.h. It may however possibly benefit from being upstreamed.

This required tiny adaptations to event_hdl.c and quic_sock.c. The
test case was updated and the API doc added. Note that in order to
keep include files small, the struct mt_list definition remains in
list-t.h (par of the internal API) and was ifdef'd out in mt_list.h.

A test on QUIC with both quictls 1.1.1 and wolfssl 5.6.3 on ARM64 with
80 threads shows a drastic reduction of CPU usage thanks to this and
the refined memory barriers. Please note that the CPU usage on OpenSSL
3.0.9 is significantly higher due to the excessive use of atomic ops
by openssl, but 3.1 is only slightly above 1.1.1 though:

  - before: 35 Gbps, 3.5 Mpps, 7800% CPU
  - after:  41 Gbps, 4.2 Mpps, 2900% CPU
2023-09-13 11:50:33 +02:00
Frdric Lcaille
84757e32e6 BUG/MEDIUM: quic: quic_cc_conn ->cntrs counters unreachable
This bug arrived with this commit in 2.9-dev3:

    MEDIUM: quic: Allow the quic_conn memory to be asap released.

When sending packets from quic_cc_conn_io_cb(), e.g. when the quic_conn
object has been released and replaced by a lighter one (quic_cc_conn),
some counters may have to be incremented. But they were not reachable
because not shared between quic_conn and quic_cc_conn struct.

To fix this, one has only to move the ->cntrs counters from quic_conn
to QUIC_CONN_COMMON struct which is shared between both quic_cc_conn

Thank you to Tristan for having reported this in GH #2247.

No need to backport.
2023-09-12 18:13:36 +02:00
Willy Tarreau
efc46dede9 DEBUG: pools: inspect pools on fatal error and dump information found
It's a bit frustrating sometimes to see pool checks catch a bug but not
provide exploitable information without a core.

Here we're adding a function "pool_inspect_item()" which is called just
before aborting in pool_check_pattern() and POOL_DEBUG_CHECK_MARK() and
which will display the error type, the pool's pointer and name, and will
try to check if the item's tag matches the pool, and if not, will iterate
over all pools to see if one would be a better candidate, then will try
to figure the last known caller and possibly other likely candidates if
the pool's tag is not sufficiently trusted. This typically helps better
diagnose corruption in use-after-free scenarios, or freeing to a pool
that differs from the one the object was allocated from, and will also
indicate calling points that may help figure where an object was last
released or allocated. The info is printed on stderr just before the
backtrace.

For example, the recent off-by-one test in the PPv2 changes would have
produced the following output in vtest logs:

  ***  h1    debug|FATAL: pool inconsistency detected in thread 1: tag mismatch on free().
  ***  h1    debug|  caller: 0x62bb87 (conn_free+0x147/0x3c5)
  ***  h1    debug|  pool: 0x2211ec0 ('pp_tlv_256', size 304, real 320, users 1)
  ***  h1    debug|Tag does not match. Possible origin pool(s):
  ***  h1    debug|  tag: @0x2565530 = 0x2216740 (pp_tlv_128, size 176, real 192, users 1)
  ***  h1    debug|Recorded caller if pool 'pp_tlv_128':
  ***  h1    debug|  @0x2565538 (+0184) = 0x62c76d (conn_recv_proxy+0x4cd/0xa24)

A mismatch in the allocated/released pool is already visible, and the
callers confirm it once resolved, where the allocator indeed allocates
from pp_tlv_128 and conn_free() releases to pp_tlv_256:

  $ addr2line -spafe ./haproxy <<< $'0x62bb87\n0x62c76d'
  0x000000000062bb87: conn_free at connection.c:568
  0x000000000062c76d: conn_recv_proxy at connection.c:1177
2023-09-11 15:46:14 +02:00
Willy Tarreau
f6bee5a50b DEBUG: pools: make pool_check_pattern() take a pointer to the pool
This will be useful to report detailed bug traces.
2023-09-11 15:19:49 +02:00
Willy Tarreau
e92e96b00f DEBUG: pools: pass the caller pointer to the check functions and macros
In preparation for more detailed pool error reports, let's pass the
caller pointers to the check functions. This will be useful to produce
messages indicating where the issue happened.
2023-09-11 15:19:49 +02:00
Willy Tarreau
baf2070421 DEBUG: pools: always record the caller for uncached allocs as well
When recording the caller of a pool_alloc(), we currently store it only
when the object comes from the cache and never when it comes from the
heap. There's no valid reason for this except that the caller's pointer
was not passed to pool_alloc_nocache(), so it used to set NULL there.
Let's just pass it down the chain.
2023-09-11 15:19:49 +02:00
Willy Tarreau
4a18d9e560 REORG: cpuset: move parse_cpu_set() and parse_cpumap() to cpuset.c
These ones were still in cfgparse.c but they're not specific to the
config at all and may actually be used even when parsing cpu list
entries in /sys. Better move them where they can be reused.
2023-09-08 16:25:19 +02:00
Willy Tarreau
5119109e3f MINOR: cpuset: dynamically allocate cpu_map
cpu_map is 8.2kB/entry and there's one such entry per group, that's
~520kB total. In addition, the init code is still in haproxy.c enclosed
in ifdefs. Let's make this a dynamically allocated array in the cpuset
code and remove that init code.

Later we may even consider reallocating it once the number of threads
and groups is known, in order to shrink it a little bit, as the typical
setup with a single group will only need 8.2kB, thus saving half a MB
of RAM. This would require that the upper bound is placed in a variable
though.
2023-09-08 16:25:19 +02:00
Willy Tarreau
1f2433fb6a MINOR: tools: add function read_line_to_trash() to read a line of a file
This function takes on input a printf format for the file name, making
it particularly suitable for /proc or /sys entries which take a lot of
numbers. It also automatically trims the trailing CR and/or LF chars.
2023-09-08 16:25:19 +02:00
Frédéric Lécaille
e3e218b98e CLEANUP: quic: Remove useless free_quic_tx_pkts() function.
This function define but no more used since this commit:
    BUG/MAJOR: quic: Really ignore malformed ACK frames.
2023-09-08 10:17:25 +02:00
Frédéric Lécaille
292dfdd78d BUG/MINOR: quic: Wrong cluster secret initialization
The function generate_random_cluster_secret() which initializes the cluster secret
when not supplied by configuration is buggy. There 1/256 that the cluster secret
string is empty.

To fix this, one stores the cluster as a reduced size first 128 bits of its own
SHA1 (160 bits) digest, if defined by configuration. If this is not the case, it
is initialized with a 128 bits random value. Furthermore, thus the cluster secret
is always initialized.

As the cluster secret is always initialized, there are several tests which
are for now on useless. This patch removes such tests (if(global.cluster_secret))
in the QUIC code part and at parsing time: no need to check that a cluster
secret was initialized with "quic-force-retry" option.

Must be backported as far as 2.6.
2023-09-08 09:50:58 +02:00
William Lallemand
15e591b6e0 MINOR: ssl: add support for 'curves' keyword on server lines
This patch implements the 'curves' keyword on server lines as well as
the 'ssl-default-server-curves' keyword in the global section.

It also add the keyword on the server line in the ssl_curves reg-test.

These keywords allow the configuration of the curves list for a server.
2023-09-07 23:29:10 +02:00
Willy Tarreau
28ff1a5d56 MINOR: tasks/stats: report the number of niced tasks in "show info"
We currently know the number of tasks in the run queue that are niced,
and we don't expose it. It's too bad because it can give a hint about
what share of the load is relevant. For example if one runs a Lua
script that was purposely reniced, or if a stats page or the CLI is
hammered with slow operations, seeing them appear there can help
identify what part of the load is not caused by the traffic, and
improve monitoring systems or autoscalers.
2023-09-06 17:44:44 +02:00
Remi Tricot-Le Breton
e03d060aa3 MINOR: cache: Change hash function in default normalizer used in case of "vary"
When building the secondary signature for cache entries when vary is
enabled, the referer part of the signature was a simple crc32 of the
first referer header.
This patch changes it to a 64bits hash based of xxhash algorithm with a
random seed built during init. This will prevent "malicious" hash
collisions between entries of the cache.
2023-09-06 16:11:31 +02:00
Aurelien DARRAGON
d9b81e5b49 MEDIUM: log/sink: make logsrv postparsing more generic
We previously had postparsing logic but only for logsrv sinks, but now we
need to make this operation on logsrv directly instead of sinks to prepare
for additional postparsing logic that is not sink-specific.

To do this, we migrated post_sink_resolve() and sink_postresolve_logsrvs()
to their postresolve_logsrvs() and postresolve_logsrv_list() equivalents.

Then, we split postresolve_logsrv_list() so that the sink-only logic stays
in sink.c (sink_resolve_logsrv_buffer() function), and the "generic"
target part stays in log.c as resolve_logsrv().

Error messages formatting was preserved as far as possible but some slight
variations are to be expected.
As for the functional aspect, no change should be expected.
2023-09-06 16:06:39 +02:00
Aurelien DARRAGON
969e212c66 MINOR: log: add dup_logsrv() helper function
ease code maintenance by introducing dup_logsrv() helper function to
properly duplicate an existing logsrv struct.
2023-09-06 16:06:39 +02:00
Aurelien DARRAGON
d499485aa9 MINOR: sink: simplify post_sink_resolve function
Simplify post_sink_resolve() function to reduce code duplication and
make it easier to maintain.
2023-09-06 16:06:39 +02:00
Aurelien DARRAGON
5b295ff409 MINOR: ring: add a function to compute max ring payload
Add a helper function to the ring API to compute the maximum payload
length that could fit into the ring based on ring size.
2023-09-06 16:06:39 +02:00
Christopher Faulet
3ec156f027 BUG/MEDIUM: applet: Fix API for function to push new data in channels buffer
All applets only check the -1 error value (need room) for applet_put*
functions while the underlying functions may also return -2 if the input is
closed or -3 if the data length is invalid. It means applets already handle
other cases by their own.

The API should be fixed but for now, to ease backports, we only fix
applet_put* functions to always return -1 on error. This way, at least for
the applets point of view, the API is consistent.

This patch should be backported to 2.8. Probably not further. Except if we
suspect it could fix a bug.
2023-09-06 09:29:27 +02:00
Frdric Lcaille
fb4294be55 BUG/MINOR: quic: Wrong RTT computation (srtt and rrt_var)
Due to the fact that several variable values (rtt_var, srtt) were stored as multiple
of their real values, some calculations were less accurate as expected.

Stop storing 4*rtt_var values, and 8*srtt values.
Adjust all the impacted statements.

Must be backported as far as 2.6.
2023-09-05 17:14:51 +02:00
William Lallemand
d90d3bf894 MINOR: global: export the display_version() symbol
Export the display_version() function which can be used elsewhere than
in haproxy.c
2023-09-05 15:24:39 +02:00
Willy Tarreau
86854dd032 MEDIUM: threads: detect excessive thread counts vs cpu-map
This detects when there are more threads bound via cpu-map than CPUs
enabled in cpu-map, or when there are more total threads than the total
number of CPUs available at boot (for unbound threads) and configured
for bound threads. In this case, a warning is emitted to explain the
problems it will cause, and explaining how to address the situation.

Note that some configurations will not be detected as faulty because
the algorithmic complexity to resolve all arrangements grows in O(N!).
This means that having 3 threads on 2 CPUs and one thread on 2 CPUs
will not be detected as it's 4 threads for 4 CPUs. But at least configs
such as T0:(1,4) T1:(1,4) T2:(2,4) T3:(3,4) will not trigger a warning
since they're valid.
2023-09-04 19:39:17 +02:00
Willy Tarreau
8357f950cb MEDIUM: threads: detect incomplete CPU bindings
It's very easy to mess up with some cpu-map directives and to leave
some thread unbound. Let's add a test that checks that either all
threads are bound or none are bound, but that we do not face the
intermediary situation where some are pinned and others are left
wandering around, possibly on the same CPUs as bound ones.

Note that this should not be backported, or maybe turned into a
notice only, as it appears that it will easily catch invalid
configs and that may break updates for some users.
2023-09-04 19:39:17 +02:00
Willy Tarreau
e65f54cf96 MINOR: cpuset: centralize a reliable bound cpu detection
Till now the CPUs that were bound were only retrieved in
thread_cpus_enabled() in order to count the number of CPUs allowed,
and it relied on arch-specific code.

Let's slightly arrange this into ha_cpuset_detect_bound() that
reuses the ha_cpuset struct and the accompanying code. This makes
the code much clearer without having to carry along some arch-specific
stuff out of this area.

Note that the macos-specific code used in thread.c to only count
online CPUs but not retrieve a mask, so for now we can't infer
anything from it and can't implement it.

In addition and more importantly, this function is reliable in that
it will only return a value when the detection is accurate, and will
not return incomplete sets on operating systems where we don't have
an exact list, such as online CPUs.
2023-09-04 19:39:17 +02:00
Willy Tarreau
d3ecc67a01 MINOR: cpuset: add ha_cpuset_or() to bitwise-OR two CPU sets
This operation was not implemented and will be needed later.
2023-09-04 19:39:17 +02:00
Willy Tarreau
eb10567254 MINOR: cpuset: add ha_cpuset_isset() to check for the presence of a CPU in a set
This function will be convenient to test for the presence of a given CPU
in a set.
2023-09-04 19:39:17 +02:00
Willy Tarreau
17a7baca07 BUILD: bug: make BUG_ON() void to avoid a rare warning
When building without threads, the recently introduced BUG_ON(tid != 0)
turns to a constant expression that evaluates to 0 and that is not used,
resulting in this warning:

  src/connection.c: In function 'conn_free':
  src/connection.c:584:3: warning: statement with no effect [-Wunused-value]

This is because the whole thing is declared as an expression for clarity.
Make it return void to avoid this. No backport is needed.
2023-09-04 19:38:51 +02:00
Andrew Hopkins
b3f94f8b3b BUILD: ssl: Build with new cryptographic library AWS-LC
This adds a new option for the Makefile USE_OPENSSL_AWSLC, and
update the documentation with instructions to use HAProxy with
AWS-LC.

Update the type of the OCSP callback retrieved with
SSL_CTX_get_tlsext_status_cb with the actual type for
libcrypto versions greater than 1.0.2. This doesn't affect
OpenSSL which casts the callback to void* in SSL_CTX_ctrl.
2023-09-04 18:19:18 +02:00
Christopher Faulet
b50a471adb BUG/MEDIUM: stconn: Don't block sends if there is a pending shutdown
For the same reason than the previous patch, we must not block the sends
when there is a pending shutdown. In other words, we must consider the sends
are allowed when there is a pending shutdown.

This patch must slowly be backported as far as 2.2. It should partially fix
issue #2249.
2023-09-01 14:18:26 +02:00
Willy Tarreau
844a3bc25b MEDIUM: checks: implement a queue in order to limit concurrent checks
The progressive adoption of OpenSSL 3 and its abysmal handshake
performance has started to reveal situations where it simply isn't
possible anymore to succesfully run health checks on many servers,
because between the moment all the checks are started and the moment
the handshake finally completes, the timeout has expired!

This also has consequences on production traffic which gets
significantly delayed as well, all that for lots of checks. While it's
possible to increase the check delays, it doesn't solve everything as
checks still take a huge amount of time to converge in such conditions.

Here we take a different approach by permitting to enforce the maximum
concurrent checks per thread limitation and implementing an ordered
queue. Thanks to this, if a thread about to start a check has reached
its limit, it will add the check at the end of a queue and it will be
processed once another check is finished. This proves to be extremely
efficient, with all checks completing in a reasonable amount of time
and not being disturbed by the rest of the traffic from other checks.
They're just cycling slower, but at the speed the machine can handle.

One must understand however that if some complex checks perform multiple
exchanges, they will take a check slot for all the required duration.
This is why the limit is not enforced by default.

Tests on SSL show that a limit of 5-50 checks per thread on local
servers gives excellent results already, so that could be a good starting
point.
2023-09-01 14:00:04 +02:00
Willy Tarreau
cfc0bceeb5 MEDIUM: checks: search more aggressively for another thread on overload
When the current check is overloaded (more running checks than the
configured limit), we'll try more aggressively to find another thread.
Instead of just opportunistically looking for one half as loaded, now if
the current thread has more than 1% more active checks than another one,
or has more than a configured limit of concurrent running checks, it will
search for a more suitable thread among 3 other random ones in order to
migrate the check there. The number of migrations remains very low (~1%)
and the checks load very fair across all threads (~1% as well). The new
parameter is called tune.max-checks-per-thread.
2023-09-01 08:26:06 +02:00
Willy Tarreau
00de9e0804 MINOR: checks: maintain counters of active checks per thread
Let's keep two check counters per thread:
  - one for "active" checks, i.e. checks that are no more sleeping
    and are assigned to the thread. These include sleeping and
    running checks ;

  - one for "running" checks, i.e. those which are currently
    executing on the thread.

By doing so, we'll be able to spread the health checks load a bit better
and refrain from sending too many at once per thread. The counters are
atomic since a migration increments the target thread's active counter.
These numbers are reported in "show activity", which allows to check
per thread and globally how many checks are currently pending and running
on the system.

Ideally, we should only consider checks in the process of establishing
a connection since that's really the expensive part (particularly with
OpenSSL 3.0). But the inner layers are really not suitable to doing
this. However knowing the number of active checks is already a good
enough hint.
2023-09-01 08:26:06 +02:00
Willy Tarreau
3b7942a1c9 MINOR: check/activity: collect some per-thread check activity stats
We now count the number of times a check was started on each thread
and the number of times a check was adopted. This helps understand
better what is observed regarding checks.
2023-09-01 08:26:06 +02:00
Willy Tarreau
e03d05c6ce MINOR: check: remember when we migrate a check
The goal here is to explicitly mark that a check was migrated so that
we don't do it again. This will allow us to perform other actions on
the target thread while still knowing that we don't want to be migrated
again. The new READY bit combine with SLEEPING to form 4 possible states:

 SLP  RDY   State      Description
  0    0    -          (reserved)
  0    1    RUNNING    Check is bound to current thread and running
  1    0    SLEEPING   Check is sleeping, not bound to a thread
  1    1    MIGRATING  Check is migrating to another thread

Thus we set READY upon migration, and check for it before migrating, this
is sufficient to prevent a second migration. To make things a bit clearer,
the SLEEPING bit was switched with FASTINTER so that SLEEPING and READY are
adjacent.
2023-09-01 08:26:06 +02:00
Willy Tarreau
7163f95b43 MINOR: checks: start the checks in sleeping state
The CHK_ST_SLEEPING state was introduced by commit d114f4a68 ("MEDIUM:
checks: spread the checks load over random threads") to indicate that
a check was not currently bound to a thread and that it could easily
be migrated to any other thread. However it did not start the checks
in this state, meaning that they were not redispatchable on startup.

Sometimes under heavy load (e.g. when using SSL checks with OpenSSL 3.0)
the cost of setting up new connections is so high that some threads may
experience connection timeouts on startup. In this case it's better if
they can transfer their excess load to other idle threads. By just
marking the check as sleeping upon startup, we can do this and
significantly reduce the number of failed initial checks.
2023-09-01 08:26:06 +02:00
Willy Tarreau
52b260bae4 MINOR: server/ssl: maintain an index of the last known valid SSL session
When a thread creates a new session for a server, if none was known yet,
we assign the thread id (hence the reused_sess index) to a shared variable
so that other threads will later be able to find it when they don't have
one yet. For now we only set and clear the pointer upon session creation,
we do not yet pick it.

Note that we could have done it per thread-group, so as to avoid any
cross-thread exchanges, but it's anticipated that this is essentially
used during startup, at a moment where the cost of inter-thread contention
is very low compared to the ability to restart at full speed, which
explains why instead we store a single entry.
2023-08-31 08:50:01 +02:00
Willy Tarreau
607041dec3 MEDIUM: server/ssl: place an rwlock in the per-thread ssl server session
The goal will be to permit a thread to update its session while having
it shared with other threads. For now we only place the lock and arrange
the code around it so that this is quite light. For now only the owner
thread uses this lock so there is no contention.

Note that there is a subtlety in the openssl API regarding
i2s_SSL_SESSION() in that it fills the area pointed to by its argument
with a dump of the session and returns a size that's equal to the
previously allocated one. As such, it does modify the shared area even
if that's not obvious at first glance.
2023-08-31 08:50:01 +02:00
Alexander Stephan
ece0d1ab49 MINOR: sample: Refactor fc_pp_authority by wrapping the generic TLV fetch
We already have a call that can retreive an TLV with any value.
Therefore, the fetch logic is redundant and can be simplified
by simply calling the generic fetch with the correct TLV ID
set as an argument.
2023-08-29 15:31:51 +02:00
Alexander Stephan
fecc573da1 MEDIUM: connection: Generic, list-based allocation and look-up of PPv2 TLVs
In order to be able to implement fetches in the future that allow
retrieval of any TLVs, a new generic data structure for TLVs is introduced.

Existing TLV fetches for PP2_TYPE_AUTHORITY and PP2_TYPE_UNIQUE_ID are
migrated to use this new data structure. TLV related pools are updated
to not rely on type, but only on size. Pools accomodate the TLV list
element with their associated value. For now, two pools for 128 B and
256 B values are introduced. More fine-grained solutions are possible
in the future, if necessary.
2023-08-29 15:15:47 +02:00
Alexander Stephan
c9d47652d2 CLEANUP/MINOR: connection: Improve consistency of PPv2 related constants
This patch improves readability by scoping HA proxy related PPv2 constants
with a 'HA" prefix. Besides, a new constant for the length of a CRC32C
TLV is introduced. The length is derived from the PPv2 spec, so 32 Bit.
2023-08-29 15:15:47 +02:00
Willy Tarreau
bd84387beb MEDIUM: capabilities: enable support for Linux capabilities
For a while there has been the constraint of having to run as root for
transparent proxying, and we're starting to see some cases where QUIC is
not running in socket-per-connection mode due to the missing capability
that would be needed to bind a privileged port. It's not realistic to
ask all QUIC users on port 443 to run as root, so instead let's provide
a basic support for capabilities at least on linux. The ones currently
supported are cap_net_raw, cap_net_admin and cap_net_bind_service. The
mechanism was made OS-specific with a dedicated file because it really
is. It can be easily refined later for other OSes if needed.

A new keyword "setcaps" is added to the global section, to enumerate the
capabilities that must be kept when switching from root to non-root. This
is ignored in other situations though. HAProxy has to be built with
USE_LINUX_CAP=1 for this to be supported, which is enabled by default
for linux-glibc, linux-glibc-legacy and linux-musl.

A good way to test this is to start haproxy with such a config:

    global
        uid 1000
        setcap cap_net_bind_service

    frontend test
        mode http
        timeout client 3s
        bind quic4@:443 ssl crt rsa+dh2048.pem allow-0rtt

and run it under "sudo strace -e trace=bind,setuid", then connecting
there from an H3 client. The bind() syscall must succeed despite the
user id having been switched.
2023-08-29 11:11:50 +02:00
William Lallemand
e7d9082315 BUG/MINOR: ssl/cli: can't find ".crt" files when replacing a certificate
Bug was introduced by commit 26654 ("MINOR: ssl: add "crt" in the
cert_exts array").

When looking for a .crt directly in the cert_exts array, the
ssl_sock_load_pem_into_ckch() function will be called with a argument
which does not have its ".crt" extensions anymore.

If "ssl-load-extra-del-ext" is used this is not a problem since we try
to add the ".crt" when doing the lookup in the tree.

However when using directly a ".crt" without this option it will failed
looking for the file in the tree.

The fix removes the "crt" entry from the array since it does not seem to
be really useful without a rework of all the lookups.

Should fix issue #2265

Must be backported as far as 2.6.
2023-08-28 18:20:39 +02:00
Willy Tarreau
892d04733f BUILD: import: guard plock.h against multiple inclusion
Surprisingly there's no include guard in plock.h though there is one in
atomic-ops.h. Let's add one, or we cannot risk including the file multiple
times.
2023-08-26 17:28:08 +02:00
Amaury Denoyelle
5afcb686b9 MAJOR: connection: purge idle conn by last usage
Backend idle connections are purged on a recurring occurence during the
process lifetime. An estimated number of needed connections is
calculated and the excess is removed periodically.

Before this patch, purge was done directly using the idle then the safe
connection tree of a server instance. This has a major drawback to take
no account of a specific ordre and it may removed functional connections
while leaving ones which will fail on the next reuse.

The problem can be worse when using criteria to differentiate idle
connections such as the SSL SNI. In this case, purge may remove
connections with a high rate of reusing while leaving connections with
criteria never matched once, thus reducing drastically the reuse rate.

To improve this, introduce an alternative storage for idle connection
used in parallel of the idle/safe trees. Now, each connection inserted
in one of this tree is also inserted in the new list at
`srv_per_thread.idle_conn_list`. This guarantees that recently used
connection is present at the end of the list.

During the purge, use this list instead of idle/safe trees. Remove first
connection in front of the list which were not reused recently. This
will ensure that connection that are frequently reused are not purged
and should increase the reuse rate, particularily if distinct idle
connection criterias are in used.
2023-08-25 15:57:48 +02:00
Amaury Denoyelle
61fc9568fb MINOR: server: move idle tree insert in a dedicated function
Define a new function _srv_add_idle(). This is a simple wrapper to
insert a connection in the server idle tree. This is reserved for simple
usage and require to idle_conns lock. In most cases,
srv_add_to_idle_list() should be used.

This patch does not have any functional change. However, it will help
with the next patch as idle connection will be always inserted in a list
as secondary storage along with idle/safe trees.
2023-08-25 15:57:48 +02:00
Amaury Denoyelle
77ac8eb4a6 MINOR: connection: simplify removal of idle conns from their trees
Small change of API for conn_delete_from_tree(). Now the connection
instance is taken as argument instead of its inner node.

No functional change introduced with this commit. This simplifies
slightly invocation of conn_delete_from_tree(). The most useful changes
is that this function will be extended in the next patch to be able to
remove the connection from its new idle list at the same time as in its
idle tree.
2023-08-25 15:57:48 +02:00
Frdric Lcaille
81815a9a83 MEDIUM: map/acl: Replace map/acl spin lock by a read/write lock.
Replace ->lock type of pat_ref struct by HA_RWLOCK_T.
Replace all calls to HA_SPIN_LOCK() (resp. HA_SPIN_UNLOCK()) by HA_RWLOCK_WRLOCK()
(resp. HA_RWLOCK_WRUNLOCK()) when a write access is required.
There is only one read access which is needed. This is in the "show map" command
callback, cli_io_handler_map_lookup() where a HA_SPIN_LOCK() call is replaced
by HA_RWLOCK_RDLOCK() (resp. HA_SPIN_UNLOCK() by HA_RWLOCK_RDUNLOCK).
Replace HA_SPIN_INIT() calls by HA_RWLOCK_INIT() calls.
2023-08-25 15:42:03 +02:00
Frdric Lcaille
745d1a269b MEDIUM: map/acl: Improve pat_ref_set_elt() efficiency (for "set-map", "add-acl"action perfs)
Store a pointer to the expression (struct pattern_expr) into the data structure
used to chain/store the map element references (struct pat_ref_elt) , e.g. the
struct pattern_tree when stored into an ebtree or struct pattern_list when
chained to a list.

Modify pat_ref_set_elt() to stop inspecting all the expressions attached to a map
and to look for the <elt> element passed as parameter to retrieve the sample data
to be parsed. Indeed, thanks to the pointer added above to each pattern tree nodes
or list elements, they all can be inspected directly from the <elt> passed as
parameter and its ->tree_head and ->list_head member: the pattern tree nodes are
stored into elt->tree_head, and the pattern list elements are chained to
elt->list_head list. This inspection was also the job of pattern_find_smp() which
is no more useful. This patch removes the code of this function.
2023-08-25 15:41:59 +02:00
Frdric Lcaille
0844bed7d3 MEDIUM: map/acl: Improve pat_ref_set() efficiency (for "set-map", "add-acl" action perfs)
Organize reference to pattern element of map (struct pat_ref_elt) into an ebtree:
  - add an eb_root member to the map (pat_ref struct) and an ebpt_node to its
    element (pat_ref_elt struct),
  - modify the code to insert these nodes into their ebtrees each time they are
    allocated. This is done in pat_ref_append().

Note that ->head member (struct list) of map (struct pat_ref) is not removed
could have been removed. This is not the case because still necessary to dump
the map contents from the CLI in the order the map elememnts have been inserted.

This patch also modifies http_action_set_map() which is the callback at least
used by "set-map" action. The pat_ref_elt element returned by pat_ref_find_elt()
is no more ignored, but reused if not NULL by pat_ref_set() as first element to
lookup from. This latter is also modified to use the ebtree attached to the map
in place of the ->head list attached to each map element (pat_ref_elt struct).

Also modify pat_ref_find_elt() to makes it use ->eb_root map ebtree added to the
map by this patch in place of inspecting all the elements with a strcmp() call.
2023-08-25 15:41:56 +02:00
Amaury Denoyelle
5053e89142 MEDIUM: h2: prevent stream opening before connection reverse completed
HTTP/2 demux must be handled with care for active reverse connection.
Until accept has been completed, it should be forbidden to handle
HEADERS frame as session is not yet ready to handle streams.

To implement this, use the flag H2_CF_DEM_TOOMANY which blocks demux
process. This flag is automatically set just after conn_reverse()
invocation. The flag is removed on rev_accept_conn() callback via a new
H2 ctl enum. H2 tasklet is woken up to restart demux process.

As a side-effect, reporting in H2 mux may be blocked as demux functions
are used to convert error status at the connection level with
CO_FL_ERROR. To ensure error is reported for a reverse connection, check
h2c_is_dead() specifically for this case in h2_wake(). This change also
has its own side-effect : h2c_is_dead() conditions have been adjusted to
always exclude !h2c->conn->owner condition which is always true for
reverse connection or else H2 mux may kill them unexpectedly.
2023-08-24 17:03:08 +02:00
Amaury Denoyelle
47f502df5e MEDIUM: proto_reverse_connect: bootstrap active reverse connection
Implement active reverse connection initialization. This is done through
a new task stored in the receiver structure. This task is instantiated
via bind callback and first woken up via enable callback.

Task handler is separated into two halves. On the first step, a new
connection is allocated and stored in <pend_conn> member of the
receiver. This new client connection will proceed to connect using the
server instance referenced in the bind_conf.

When connect has successfully been executed and HTTP/2 connection is
ready for exchange after SETTINGS, reverse_connect task is woken up. As
<pend_conn> is still set, the second halve is executed which only
execute listener_accept(). This will in turn execute accept_conn
callback which is defined to return the pending connection.

The task is automatically requeued inside accept_conn callback if bind
maxconn is not yet reached. This allows to specify how many connection
should be opened. Each connection is instantiated and reversed serially
one by one until maxconn is reached.

conn_free() has been modified to handle failure if a reverse connection
fails before being accepted. In this case, no session exists to notify
about the failure. Instead, reverse_connect task is requeud with a 1
second delay, giving time to fix a possible network issue. This will
allow to attempt a new connection reverse.

Note that for the moment connection rebinding after accept is disabled
for simplicity. Extra operations are required to migrate an existing
connection and its stack to a new thread which will be implemented
later.
2023-08-24 17:03:06 +02:00
Amaury Denoyelle
0747e493a0 MINOR: proto_reverse_connect: parse rev@ addresses for bind
Implement parsing for "rev@" addresses on bind line. On config parsing,
server name is stored on the bind_conf.

Several new callbacks are defined on reverse_connect protocol to
complete parsing. listen callback is used to retrieve the server
instance from the bind_conf server name. If found, the server instance
is stored on the receiver. Checks are implemented to ensure HTTP/2
protocol only is used by the server.
2023-08-24 17:02:37 +02:00
Amaury Denoyelle
008e8f67ee MINOR: connection: extend conn_reverse() for active reverse
Implement active reverse support inside conn_reverse(). This is used to
transfer the connection from the backend to the frontend side.

A new flag is defined CO_FL_REVERSED which is set just after this
transition. This will be used to identify connections which were
reversed but not yet accepted.
2023-08-24 17:02:37 +02:00
Amaury Denoyelle
5db6dde058 MINOR: proto: define dedicated protocol for active reverse connect
A new protocol named "reverse_connect" is created. This will be used to
instantiate connections that are opened by a reverse bind.

For the moment, only a minimal set of callbacks are defined with no real
work. This will be extended along the next patches.
2023-08-24 17:02:37 +02:00
Amaury Denoyelle
1723e21af2 MINOR: connection: use attach-srv name as SNI reuse parameter on reverse
On connection passive reverse from frontend to backend, its hash node is
calculated to be able to select it from the idle server pool. If
attach-srv rule defined an associated name, reuse it as the value for
SNI prehash.

This change allows a client to select a reverse connection by its name
by configuring its server line with a SNI to permit this.
2023-08-24 17:02:34 +02:00
Amaury Denoyelle
0b3758e18f MINOR: tcp-act: define optional arg name for attach-srv
Add an optional argument 'name' for attach-srv rule. This contains an
expression which will be used as an identifier inside the server idle
pool after reversal. To match this connection for a future transfer
through the server, the SNI server parameter must match this name. If no
name is defined, match will only occur with an empty SNI value.

For the moment, only the parsing step is implemented. An extra check is
added to ensure that the reverse server uses SSL with a SNI. Indeed, if
name is defined but server does not uses a SNI, connections will never
be selected on reused after reversal due to a hash mismatch.
2023-08-24 15:28:38 +02:00
Amaury Denoyelle
58cb76d7e1 MINOR: tcp-act: parse 'tcp-request attach-srv' session rule
Create a new tcp-request session rule 'attach-srv'.

The parsing handler is used to extract the server targetted with the
notation 'backend/server'. The server instance is stored in the act_rule
instance under the new union variant 'attach_srv'.

Extra checks are implemented in parsing to ensure attach-srv is only
used for proxy in HTTP mode and with listeners/server with no explicit
protocol reference or HTTP/2 only.

The action handler itself is really simple. It assigns the stored server
instance to the 'reverse' member of the connection instance. It will be
used in a future patch to implement passive reverse-connect.
2023-08-24 15:02:32 +02:00
Amaury Denoyelle
6e428dfaf2 MINOR: backend: only allow reuse for reverse server
A reverse server relies solely on its pool of idle connection to
transfer requests which will be populated through a new tcp-request rule
'attach-srv'.

Several changes are required on connect_server() to implement this.
First, reuse mode is forced to always for this type of server. Then, if
no idle connection is found, the request will be aborted. This results
with a 503 HTTP error code, similarly to when no server is available.
2023-08-24 14:49:03 +02:00
Amaury Denoyelle
e6223a3188 MINOR: server: define reverse-connect server
Implement reverse-connect server. This server type cannot instantiate
its own connection on transfer. Instead, it can only reuse connection
from its idle pool. These connections will be populated using the future
'tcp-request session attach-srv' rule.

A reverse-connect has no address. Instead, it uses a new custom server
notation with '@' character prefix. For the moment, only '@reverse' is
defined. An extra check is implemented to ensure server is used in a
HTTP proxy.
2023-08-24 14:49:03 +02:00
Amaury Denoyelle
4fb538d4b6 MEDIUM: h2: reverse connection after SETTINGS reception
Reverse connection after SETTINGS reception if it was set as reversable.
This operation is done in a new function h2_conn_reverse(). It regroups
common changes which are needed for both reversal direction :
H2_CF_IS_BACK is set or unset and timeouts are inverted.

For the moment, only passive reverse is fully implemented. Once done,
the connection instance is directly inserted in its targetted server
pool. It can then be used immediately for future transfers using this
server.
2023-08-24 14:49:03 +02:00
Amaury Denoyelle
1f76b8ae07 MEDIUM: connection: implement passive reverse
Define a new method conn_reverse(). This method is used to reverse a
connection from frontend to backend or vice-versa depending on its
initial status.

For the moment, passive reverse only is implemented. This covers the
transition from frontend to backend side. The connection is detached
from its owner session which can then be freed. Then the connection is
linked to the server instance.

only for passive connection on
frontend to transfer them on the backend side. This requires to free the
connection session after detaching it from.
2023-08-24 14:44:33 +02:00
Amaury Denoyelle
fbe35afaa4 MINOR: proxy: simplify parsing 'backend/server'
Several CLI handlers use a server argument specified with the format
'<backend>/<server>'. The parsing of this arguement is done in two
steps, first splitting the string with '/' delimiter and then use
get_backend_server() to retrieve the server instance.

Refactor this code sections with the following changes :
* splitting is reimplented using ist API
* get_backend_server() is removed. Instead use the already existing
  proxy_be_by_name() then server_find_by_name() which contains
  duplicated code with the now removed function.

No functional change occurs with this commit. However, it will be useful
to add new configuration options reusing the same '<backend>/<server>'
for reverse connect.
2023-08-24 14:44:33 +02:00
Willy Tarreau
9b47ed1a93 IMPORT: xxhash: update xxHash to version 0.8.2
Peter Varkoly reported a build issue on ppc64le in xxhash.h. Our version
(0.8.1) was the last one 9 months ago, and since then this specific issue
was addressed in 0.8.2, so let's apply the maintenance update.

This should be backported to 2.8 and 2.7.
2023-08-24 12:01:06 +02:00
Amaury Denoyelle
cd97ba147c BUILD/IMPORT: fix compilation with PLOCK_DISABLE_EBO=1
Compilation is broken due to missing __pl_wait_unlock_long() definition
when building with PLOCK_DISABLE_EBO=1. This has been introduced since
the following commit which activates the inlining version of
pl_wait_unlock_long() :
  commit 071d689a51
  MINOR: threads: inline the wait function for pthread_rwlock emulation

Add an extra check on PLOCK_DISABLE_EBO before choosing the inline or
default version of pl_wait_unlock_long() to fix this.
2023-08-17 11:16:54 +02:00
Willy Tarreau
78fa54863d MINOR: atomic: make sure to always relax after a failed CAS
There were a few places left where we forgot to call __ha_cpu_relax()
after a failed CAS, in the HA_ATOMIC_UPDATE_{MIN,MAX} macros, and in
a few sync_* API macros (the same as above plus HA_ATOMIC_CAS and
HA_ATOMIC_XCHG). Let's add them now.

This could have been a cause of contention, particularly with
process_stream() calling stream_update_time_stats() which uses 8 of them
in a call (4 for the server, 4 for the proxy). This may be a possible
explanation for the high CPU consumption reported in GH issue #2251.

This should be backported at least to 2.6 as it's harmless.
2023-08-17 09:09:20 +02:00
Willy Tarreau
071d689a51 MINOR: threads: inline the wait function for pthread_rwlock emulation
When using pthread_rwlock emulation, contention is reported on
pl_wait_unlock_long(). This is really not convenient to analyse what is
happening. Now plock supports inlining the wait call for just the lorw
functions by enabling PLOCK_LORW_INLINE_WAIT. Let's do this so that now
the wait time will be precisely reported as either pthread_rwlock_rdlock()
or pthread_rwlock_wrlock() depending on the contended function, but no
more on pl_wait_unlock_long(), which will still be reported for all
other locks.
2023-08-17 00:09:05 +02:00
Willy Tarreau
e56275378f IMPORT: lorw: support inlining the wait call
Now when PLOCK_LORW_INLINE_WAIT is defined, the pl_wait_unlock_long()
calls in pl_lorw_rdlock() and pl_lorw_wrlock() will be inlined so that
all the CPU time is accounted for in the calling function.

This is plock upstream commit c993f81d581732a6eb8fe3033f21970420d21e5e.
2023-08-17 00:09:05 +02:00
Willy Tarreau
66dcc0550e IMPORT: plock: always expose the inline version of the lock wait function
Doing so will allow to expose the time spent in certain highly
contended functions, which can be desirable for more accurate CPU
profiling. For example this could be done in locking functions that
are already not inlined so that they are the ones being reported as
those consuming the CPU instead of just pl_wait_unlock_long().

This is plock upstream commit 7505c2e2c8c4aa0ab8f52a2288e1334ae6412be4.
2023-08-17 00:09:05 +02:00
Willy Tarreau
c6b98f05d2 IMPORT: plock: also support inlining the int code
Commit 9db830b ("plock: support inlining exponential backoff code")
added an option to support inlining of the wait code for longs but
forgot to do it for ints. Let's do it now.

This is plock upstream commit b1f9f0d252fa40577d11cfb2bc0a809d6960a297.
2023-08-17 00:09:05 +02:00
Willy Tarreau
7bf829ace1 MAJOR: pools: move the shared pool's free_list over multiple buckets
This aims at further reducing the contention on the free_list when using
global pools. The free_list pointer now appears for each bucket, and both
the alloc and the release code skip to a next bucket when ending on a
contended entry. The default entry used for allocations and releases
depend on the thread ID so that locality is preserved as much as possible
under low contention.

It would be nice to improve the situation to make sure that releases to
the shared pools doesn't consider the first entry's pointer but only an
argument that would be passed and that would correspond to the bucket in
the thread's cache. This would reduce computations and make sure that the
shared cache only contains items whose pointers match the same bucket.
This was not yet done. One possibility could be to keep the same splitting
in the local cache.

With this change, an h2load test with 5 * 160 conns & 40 streams on 80
threads that was limited to 368k RPS with the shared cache jumped to
3.5M RPS for 8 buckets, 4M RPS for 16 buckets, 4.7M RPS for 32 buckets
and 5.5M RPS for 64 buckets.
2023-08-12 19:04:34 +02:00
Willy Tarreau
8a0b5f783b MINOR: pools: move the failed allocation counter over a few buckets
The failed allocation counter cannot depend on a pointer, but since it's
a perpetually increasing counter and not a gauge, we don't care where
it's incremented. Thus instead we're hashing on the TID. There's no
contention there anyway, but it's better not to waste the room in
the pool's heads and to move that with the other counters.
2023-08-12 19:04:34 +02:00
Willy Tarreau
da6999f839 MEDIUM: pools: move the needed_avg counter over a few buckets
That's the same principle as for ->allocated and ->used. Here we return
the summ of the raw values, so the result still needs to be fed to
swrate_avg(). It also means that we now use the local ->used instead
of the global one for the calculations and do not need to call pool_used()
anymore on fast paths. The number of samples should likely be divided by
the number of buckets, but that's not done yet (better observe first).

A function pool_needed_avg() was added to report aggregated values for
the "show pools" command.

With this change, an h2load made of 5 * 160 conn * 40 streams on 80
threads raised from 1.5M RPS to 6.7M RPS.
2023-08-12 19:04:34 +02:00
Willy Tarreau
9e5eb586b1 MEDIUM: pools: move the used counter over a few buckets
That's the same principle as for ->allocated. The small difference here
is that it's no longer possible to decrement ->used in batches when
releasing clusters from the cache to the shared cache, so the counter
has to be decremented for each of them. But as it provides less
contention and it's done only during forced eviction, it shouldn't be
a problem.

A function "pool_used()" was added to return the sum of the entries.
It's used by pool_alloc_nocache() and pool_free_nocache() which need
to count the number of used entries. It's not a problem since such
operations are done when picking/releasing objects to/from the OS,
but it is a reminder that the number of buckets should remain small.

With this change, an h2load test made of 5 * 160 conn * 40 streams on
80 threads raised from 812k RPS to 1.5M RPS.
2023-08-12 19:04:34 +02:00
Willy Tarreau
cdb711e42b MEDIUM: pools: spread the allocated counter over a few buckets
The ->used counter is one of the most stressed, and it heavily
depends on the ->allocated one, so let's first move ->allocated
to a few buckets.

A function "pool_allocated()" was added to return the sum of the entries.
It's important not to abuse it as it does iterate, so everywhere it's
possible to avoid it by keeping a local counter, it's better. Currently
it's used for limited pools which need to make sure they do not allocate
too many objects. That's an acceptable tradeoff to save CPU on large
machines at the expense of spending a little bit more on small ones which
normally are not under load.
2023-08-12 19:04:34 +02:00
Willy Tarreau
06885aaea7 MINOR: pools: introduce the use of multiple buckets
On many threads and without the shared cache, there can be extreme
contention on the ->allocated counter, the ->free_list pointer, and
the ->used counter. It's possible to limit this contention by spreading
the counters a little bit over multiple entries, that are summed up when
a consultation is needed. The criterion used to spread the values cannot
be related to the thread ID due to migrations, since we need to keep
consistent stats (allocated vs used).

Instead we'll just hash the pointer, it provides an index that does the
job and that is consistent for the object. When having just a few entries
(16 here as it showed almost identical performance between global and
non-global pools) even iterations should be short enough during
measurements to not be a problem.

A pair of functions designed to ease pointer hash bucket calculation were
added, with one of them doing it for thread IDs because allocation failures
will be associated with a thread and not a pointer.

For now this patch only brings in the relevant parts of the infrastructure,
the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512
threads or more are supported, 5 bits when 128 or more are supported, 4
bits when 16 or more are supported, otherwise 3 bits for small setups.
The array in the pool_head and the two utility functions are already
added. It should have no measurable impact beyond inflating the pool_head
structure.
2023-08-12 19:04:34 +02:00
Willy Tarreau
29ad61fb00 OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated
The pool's allocation counter doesn't strictly require to be updated
from these functions, it may more efficiently be done in the caller
(even out of a loop for pool_flush() and pool_gc()), and doing so will
also help us spread the counters over an array later. The functions
were renamed _noinc and _nodec to make sure we catch any possible
user in an external patch. If needed, the original functions may easily
be reimplemented in an inline function.
2023-08-12 19:04:34 +02:00
Willy Tarreau
f0d188f6ed OPTIM: tools: improve hash distribution using a better prime seed
During tests it was noticed that the current hash is not that good
on 4- and 5- bit hashes. About 7.5% of all the 32-bit primes were tested
as candidates for the hash function, by submitting them 128 arrangements
of N pointers among 40k extracted from haproxy's pools, and the average
fill rates for 1- to 12- bit hashes were measured and compared. It was
clear that some values do not provide great hashes and other ones are
way more resistant.

The current value is not bad at all but delivers 42.6% unique 2-bit
outputs, 41.6% 3-bit, 38.0% 4-bit, 38.2% 5-bit and 37.1% 10-bit. Some
values did perform significantly better, among which 0xacd1be85 which
does 43.2% 2-bit, 42.5% 3-bit, 42.2% 4-bit, 39.2% 5-bit and 37.3% 10-bit.

The reverse value used in the ptr2_hash() was really underperforming and
was replaced with 0x9d28e4e9 which does 49.6%, 40.4%, 42.6%, 39.1%, and
37.2% respectvely.

This should slightly improve the accuracy of the task and memory
profiling, and will be useful for pools.
2023-08-12 19:04:34 +02:00
Willy Tarreau
58946d44f8 MINOR: tools: improve ptr hash distribution on 64 bits
When testing the pointer hash on 64-bit real pointers (map entries),
it appeared that the shift by 33 bits that hoped to compensate for the
3 nul LSB degrades the hash, and the centering is more optimal on
31-(bits+1)/2. This makes sense since the topmost bit of the
multiplicator is 31, so for an input of 1 bit and 1 bit of output we
would always get zero. With the formula adjusted this way, we can get
up to ~15% more unique entries at 10 bits and ~24% more at 11 bits.
2023-08-12 19:04:34 +02:00
Willy Tarreau
ab6cb5dea0 MINOR: tools: make ptr_hash() support 0-bit outputs
When dealing with macro-based size definitions, it is useful to be able
to hash pointers on zero bits so that the macro automatically returns a
constant 0. For now it only supports 1-32. Let's just add this special
case. It's automatically optimized out by the compiler since the function
is inlined.
2023-08-12 19:04:34 +02:00
Willy Tarreau
59c347c15e BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP
LONGBITS was defined long ago with old compilers that didn't provide the
word size. It's still present as being referenced in various places in the
code, but we must not use it to define other macros that may be evaluated
at pre-processing time since it contains sizeof() and casts that are not
compatible with preprocessor conditions. Let's switch MAX_THREADS_PER_GROUP
to __WORDSIZE so that we can condition blocks of code on it if needed.

LONGBITS should really be removed by now, given that we don't support
compilers not providing __WORDSIZE anymore (gcc < 4.2).
2023-08-12 19:04:34 +02:00
Willy Tarreau
9e52c35de4 CLEANUP: stick-table: slightly reorder the stktable struct
By moving the config-time stuff after the updt_lock, we can plug some
holes without interfering with it. This allows us to get back to the
768-bytes struct. The performance was not affected at all.
2023-08-11 19:03:35 +02:00
Willy Tarreau
9c6248560e MINOR: stick-table: move the update lock into its own cache line
The read-lock contention observed on the update lock while turning it
into an upgradable lock were due to false sharing with the nearby
updates. Simply moving the lock alone into its own cache line is
sufficient to almost double the performance again, raising from 2355
to 4480k RPS with very low contention:

  Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 743422995452 lost
  Overhead  Shared Object          Symbol
    15.88%  haproxy                [.] stktable_lookup_key
     5.94%  haproxy                [.] ebmb_lookup
     5.69%  haproxy                [.] http_wait_for_request
     3.66%  haproxy                [.] stktable_touch_with_exp
     2.62%  [kernel]               [k] _raw_spin_unlock_irqrestore
     1.86%  haproxy                [.] http_action_return
     1.79%  haproxy                [.] stream_process_counters
     1.78%  [kernel]               [k] skb_release_data
     1.77%  haproxy                [.] process_stream

Unfortunately, trying to move the line anywhere else didn't work,
despite the remaining holes, because this structure is not quite
clean. This adds 64 bytes to a struct that was already 768 long,
so it's now 832. It's possible to repack it a little bit and regain
these bytes by removing the THREAD_ALIGN before "keys" because we
rarely use the config stuff, but that's a bit unsafe.
2023-08-11 19:03:35 +02:00
Willy Tarreau
87e072eea5 MEDIUM: stick-table: use a distinct lock for the updates tree
Updating an entry in the updates tree is currently performed under the
table's write lock, which causes huge contention with other accesses
such as lookups and free. Aside the updates tree, the update,
localupdate and commitupdate variables, nothing is manipulated, so
let's create a distinct lock (updt_lock) to protect these together
to remove this contention. It required to add an extra lock in the
few places where we delete the update (though only if we're really
going to delete it) to protect the tree. This is very convenient
because now peer_send_teachmsgs() only needs to take this read lock,
and there is very little contention left on the stick-table.

With this alone, the performance jumped from 614k to 1140k/s on a
80-thread machine with a peers section! Stick-table updates with
no peers however now has to stand two locks and slightly regressed
from 4.0-4.1M/s to 3.9-4.0. This is fairly minimal compared to the
significant unlocking of the peers updates and considered totally
acceptable.
2023-08-11 19:03:35 +02:00