25206 Commits

Author SHA1 Message Date
Christopher Faulet
4a99f15f0c Revert "REGTESTS: stop using truncated.vtc on freebsd"
This reverts commit 0b9a75e8781593c250f6366a64a019018ade688e.

Thanks to the previous fixes ("REGTESTS: Fix truncated.vtc to send 0-CRLF" and
"BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut"),
this script can be reenabled for FreeBSD.
2025-02-18 17:35:00 +01:00
Christopher Faulet
b70921f2c1 BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut
On shut, truncated HTX messages were not properly handled by the H2
multiplexer. Depending on how data were emitted, a chunked HTX message
without the 0-CRLF could be considered as full and an empty data with ES
flag set could be emitted instead of a RST_STREAM(CANCEL) frame.

In the H2 multiplexer, when a shut is performed, an HTX message is
considered as truncated if more HTX data are still expected. It is based on
the presence or not of the H2_SF_MORE_HTX_DATA flag on the H2 stream.
However, this flag is set or unset depending on the HTX extra field
value. This field is used to state how much data that must still be
transferred, based on the announced data length. For a message with a
content-length, this assumption is valid. But for a chunked message, it is
not true. Only the length of the current chunk is announced. So we cannot
rely on this field in that case to know if a message is full or not.

Instead, we must rely on the HTX start-line flags to know if more HTX data
are expected or not. If the xfer length is known (the HTX_SL_F_XFER_LEN flag
is set on the HTX start-line), it means that more data are always expected,
until the end of message is reached (the HTX_FL_EOM flag is set on the HTX
message). This is true for bodyless message because the end of message is
reported with the end of headers. This is also true for tunneled messages
because the end of message is received before switching the H2 stream in
tunnel mode.

This patch must be backported as far as 2.8.
2025-02-18 17:34:59 +01:00
Christopher Faulet
b93e419750 REGTESTS: Fix truncated.vtc to send 0-CRLF
When a chunked messages is sent, the 0-CRLF must be explicitely sent. Since
the begining, it is missing. Just add it.
2025-02-18 17:34:59 +01:00
Willy Tarreau
af5c07eee9 DEV: h2: fix flags for the continuation frame
It's flag 2 (end of headers) that's defined there, not 3 (padded).
2025-02-18 14:17:17 +01:00
Amaury Denoyelle
2715dbe9d0 BUG/MINOR: mux-quic: prevent crash after MUX init failure
qmux_init() may fail for several reasons. In this case, connection
resources are freed and underlying and a CONNECTION_CLOSE will be
emitted via its quic_conn instance.

In case of qmux_init() failure, qcc_release() is used to clean up
resources, but QCC <conn> member is first resetted to NULL, as
connection released must be delayed. Some cleanup operations are thus
skipped, one of them is the resetting of <ctx> connection member to
NULL. This may cause a crash as <ctx> is a dangling pointer after QCC
release. One of the possible reproducer is to activate QMUX traces,
which will cause a segfault on the qmux_init() error leave trace.

To fix this, simply reset <ctx> to NULL manually on qmux_init() failure.

This must be backported up to 3.0.
2025-02-18 11:02:46 +01:00
Amaury Denoyelle
2cdc4695cb BUG/MINOR: quic: prevent crash on conn access after MUX init failure
Initially, QUIC-MUX was responsible to reset quic_conn <conn> member to
NULL when MUX was released. This was performed via qcc_release().

However, qcc_release() is also used on qmux_init() failure. In this
case, connection must be freed via its session, so QCC <conn> member is
resetted to NULL prior to qcc_release(), which prevents quic_conn <conn>
member to also be resetted. As the connection is freed soon after,
quic_conn <conn> is a dangling pointer, which may cause crashes.

This bug should be very rare as first it implies that QUIC-MUX
initialization has failed (for example due to a memory alloc error).
Also, <conn> member is rarely used by quic_conn instance. In fact, the
only reproducible crash was done with QUIC traces activated, as in this
case connection is accessed via quic_conn under __trace_enabled()
function.

To fix this, detach connection from quic_conn via the XPRT layer instead
of the MUX. More precisely, this is performed via quic_close(). This
should ensure that it will always be conducted, either on normal
connection closure, but also after special conditions such as MUX init
failure.

This should be backported up to 2.6.
2025-02-18 10:43:56 +01:00
Willy Tarreau
607aa57b2e DEV: h2: add a Lua-based HTTP/2 connection tracer
The following config is sufficient to trace H2 exchanges between a client
and a server:

   global
       lua-load "dev/h2/h2-tracer.lua"

   listen h2_sniffer
       mode tcp
       bind :8002
       filter lua.h2-tracer #hex
       server s1 127.0.0.1:8003

The commented "hex" argument will also display full frames in hex (not
recommended). The connections are prefixed with a 3-hex digit number in
order to also support a bit of multiplexing without impacting the reading
too much. The screen is split in two, with the request on the left and
the response on the right. Here's an example of what it does between an
haproxy backend and an haproxy frontend both in H2, when submitted a
curl request for /?s=30k handled by httpterm:

  [001] ### req start
  [001] [PREFACE len=24]
  [001] [SETTINGS sid=0 len=24 (bytes=24)]
  [001]                                          | ### res start
  [001]                                          | [SETTINGS sid=0 len=18 (bytes=27)]
  [001]                                          | [SETTINGS ACK sid=0 len=0 (bytes=0)]
  [001] [SETTINGS ACK sid=0 len=0 (bytes=56)]
  [001] [HEADERS EH+ES sid=1 len=47 (bytes=47)]
  [001]                                          | [HEADERS EH sid=1 len=101 (bytes=15351)]
  [001]                                          | [DATA sid=1 len=15126 (bytes=15241)]
  [001]                                          | [DATA sid=1 len=1258 (bytes=106)]
  [001]                                          |                  ... -106 = 1152
  [001]                                          |                    ... -1152 = 0
  [001] [WINDOW_UPDATE sid=1 len=4 (bytes=43)]
  [001] [WINDOW_UPDATE sid=0 len=4 (bytes=30)]
  [001] [WINDOW_UPDATE sid=1 len=4 (bytes=17)]
  [001] [WINDOW_UPDATE sid=0 len=4 (bytes=4)]
  [001]                                          | [DATA ES sid=1 len=14336 (bytes=14336)]
  [001] [WINDOW_UPDATE sid=0 len=4 (bytes=4)]
  [001] ### req end: 31080 bytes total
  [001]                                          | [GOAWAY sid=0 len=8 (bytes=8)]
  [001]                                          | ### res end: 31097 bytes total

It deserves some improvements. For instance at the moment it does not
verify the preface, any 24 bytes will work. It does not perform any
protocol validation either. Detecting some issues such as out-of-sequence
frames could be helpful. But it already helps as-is.
2025-02-18 09:26:15 +01:00
William Lallemand
764f6910ed DOC: configuration: document the "crt" frontend keyword
Document the "crt" keyword of frontend and listen section.
2025-02-17 18:26:37 +01:00
William Lallemand
cd6a02ace9 MEDIUM: ssl/crtlist: "crt" keyword in frontend
This patch implements the "crt" keywords in frontend, declaring an
implicit crt-list named after the frontend.

The patch is split in two steps:

The first step is the crt keyword parser, which parses crt lines and
fill a "cfg_crt_node" struct containing a ssl_bind_conf and a
ckch_conf which are put in a list to be used later.

After parsing the frontend section, as a 2nd step, a
post_section_parser is called, it will create a crt-list named after
the frontend and will fill it with certificates from the list of
cfg_crt_node. Once created this crt-list will be loaded in every "ssl"
bind lines that didn't declare any crt or crt-list.

Example:

    listen https
       bind :443 ssl
       crt foobar.pem
       crt test1.net.crt key test1.net.key

Implements part of #2854
2025-02-17 18:26:37 +01:00
William Lallemand
82f927817e MINOR: ssl/ckch: return from ckch_conf_clean() when conf is NULL
ckch_conf_clean() mustn't be executed when the argument is NULL, this
will keep the API more consistant like any free() function.
2025-02-17 18:26:37 +01:00
William Lallemand
0330011acf MINOR: ssl/crtlist: handle crt_path == cc->crt in crtlist_load_crt()
Handle the case where crt_path == cc->crt, so the pointer doesn't get
free'd before getting strdup'ed in crtlist_load_crt().
2025-02-17 18:26:37 +01:00
William Lallemand
69163cd63e MINOR: ssl/crtlist: split the ckch_conf loading from the crtlist line parsing
ckch_conf loading is not that simple as it requires to check
- if the cert already exists in the ckchs_tree
- if the ckch_conf is compatible with an existing cert in ckchs_tree
- if the cert is a bundle which need to load multiple ckch_store

This logic could be reuse elsewhere, so this commit introduce the new
crtlist_load_crt() function which does that.
2025-02-17 18:26:37 +01:00
Christopher Faulet
ca79ed5eef BUG/MINOR: fcgi: Don't set the status to 302 if it is already set
When a "Location" header was found in a FCGI response, the status code was
forced to 302. But it should only be performed if no status code was set
first.

So now, we take care to not override an already defined status code when the
"Location" header is found.

This patch should fix the issue #2865. It must backported to all stable
versions.
2025-02-17 16:37:53 +01:00
Christopher Faulet
34542d5ec2 BUG/MEDIUM: filters: Handle filters registered on data with no payload callback
An HTTP filter with no http_payload callback function may be registered on
data. In that case, this filter is obviously not called when some data are
received but it remains important to update its internal state to be sure to
keep it synchronized on the stream, especially its offet value. Otherwise,
the wrong calculation on the global offset may be performed in
flt_http_end(), leading to an integer overflow when data are moved from
input to output. This overflow triggers a BUG_ON() in c_adv().

The same is true for TCP filters with no tcp_payload callback function.

This patch must be backport to all stable versions.
2025-02-17 16:16:29 +01:00
Christopher Faulet
49b7bcf583 BUG/MINOR: cli: Wait for the last ACK when FDs are xferred from the old worker
On reload, the new worker requests bound FDs to the old one. The old worker
sends them in message of at most 252 FDs. Each message is acknowledged by
the new worker. All messages sent or received by the old worker are handled
manually via sendmsg/recv syscalls. So the old worker must be sure consume
all the ACK replies. However, the last one was never consumed. So it was
considered as a command by the CLI applet. This issue was hidden since
recently. But it was the root cause of the issue #2862.

Note this last ack is also the first one when there are less than 252 FDs to
transfer.

This patch must be backported to all stable versions.
2025-02-17 15:31:07 +01:00
Christopher Faulet
972ce87676 BUG/MEDIUM: cli: Be sure to drop all input data in END state
Commit 7214dcd ("BUG/MEDIUM: applet: Don't pretend to have more data to
handle EOI/EOS/ERROR") revealed a bug with the CLI applet. Pending input
data when the applet is in CLI_ST_END state were never consumed or dropped,
leading to a wakeup loop.

The CLI applet implements its own snd_buf callback function. It is important
it consumes all pending input data. Otherwise, the applet is woken up in
loop until it empties the request buffer. Another way to fix the issue would
be to report an error. But in that case, it seems reasonnable to drop these
data.

The issue can be observed on reload, in master/worker mode, because of issue
about the last ACK message which was never consummed by the _getsocks()
command.

This patch should fix the issue #2862. It must be backported to 3.1 with the
commit above.
2025-02-17 15:31:07 +01:00
William Lallemand
ab2fa95bdd BUG/MINOR: startup: hap_register_feature() fix for partial feature name
In patch 2fe4cbd8e ("MINOR: startup: allow hap_register_feature() to
enable a feature in the list"), the ability to overwrite a '-' in the
feature list was added. However the code was not tokenizing correctly
the string, and partial feature name found in the name could result in
having the same feature name multiple time.

This patch rewrites the lookup of the string by tokenizing it correctly.
2025-02-17 14:56:09 +01:00
William Lallemand
7268e9c249 BUG/MINOR: startup: leave at first post_section_parser which fails
Since we are now iterating on post_section_parser() for a same keyword,
we need to exit at the first ERR_ABORT.

The post_section_parser() is called when parsing a new section, but also
at the end of the file to be called for the last section.

The changes in 4de86bb ("MEDIUM: initcall: allow to register mutiple
post_section_parser per section") should have added tests on the
ERR_ABORT value.

Also pcs->post_section_parser() must be called instead of
cs->post_section_parser() because we could have a NULL ptr.

This bug does not affect anything since we don't use
REGISTER_CONFIG_POST_SECTION() yet.
2025-02-17 11:21:20 +01:00
Amaury Denoyelle
32691e7c25 MINOR: quic: support frame type as a varint
QUIC frame type is encoded as a variable-length integer. Thus, 64-bit
integer should be used for them. Currently, this was not the case as
type was represented as a 1-byte char inside quic_frame structure. This
does not cause any issue with QUIC from RFC9000, as all frame types fit
in this range. Furthermore, a QUIC implementation is required to use the
smallest size varint when encoding a frame type.

However, the current code is unable to accept QUIC extension with bigger
frame types. This is notably the case for quic-on-streams draft. Thus,
this commit readjusts quic_frame architecture to be able to support
higher frame type values.

First, type field of quic_frame is changed to a 64-bits variable. Both
encoding and decoding frame functions uses variable-length integer
helpers to manipulate the frame type field.

Secondly, the quic_frame builders/parsers infrastructure is still
preserved. However, it could be impossible to define new large frame
type as an index into quic_frame_builders / quic_frame_parsers arrays.
Thus, wrapper functions are now provided to access the builders and
parsers. Both qf_builder() and qf_parser() wrappers can then be extended
to return custom builder/parser instances for larger frame type.

Finally, unknown frame type detection also uses the new wrapper
quic_frame_is_known(). As with builders/parsers, for large frame type,
this function must be manually completed to support a new type value.
2025-02-14 09:00:05 +01:00
William Lallemand
2fe4cbd8e5 MINOR: startup: allow hap_register_feature() to enable a feature in the list
This patch allows hap_register_feature() to enable a feature in the list
which was already registered and marked disabled.

This way we could enable automatically some features under certain
condition without the need of the USE argument with make and correctly
report its activation.
2025-02-14 00:09:17 +01:00
William Lallemand
7034f2ca48 MINOR: ssl: store the filenames resulting from a lookup in ckch_conf
With this patch, files resulting from a lookup (*.key, *.ocsp,
*.issuer etc) are now stored in the ckch_conf.

It allows to see the original filename from where it was loaded in "show
ssl cert <filename>"
2025-02-13 17:44:00 +01:00
Willy Tarreau
a4d65c9cc8 DOC: watchdog: document the sequence of the watchdog and panic
Each time we go into the watchdog and panic code, it's super hard to
figure who calls what since signals are involved to bounce between
threads. Let's document the main principles and sequences to ease the
journey next time.
2025-02-13 16:45:07 +01:00
William Lallemand
0c0b38d64c MINOR: ssl/cli: display more filenames in 'show ssl cert'
"show ssl cert <file>" only displays a unique filename, which is the
key used in the ckch_store tree. This patch extends it by displaying
every filenames from the ckch_conf that can be configured with the
crt-store.

In order to be more consistant, some changes are needed in the future:
- we need to store the complete path in the ckch_conf (meaning with
  crt-path or key-path)
- we need to fill a ckch_conf in cases the files are autodiscovered
2025-02-13 16:18:06 +01:00
William Lallemand
5a7cbb8d81 BUG/MINOR: ssl/cli: "show ssl crt-list" lacks sigals
1d3c8223 ("MINOR: ssl: allow to change the server signature algorithm")
mplemented the sigals keyword in the crt-list but never the dump of the
keyword over the CLI.

Must be backported as far as 2.8.
2025-02-12 17:16:50 +01:00
William Lallemand
037d2e5498 BUG/MINOR: ssl/cli: "show ssl crt-list" lacks client-sigals
b6ae2aafde43 ("MINOR: ssl: allow to change the signature algorithm for
client authentication") implemented the client-sigals keyword in the
crt-list but never the dump of the keyword over the CLI.

Must be backported as far as 2.8.
2025-02-12 17:16:50 +01:00
Willy Tarreau
561319bd1c BUG/MEDIUM: fd: mark FD transferred to another process as FD_CLONED
The crappy epoll API stroke again with reloads and transferred FDs.
Indeed, when listening sockets are retrieved by a new worker from a
previous one, and the old one finally stops listening on them, it
closes the FDs. But in this case, since the sockets themselves were
not closed, epoll will not unregister them and will continue to
report new activity for these in the old process, which can only
observe, count an fd_poll_drop event and not unregister them since
they're not reachable anymore.

The unfortunate effect is that long-lasting old processes are woken
up at the same rate as the new process when accepting new connections,
and can waste a lot of CPU. Accept rates divided by 8 were observed on
a small test involving a slow transfer on 10 connections facing a reload
every second so that 10 processes were busy dealing with them while
another process was hammering the service with new connections.

Fortunately, years ago we implemented a flag FD_CLONED exactly for
similar purposes. Let's simply mark transferred FDs with FD_CLONED
so that the process knows that these ones require special treatment
and have to be manually unregistered before being closed. This does
the job fine, now old processes correctly unregister the FD before
closing it and no longer receive accept events for the new process.

This needs to be backported to all stable versions. It only affects
epoll, as usual, and this time in combination with transferred FDs
(typically reloads in master-worker mode). Thanks to Damien Claisse
for providing all detailed measurements and statistics allowing to
understand and reproduce the problem.
2025-02-12 16:35:01 +01:00
Amaury Denoyelle
e2744d23be MINOR: quic: refactor CRYPTO encoding and splitting
This patch is the direct follow-up of the previous one which refactor
STREAM frame encoding. Reuse the newly defined quic_strm_frm_fillbuf()
and quic_strm_frm_split() functions for CRYPTO frame encoding.

The code for CRYPTO and STREAM frames encoding should now be clearer as
it is mostly identical.
2025-02-12 15:10:54 +01:00
Amaury Denoyelle
f96af8e463 MINOR: quic: refactor STREAM encoding and splitting
CRYPTO and STREAM frames encoding is similar. If payload is too large,
frame will be splitted and only the first payload part will be written
in the output QUIC packet. This process is complexified by the presence
of a variable-length integer Length field prior to the payload.

This commit aims at refactor these operations. Define two functions to
simplify the code :
* quic_strm_frm_fillbuf() which is used to calculate the optimal frame
  length of a STREAM/CRYPTO frame with its payload in a buffer
* quic_strm_frm_split() which is used to split the frame payload if
  buffer is too small

With this patch, both functions are now implemented for STREAM encoding.
2025-02-12 15:10:03 +01:00
William Lallemand
0b9a75e878 REGTESTS: stop using truncated.vtc on freebsd
We never succeed to make the truncated.vtc reg-test work constantly on
the Cirrus FreeBSD CI.

Let's exclude it from the FreeBSD tests so the CI don't break randomly.
2025-02-12 13:34:40 +01:00
William Lallemand
0b47e5fa20 DOC: initcall: name correctly REGISTER_CONFIG_POST_SECTION()
REGISTER_CONFIG_POST_SECTION() was not named correctly.
2025-02-12 13:27:44 +01:00
William Lallemand
6097938209 CI: cirrus-ci: bump FreeBSD image to 14-2
FreeBSD CI since to be broken for a while, try to upgrade the image to
the latest 14.2 version.
2025-02-12 13:18:55 +01:00
William Lallemand
4de86bbbfc MEDIUM: initcall: allow to register mutiple post_section_parser per section
Before this patch, REGISTER_CONFIG_SECTION() allowed to register one and only
one callback (<post>) called after the parsing of a section.

It was limitating because you couldn't register a post callback from anywhere
else in the code.

This patch introduces the new REGISTER_CONFIG_SECTION_POST() macros which allows
to register a new post callback for a section keyword from anywhere.

This patch introduces the feature by allowing `struct cfg_section` entries that
does not have a `section_parser`, and then iterating on all cfg_section with a
post_section_parser for a keyword.
2025-02-12 12:52:41 +01:00
William Lallemand
5c2039b5b8 CLEANUP: mworker: "program" section does not have a post_section_parser anymore
The "program" section does not have a post_section_parser anymore so no
need to make an exception for it.
2025-02-12 12:37:01 +01:00
William Lallemand
313eeae7db BUG/MINOR: mworker: post_section_parser for the last section in discovery
Previous patch 2c270a05f ("BUG/MINOR: mworker: section ignored in
discovery after a post_section_parser") needs an adjustment for the last
section of the file.

Indeed the post_section_parser of the last section must not be called in
discovery mode.

Must be backported in 3.1.
2025-02-12 12:34:57 +01:00
William Lallemand
2c270a05f0 BUG/MINOR: mworker: section ignored in discovery after a post_section_parser
When a new section is discovered, the post_section_parser of the
previous section is called. However in the new master-worker mode the
discovery mode will skip the post_section_parser. But instead of
trying to parse the current section keyword after that, it would skip
completely the current line.

This is a minor bug since there isn't a lot of section with
post_section_parser, and not a lot of section to parse in discovery
mode.

But this could be reproduced like this:

	global
	        expose-deprecated-directives

	resolvers res
		parse-resolv-conf

	program foo
	        command sleep 10

	program bar
	       command sleep 10

Ths 'resolvers' section has a post_section_parser which will be ignored
in discovery mode with the consequence of ignoring the first program
section.

This must be backported in 3.1.
2025-02-12 12:18:17 +01:00
Amaury Denoyelle
731340afbd MINOR: quic: simplify length calculation for STREAM/CRYPTO frames
STREAM and CRYPTO frames have a similar encoding format. In particular,
both of them have a variable-length integer Length field just before the
frame payload.

It is complex to determine the optimal Length value before copying the
payload data in the remaining buffer space. As such, helper functions
were implemented to calculate this. However, CRYPTO and STREAM frames
encoding implementation were not completely aligned, which renders the
code harder to follow.

The purpose of this commit is to simplify CRYPTO and STREAM frames
encoding. First, a new helper quic_int_cap_length() is defined which is
useful to determine the optimal buffer room available if prefixed by a
variable-length integer as Length field. Then, processing of both CRYPTO
and STREAM frames is now nearly identical, based on this new helper
function. Functions max_available_room() and max_stream_data_size() are
now unused and are removed.
2025-02-12 11:51:09 +01:00
Amaury Denoyelle
e6a223542a BUG/MINOR: quic: fix CRYPTO payload size calcul for encoding
Function max_stream_data_size() is used to determine the payload length
of a CRYPTO frame. It takes into account that the CRYPTO length field is
a variable length integer.

Implemented calcul was incorrect as it reserved too much space as a
frame header. This error is mostly due because max_stream_data_size()
reuses max_available_room() which also reserve space for a variable
length integer. This results in CRYPTO frames shorter of 1 to 2 bytes
than the maximum achievable value, which produces in the end datagram
shorter than the MTU.

Fix max_stream_data_size() implementation. It is now merely a wrapper on
max_available_room(). This ensures that CRYPTO frame encoding is now
properly optimized to use the MTU available.

This should be backported up to 2.6.
2025-02-12 11:51:09 +01:00
Amaury Denoyelle
63747452a3 BUG/MINOR: quic: reserve length field for long header encoding
Long header packets have a mandatory Length field, which contains the
size of Packet number and payload, encoded as a variable-length integer.
Its value can thus only be determined after the payload size is known,
which depends on the remaining buffer space after this variable-length
field.

Packet payload are encoded in two steps. First, a list of input frames
is processed until the packet buffer is full. CRYPTO and STREAM frames
payload can be splitted if need to fill the buffer. Real encoding is
then performed as a second stage operation, first with Length field,
then with the selected frames themselves.

Before this patch, no space was reserved in the buffer for Length field
when attaching the frames to the packet. This could result in a error as
the packet payload would be too large for the remaining space.

In practice, this issue was rarely encounted, mostly as a side-effect
from another issue linked to CRYPTO frame encoding. Indeed, a wrong
calculation is performed on CRYPTO splitting, which results in frame
payload shorter by a few bytes than expected. This however ensured there
would be always enough room for the Length field and payload during
encoding. As CRYPTO frames are the only big enough content emitted with
a Long header packet, this renders the current issue mostly non
reproducible.

Fix the original issue by reserving some space for Length field prior to
frame payload calculation, using a maximum value based on the remaining
room space. Packet length is then reduced if needed when encoding is
performed, which ensures there is always enough room for the selected
frames.

Note that the other issue impacting CRYPTO frame encoding is not yet
fixed. This could result in datagrams with Long header packets not
completely extended to the full MTU. The issue will be addressed in
another patch.

This should be backported up to 2.6.
2025-02-12 11:51:09 +01:00
Willy Tarreau
627280e15f MAJOR: leastconn: postpone the server's repositioning under contention
When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().

In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.

While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.

Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:

Before:
  Overhead  Shared Object             Symbol
    43.46%  haproxy-master-inlineebo  [.] fwlc_srv_reposition
    21.20%  haproxy-master-inlineebo  [.] fwlc_get_next_server
     0.91%  haproxy-master-inlineebo  [.] process_stream
     0.75%  [kernel]                  [k] ice_napi_poll
     0.51%  [kernel]                  [k] tcp_recvmsg
     0.50%  [kernel]                  [k] ice_start_xmit
     0.50%  [kernel]                  [k] tcp_ack

After:
  Overhead  Shared Object             Symbol
    30.37%  haproxy                   [.] fwlc_get_next_server
     2.51%  haproxy                   [.] fwlc_srv_reposition
     1.91%  haproxy                   [.] process_stream
     1.46%  [kernel]                  [k] ice_napi_poll
     1.36%  [kernel]                  [k] tcp_recvmsg
     1.04%  [kernel]                  [k] tcp_ack
     1.00%  [kernel]                  [k] skb_release_data
     0.96%  [kernel]                  [k] ice_start_xmit
     0.91%  haproxy                   [.] conn_backend_get
     0.82%  haproxy                   [.] connect_server
     0.82%  haproxy                   [.] run_tasks_from_lists

Tested on an Ampere Altra with 64 aarch64 cores dedicated to haproxy,
the gain is even more visible (3.6x):

  Before: 311-323k rps, 3.16-3.25ms, 6400% CPU
  Overhead  Shared Object     Symbol
    55.69%  haproxy-master    [.] fwlc_srv_reposition
    33.30%  haproxy-master    [.] fwlc_get_next_server
     0.89%  haproxy-master    [.] process_stream
     0.45%  haproxy-master    [.] h1_snd_buf
     0.34%  haproxy-master    [.] run_tasks_from_lists
     0.32%  haproxy-master    [.] connect_server
     0.31%  haproxy-master    [.] conn_backend_get
     0.31%  haproxy-master    [.] h1_headers_to_hdr_list
     0.24%  haproxy-master    [.] srv_add_to_idle_list
     0.23%  haproxy-master    [.] http_request_forward_body
     0.22%  haproxy-master    [.] __pool_alloc
     0.21%  haproxy-master    [.] http_wait_for_response
     0.21%  haproxy-master    [.] h1_send

  After: 1.21M rps, 0.842ms, 6400% CPU
  Overhead  Shared Object     Symbol
    17.44%  haproxy           [.] fwlc_get_next_server
     6.33%  haproxy           [.] process_stream
     4.40%  haproxy           [.] fwlc_srv_reposition
     3.64%  haproxy           [.] conn_backend_get
     2.75%  haproxy           [.] connect_server
     2.71%  haproxy           [.] h1_snd_buf
     2.66%  haproxy           [.] srv_add_to_idle_list
     2.33%  haproxy           [.] run_tasks_from_lists
     2.14%  haproxy           [.] h1_headers_to_hdr_list
     1.56%  haproxy           [.] stream_set_backend
     1.37%  haproxy           [.] http_request_forward_body
     1.35%  haproxy           [.] http_wait_for_response
     1.34%  haproxy           [.] h1_send

And at similar loads, the CPU usage considerably drops (3.55x), as
well as the response time (10x):

  After: 320k rps, 0.322ms, 1800% CPU
  Overhead  Shared Object     Symbol
     7.62%  haproxy           [.] process_stream
     4.64%  haproxy           [.] h1_headers_to_hdr_list
     3.09%  haproxy           [.] h1_snd_buf
     3.08%  haproxy           [.] h1_process_demux
     2.22%  haproxy           [.] __pool_alloc
     2.14%  haproxy           [.] connect_server
     1.87%  haproxy           [.] h1_send
   > 1.84%  haproxy           [.] fwlc_srv_reposition
     1.84%  haproxy           [.] run_tasks_from_lists
     1.77%  haproxy           [.] sock_conn_iocb
     1.75%  haproxy           [.] srv_add_to_idle_list
     1.66%  haproxy           [.] http_request_forward_body
     1.65%  haproxy           [.] wake_expired_tasks
     1.59%  haproxy           [.] h1_parse_msg_hdrs
     1.51%  haproxy           [.] http_wait_for_response
   > 1.50%  haproxy           [.] fwlc_get_next_server

The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.

For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.
2025-02-12 11:48:10 +01:00
Willy Tarreau
b6a8318cc2 MEDIUM: server: allocate a tasklet for asyncronous requeuing
This creates a tasklet that only expects to be called when the LB
algorithm is under contention when trying to reposition the server
in its tree. Indeed, that's one of the operations that usually
requires to take a write lock on a highly contended area, often
for very little benefits under contention; indeed, under load, if
a server keeps its previous position for a few extra microseconds,
usually there's no harm. Thus this new tasklet can be woken up by
the LB algo to ask the server to later call lbprm.server_requeue().
It does nothing else.
2025-02-11 17:24:09 +01:00
Willy Tarreau
20b8c4ddba MINOR: lbprm: add a new callback ->server_requeue to the lbprm
This callback will be used to reposition a server to its expected
position regardless of the fact that it was taken or dropped. It
will only be used by supporting LB algos. For now, only fwlc defines
it and assigns it to fwlc_srv_reposition(). At the moment it's not
used yet.
2025-02-11 17:16:14 +01:00
Willy Tarreau
eced1d6d8a DEBUG: thread: reduce the struct lock_stat to store only 30 buckets
Storing only 30 buckets means we only keep 256 bytes per label. This
further simplifies address calculation and reduces the memory used
without complicating the locking code. It means we won't measure wait
times larger than a second but we're not supposed to face this as it
would trigger the watchdog anyway. It may become a little bit just if
measuring using rdtsc() instead of now_mono_time() though (typically
the limit would be around 350ms for a 3 GHz CPU).
2025-02-10 18:34:43 +01:00
Willy Tarreau
c2f2d6fd3c DEBUG: thread: make lock_stat per operation instead of for all operations
It's more convenient (and more readable) to have the lock stats arranged
by operation type (read, seek, write). It will also allow to later simplify
the structure format and the bucket address calculation. Now lock_stat[]
got split into lock_stats_rd[], lock_stats_sk[], lock_stats_wr[].
2025-02-10 18:34:43 +01:00
Willy Tarreau
4168d1278c DEBUG: thread: don't keep the redundant _locked counter
Now that we have our sums by bucket, the _locked counter is redundant
since it's always equal to the sum of all entries. Let's just get rid
of it and replace its consumption with a loop over all buckets, this
will reduce the overhead of taking each lock at the expense of a tiny
extra effort when dumping all locks, which we don't care about.
2025-02-10 18:34:43 +01:00
Willy Tarreau
a22550fbd7 DEBUG: thread: report the wait time buckets for lock classes
In addition to the total/average wait time, we now also store the
wait time in 2^N buckets. There are 32 buckets for each type (read,
seek, write), allowing to store wait times from 1-2ns to 2.1-4.3s,
which is quite sufficient, even if we'd want to switch from NS to
CPU cycles in the future. The counters are only reported for non-
zero buckets so as not to visually pollute the output.

This significantly inflates the lock_stat struct, which is now
aligned to 256 bytes and rounded up to 1kB. But that's not really
a problem, given that there's only one per lock label.
2025-02-10 18:34:43 +01:00
Willy Tarreau
0b849c59fb DEBUG: thread: make lock time computation more consistent
The lock time computation was a bit inconsistent between functions,
particularly those using a try_lock. Some of them would count the lock
as taken without counting the time, others would simply not count it.
This is essentially due to the way the time is retrieved, as it was
done inside the atomic increment.

Let's instead always use start_time to carry the elapsed time, by
presetting it to the negative time before the event and addinf the
positive time after, so that it finally contains the duration. Then
depending on the try lock's success, we add the result or not. This
was generalized to all lock functions for consistency, and because
this will be handy for future changes.
2025-02-10 18:34:43 +01:00
Willy Tarreau
99a88ee904 DEBUG: thread: report the spin lock counters as seek locks
Technically speaking, spin locks use a seek lock, not a write lock,
so better count them appropriately for consistency (lock time, or
function calls count).
2025-02-10 18:34:43 +01:00
Willy Tarreau
7ddcdff33f BUG/MEDIUM: debug: close a possible race between thread dump and panic()
The rework of the thread dumping mechanism in 2.8 with commit 9a6ecbd590
("MEDIUM: debug: simplify the thread dump mechanism") opened a small
race, which is that a thread in the process of dumping other ones may
block the other one from panicing while it's looping at the end of
ha_thread_dump_fill(), or any other sequence involving the currently
dumped one.

This was emphasized in 3.1 with commit 148eb5875f ("DEBUG: wdt: better
detect apparently locked up threads and warn about them") that allowed
to emit warnings about long-stuck threads, because in this case, what
happens is that sometimes a thread starts to emit a warning (or a set
of warnings), and while the warning is being awaited for, a panic
finally happens and interrupts either the dumping thread, which never
finishes and waits for the target's pointer to become NULL which will
never happen since it was supposed to do it itself, or the currently
dumped thread which could wait for the dumping thread to become ready
while this one has not released the former.

In order to address this, first we now make sure never to dump a thread
that is already in the process of dumping another one. We're adding a
new thread flag to know this situation, that is set in ha_thread_dump_fill()
and cleared in ha_thread_dump_done(). And similarly, we don't trigger
the watchdog on a thread waiting for another one to finish its dump,
as it's likely a case of warning (and maybe even a panic) that makes
them wait for each other and we don't want such cases to be reentrant.
Finally, we check in the main polling loop that the flag never accidentally
leaked (e.g. wrong flag manipulation) as this would be difficult to spot
with bad consequences.

This should be backported at least to 2.8, and should resolve github
issue #2860. Thanks to Chris Staite for the very informative backtrace
that exhibited the problem.
2025-02-10 18:34:26 +01:00
Willy Tarreau
37e84676c7 [RELEASE] Released version 3.2-dev5
Released version 3.2-dev5 with the following main changes :
    - BUG/MINOR: ssl: put ssl_sock_load_ca under SSL_NO_GENERATE_CERTIFICATES
    - CLEANUP: ssl: rename ssl_sock_load_ca to ssl_sock_gencert_load_ca
    - CLEANUP: ssl: move ssl_sock_gencert_load_ca declaration in ssl_gencert.h
    - CLEANUP: tree-wide: define and use acl_match_cond() helper
    - MINOR: epoll: permit to mask certain specific events
    - MINOR: proxies: Add a per-thread group field to struct proxy.
    - MINOR: Add fields to the per-thread group field in struct server.
    - MINOR: proxies/servers: Calculate queueslength and use it.
    - MEDIUM: servers/proxies: Switch to using per-tgroup queues.
    - BUG/MINOR: stream: Properly handle "on-marked-up shutdown-backup-sessions"
    - MEDIUM: stream: Map task wake up reasons to dedicated stream events
    - MEDIUM: stream: No longer use TASK_F_UEVT* to shut a stream down
    - BUILD: tools: fix build on BSD by dropping the ETIME check
    - MINOR: queues: use __ha_cpu_relax() on failed CAS.
    - BUILD: queues: Use unsigned int when needed
    - BUILD: ssl: allow to build without the renegotiation API of WolfSSL
    - BUILD: ssl: more cleaner approach to WolfSSL without renegotiation
    - BUG/MEDIUM: chunk: make sure to flush the trash pool before resizing
    - MINOR: quic: remove references to burst in quic-cc-algo parsing
    - MINOR: quic: allow BBR testing without pacing
    - MINOR: quic: transform pacing settings into a global option
    - MAJOR: quic: mark pacing as stable and enable it by default
    - MINOR: quic: mark BBR as stable
    - MINOR: quic: define quic_tune
    - BUILD: quic: fix overflow in global tune
    - DEBUG: fd: add a counter of takeovers of an FD since it was last opened
    - MINOR: fd: add a generation number to file descriptors
    - DEBUG: epoll: store and compare the FD's generation count with reported event
    - MEDIUM: epoll: skip reports of stale file descriptors
    - MINOR: mux-h1: Add masks to group H1S DEMUX and MUX errors
    - BUG/MINOR: mux-h1: Only report a SE error on demux error
    - MINOR: tevt: Add the termination events log's fundations
    - MINOR: tevt/stconn: Add a termination events log in the SE descriptor
    - MINOR: tevt/mux-h1: Report termination events for the H1C and H1S
    - MINOR: tevt/mux-h2: Report termination events for the H2C
    - MINOR: tevt/stream/stconn: Report termination events for stream and sc
    - MINOR: tevt/conn: Report intercepted event for L4 rules
    - MINOR: tevt/mux-h1/mux-h2: Add termination events log when dumping mux info
    - MINOR: tevt/muxes: Add CTL and SCTL command to get the termination event logs
    - MINOR: tevt/mux-pt: Add support for termination event logs
    - MINOR: tevt/connection: Add dedicated termination events for lower locations
    - MEDIUM: tevt/muxes: Add dedicated termination events for muxc/se locations
    - MINOR: tevt/stconn: Be more accurate to report shutw events
    - MEDIUM: tevt/stconn/stream: Add dedicated termination events for stream location
    - MINOR: tevt: Don't duplicate termination event during reporting
    - MINOR: tevt/applet:  Add limited support for termination event logs for applets
    - MINOR: tevt: Add a sample to get termination events for all locations
    - MINOR: tevt: Improve function to convert a termination events log to string
    - REORG: tevt/connection: Move enums at the end of the header file
    - MINOR: tevt/dev: Add term_events tool
    - MINOR: tevt/connection: Add support for POLL_HUP/POLL_ERR events
    - MINOR: tevt/dev: Parse tuple of termination events
    - BUG/MEDIUM: htx: wrong count computation in htx_xfer_blks()
    - DOC: htx: clarify <mark> parameter for htx_xfer_blks()
    - BUILD: quic: remove GCC undefined error in qc_release_lost_pkts()
    - MEDIUM: htx: prevent <mark> to copy incomplete headers in htx_xfer_blks()
    - BUG/MEDIUM: mux-fcgi: Properly handle read0 on partial records
    - BUG/MINOR: tevt/http-ana: Remove badly placed event reports
    - DEBUG: http-ana: Remove debug counters from HTTP analyzers
    - DEBUG: mux-h1: Remove some debug counters
    - BUG/MINOR: tcp-rules: Don't forward close during tcp-response content rules eval
    - MEDIUM: stream: interrupt costly rulesets after too many evaluations
    - BUG/MINOR: http-check: Don't pretend a C-L heeader is set before adding it
    - BUILD: ssl: remove a boringssl definition defined by recent boringssl libs
    - BUG/MINOR: tevt/mux-h2: Set truncated receive/eos events at SE level on error
    - BUG/MEDIUM: flt-spoe: Set/test applet flags instead of SE flags from I/O handler
    - BUG/MEDIUM: applet: Don't pretend to have more data to handle EOI/EOS/ERROR
    - BUG/MEDIUM: flt-spoe: Properly handle end of stream from the SPOE applet
    - MINOR: flt-spoe: Report end of input immediately after applet init
    - MINOR: mux-spop: Report EOI on the SE when a ACK is received for a stream
    - MINOR: mux-spop: Set SPOP_CF_ERROR flag on connection error only
    - MINOR: tevt/mux-spop:  Report termination events for the SPOP connect/stream
    - CLEANUP: mux-spop: Remove useless comments
    - MINOR: mux-spop: Dump info about connections and streams in dedicated functions
    - MINOR: mux-spop: Implement .show_sd callback function
    - MEDIUM: mux-fcgi: Add a function to propagate termination flags from fstrm to SE
    - BUG/MEDIUM: mux-fcgi: Propagate flags to SE in fcgi_strm_wake_one_stream
    - MINOR: tevt/mux-fcgi:  Report termination events for the FCGI connect/stream
    - MINOR: mux-fcgi: Dump info about connections and streams in dedicated functions
    - MINOR: mux-spop/mux-fcgi: Add support of the debug string for logs
    - BUG/MINOR: cli: Don't set SE flags from the cli applet
    - BUG/MINOR: cli: Fix memory leak on error for _getsocks command
    - BUG/MINOR: cli: Fix a possible infinite loop in _getsocks()
    - BUG/MINOR: config/userlist: Support one 'users' option for 'group' directive
    - BUG/MINOR: auth: Fix a leak on error path when parsing user's groups
    - BUG/MINOR: flt-trace: Support only one name option
    - MINOR: filters: Improve errors formating during filters parsing
    - BUG/MINOR: stats-json: Define JSON_INT_MAX as a signed integer
    - DOC: option redispatch should mention persist options
    - BUG/MINOR: debug: make "debug dev sched" accept a negative TID
    - BUG/MINOR: debug: make sure the "debug dev sched" tasks don't block stopping
    - IMPORT: plock: export the uninlined version of the lock wait function
    - IMPORT: plock: give higher precedence to W than S
    - IMPORT: plock: lower the slope of the exponential back-off
    - IMPORT: plock: use cpu_relax() for a shorter time in EBO
    - Revert "IMPORT: plock: export the uninlined version of the lock wait function"
    - BUG/MEDIUM: ssl: chosing correct certificate using RSA-PSS with TLSv1.3
v3.2-dev5
2025-02-08 05:53:40 +01:00
William Lallemand
3912780b1e BUG/MEDIUM: ssl: chosing correct certificate using RSA-PSS with TLSv1.3
The clienthello callback was written when TLSv1.3 was not yet out, and
signatures algorithm changed since then.

With TLSv1.2, the least significant byte was used to determine the
SignatureAlgorithm, which could be rsa(1), dsa(2), ecdsa(3).
https://datatracker.ietf.org/doc/html/rfc5246#section-7.4.1.4.1

This was used to chose which type of certificate to push to the client.

But TLSv1.3 changed that, and introduced new RSA-PSS algorithms that
does not have the least sinificant byte to 1.
https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.3

This would result in chosing the wrong certificate when an RSA an ECDSA
ones are in the configuration for the same SNI or default entry.

This patch fixes the issue by parsing bothe hash and signature field to
check the RSA-PSS signature scheme.

This must fix issue #2852.

This must be backported in every stable versions. The code was moved
from ssl_sock.c to ssl_clienthello in recent versions.
2025-02-07 20:56:42 +01:00