Several settings can be set to control stream multiplexing and
associated receive window. Previously, all of these settings were
configured using prefix "tune.quic.frontend.", despite being applied
blindly on both sides.
Fix this by duplicating these settings specific to frontend and backend
side. Options are also renamed to use the standardize prefix
"tune.quic.[be|fe].stream." notation.
Also, each option is individually renamed to better reflect its purpose
and hide technical details relative to QUIC transport parameter naming :
* max-data-size -> stream.rxbuf
* max-streams-bidi -> stream.max-concurrent
* stream-data-ratio -> stream.data-ratio
No need to backport.
Streamline max-idle-timeout option. Rename it to use the newer cohesive
naming scheme 'tune.quic.fe|be.'.
Two different fields were already defined in global struct. These fields
are moved into quic_tune along with other QUIC settings. However, no
parser was defined for backend option, this commit fixes this.
No need to backport this.
On frontend side, a quic_conn can have a dedicated FD or use the
listener one. These different modes can be activated via a global QUIC
tune setting.
This patch adjusts the option. First, it is renamed to the more
meaningful name 'tune.quic.fe.sock-per-conn'. Also, arguments are now
either 'default-on' or 'force-off'. The objective is to better highlight
reliationship with 'quic-socket' bind option.
The older option is deprecated and will be removed in 3.5.
A QUIC global tune setting is defined to be able to force Retry emission
prior to handshake. By definition, this ability is only supported by
QUIC servers, hence it is a frontend option only.
Rename the option to use "fe" prefix. The old option name is deprecated
and will be removed in 3.5
QUIC global memory can be limited across the entire process via a global
tune setting. Previously, this setting used to misleading "frontend"
prefix. As this is applied as a sum between all QUIC connections, both
from frontend and backend sides, remove the prefix. The new option name
is "tune.quic.mem.tx-max".
The older option name is deprecated and will be removed in 3.5.
This patch is similar to the previous one, except that it is focused on
Tx QUIC settings. It is now possible to toggle GSO and pacing on
frontend and backend sides independently.
As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUI
settings. Older options are deprecated and will be removed on 3.5
release.
Various settings can be configured related to QUIC congestion controler.
This patch duplicates them to be able to set independent values on
frontend and backend sides.
As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUIC
settings. Older options are deprecated and will be removed on 3.5
release.
Previously, QUIC glitches support was only implemented for frontend
side. Extend this so that the option can be specified separately both on
frontend and backend sides. Function _qcc_report_glitch() now retrieves
the relevant max value based on connection side.
In addition to this, option has been renamed to use "fe/be" prefixes.
This is part of the current serie of commits which unify QUIC settings.
Older options are deprecated and will be removed on 3.5 release.
Rename the option to quickly enable/disable every QUIC listeners. It now
takes an argument on/off. The documentation is extended to reflect the
fact that QUIC backend are not impacted by this option.
The older keyword is simply removed. Deprecation is considered
unnecessary as this setting is only useful during debugging.
Remove parsing code for tune.quic.frontend.conn-tx-buffers.limit. This
option was deprecated for some time and in fact was noop and not
mentionned anymore in the documentation.
When using a wildcard DNS domain in the ACME configuration, for example
*.example.com, one might think that it needs to use the challenge_ready
command with this domain. But that's not the case, the challenge_ready
command takes the domain asked by the ACME server, which is stripped of
the wildcard.
In order to be clearer, the log message shows exactly the command the
user should sent, which is clearer.
The dns-01-record field in the dpapi sink, output the authentication
token which is needed in the TXT record in order to validate the DNS-01
challenge.
Before waking up the expiration task again at the end of it, make sure
the next date is set. If there's nothing left to do, then task_exp will
be TASK_ETERNITY and we then don't want to be waken up again.
As reported by @TimWolla on GH #3168, there was a typo in shm stats file
BUG_ON to report that the size of shm_stats_file_object changed.
No backport needed.
The previous commit switch from ncbuf to ncbmbuf as storage for received
CRYPTO frames. The latter ensures that buffering of such frames cannot
fail anymore due to gaps size.
Previously, extra mechanism were implemented on QUIC frames parsing
function to overcome the limitation of ncbuf on gaps size. Before
insertion, CRYPTO frames were stored in a temporary tree to order their
insertion. As this is not necessary anymore, this commit removes the
temporary tree insertion.
This commit is closely associated to the previous bug fix. As it
provides a neat optimization and code simplication, it can be backported
with it, but not in the next immediate release to spot potential
regression.
In QUIC, TLS handshake messages such as ClientHello are encapsulated in
CRYPTO frames. Each QUIC implementation can split the content in several
frames of random sizes. In fact, this feature is now used by several
clients, based on chrome so-called "Chaos protection" mechanism :
https://quiche.googlesource.com/quiche/+/cb6b51054274cb2c939264faf34a1776e0a5bab7
To support this, haproxy uses a ncbuf storage to store received CRYPTO
frames before passing it to the SSL library. However, this storage
suffers from a limitation as gaps between two filled blocks cannot be
smaller than 8 bytes. Thus, depending on the size of received CRYPTO
frames and their order, ncbuf may not be sufficient. Over time, several
mechanisms were implemented in haproxy QUIC frames parsing to overcome
the ncbuf limitation.
However, reports recently highlight that with some clients haproxy is
not able to deal with CRYPTO frames reception. In particular, this is
the case with the latest ngtcp2 release, which implements a similar
chaos protection mechanism via the following patch. It also seems that
this impacts haproxy interaction with firefox.
commit 89c29fd8611d5e6d2f6b1f475c5e3494c376028c
Author: Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com>
Date: Mon Aug 4 22:48:06 2025 +0900
Crumble Client Initial CRYPTO (aka chaos protection)
To fix haproxy CRYPTO frames buffering once and for all, an alternative
non-contiguous buffer named ncbmbuf has been recently implemented. This
type does not suffer from gaps size limitation, albeit at the cost of a
small reduction in the size available for data storage.
Thus, the purpose of this current patch is to replace ncbuf with the
newer ncbmbuf for QUIC CRYPTO frames parsing. Now, ncbmb_add() is used
to buffer received frames which is guaranteed to suceed. The only
remaining case of error is if a received frame offset and length exceed
the ncbmbuf data storage, which would result in a CRYPTO_BUFFER_EXCEEDED
error code.
A notable behavior change when switching to ncbmbuf implementation is
that NCB_ADD_COMPARE mode cannot be used anymore during add. Instead,
crypto frame content received at a similar offset will be overwritten.
A final note regarding STREAM frames parsing. For now, it is considered
unnecessary to switch from ncbuf in this case. Indeed, QUIC clients does
not perform aggressive fragmentation for them. Keeping ncbuf ensure that
the data storage size is bigger than the equivalent ncbmbuf area.
This should fix github issue #3141.
This patch must be backported up to 2.6. It is first necessary to pick
the relevant commits for ncbmbuf implementation prior to it.
Write some tests for ncbmbuf buf. These tests should be run each time
ncbmbuf implementation is adjusted. Use the following command :
$ gcc -g -DSTANDALONE -I./include -o ncbmbuf src/ncbmbuf.c && ./ncbmbuf
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Implement ncbmb_advance() function for the ncbmbuf type. This allows to
remove bytes in front of the buffer, regardless of the existing gaps.
This is implemented by resetting the corresponding bits of the bitmap.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Implement ncbmb_data() function for the ncbmbuf type. Its purpose is
similar to its ncbuf counterpart : it returns the size in bytes of data
starting at a specific offset until the next gap.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Extend private API for ncbmbuf type by defining an iterator type for the
buffer bitmap handling. The purpose is to provide a simple method to
iterate over the bitmap one byte at a time, with a proper bitmask set to
hide irrelevant bits.
This internal type is unused for now, but will become useful when
implementing ncb_data() and ncb_advance() functions.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
This patch implements add operation for ncbmbuf type.
This function is simpler than its ncbuf counterpart. Indeed, for now
only NCB_ADD_OVERWRT mode is supported. This compromise has been chosen
as ncbmbuf will be first used for QUIC CRYPTO frames handling, which
does not mandate to compare existing filled blocks during insertion.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Define ncbmbuf which is an alternative non-contiguous buffer
implementation. "bm" abbreviation stands for bitmap, which reflects how
gaps and filled blocks are encoded. The main purpose of this
implementation is to get rid of the ncbuf limitation regarding the
minimal size for gaps between two blocks of data.
This commit adds the new module ncbmbuf. Along with it, some utility
functions such as ncbmb_make(), ncbmb_init() and ncbmb_is_empty() are
defined. Public API of ncbmbuf will be extended in the following
patches.
This patch is not considered a bug fix. However, it will be required to
fix issue encountered on QUIC CRYPTO frames parsing. Thus, it will be
necessary to backport the current patch prior to the fix to come.
The doc in commit 977feb5617 ("DOC: api: update the pools API with the
alignment and typed declarations") says that alignment of zero means
the type's alignment. And this is followed by the DECLARE_TYPED_POOL()
macro. Yet this is not what is done in create_pool_from_reg() which
only raises the alignment to a void* if lower, while it should start
from the type's. The effect is haproxy refusing to start on some 32-bit
platforms since that commit, displaying an error such as:
"BUG in the code: at src/mux_h2.c:454, requested creation of pool
'h2s' aligned to 4 while type requires alignment of 8! Please
report to developers. Aborting."
Let's just apply the default type's alignment.
Thanks to @tianon for reporting this in GH issue #3168. No backport is
needed since aligned pools are 3.3-only.
Recently, proper support for interim responses forwarding to HTTP/3
client has been implemented. However, there was still an issue if two
responses are both encoded in the same snd_buf() iteration.
The issue is caused due to H3 HEADERS frame encoding method : 5 bytes
are reserved in front of the buffer to encode both H3 frame type and
varint length field. After proper headers encoding, output buffer head
is adjusted so that length can be encoded using the minimal varint size.
However, if the buffer is not empty due to a previous response already
encoded but not yet emitted, messing with the buffer head will corrupt
the entire H3 message. This only happens when encoding of both responses
is done in the same snd_buf() iteration, or at least without emission to
quic_conn layer in between.
The result of this bug is that the HTTP/3 client will be unable to parse
the response, most of the time reporting a formatting error. This can
be reproduced using the following netcat as HTTP/1 server to haproxy :
$ while sleep 0.2; do \
printf "HTTP/1.1 100 continue\r\n\r\nHTTP/1.1 200 ok\r\nContent-length: 5\r\nConnection: close\r\n\r\nblah\n" | nc -lp8002
done
To fix this, only adjust buffer head if content is empty. If this is not
the case, frame length is simply encoded as a 4-bytes varint size so
that messages are contiguous in the buffer.
This must be backported up to 2.6.
1xx informational messages are part of the HTTP response. It is not expected
to have a HX_FL_EOM flag set after parsing such messages when received from
a server. It is espacially important whne an informational messages is
processed on client side while the final response was not recieved yet, to
not erroneously detect the end of the message.
The HTTP multiplexers seem to ignore the HTX_FL_EOM flag for information
messages, but it remains an error from the HTX specification point of
view. So it must be fixed.
While it should theorically be backported as far as 3.0, it is a good idea
to not do so for now because no bug was reported and regressions may happen.
stktable_trash_oldest() goes through all the shards, trying to free a
number of entries. Going through each shard is expensive, as we have to
take the shard lock, so stop as soon as we free'd at least one entry, as
it is only called when we want to make room for one entry.
In stksess_new(), if the table is full, we call stktable_trash_oldest()
to remove a few entries so that we have some room for a new one.
It is unlikely, but possible, that stktable_trash_oldest() will fail. If
so, just give up and do not add the new entry, instead of adding it
anyway.
Give up if stktable_trash_oldest() fails to free any entry
Instead of having per-table expiration tasks, just use one per shard.
The task will now go through all the tables to expire entries. When a
table gets an expiration earlier than the one previously known, it will
be put in a mt-list, and the task will be responsible to put it into an
eb32, ordered based on the next expiration.
Each per-shard task will run on a different thread, so it should lead to
a better load distribution than the per-table tasks.
Add a new initcall stage, STG_INIT_2, for stuff to be called after
step_init_2() is called, so after we know for sure that global.nbthread
will be set.
Modify stick-tables stkt_late_init() to run at STG_INIT_2 instead of
STG_INIT, in anticipation for it to be enhanced and have a need for
global.nbthread.
Since commit 20ec1de214 ("MAJOR: cli: Refacor parsing and execution of
pipelined commands"), command not returning any response (e.g. "quit")
don't pass through the free_trash_chunk() call, possibly leaking the
cmdline buffer. A typical way to reproduce it is to loop on "quit" on
the CLI, though it very likely affects other specific commands.
Let's make sure in the release handler that we always release that
chunk in any case. This must be backported to 3.2.
This bug impacts only the backends.
The ->conn (pointer to struct connection) member validity of the ssl_sock_ctx
struct was not checked before being dereferenced, leading to possible crashes
in qc_ssl_do_hanshake() during handshake.
This was reported by GH #3163 issue.
No need to backport because the QUIC backend support arrived with 3.3
Thread groups can be assigned arbitrary thread ranges, but if the
mentioned threads do not exist, this causes crashes in listener_accept()
or some connections to be ignored. The reason is that the calculated
mask is derived from the thread group's enabled threads count. Examples:
global
nbthread 2
thread-groups 2
thread-group 1 1-64
thread-group 2 65-128
frontend f-crash
bind :8001 thread 1/all
frontend f-freeze
bind :8002 thread 2/all
This commit removes missing threads, emits a warning when the thread
group just has less threads than requested, and an error when it is
left with no threads at all.
This must be backported to 3.1 since the issue is present there already.
If users start to enable expose-experimental-directives for the purpose
of testing one specific feature, there are chances that the option remains
forever and hides the experimental status of other options.
Let's emit a warning if the option appears and is not used. This will
remind users that they can now drop it, and help keep configs safe for
future upgrades.
We normally taint the process when using experimental directives, but
a handful of places were missed so we don't always know that they are
in use. Let's fix these places (hint for future directives, just look
for places checking for "experimental_directives_allowed", and add
"mark_tainted(TAINTED_CONFIG_EXP_KW_DECLARED);").
The option was turned to off by default in 2.8 with commit 2f7c82bfd
("BUG/MINOR: haproxy: Fix option to disable the fast-forward"), however
at the same time it should have dropped its experimental status since
the feature is enabled by default. The only goal of the option is to
debug something, like many other tune.xxx options. The option should
still normally not be used without being invited to do so by developers
looking for something specific though.
This could be backported if desired to simplify debugging, though this
has never been needed for now.
The SSL counters were not handled at all for QUIC connections. This patch
implement ssl_sock_update_counters() extracting the code from ssl_sock.c
and call this function where applicable both in TLS/TCP and QUIC parts.
Must be backported as far as 2.8.
This bug impacts only the backends.
When entering the closing state, a quic_closed_conn is used to replace the quic_conn.
In this state, the ->fd value was reset to -1 value calling qc_init_fd(). This value
is used by qc_may_use_saddr() which supposes it cannot be -1 for a backend, leading
->li to be dereferencd, which is legal only for a listener.
This bug impacts only the backend but with possible crash when qc_may_use_saddr()
is called: qc_test_fd() is false leading qc->li to be dereferenced. This is legal
only for a listener.
This patch prevents such fd value resettings for backends.
No need to backport because the QUIC backends support arrived with 3.3.
A quic_conn_closed struct is initialized to replace the quic_conn when the
connection enters the closing to reduce the connection memory footprint.
->max_udp_payload quic_conn_close was not initialized leading to possible
BUG_ON()s in qc_rcv_buf() when comparing the RX buf size to this payload.
->cntrs counters were alon not initialized with the only consequence
to generate wrong values for these counters.
Must be backported as far as 2.9.
Emeric reported that he can't build haproxy anymore since 9bc6a034
("BUG/MINOR: ssl: Free global_ssl structure contents during deinit").
src/ssl_sock.c:7020:40: error: comparison with string literal results in unspecified behavior [-Werror=address]
7020 | if (global_ssl.listen_default_ciphers != LISTEN_DEFAULT_CIPHERS)
| ^~
src/ssl_sock.c:7023:41: error: comparison with string literal results in unspecified behavior [-Werror=address]
7023 | if (global_ssl.connect_default_ciphers != CONNECT_DEFAULT_CIPHERS)
| ^~
src/ssl_sock.c: At top level:
Indeed the mentionned patch is checking the pointer in order to free
something freeable, but that can't work because these constant are
strings literal which can be passed from the compiler and not pointers.
Also the test is not useful, because these strings are strdup() in
__ssl_sock_init, so they can be free directly.
Must be backported in every stable branches with 9bc6a034.
Fix quic_tx unittest module by adding an explicit define for <mtu> const
member of quic_cc_path.
This should fix coverity report from github issue #3162.
This can be backported up to 3.2.
Ensure applet_putchk() return value is checked when outputing via the
CLI 'show quic' header line.
This is only to align with other usages of the same function, as trash
output buffer should always be large enough for it. As such, the command
is simply aborted if this is not the case.
This should fix coverity report from github issue #3139.
This could be backported up to 2.8.
In stksess_new(), if we failed to allocate memory for the new stksess,
don't forget to decrement the table entry count, as nobody else will
do it for us.
An artificially high count could lead to at least purging entries while
there is no need to.
This should be backported up to 2.8.
WIP decrement current on allocation failure
A subtle regression was introduced in 3.0 by commit faa8c3e02 ("MEDIUM:
lb-chash: Deterministic node hashes based on server address"). When keys
are calculated from the server's ID (which is the default), due to the
reorganisation of the code, the key ended up being hashed twice instead
of being multiplied by the scaling range.
While most users will never notice it, it is blocking some large cache
users from upgrading from 2.8 to 3.0 or 3.2 because the keys are
redistributed.
After a check with users on the mailing list [1] it was estimated that
keep the current situation is the worst choice because those who have
not yet upgraded will face the problem while by fixing it, those who
already have and for whom it happened smoothly will handle it just
right again.
As such this fix must be backported to 3.0 without waiting (in order
to preserve those who upgrade from two redistributions). Please note
that only configurations featuring "hash-type consistent" and not
having "hash-key" present with a value other than "id" are affected,
others are not (e.g. "hash-key addr" is unaffected).
[1] https://www.mail-archive.com/haproxy@formilux.org/msg46115.html
With the fix in commit 982805e6a3 ("BUG/MINOR: pools: Fix the dump of
pools info to deal with buffers limitations"), the max count is now
compared to the number of dumped pools instead of the configured
numbered, and keeping >= is no longer valid because maxcnt is set by
default to the same value when not set, so this means that since this
patch we're always displaying "limited to the first X entries" where X
is the number of dumped entries even in the absence of any limitation.
Let's just fix the comparison to only show this when the limit is lower.
This must be backported to 3.2 where the patch above already is.
The truncation of pools output that was adressed in commit 982805e6a3
("BUG/MINOR: pools: Fix the dump of pools info to deal with buffers
limitations") required to split the pools filling from dumping. However
there is a problem when a limit is passed that is lower than the number
of pools or if a pool name is specified or if pool caches are disabled,
because in this case the number of filled slots will be lower than the
initially allocated one, and empty entries will be visited either by the
sort functions when filling the entries if "byxxx" is specified, or by
the dump function after the last entry, but none of these functions was
expecting to be passed a NULL entry.
Let's just re-adjust nbpools to match the number of filled entries at
the end. Anyway the totals are calculated on the number of dumped
entries.
This must be backported to 3.2 since the fix above was backported there
as well.
The third parameter passed to b_quic_dec_int() is unitialized. This is not a bug.
But this disturbs coverity for an unknown reason as revealed by GH issue #3154.
This patch takes the opportunity to use NULL as passed value to avoid using such
an uneeded third parameter.
Should be backported to 3.2 where this unit test was introduced.
The pcre2 matching requires an array of matches for grouping, that is
allocated when executing the rule by pre-processing it, and that is
immediately freed after use. This is quite inefficient and results in
annoying patterns in "show profiling" that attribute the allocations
to libpcre2 and the releases to haproxy.
A good suggestion from Dragan is to pre-allocate these per thread,
since the entry is not specific to a regex. In addition we're already
limited to MAX_MATCH matches so we don't even have the problem of
having to grow it while parsing nor processing.
The current patch adds a per-thread pair of init/deinit functions to
allocate a thread-local entry for that, and gets rid of the dynamic
allocations. It will result in cleaner memory management patterns and
slightly higher performance (+2.5%) when using pcre2.
'ctx' might be NULL when we exit 'ssl_sock_handshake', it can't be
dereferenced without check in the trace macro.
This was found by Coverity andraised in GitHub #3113.
This patch should be backported up to 3.2