Many tests use the A128KW algorithm which is not supported by AWS-LC but
instead of removing those tests we will just have a hardcoded value set
by default in this case.
This converter checks the validity and decrypts the content of a JWE
token that has an asymetric "alg" algorithm (RSA). In such a case, we
must provide a path to an already loaded certificate and private key
that has the "jwt" option set to "on".
This converter checks the validity and decrypts the content of a JWE
token that has a symetric "alg" algorithm. In such a case, we only
require a secret as parameter in order to decrypt the token.
This test mimics what was already done for the aes_gcm converters. Some
data is encrypted and directly decrypted and we ensure that the output
was not changed.
Those converters allow to encrypt or decrypt data with AES in Cipher
Block Chaining mode. They work the same way as the already existing
aes_gcm_enc/_dec ones apart from the AEAD tag notion which is not
supported in CBC mode.
The parameter parsing and processing and the actual crypto part of the
aes_gcm converter are interleaved. This patch puts the crypto parts in a
dedicated function for better reuse in the upcoming JWE processing.
A recent patch has introduced a new state for proxies : unpublished
backends. Such backends won't be eligilible for traffic, thus
use_backend/default_backend rules which target them won't match and
content switching rules processing will continue.
This patch defines a new frontend keywords 'force-be-switch'. This
keyword allows to ignore unpublished or disabled state. Thus,
use_backend/default_backend will match even if the target backend is
unpublished or disabled. This is useful to be able to test a backend
instance before exposing it outside.
This new keyword is converted into a persist rule of new type
PERSIST_TYPE_BE_SWITCH, stored in persist_rules list proxy member. This
is the only persist rule applicable to frontend side. Prior to this
commit, pure frontend proxies persist_rules list were always empty.
This new features requires adjustment in process_switching_rules(). Now,
when a use_backend/default_backend rule matches with an non eligible
backend, frontend persist_rules are inspected to detect if a
force-be-switch is present so that the backend may be selected.
Utility function warnif_cond_conflicts() is used when parsing an ACL.
Previously, the function directly calls ha_warning() to report an error.
Change the function so that it now takes the error message as argument.
Caller can then output it as wanted.
This change is necessary to use the function when parsing a keyword
registered as cfg_kw_list. The next patch will reuse it.
A previous patch defines a new proxy status : unpublished backends. This
patch extends this by changing proxy status reported in stats. If
unpublished is set, an extra "(UNPUB)" is added to the field.
Also, HTML stats is also slightly updated. If a backend is up but
unpublished, its status will be reported in orange color.
Define a new set of CLI commands publish/unpublish backend <be>. The
objective is to be able to change the status of a backend to
unpublished. Such a backend is considered ineligible to traffic : this
allows to skip use_backend rules which target it.
Note that contrary to disabled/stopped proxies, an unpublished backend
still has server checks running on it.
Internally, a new proxy flags PR_FL_BE_UNPUBLISHED is defined. CLI
commands handler "publish backend" and "unpublish backend" are executed
under thread isolation. This guarantees that the flag can safely be set
or remove in the CLI handlers, and read during content-switching
processing.
A proxy can be marked as disabled using the keyword with the same name.
The doc mentions that it won't process any traffic. However, this is not
really the case for backends as they may still be selected via switching
rules during stream processing.
In fact, currently access to disabled backends will be conducted up to
assign_server(). However, no eligible server is found at this stage,
resulting in a connection closure or an HTTP 503, which is expected. So
in the end, servers in disabled backends won't receive any traffic. But
this is only because post-parsing steps are not performed on such
backends. Thus, this can be considered as functional but only via
side-effects.
This patch clarifies the handling of disable backends, so that they are
never selected via switching rules. Now, process_switching_rules() will
ignore disable backends and continue rules evaluation.
As this is a behavior change, this patch is labelled as medium. The
documentation manuel for use_backend is updated accordingly.
Create a new test to ensure that switching rules selection is fine.
Currently, this checks that dynamic backend switching works as expected.
If a matching rule is resolved to an unexisting backend, the default
backend is used instead.
This regtest should be useful as switching-rules will be extended in a
future set of patches to add new abilities on backends, linked to
dynamic backend support.
This commit rewrites process_switching_rules() function. The objective
is to simplify backend selection so that a single unified
stream_set_backend() call is kept, both for regular and default backends
case.
This patch will be useful to add new capabilities on backends, in the
context of dynamic backend support implementation.
force-persist proxy keyword is converted into a persist_rule, stored in
proxy persist_rules list member. Each new rule is dynamically allocated
during parsing.
This commit fixes the memory leak on deinit due to a missing free on
persist_rules list entries. This is done via deinit_proxy()
modification. Each rule in the list is freed, along with its associated
ACL condition type.
This can be backported to every stable version.
If we want to be able to have more than 64 thread groups, we can no
longer use thread group masks as long.
One remaining place where it is done is in struct thread_set. However,
it is not really used as a mask anywhere, all we want is a thread group
counter, so convert that mask to a counter.
Fix the arithmetic when pre-filling non_empty_tgids when we still have
more than 32/64 thread groups left, to get the right index, we of course
have to divide the number of thread groups by the number of bits in a
long.
This bug was introduced by commit
7e1fed4b7a8b862bf7722117f002ee91a836beb5, but hopefully was not hit
because it requires to have at least as much thread groups as there are
bits in a long, which is impossible on 64bits machines, as MAX_TGROUPS
is still 32.
Now that it is unused, eliminate all_tgroups_mask, as we can't 64bits
masks to represent thread groups, if we want to be able to have more
than 64 thread groups.
In order to be able to have more than 64 thread groups, turn
non_empty_tgids into a long array, so that we have enough bits to
represent everty thread group, and manipulate it with the ha_bit_*
functions.
As reported by GH user @Lzq-001 on issue #3245, the config below would
cause haproxy to SEGFAULT after having reported an error:
frontend 0000000
http-request set-map %[hdr(0000)0_
Root cause is simple, in parse_http_set_map(), we define the release
function (which is responsible to clear lf_expr expressions used by the
action), prior to initializing the expressions, while the release
function assumes the expressions are always initialized.
For all similar actions, we already perform the init prior to setting
the related release function, but this was not the case for
parse_http_set_map(). We fix the bug by initializing the expressions
earlier.
Thanks to @Lzq-001 for having reported the issue and provided a simple
reproducer.
It should be backported to all stable versions, note for versions prior to
3.0, lf_expr_init() should be replace by LIST_INIT(), see
6810c41 ("MEDIUM: tree-wide: add logformat expressions wrapper")
Contrarily to what was previously believed, there are corner cases where
the counters may not be allocated, and we may want to make them optional
at a later date, so we have to check if those counters are there.
However, just checking that shared.tg is non-NULL is enough, we can then
assume that shared.tg[tgid - 1] has properly been allocated too.
Also modify the various COUNTER_SHARED_* macros to make sure they check
for that too.
ACK frames are either of type 0x02 or 0x03. The latter is an indication
that it contains extra ECN related fields. In haproxy QUIC stack, this
is considered as a different frame type, set to QUIC_FT_ACK_ECN, with
its own set of builder/parser functions.
This patch fixes ACK ECN parsing function. Indeed, the latter suffered
from two issues. First, 'first ACK range' and 'ACK ranges' were
inverted. Then, the three remaining ECN fields were simply ignored by
the parsing function.
This issue can cause desynchronization in the frames parsing code, which
may result in various result. Most of the time, the connection will be
aborted by haproxy due to an invalid frame content read.
Note that this issue was not detected earlier as most clients do not
enable ECN support if the peer is not able to emit ACK ECN frame first,
which haproxy currently never sends. Nevertheless, this is not the case
for every client implementation, thus proper ACK ECN parsing is
mandatory for a proper QUIC stack support.
Fix this by adjusting quic_parse_ack_ecn_frame() function. The remaining
ECN fields are parsed to ensure correct packet parsing. Currently, they
are not used by the congestion controller.
This must be backported up to 2.6.
The code to parse the "thread" keyword on bind lines was changed to
check if the thread numbers were correct against the value provided with
max-threads-per-group, if any were provided, however, at the time those
thread keywords have been set, it may not yet have been set, and that
breaks the feature, so revert to check against MAX_THREADS_PER_GROUP instead,
it should have no major impact.
Before updating counters, a few tests are made to check if the counters
exits. but those counters should always exist at this point, so just
remmove them.
This commit should have no impact, but can easily be reverted with no
functional impact if various crashes appear.
Instead of statically allocating the per-thread group counters,
based on the max number of thread groups available, allocate
them dynamically, based on the number of thread groups actually
used. That way we can increase the maximum number of thread
groups without using an unreasonable amount of memory.
The IPv6 header contains a payload length that excludes the 40 bytes of
IPv6 packet header, which differs from IPv4's total length which includes
it. As a result, the parser was wrong and would only see the IP part and
not the TCP one unless sufficient options were present tocover it.
This issue came in 3.4-dev2 with recent commit e88e03a6e4 ("MINOR:
net_helper: add ip.fp() to build a simplified fingerprint of a SYN"),
so no backport is needed.
As reported by GH user @kanashimia in GH #3241, providing anything else
than a table to Patref:add_bulk() method could cause a segfault because
we were calling lua_next() with the lua object without ensuring it
actually is a table.
Let's add the missing lua_istable() check on the stack object before
calling lua_next() function on it.
It should be backported up to 3.2 with 884dc62 ("MINOR: hlua_fcn:
add Patref:add_bulk()")
In GH #3241, GH user @kanashimia reported that the Patref:add_bulk()
method would raise a Lua exception when called with more than 101
elements at once.
As identified by @kanashimia there was an error in the way the
add_bulk() method was forced to yield after 101 elements precisely.
The yield is there to ensure Lua doesn't eat too much ressources at
once and doesn't impact haproxy's core responsiveness, but the check
for the yield was misplaced resulting in improper stack content upon
resume.
Thanks to user @kanashimia who even provided a reproducer which helped
a lot to troubleshoot the issue.
This fix should be backported up to 3.2 with 884dc62 ("MINOR: hlua_fcn:
add Patref:add_bulk()") where the bug was introduced.
Now that the tgid stored in the stats file has been increased to 16bits
by commit 022cb3ab7fdce74de2cf24bea865ecf7015e5754, don't forget to
increase the variable size when reading it from the file, too.
This should have no impact given the maximum thread group limit is still
32.
Increase the size of the stored tgid in the stat file from 8bits to
32bits, so that we can have more than 256 thread group. 65536 should be
enough for some time.
This bumps thet stat file minor version, as the structure changes.
Instead of always allocating MAX_TGROUPS members, allocate them
dynamically, using the number of thread groups we'll use, so that
increasing MAX_TGROUPS will not have a huge impact on the structure
size.
This flag is used as of commit dcce9369129f6ca9b8eed6b451c0e20c226af2e3
("MINOR: connections: Add a new CO_FL_SSL_NO_CACHED_INFO flag"). This patch
should be backported to 3.3. Apparently dcce9369129 has been backported
to 3.2 and 3.1 already, with that change already applied, so no need for a
backport there.
The fc_xxx info that are retrieved over tcp_info could currently not
be accessed before a stream is created due to a test that verified the
existence of a stream. The rationale here was that the function works
both for frontend and backend. Let's always retrieve these info from
the session for the frontend case so that it now becomes possible to
set variables at connection/session time. The doc did not mention this
limitation so this could almost be considered as a bug.
Some timers, like the handshake timer, are stored in the session and are
only copied to the logs struct when a stream is created. But this means
we can't measure it without a stream, nor store it once for all in a
variable at session creation time. Let's extend the sample fetch function
to retrieve it from the session when no stream is present. The doc did not
mention this limitation so this could almost be considered as a bug.
It was reported in GH #2956 and more recently in GH #3235 that some
hashes are way too slow. The former triggers watchdog warnings during
checks, the second sees the config parsing take 20 seconds. This is
always due to the use of hash algorithms that are not suitable for use
in low-latency environments like web. They might be fine for a local
auth though. The difficulty, as explained by Philipp Hossner, is that
developers are not aware of this cost and adopt this without suspecting
any side effect.
The proposal here is to measure the crypt() call time and emit a warning
if it takes more than 10ms (which is already extreme). This was tested
by Philipp and confirmed to catch his case.
This is marked medium as it might start to report warnings on config
suffering from this problem without ever detecting it till now.
Patch dba4fd24 ("MEDIUM: ssl/ech: config and load keys") introduced
ECH configuration for bind lines, but the QUIC configuration parsers
still suffers from not using the same code as the TCP/TLS one, so the
init for QUIC was missed.
Must be backported in 3.3.
The ECH job still fails to compile since the openssl 4.0 deprecated
functions were not removed yet. Let's remove ERR=1 temporarly.
We do know that there's a regression in OpenSSL 4.0 with these
reg-tests though:
Error: # top TEST reg-tests/ssl/set_ssl_crlfile.vtc FAILED (0.219) exit=2
Error: # top TEST reg-tests/ssl/set_ssl_cafile.vtc FAILED (0.236) exit=2
Error: # top TEST reg-tests/quic/set_ssl_crlfile.vtc FAILED (0.196) exit=2
OpenSSL changed the output from "Server Temp Key" in prior versions to
"Peer Temp Key" in recent ones.
a39dc27c25
It looks like it affects OpenSSL >=3.5.0
This broke the reg-test for e.g. Debian 13 builds, using OpenSSL 3.5.1
Fixes bug #3238
Could be backported in every branches.
Signed-off-by: Christian Ruppert <idl0r@qasl.de>
Discussed in issue #3187, the CLI help is confusing for the "show table"
command as it seems that the argument is mandatory.
This patch adds the arguments between square brackets to remove the
confusion.
In GH issue #3226, Sergey Fedorov (@barracuda156) reported that since
commit 10c14a1ed0 ("MINOR: proto_sockpair: send_fd_uxst: init iobuf,
cmsghdr, cmsgbuf to zeros"), macOS 10.6.8 with gcc 14.3.0 doesn't build
anymore:
src/proto_sockpair.c: In function 'send_fd_uxst':
src/proto_sockpair.c:246:49: error: variable-sized object may not be initialized except with an empty initializer
246 | char cmsgbuf[CMSG_SPACE(sizeof(int))] = {0};
| ^
src/proto_sockpair.c:247:45: error: variable-sized object may not be initialized except with an empty initializer
247 | char buf[CMSG_SPACE(sizeof(int))] = {0};
| ^
Upon investigation, it appears that the CMSG_SPACE() macro on this OS
looks too complex for gcc to consider it as a constant, so it takes
these buffers for variable-length arrays and cannot initialize them.
Let's move to a simple memset() instead, which Sergey confirmed fixes
the problem.
This needs to be backported as far as 3.1. Thanks to Sergey for the
report, the bisect and testing the fix.
This patch covers issue https://github.com/haproxy/haproxy/issues/3221.
The parser for the "userlist" section did not use the standard keyword
registration mechanism. Instead, it relied on a series of strcmp()
comparisons to identify keywords such as "group" and "user".
This had two main drawbacks:
1. The keywords were not discoverable by the "-dKall" dump option,
making it difficult for users to see all available keywords for the
section.
2. The implementation was inconsistent with the parsers for other
sections, which have been progressively refactored to use the
standard cfg_kw_list infrastructure.
This patch refactors the userlist parser to align it with the project's
standard conventions.
The parsing logic for the "group" and "user" keywords has been extracted
from the if/else block in cfg_parse_users() into two new dedicated
functions:
- cfg_parse_users_group()
- cfg_parse_users_user()
These two keywords are now registered via a dedicated cfg_kw_list,
making them visible to the rest of the HAPorxy ecosystem, including the
-dKall dump.
When a unknown keyword was used in the "userlist" section, the error was
mentioning the "users" section, instead of "userlist".
Could be backported in every branches.
New gcc and clang versions from fedora rawhide seems to use the C23
standard by default. This version changes the definition of some
string.h functions, which now return a const char * instead of a char *.
src/tools.c: In function ‘fgets_from_mem’:
src/tools.c:7200:17: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
7200 | new_pos = memchr(*position, '\n', size);
| ^
Strangely, -Wdiscarded-qualifiers does not seem to catch all the
memchr.
Should fix issue #3228.
This could be backported in previous versions.
New gcc and clang versions from fedora rawhide seems to use the C23
standard by default. This version changes the definition of some
string.h functions, which now return a const char * instead of a char *.
src/ssl_sock.c: In function ‘SSL_CTX_keylog’:
src/ssl_sock.c:4475:17: error: assignment discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers]
4475 | lastarg = strrchr(line, ' ');
Strangely, -Wdiscarded-qualifiers does not seem to catch all the
strrchr.
Should fix issue #3228.
This could be backported in previous versions.
Released version 3.4-dev2 with the following main changes :
- BUG/MEDIUM: mworker/listener: ambiguous use of RX_F_INHERITED with shards
- BUG/MEDIUM: http-ana: Properly detect client abort when forwarding response (v2)
- BUG/MEDIUM: stconn: Don't report abort from SC if read0 was already received
- BUG/MEDIUM: quic: Don't try to use hystart if not implemented
- CLEANUP: backend: Remove useless test on server's xprt
- CLEANUP: tcpcheck: Remove useless test on the xprt used for healthchecks
- CLEANUP: ssl-sock: Remove useless tests on connection when resuming TLS session
- REGTESTS: quic: fix a TLS stack usage
- REGTESTS: list all skipped tests including 'feature cmd' ones
- CI: github: remove openssl no-deprecated job
- CI: github: add a job to test the master branch of OpenSSL
- CI: github: openssl-master.yml misses actions/checkout
- BUG/MEDIUM: backend: Do not remove CO_FL_SESS_IDLE in assign_server()
- CI: github: use git prefix for openssl-master.yml
- BUG/MEDIUM: mux-h2: synchronize all conditions to create a new backend stream
- REGTESTS: fix error when no test are skipped
- MINOR: cpu-topo: Turn the cpu policy configuration into a struct
- MEDIUM: cpu-topo: Add a "threads-per-core" keyword to cpu-policy
- MEDIUM: cpu-topo: Add a "cpu-affinity" option
- MEDIUM: cpu-topo: Add a new "max-threads-per-group" global keyword
- MEDIUM: cpu-topo: Add the "per-thread" cpu_affinity
- MEDIUM: cpu-topo: Add the "per-ccx" cpu_affinity
- BUG/MINOR: cpu-topo: fix -Wlogical-not-parentheses build with clang
- DOC: config: fix number of values for "cpu-affinity"
- MINOR: tools: add a secure implementation of memset
- MINOR: mux-h2: add missing glitch count for non-decodable H2 headers
- MINOR: mux-h2: perform a graceful close at 75% glitches threshold
- MEDIUM: mux-h1: implement basic glitches support
- MINOR: mux-h1: perform a graceful close at 75% glitches threshold
- MEDIUM: cfgparse: acknowledge that proxy ID auto numbering starts at 2
- MINOR: cfgparse: remove useless checks on no server in backend
- OPTIM/MINOR: proxy: do not init proxy management task if unused
- MINOR: patterns: preliminary changes for reorganization
- MEDIUM: patterns: reorganize pattern reference elements
- CLEANUP: patterns: remove dead code
- OPTIM: patterns: cache the current generation
- MINOR: tcp: add new bind option "tcp-ss" to instruct the kernel to save the SYN
- MINOR: protocol: support a generic way to call getsockopt() on a connection
- MINOR: tcp: implement the get_opt() function
- MINOR: tcp_sample: implement the fc_saved_syn sample fetch function
- CLEANUP: assorted typo fixes in the code, commits and doc
- BUG/MEDIUM: cpu-topo: Don't forget to reset visited_ccx.
- BUG/MAJOR: set the correct generation ID in pat_ref_append().
- BUG/MINOR: backend: fix the conn_retries check for TFO
- BUG/MINOR: backend: inspect request not response buffer to check for TFO
- MINOR: net_helper: add sample converters to decode ethernet frames
- MINOR: net_helper: add sample converters to decode IP packet headers
- MINOR: net_helper: add sample converters to decode TCP headers
- MINOR: net_helper: add ip.fp() to build a simplified fingerprint of a SYN
- MINOR: net_helper: prepare the ip.fp() converter to support more options
- MINOR: net_helper: add an option to ip.fp() to append the TTL to the fingerprint
- MINOR: net_helper: add an option to ip.fp() to append the source address
- DOC: config: fix the length attribute name for stick tables of type binary / string
- MINOR: mworker/cli: only keep positive PIDs in proc_list
- CLEANUP: mworker: remove duplicate list.h include
- BUG/MINOR: mworker/cli: fix show proc pagination using reload counter
- MINOR: mworker/cli: extract worker "show proc" row printer
- MINOR: cpu-topo: Factorize code
- MINOR: cpu-topo: Rename variables to better fit their usage
- BUG/MEDIUM: peers: Properly handle shutdown when trying to get a line
- BUG/MEDIUM: mux-h1: Take care to update <kop> value during zero-copy forwarding
- MINOR: threads: Avoid using a thread group mask when stopping.
- MINOR: hlua: Add support for lua 5.5
- MEDIUM: cpu-topo: Add an optional directive for per-group affinity
- BUG/MEDIUM: mworker: can't use signals after a failed reload
- BUG/MEDIUM: stconn: Move data from <kip> to <kop> during zero-copy forwarding
- DOC: config: fix a few typos and refine cpu-affinity
- MINOR: receiver: Remove tgroup_mask from struct shard_info
- BUG/MINOR: quic: fix deprecated warning for window size keyword
QUIC configuration was cleaned up in the previous release. Several
global keyword names were changed to unify the configuration. For each
of them the older keyword is marked as deprecated, with a warning to
mention the newer alternative.
This patch fixes the warning for 'tune.quic.frontend.default-max-size'
as the alternative proposed was not correct. The proper value now is
'tune.quic.fe.cc.max-win-size'.
This must be backported up to 3.3.
The only purpose from tgroup_mask seems to be to calculate how many
tgroups share the same shard, but this is an information we can
calculate differently, we just have to increment the number when a new
receiver is added to the shard, and decrement it when one is detached
from the shard. Removing thread group masks will allow us to increase
the maximum number of thread groups past 64.
There were two typos in the recently updated parts about per-group.
Also, change the commas to ':' after the options values, as sometimes
it would be confusing. Last, place quotes around keyword names so that
they're explicitly referred to as language keywords. No backport is
needed.
The <kip> of producer was not forwarded to <kop> of consumer when zero-copy
data forwarding was tried. Because of the issue, the chunking of emitted H1
messages could be invalid.
To fix the bug, sc_ep_fwd_kip() must be called at this stage.
This fix is related to the previous one (529a8dbfb "BUG/MEDIUM: mux-h1: Take
care to update <kop> value during zero-copy forwarding"). Both are required
to fully fix the issue #3230.
This patch must be backported to 3.3.
In issue #3229 it was reported that the master couldn't reload after a
failed reload following a wrong configuration.
It is still possible to do a reload using the "reload" command of the
master CLI. But every signals are blocked.
The problem was introduced in 709cde6d0 ("BUG/MEDIUM: mworker: signals
inconsistencies during startup and reload") which fixes the blocking of
signals during the reload.
However the patch missed a case, indeed, the
run_master_in_recovery_mode() is not being called when the worker failed
to parse the configuration, it is only failing when the master is
failing.
To handle this case, the mworker_unblock_signals() function must be
called upon mworker_on_new_child_failure(). But since this is called in
an haproxy signal handler it would mess with the signals.
Instead, the patch adds a task which is started by the signal handler,
and restores the signals outside of it.
This must be backported as far as 3.1.
When using per-group affinity, add an optional new directive. It accepts
the values of "auto", where when multiple thread groups are created, the
available CPUs are split equally across the groups, and is the new
default, and "loose", where all groups are bound to all available CPUs,
this is the old default.
Lua 5.5 adds an extra argument to lua_newstate(). Since there are
already a few other ifdefs in hlua.c checking for the Lua version,
and there's a single call place, let's do the same here. This should
be safe for backporting if needed.
Signed-off-by: Mike Lothian <mike@fireburn.co.uk>
Remove the "stopped_tgroup_mask" variable, that indicated which thread
groups were stopping, and instead just use "stopped_tgroups", a counter
indicating how many thread groups are stopping. We want to remove all
thread group masks, so that we can increase the maximum number of thread
groups past 64.
Since the extra field was removed from the HTX structure, a regression was
introduced when forwarding of chunked messages. The <kop> value was not
decreased as it should be when data were sent via the zero-copy
forwarding. Because of this bug, it was possible to announce a chunk size
larger than the chunk data sent.
To fix the bug, an helper function was added to properly update the <kop>
value when a chunk size is emitted. This function is now called when new
chunk is announced, including during zero-copy forwarding.
As a workaround, "tune.disable-zero-copy-forwarding" or just
"tune.h1.zero-copy-fwd-send off" can be set in the global section.
This patch should fix the issue #3230. It must be backported to 3.3.
When a shutdown was reported to a peer applet, the event was not properly
handled if it failed to receive data. The function responsible to get data
was exiting too early if the applet buffer was empty, without testing the
sedesc status. Because of this issue, it was possible to have frozen peer
applets. For instance, it happend on client timeout. With too many frozen
applets, it was possible to reach the maxconn.
This patch should fix the issue #3234. It must be backported to 3.3.
Rename "visited_tsid" and "visited_ccx" to "touse_tsid" and
"touse_ccx". They are not there to remember which tsid/ccx we
alreaday visited, contrarily to visited_ccx_set and
visited_cl_set, they are there to know which tsid/ccx we should
use, so make that clear.
Introduce cli_append_worker_row() to centralize formatting of a single
worker row. Also, replace duplicated row-printing code in both current
and old workers loops with the helper. Motivation: Reduces LOC and
improves readability by removing duplication.
After commit 594408cd612b5 ("BUG/MINOR: mworker/cli: 'show proc' is limited
by buffer size"), related to ticket #3204, the "show proc" logic
has been fixed to be able to print more than 202 processes. However, this
fix can lead to the omission of entries in case they have the same
timestamp.
To fix this, we use the unique reload counter instead of the timestamp.
On partial flush, set ctx->next_reload = child->reloads.
On resume skip entries with child->reloads >= ctx->next_reload.
Finally, we clear ctx->next_reload at the end of a complete dump so
subsequent show proc starts from the top.
Could be backported in all stable branches.
Change mworker_env_to_proc_list() to if (child->pid > 0) before
LIST_APPEND, avoiding invalid PIDs (0/-1) in the process list.
This has no functional impact beyond stricter validation and it aligns
with existing kill safeguards.
The stick-table doc was reworked and moved in 3.2 with commit da67a89f3
("DOC: config: move stick-tables and peers to their own section"), however
the optional length attribute for binary/string types was mistakenly
spelled "length" while it's "len".
This must be backported to 3.2.
It can make sense to support extra components in the fingerprint to ease
configuration, so let's change the 0/1 value to a bit field. We also turn
the current 1 (TCP options list) to 2 so that we'll reuse 1 for the TTL.
Here we collect all the stuff that depends on the sender's settings,
such as TOS, IP version, TTL range, presence of DF bit or IP options,
presence of DATA in the SYN, CWR+ECE flags, TCP header length, wscale,
initial window, mss, as well as the list of TCP extension kinds. It's
obviously fairly limited but can allows to avoid blacklisting certain
valid clients sharing the same IP address as a misbehaving one.
It supports both a short and a long mode depending on the argument.
These can be used with the tcp-ss bind option. The doc was updated
accordingly.
This adds the following converters, used to decode fields
in an incoming tcp header:
tcp.dst, tcp.flags, tcp.seq, tcp.src, tcp.win,
tcp.options.mss, tcp.options.tsopt, tcp.options.tsval,
tcp.options.wscale, tcp.options_list,
These can be used with the tcp-ss bind option. The doc was updated
accordingly.
This adds a few converters that help decode parts of IP packets:
- ip.data : returns the next header (typically TCP)
- ip.df : returns the dont-fragment flags
- ip.dst : returns the destination IPv4/v6 address
- ip.hdr : returns only the IP header
- ip.proto: returns the upper level protocol (udp/tcp)
- ip.src : returns the source IPv4/v6 address
- ip.tos : returns the TOS / TC field
- ip.ttl : returns the TTL/HL value
- ip.ver : returns the IP version (4 or 6)
These can be used with the tcp-ss bind option. The doc was updated
accordingly.
This adds a few converters that help decode parts of ethernet frame
headers:
- eth.data : returns the next header (typically IP)
- eth.dst : returns the destination MAC address
- eth.hdr : returns only the ethernet header
- eth.proto: returns the ethernet proto
- eth.src : returns the source MAC address
- eth.vlan : returns the VLAN ID when present
These can be used with the tcp-ss bind option. The doc was updated
accordingly.
In 2.6, do_connect_server() was introduced by commit 0a4dcb65f ("MINOR:
stream-int/backend: Move si_connect() in the backend scope") and changed
the approach to work with a stream instead of a stream-interface. However
si_oc(si) was wrongly turned to &s->res instead of &s->req, which breaks
TFO by always inspecting the response channel to figure whether there are
data pending.
This fix can be backported to all versions till 2.6.
In 2.6, the retries counter on a stream was changed from retries left
to retries done via commit 731c8e6cf ("MINOR: stream: Simplify retries
counter calculation"). However, one comparison fell through the cracks
in order to detect whether or not we can use TFO (only first attempt),
resulting in TFO never working anymore.
This may be backported to all versions till 2.6.
We want to reset visited_ccx, as introduced by commit
8aef5bec1ef57eac449298823843d6cc08545745, each time we run the loop,
otherwise the chances of its content being correct are very low, and
will likely end up being bound to the wrong threads.
This was reported in github issue #3224.
This function retrieves the copy of a SYN packet that the system has
kept for us when bind option "tcp-ss" was set to 1 or above. It's
recommended to copy it to a local variable because it will be freed
after being read. It allows to inspect all parts of an incoming SYN
packet, provided that it was preserved (e.g. not possible with SYN
cookies). The doc provides examples of how to use it.
It's regularly needed to call getsockopt() on a connection, but each
time the calling code has to do all the job by itself. This commit adds
a "get_opt()" callback on the protocol struct, that directly calls
getsockopt() on the connection's FD. A generic implementation for
standard sockets is provided, though QUIC would likely require a
different approach, or maybe a mapping. Due to the overlap between
IP/TCP/socket option values, it is necessary for the caller to indicate
both the level and the option. An abstraction of the level could be
done, but the caller would nonetheless have to know the optname, which
is generally defined in the same include files. So for now we'll
consider that this callback is only for very specific use.
The levels and optnames are purposely passed as signed ints so that it
is possible to further extend the API by using negative levels for
internal namespaces.
This option enables TCP_SAVE_SYN on the listening socket, which will
cause the kernel to try to save a copy of the SYN packet header (L2,
IP and TCP are supported). This can permit to check the source MAC
address of a client, or find certain TCP options such as a source
address encapsulated using RFC7974. It could also be used as an
alternate approach to retrieving the source and destination addresses
and ports. For now setting the option is enabled, but sample fetch
functions and converters will be needed to extract info.
This makes a significant difference when loading large files and during
commit and clear operations, thanks to improved cache locality. In the
measurements below, master refers to the code before any of the changes
to the patterns code, not the code before this one commit.
Timing the replacement of 10M entries from the CLI with this command
which also reports timestamps at start, end of upload and end of clear:
$ (echo "prompt i"; echo "show activity"; echo "prepare acl #0";
awk '{print "add acl @1 #0",$0}' < bad-ip.map; echo "show activity";
echo "commit acl @1 #0"; echo "clear acl @0 #0";echo "show activity") |
socat -t 10 - /tmp/sock1 | grep ^uptim
master, on a 3.7 GHz EPYC, 3 samples:
uptime_now: 6.087030
uptime_now: 25.981777 => 21.9 sec insertion time
uptime_now: 29.286368 => 3.3 sec commit+clear
uptime_now: 5.748087
uptime_now: 25.740675 => 20.0s insertion time
uptime_now: 29.039023 => 3.3 s commit+clear
uptime_now: 7.065362
uptime_now: 26.769596 => 19.7s insertion time
uptime_now: 30.065044 => 3.3s commit+clear
And after this commit:
uptime_now: 6.119215
uptime_now: 25.023019 => 18.9 sec insertion time
uptime_now: 27.155503 => 2.1 sec commit+clear
uptime_now: 5.675931
uptime_now: 24.551035 => 18.9s insertion
uptime_now: 26.652352 => 2.1s commit+clear
uptime_now: 6.722256
uptime_now: 25.593952 => 18.9s insertion
uptime_now: 27.724153 => 2.1s commit+clear
Now timing the startup time with a 10M entries file (on another machine)
on master, 20 samples:
Standard Deviation, s: 0.061652677408033
Mean: 4.217
And after this commit:
Standard Deviation, s: 0.081821371548669
Mean: 3.78
Situations where we are iterating over elements and find one with a
different generation ID cannot arise anymore since the elements are kept
per-generation.
Instead of a global list (and tree) of pattern reference elements, we
now have an intermediate pat_ref_gen structure and store the elements in
those. This simplifies the logic of some operations such as commit and
clear, and improves performance in some cases - numbers to be provided
in a subsequent commit after one important optimization is added.
A lot of the changes are due to adding an extra level of indirection,
changing many cases where we iterate over all elements to an outer loop
iterating over the generation and an inner one iterating over the
elements of the current generation. It is therefore easier to read this
patch using 'git diff -w'.
Safe and non-functional changes that only add currently unused
structures, field, functions and macros, in preparation of larger
changes that alter the way pattern reference elements are stored.
This includes code to create and lookup generation objects, and
macros to iterate over the generations of a pattern reference.
Each proxy has its owned task for internal purpose. Currently, it is
only used either by frontends or if a stick-table is present.
This commit rendres the task allocation optional to only the required
case. Thus, it is not allocated anymore for backend only proxies without
stick-table.
A legacy check could be activated at compile time to reject backends
without servers. In practice this is not used anymore and does not have
much sense with the introduction of dynamic servers.
Each frontend/backend/listen proxies is assigned an unique ID. It can
either be set explicitely via 'id' keyword, or automatically assigned on
post parsing depending on the available values.
It was expected that the first automatically assigned value would start
at '1'. However, due to a legacy bug this is not the case as this value
is always skipped. Thus, automatically assigned proxies always start at
'2' or more.
To avoid breaking the current existing state, this situation is now
acknowledged with the current patch. The code is rewritten with an
explicit warning to ensure that this won't be fixed without knowing the
current status. A new regtest also ensures this.
We now count glitches for each parsing error, including those that
have been accepted via accept-unsafe-violations-*. Front and back
are considered and the connection gets killed on error once if the
threshold is reached or passed and the CPU usage is beyond the
configured limit (0 by default). This was tested with:
curl -ivH "host : blah" 0:4445{,,,,,,,,,}
which sends 10 requests to a configuration having a threshold of 5.
The global keywords are named similarly to H2 and quic:
tune.h1.be.glitches-threshold xxxx
tune.h1.fe.glitches-threshold xxxx
The glitches count of each connection is also reported when non-null
in the connection dumps (e.g. "show fd").
This avoids hitting the hard wall for connections with non-compliant
peers that would be accumulating errors over long connections. We now
permit to recycle the connection early enough to reset the connection
counter.
This was tested artificially by adding this to h2c_frt_handle_headers():
h2c_report_glitch(h2c, 1, "new stream");
or this to h2_detach():
h2c_report_glitch(h2c, 1, "detaching");
and injecting using h2load -c 1 -n 1000 0:4445 on a config featuring
tune.h2.fe.glitches-threshold 1000:
finished in 8.74ms, 85802.54 req/s, 686.62MB/s
requests: 1000 total, 751 started, 751 done, 750 succeeded, 250 failed, 250 errored, 0 timeout
status codes: 750 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 6.00MB (6293303) total, 132.57KB (135750) headers (space savings 29.84%), 5.86MB (6144000) data
min max mean sd +/- sd
time for request: 9us 178us 10us 6us 99.47%
time for connect: 139us 139us 139us 0us 100.00%
time to 1st byte: 339us 339us 339us 0us 100.00%
req/s : 87477.70 87477.70 87477.70 0.00 100.00%
The failures are due to h2load not supporting reconnection.
One rare error case could produce a protocol error on the stream when
not being able to decode response headers wasn't being accounted as a
glitch, so let's fix it.
This guarantees that the compiler will not optimize away the memset()
call if it detects a dead store.
Use this to clear SSL passphrases.
No backport needed.
src/cpu_topo.c:1325:15: warning: logical not is only applied to the left hand side of this bitwise operator [-Wlogical-not-parentheses]
1325 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^ ~
src/cpu_topo.c:1325:15: note: add parentheses after the '!' to evaluate the bitwise operator first
1325 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
src/cpu_topo.c:1325:15: note: add parentheses around left hand side expression to silence this warning
1325 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
src/cpu_topo.c:1533:15: warning: logical not is only applied to the left hand side of this bitwise operator [-Wlogical-not-parentheses]
1533 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^ ~
src/cpu_topo.c:1533:15: note: add parentheses after the '!' to evaluate the bitwise operator first
1533 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
src/cpu_topo.c:1533:15: note: add parentheses around left hand side expression to silence this warning
1533 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
No backport needed.
Add a new cpu-affinity keyword, "per-thread".
If used, each thread will be bound to only one hardware thread of the
thread group.
If used in conjonction with the "threads-per-core 1" cpu_policy, then
each thread will be bound on a different core.
Add a new global keyword, max-threads-per-group. It sets the maximum number of
threads a thread group can contain. Unless the number of thread groups
is fixed with "thread-groups", haproxy will just create more thread
groups as needed.
The default and maximum value is 64.
Add a new global option, "cpu-affinity", which controls how threads are
bound.
It currently accepts three values, "per-core", which will bind one thread to
each hardware thread of a given core, and "per-group" which will use all
the available hardware threads of the thread group, and "auto", the
default, which will use "per-group", unless "threads-per-core 1" has
been specified in cpu_policy, in which case it will use per-core.
Add a new, optional key-word to "cpu-policy", "threads-per-core".
It takes one argument, "1" or "auto". If "1" is used, then only one
thread per core will be created, no matter how many hardware thread each
core has. If "auto" is used, then one thread will be created per
hardware thread, as is the case by default.
for example: cpu-policy performance threads-per-core 1
Turn the cpu policy configuration into a struct. Right now it just
contains an int, that represents the policy used, but will get more
information soon.
Since commit 1ed2c9d ("REGTESTS: list all skipped tests including
'feature cmd' ones"), the script emits some error when trying to display
the list of skipped tests when there are none.
No backport needed.
In H2 the conditions to create a new stream differ for a client and a
server when a GOAWAY was exchanged. While on the server, any stream
whose ID is lower than or equal to the one advertised in GOAWAY is
valid, for a client it's forbidden to create any stream after receipt
of a GOAWAY, even if its ID is lower than or equal to the last one,
despite the server not being able to tell the difference from the
number of streams in flight.
Unfortunately, the logic in the code did not always reflect this
specificity of the client (the backend code in our case), and most
often considered that it was still permitted to create a new stream
until the max_id was greater than or equal to the advertised last_id.
This is for example what h2c_is_dead() and h2c_streams_left() do. In
other places, such as h2_avail_streams(), the rule is properly taken
into account. Very often the advertised last_id is the same, and this
is also what haproxy does (which explains why it's impossible to
reproduce the issue by chaining two haproxy layers), but a server may
wish to advertise any ID including 2^31-1 as mentioned in the spec,
and in this case the functions would behave differently.
This discrepancy results in a corner case where a GOAWAY received on
an idle connection will cause the next stream creation to be initially
accepted but then rejected via h2_avail_streams(), and the connection
left in a bad state, still attached to the session due to http-reuse
safe, but not reinserted into idle list, since the backend code
currently is not able to properly recover from this situation. Worse,
the idle flags are no longer on it but TASK_F_USR1 still is, and this
makes the recently added BUG_ON() rightfully trigger since this case
is not supposed to happen.
Admittedly more of the backend recovery code needs to be reworked,
however the mux must consistently decide whether or not a connection
may be reused or needs to be released.
This commit fixes the affected logic by introducing a new function
"h2c_reached_last_stream()" which says if a connection has reached its
last stream, regardless of the side, and using this one everywhere
max_id was compared to last_id. This is sufficient to address the
corner case that be_reuse_connection() currently cannot recover from.
This is in relation to GH issue #3215 and it should be sufficient to
fix the issue there. Thanks to Chris Staite for reporting the issue
and kudos to Amaury for spotting the events sequence that can lead
to this situation.
This patch must be backported to 3.3 first, then to older versions
later. It's worth noting that it's much more difficult to observe
the issue before 3.3 because the BUG_ON() is not there, and the
possibly non-released connection might end up being killed for other
reasons (timeouts etc). But one possible visible effect might be the
impossibility to delete a server (which Chris observed in 3.3).
Back in the mists of time, commit e91a526c8f decided that if we were trying
to stay on the same server than the previous request, and if there were
a connection available in the session, we'd remove its CO_FL_SESS_IDLE.
The reason for doing that has been long lost, probably it fixed a bug at some
point, but it was most probably not the right place to do that. And starting
with 3.3, this triggers a BUG_ON() because that flag is expected later on.
So just revert the commit, if the ancient bug shows up again, it will be
fixed another way.
This should be backported to 3.3. There is little reason to backport it
to previous versions, unless other patches depend on it.
vtest.yml only builds the releases of OpenSSL for now, there's no way to
check if we still have issues with the API before a pre-release version
is released.
This job builds the master branch of OpenSSL.
It is run everyday at 3 AM.
Remove the openssl no-deprecated job which was used for 1.1.0 API.
It's not useful anymore since it uses the OpenSSL version of the
distributions.
Checking depreciations in the API is still useful when using newest
version of the library. A job for the OpenSSL master branch would be
more useful than that.
The script for running regression tests is modified to improve the
visibility of skipped tests.
Previously, the reasons for skipping tests were only visible during the
test discovery phase when grepping the vtc (REQUIRE, EXCLUDE, etc).
But reg-tests skipped by vtest with the 'feature cmd' keywords were not
listed.
This change introduces the following:
- vtest does not remove the logs itself anymore, because it is not
able to let the log available when a test is skipped. So the -L
parameter is now always passed to vtest
- All skipped tests during the discovery phase are now logged to a
'skipped.log' file within the test directory
- The script now parses vtest logs to find tests that were skipped
due to missing features (via the 'feature cmd' in .vtc files)
and adds them to the skipped list.
This issue was reported in GH #3214 where quic/tls13_ssl_crt-list_filters.vtc
QUIC reg test was run without haproxy QUIC support due to OPENSSL_AWSLC enabled
featured.
This is due to the fact that when ssl/tls13_ssl_crt-list_filters.vtc has been
ported to QUIC the feature(OPENSSL) was silly replaced by feature(QUIC) leading
the script to be run even without QUIC support if OR'ed OPENSSL_AWSLC feature is
enabled.
A good method to port these feature() commands to QUIC would have been
to add a feature(QUIC) command seperated from the one used for the supported
TLS stacks identified by the original underlying ssl reg tests (in reg-tests/ssl).
This is what is done by this patch.
Thank you to @idl0r for having reported this issue.
In ssl_sock_srv_try_reuse_sess(), the connection is always defined, to TCP
and QUIC connections. No reason to test it. Because it is not so obvious for
the QUIC part, a BUG_ON() could be added here. For now, just remove useless
tests.
This patch should fix a Coverity report from #3213.
The xprt used to perform a healthcheck is always defined and cannot be NULL.
So there is no reason to test it. It could lead to wrong assumptions later
in the code.
This patch should fix a Coverity report from #3213.
The server's xprt is always defined and cannot be NULL. So there is no
reason to test it. It could lead to wrong assumptions later in the code.
This patch should fix a Coverity report from #3213.
Not every CC algos implement hystart, so only call the method if it is
actually there. Failure to do so will cause crashes if hystart is on,
and the algo doesn't implement it.
This should fix github issue #3218
This should be backported up to 3.0.
SC_FL_ABRT_DONE flag should never be set when SC_FL_EOS was already
set. These both flags were introduced to replace the old CF_SHUTR and to
have a flag for shuts driven by the stream and a flag for the read0 received
by the mux. So both flags must not be seen at same time on a SC. It is
espeically important because some processing are performed when these flags
are set. And wrong decisions may be made.
This patch must be backproted as far as 2.8.
The first attempt to fix this issue (c672b2a29 "BUG/MINOR: http-ana:
Properly detect client abort when forwarding the response") was not fully
correct and could be responsible to false report of client abort during the
response forwarding. I guess it is possible to truncate the response.
Instead, we must also take care that the client closed on its side, by
checking SC_FL_EOS flag on the front SC. Indeed, if the client has aborted,
this flag should be set.
This patch should be backported as far as 2.8.
The RX_F_INHERITED flag was ambiguous, as it was used to mark both
listeners inherited from the parent process and listeners duplicated
from another local receiver. This could lead to incorrect behavior
concerning socket unbinding and suspension.
This commit refactors the handling of inherited listeners by splitting
the RX_F_INHERITED flag into two more specific flags:
- RX_F_INHERITED_FD: Indicates a listener inherited from the parent
process via its file descriptor. These listeners should not be unbound
by the master.
- RX_F_INHERITED_SOCK: Indicates a listener that shares a socket with
another one, either by being inherited from the parent or by being
duplicated from another local listener. These listeners should not be
suspended or resumed individually.
Previously, the sharding code was unconditionally using RX_F_INHERITED
when duplicating a file descriptor. In HAProxy versions prior to 3.1,
this led to a file descriptor leak for duplicated unix stats sockets in
the master process. This would eventually cause the master to crash with
a BUG_ON in fd_insert() once the file descriptor limit was reached.
This must be backported as far as 3.0. Branches earlier than 3.0 are
affected but would need a different patch as the logic is different.
Released version 3.4-dev1 with the following main changes :
- BUG/MINOR: jwt: Missing "case" in switch statement
- DOC: configuration: ECH support details
- Revert "MINOR: quic: use dynamic cc_algo on bind_conf"
- MINOR: quic: define quic_cc_algo as const
- MINOR: quic: extract cc-algo parsing in a dedicated function
- MINOR: quic: implement cc-algo server keyword
- BUG/MINOR: quic-be: Missing keywords array NULL termination
- REGTESTS: ssl enable tls12_reuse.vtc for AWS-LC
- REGTESTS: ssl: split tls*_reuse in stateless and stateful resume tests
- BUG/MEDIUM: connection: fix "bc_settings_streams_limit" typo
- BUG/MEDIUM: config: ignore empty args in skipped blocks
- DOC: config: mention clearer that the cache's total-max-size is mandatory
- DOC: config: reorder the cache section's keywords
- BUG/MINOR: quic/ssl: crash in ClientHello callback ssl traces
- BUG/MINOR: quic-be: handshake errors without connection stream closure
- MINOR: quic: Add useful debugging traces in qc_idle_timer_do_rearm()
- REGTESTS: ssl: Move all the SSL certificates, keys, crt-lists inside "certs" directory
- REGTESTS: quic/ssl: ssl/del_ssl_crt-list.vtc supported by QUIC
- REGTESTS: quic: dynamic_server_ssl.vtc supported by QUIC
- REGTESTS: quic: issuers_chain_path.vtc supported by QUIC
- REGTESTS: quic: new_del_ssl_cafile.vtc supported by QUIC
- REGTESTS: quic: ocsp_auto_update.vtc supported by QUIC
- REGTESTS: quic: set_ssl_bug_2265.vtc supported by QUIC
- MINOR: quic: avoid code duplication in TLS alert callback
- BUG/MINOR: quic-be: missing connection stream closure upon TLS alert to send
- REGTESTS: quic: set_ssl_cafile.vtc supported by QUIC
- REGTESTS: quic: set_ssl_cert_noext.vtc supported by QUIC
- REGTESTS: quic: set_ssl_cert.vtc supported by QUIC
- REGTESTS: quic: set_ssl_crlfile.vtc supported by QUIC
- REGTESTS: quic: set_ssl_server_cert.vtc supported by QUIC
- REGTESTS: quic: show_ssl_ocspresponse.vtc supported by QUIC
- REGTESTS: quic: ssl_client_auth.vtc supported by QUIC
- REGTESTS: quic: ssl_client_samples.vtc supported by QUIC
- REGTESTS: quic: ssl_default_server.vtc supported by QUIC
- REGTESTS: quic: new_del_ssl_crlfile.vtc supported by QUIC
- REGTESTS: quic: ssl_frontend_samples.vtc supported by QUIC
- REGTESTS: quic: ssl_server_samples.vtc supported by QUIC
- REGTESTS: quic: ssl_simple_crt-list.vtc supported by QUIC
- REGTESTS: quic: ssl_sni_auto.vtc code provision for QUIC
- REGTESTS: quic: ssl_curve_name.vtc supported by QUIC
- REGTESTS: quic: add_ssl_crt-list.vtc supported by QUIC
- REGTESTS: add ssl_ciphersuites.vtc (TCP & QUIC)
- BUG/MINOR: quic: do not set first the default QUIC curves
- REGTESTS: quic/ssl: Add ssl_curves_selection.vtc
- BUG/MINOR: ssl: Don't allow to set NULL sni
- MEDIUM: quic: Add connection as argument when qc_new_conn() is called
- MINOR: ssl: Add a function to hash SNIs
- MINOR: ssl: Store hash of the SNI for cached TLS sessions
- MINOR: ssl: Compare hashes instead of SNIs when a session is cached
- MINOR: connection/ssl: Store the SNI hash value in the connection itself
- MEDIUM: tcpcheck/backend: Get the connection SNI before initializing SSL ctx
- BUG/MEDIUM: ssl: Don't reuse TLS session if the connection's SNI differs
- MEDIUM: ssl/server: No longer store the SNI of cached TLS sessions
- BUG/MINOR: log: Dump good %B and %U values in logs
- BUG/MEDIUM: http-ana: Don't close server connection on read0 in TUNNEL mode
- DOC: config: Fix description of the spop mode
- DOC: config: Improve spop mode documentation
- MINOR: ssl: Split ssl_crt-list_filters.vtc in two files by TLS version
- REGTESTS: quic: tls13_ssl_crt-list_filters.vtc supported by QUIC
- BUG/MEDIUM: h3: do not access QCS <sd> if not allocated
- CLEANUP: mworker/cli: remove useless variable
- BUG/MINOR: mworker/cli: 'show proc' is limited by buffer size
- BUG/MEDIUM: ssl: Always check the ALPN after handshake
- MINOR: connections: Add a new CO_FL_SSL_NO_CACHED_INFO flag
- BUG/MEDIUM: ssl: Don't store the ALPN for check connections
- BUG/MEDIUM: ssl: Don't resume session for check connections
- CLEANUP: improvements to the alignment macros
- CLEANUP: use the automatic alignment feature
- CLEANUP: more conversions and cleanups for alignment
- BUG/MEDIUM: h3: fix access to QCS <sd> definitely
- MINOR: h2/trace: emit a trace of the received RST_STREAM type
Right now we don't get any state trace when receiving an RST_STREAM, and
this is not convenient because RST_STREAM(0) is not visible at all, except
in developer level because the function is entered and left.
Let's extract the RST code first and always log it using TRACE_PRINTF()
(along with h2c/h2s) so that it's possible to detect certain codes being
used.
The previous patch tried to fix access to QCS <sd> member, as the latter
is not always allocated anymore on the frontend side.
a15f0461a016a664427f5aaad2227adcc622c882
BUG/MEDIUM: h3: do not access QCS <sd> if not allocated
In particular, access was prevented after HEADERS parsing in case
h3_req_headers_to_htx() returned an error, which indicates that the
stream-endpoint allocation was not performed. However, this still is not
enough when QCS instance is already closed at this step. Indeed, in this
case, h3_req_headers_to_htx() returns OK but stream-endpoint allocation
is skipped as an optimization as no data exchange will be performed.
To definitely fix this kind of problems, add checks on qcs <sd> member
before accessing it in H3 layer. This method is the safest one to ensure
there is no NULL dereferencement.
This should fix github issue #3211.
This must be backported along the above mentionned patch.
- Convert additional cases to use the automatic alignment feature for
the THREAD_ALIGN(ED) macros. This includes some cases that are less
obviously correct where it seems we wanted to align only in the
USE_THREAD case but were not using the thread specific macros.
- Also move some alignment requirements to the structure definition
instead of having it on variable declaration.
- Use the automatic alignment feature instead of hardcoding 64 all over
the code.
- This also converts a few bare __attribute__((aligned(X))) to using the
ALIGNED macro.
- It is now possible to use the THREAD_ALIGN and THREAD_ALIGNED macros
without a parameter. In this case, we automatically align on the cache
line size.
- The cache line size is set to 64 by default to match the current code,
but it can be overridden on the command line.
- This required moving the DEFVAL/DEFNULL/DEFZERO macros to compiler.h
instead of tools-t.h, to avoid namespace pollution if we included
tools-t.h from compiler.h.
Don't attempt to use stored sessions when creating new check
connections, as the check SSL parameters might be different from the
server's ones.
This has not been proven to be a problem yet, but it doesn't mean it
can't be, and this should be backported up to 2.8 along with
dcce9369129f6ca9b8eed6b451c0e20c226af2e3 if it is.
When establishing check connections, do not store the negociated ALPN
into the server's path_param if the connection is a check connection, as
it may use different SSL parameters than the regular connections. To do
so, only store them if the CO_FL_SSL_NO_CACHED_INFO is not set.
Otherwise, the check ALPN may be stored, and the wrong mux can be used
for regular connections, which will end up generating 502s.
This should fix Github issue #3207
This should be backported to 3.3.
Add a new flag to connections, CO_FL_SSL_NO_CACHED_INFO, and set it for
checks.
It lets the ssl layer know that he should not use cached informations,
such as the ALPN as stored in the server, or cached sessions.
This wlil be used for checks, as checks may target different servers, or
used a different SSL configuration, so we can't assume the stored
informations are correct.
This should be backported to 3.3, and may be backported up to 2.8 if the
attempts to do session resume by checks is proven to be a problem.
Move the code that is responsible for checking the ALPN, and updating
the one stored in the server's path_param, from after we created the
mux, to after we did an handshake. Once we did it once, the mux will not
be created by the ssl code anymore, as when we know which mux to use
thanks to the ALPN, it will be done earlier in connect_server(), so in
the unlikely event it changes, we would not detect it anymore, and we'd
keep on creating the wrong mux.
This can be reproduced by doing a first request, and then changing the
ALPN of the server without haproxy noticing (ie without haproxy noticing
that the server went down).
This should be backported to 3.3.
In ticket #3204, it was reported that "show proc" is not able to display
more than 202 processes. Indeed the bufsize is 16k by default in the
master, and can't be changed anymore since 3.1.
This patch allows the 'show proc' to start again to dump when the buffer
is full, based on the timestamp of the last PID it attempted to dump.
Using pointers or count the number of processes might not be a good idea
since the list can change between calls.
Could be backported in all stable branche.
Since the following commit, allocation of QCS stream-endpoint on FE side
has been delayed. The objective is to allocate it only for QCS attached
to an upper stream object. Stream-endpoint allocation is now performed
on qcs_attach_sc() called during HEADERS parsing.
commit e6064c561684d9b079e3b5725d38dc3b5c1b5cd5
OPTIM: mux-quic: delay FE sedesc alloc to stream creation
Also, stream-endpoint is accessed through the QCS instance after HEADERS
or DATA frames parsing, to update the known input payload length. The
above patch triggered regressions as in some code paths, <sd> field is
dereferenced while still being NULL.
This patch fixes this by restricting access to <sd> field after newer
conditions.
First, after HEADERS parsing, known input length is only updated if
h3_req_headers_to_htx() previously returned a success value, which
guarantee that qcs_attach_sc() has been executed.
After DATA parsing, <sd> is only accessed after the frame validity
check. This ensures that HEADERS were already parsed, thus guaranteing
that stream-endpoint is allocated.
This should fix github issue #3211.
This must be backported up to 3.3. This is sufficient, unless above
patch is backported to previous releases, in which case the current one
must be picked with it.
ssl/tls13_ssl_crt-list_filters.vtc was renamed to ssl/tls13_ssl_crt-list_filters.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then tls13_ssl_crt-list_filters.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
Seperate the section from ssl_crt-list_filters.vtc which supports TLS 1.2 and 1.3
versions to produce tls12_ssl_crt-list_filters.vtc and tls13_ssl_crt-list_filters.vtc.
The spop mode description was a bit confusing. So let's improve it.
Thanks to @NickMRamirez.
This patch shoud fix issue #3206. It could be backported as far as 3.1.
It was mentionned that the spop mode turned the backend into a "log"
backend. It is obviously wrong. It turns the backend into a spop backend.
This patch should be backported as far as 3.1.
It is a very old bug (2012), dating from the introduction of the keep-alive
support to HAProxy. When a request is fully received, the SC on backend side
is switched to NOHALF mode. It means that when the read0 is received from
the server, the server connection is immediately closed. It is expected to
do so at the end of a classical request. However, it must not be performed
if the session is switched to the TUNNEL mode (after an HTTP/1 upgrade or a
CONNECT). The client may still have data to send to the server. And closing
brutally the server connection this way will be handled as an error on
client side.
This bug is especially visible when a H2 connection on client side because a
RST_STREAM is emitted and a "SD--" is reported in logs.
Thanks to @chrisstaite
This patch should fix the issue #3205. It must be backported to all stable
versions.
When per-stream "bytes_in" and "bytes_out" counters where replaced in 3.3,
the wrong counters were used for %B and %U values in logs. In the
configuration manual and the commit message, it was specificed that
"bytes_in" was replaced by "req_in" and "bytes_out" by "res_in", but in the
code, wrong counters were used. It is now fixed.
This patch should fix the issue #3208. It must be backported to 3.3.
Thanks to the previous patch, "BUG/MEDIUM: ssl: Don't reuse TLS session
if the connection's SNI differs", it is no useless to store the SNI of
cached TLS sessions. This SNI is no longer tested and new connections
reusing a session must have the same SNI.
The main change here is for the ssl_sock_set_servername() function. It is no
longer possible to compare the SNI of the reused session with the one of the
new connection. So, the SNI is always set, with no other processing. Mainly,
the session is not destroyed when SNIs don't match. It means the commit
119a4084bf ("BUG/MEDIUM: ssl: for a handshake when server-side SNI changes")
is implicitly reverted.
It is good to note that it is unclear for me when and why the reused session
should be destroyed. Because I'm unable to reproduce any issue fixed by the
commit above.
This patch could be backported as far as 3.0 with the commit above.
When a new SSL server connection is created, if no SNI is set, it is
possible to inherit from the one of the reused TLS session. The bug was
introduced by the commit 95ac5fe4a ("MEDIUM: ssl_sock: always use the SSL's
server name, not the one from the tid"). The mixup is possible between
regular connections but also with health-checks connections.
But it is only the visible part of the bug. If the SNI of the cached TLS
session does not match the one of the new connection, no reuse must be
performed at all.
To fix the bug, hash of the SNI of the reused session is compared with the
one of the new connection. The TLS session is reused only if the hashes are
the same.
This patch should fix the issue #3195. It must be slowly backported as far
as 3.0. it relies on the following series:
* MEDIUM: tcpcheck/backend: Get the connection SNI before initializing SSL ctx
* MINOR: connection/ssl: Store the SNI hash value in the connection itself
* MEDIUM: ssl: Store hash of the SNI for cached TLS sessions
* MINOR: ssl: Add a function to hash SNIs
* MEDIUM: quic: Add connection as argument when qc_new_conn() is called
* BUG/MINOR: ssl: Don't allow to set NULL sni
The SNI of a new connection is now retrieved earlier, before the
initialization of the SSL context. So, concretely, it is now performed
before calling conn_prepare(). The SNI is then set just after.
When a SNI is set on a new connection, its hash is now saved in the
connection itself. To do so, a dedicated field was added into the connection
strucutre, called sni_hash. For now, this value is only used when the TLS
session is cached.
This patch relies on the commit "MINOR: ssl: Store hash of the SNI for
cached TLS sessions". We now use the hash of the SNIs instead of the SNIs
themselves to know if we must update the cached SNI or not.
For cached TLS sessions, in addition to the SNI itself, its hash is now also
saved. No changes are expected here because this hash is not used for now.
This commit relies on:
* MINOR: ssl: Add a function to hash SNIs
This patch only adds the function ssl_sock_sni_hash() that can be used to
get the hash value corresponding to an SNI. A global seed, sni_hash_seed, is
used.
This patch reverts the commit efe60745b ("MINOR: quic: remove connection arg
from qc_new_conn()"). The connection will be mandatory when the QUIC
connection is created on backend side to fix an issue when we try to reuse a
TLS session.
So, the connection is again an argument of qc_new_conn(), the 4th
argument. It is NULL for frontend QUIC connections but there is no special
check on it.
ssl_sock_set_servername() function was documented to support NULL sni to
unset it. However, the man page of SSL_get_servername() does not mentionned
it is supported or not. And it is in fact not supported by WolfSSL and leads
to a crash if we do so.
For now, this function is never called with a NULL sni, so it better and
safer to forbid this case. Now, if the sni is NULL, the function does
nothing.
This patch could be backported to all stable versions.
This reg test ensures the curves may be correctly set for frontend
and backends by "ssl-default-bind-curves" and "ssl-default-server-curves"
as global options or with "curves" options on "bind" and "server" lines.
This patch impacts both the QUIC frontends and listeners.
Note that "ssl-default-bind-ciphersuites", "ssl-default-bind-curves",
are not ignored by QUIC by the frontend. This is also the case for the
backends with "ssl-default-server-ciphersuites" and "ssl-default-server-curves".
These settings are set by ssl_sock_prepare_ctx() for the frontends and
by ssl_sock_prepare_srv_ssl_ctx() for the backends. But ssl_quic_initial_ctx()
first sets the default QUIC frontends (see <quic_ciphers> and <quic_groups>)
before these ssl_sock.c function are called, leading some TLS stack to
refuse them if they do not support them. This is the case for some OpenSSL 3.5
stack with FIPS support. They do not support X25519.
To fix this, set the default QUIC ciphersuites and curves only if not already
set by the settings mentioned above.
Rename <quic_ciphers> global variable to <default_quic_ciphersuites>
and <quic_groups> to <default_quic_curves> to reflect the OpenSSL API naming.
These options are taken into an account by ssl_quic_initial_ctx()
which inspects these four variable before calling SSL_CTX_set_ciphersuites()
with <default_quic_ciphersuites> as parameter and SSL_CTX_set_curves() with
<default_quic_curves> as parameter if needed, that is to say, if no ciphersuites
and curves were set by "ssl-default-bind-ciphersuites", "ssl-default-bind-curves"
as global options or "ciphersuites", "curves" as "bind" line options.
Note that the bind_conf struct is not modified when no "ciphersuites" or
"curves" option are used on "bind" lines.
On backend side, rely on ssl_sock_init_srv() to set the server ciphersuites
and curves. This function is modified to use respectively <default_quic_ciphersuites>
and <default_quic_curves> if no ciphersuites and curves were set by
"ssl-default-server-ciphersuites", "ssl-default-server-curves" as global options
or "ciphersuites", "curves" as "server" line options.
Thank to @rwagoner for having reported this issue in GH #3194 when using
an OpenSSL 3.5.4 stack with FIPS support.
Must be backported as far as 2.6
This reg test ensures the ciphersuites may be correctly set for frontend
and backends by "ssl-default-bind-ciphersuites" and "ssl-default-server-ciphersuites"
as global options or with "ciphersuites" options on "bind" and "server" lines.
ssl/add_ssl_crt-list.vtc was renamed to ssl/add_ssl_crt-list.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then add_ssl_crt-list.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/ssl_curve_name.vtc was renamed to ssl/ssl_curve_name.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_curve_name.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
Note that this script works by chance for QUIC because the curves
selection matches the default ones used by QUIC.
ssl/ssl_sni_auto.vtc was renamed to ssl/ssl_sni_auto.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_sni_auto.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
Mark the test as broken for QUIC
ssl/ssl_simple_crt-list.vtc was renamed to ssl/ssl_simple_crt-list.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_simple_crt-list.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/ssl_server_samples.vtc was renamed to ssl/ssl_server_samples.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_server_samples.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/ssl_frontend_samples.vtc was renamed to ssl/ssl_frontend_samples.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_frontend_samples.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/new_del_ssl_crlfile.vtc was renamed to ssl/new_del_ssl_crlfile.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then new_del_ssl_crlfile.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/ssl_default_server.vtc was renamed to ssl/ssl_default_server.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_default_server.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/ssl_client_samples.vtc was renamed to ssl/ssl_client_samples.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_client_samples.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/ssl_client_auth.vtc was renamed to ssl/ssl_client_auth.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ssl_client_auth.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/show_ssl_ocspresponse.vtc was renamed to ssl/show_ssl_ocspresponse.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then show_ssl_ocspresponse.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/set_ssl_server_cert.vtc was renamed to ssl/set_ssl_server_cert.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then set_ssl_server_cert.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/set_ssl_crlfile.vtc was renamed to ssl/set_ssl_crlfile.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then set_ssl_crlfile.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/set_ssl_cert.vtc was renamed to ssl/set_ssl_cert.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then set_ssl_cert.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/set_ssl_cert_noext.vtc was renamed to ssl/set_ssl_cert_noext.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then set_ssl_cert_noext.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/set_ssl_cafile.vtc was renamed to ssl/set_ssl_cafile.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then set_ssl_cafile.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
This is the same issue as the one fixed by this commit:
BUG/MINOR: quic-be: handshake errors without connection stream closure
But this time this is when the client has to send an alert to the server.
The fix consists in creating the mux after having set the handshake connection
error flag and error_code.
This bug was revealed by ssl/set_ssl_cafile.vtc reg test.
Depends on this commit:
MINOR: quic: avoid code duplication in TLS alert callback
Must be backported to 3.3
Both the OpenSSL QUIC API TLS alert callback ha_quic_ossl_alert() does exactly
the same thing than the one for quictls API, even if the parameter have different
types.
Call ha_quic_send_alert() quictls callback from ha_quic_ossl_alert OpenSSL
QUIC API callback to avoid such code duplication.
ssl/set_ssl_bug_2265.vtc was renamed to ssl/set_ssl_bug_2265.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then set_ssl_bug_2265.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/ocsp_auto_update.vtc was renamed to ssl/ocsp_auto_update.vtci
to produce a common part runnable both for QUIC and TCP listeners.
Then ocsp_auto_update.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC listeners and "stream" for TCP listeners);
ssl/new_del_ssl_cafile.vtc was rename to ssl/new_del_ssl_cafile.vtci
to produce a common part runnable both for QUIC and TCP connections.
Then new_del_ssl_cafile.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC connection and "stream" for TCP connections);
ssl/issuers_chain_path.vtc was rename to ssl/issuers_chain_path.vtci
to produce a common part runnable both for QUIC and TCP connections.
Then issuers_chain_path.vtc files were created both under ssl and quic directories
to call this .vtci file with correct VTC_SOCK_TYPE environment values
("quic" for QUIC connection and "stream" for TCP connections);
ssl/dynamic_server_ssl.vtc was rename to ssl/dynamic_server_ssl.vtci
to produce a common part runnable both for QUIC and TCP connections.
Then dynamic_server_ssl.vtc were created both under ssl and quic directories
to call the .vtci file with correct VTC_SOCK_TYPE environment value.
Note that VTC_SOCK_TYPE may be resolved in haproxy -cli { } sections.
Extract from ssl/del_ssl_crt-list.vtc the common part to produce
ssl/del_ssl_crt-list.vtci which may be reused by QUIC and TCP
from respectively quic/del_ssl_crt-list.vtc and ssl/del_ssl_crt-list.vtc
thanks to "include" VTC command and VTC_SOCK_TYPE special vtest environment
variable.
Move all these files and others for OCSP tests found into reg-tests/ssl
to reg-test/ssl/certs and adapt all the VTC files which use them.
This patch is needed by other tests which have to include the SSL tests.
Indeed, some VTC commands contain paths to these files which cannot
be customized with environment variables, depending on the location the VTC file
is runi from, because VTC does not resolve the environment variables. Only macros
as ${testdir} can be resolved.
For instance this command run from a VTC file from reg-tests/ssl directory cannot
be reused from another directory, except if we add a symbolic link for each certs,
key etc.
haproxy h1 -cli {
send "del ssl crt-list ${testdir}/localhost.crt-list ${testdir}/common.pem:1"
}
This is not what we want. We add a symbolic link to reg-test/ssl/certs to the
directory and modify the command above as follows:
haproxy h1 -cli {
send "del ssl crt-list ${testdir}/certs/localhost.crt-list ${testdir}/certs/common.pem:1"
}
Traces were missing in this function.
Also add information about the connection struct from qc->conn when
initialized for all the traces.
Should be easily backported as far as 2.6.
This bug was revealed on backend side by reg-tests/ssl/del_ssl_crt-list.vtc when
run wich QUIC connections. As expected by the test, a TLS alert is generated on
servsr side. This latter sands a CONNECTION_CLOSE frame with a CRYPTO error
(>= 0x100). In this case the client closes its QUIC connection. But
the stream connection was not informed. This leads the connection to
be closed after the server timeout expiration. It shouls be closed asap.
This is the reason why reg-tests/ssl/del_ssl_crt-list.vtc could succeeds
or failed, but only after a 5 seconds delay.
To fix this, mimic the ssl_sock_io_cb() for TCP/SSL connections. Call
the same code this patch implements with ssl_sock_handle_hs_error()
to correctly handle the handshake errors. Note that some SSL counters
were not incremented for both the backends and frontends. After such
errors, ssl_sock_io_cb() start the mux after the connection has been
flagged in error. This has as side effect to close the stream
in conn_create_mux().
Must be backported to 3.3 only for backends. This is not sure at this time
if this bug may impact the frontends.
Such crashes may occur for QUIC frontends only when the SSL traces are enabled.
ssl_sock_switchctx_cbk() ClientHello callback may be called without any connection
initialize (<conn>) for QUIC connections leading to crashes when passing
conn->err_code to TRACE_ERROR().
Modify the TRACE_ERROR() statement to pass this parameter only when <conn> is
initialized.
Must be backported as far as 3.2.
Probably due to historical accumulation, keywords were in a random
order that doesn't help when looking them up. Let's just reorder them
in alphabetical order like other sections. This can be backported.
As returned by Christian Ruppert in GH issue #3203, we're having an
issue with checks for empty args in skipped blocks: the check is
performed after the line is tokenized, without considering the case
where it's disabled due to outer false .if/.else conditions. Because
of this, a test like this one:
.if defined(SRV1_ADDR)
server srv1 "$SRV1_ADDR"
.endif
will fail when SRV1_ADDR is empty or not set, saying that this will
result in an empty arg on the line.
The solution consists in postponing this check after the conditions
evaluation so that disabled lines are already skipped. And for this
to be possible, we need to move "errptr" one level above so that it
remains accessible there.
This will need to be backported to 3.3 and wherever commit 1968731765
("BUG/MEDIUM: config: solve the empty argument problem again") is
backported. As such it is also related to GH issue #2367.
The keyword was correct in the doc but in the code it was spelled
with a missing 's' after 'settings', making it unavailable. Since
there was no other way to find this but reading the code, it's safe
to simply fix it and assume nobody relied on the wrong spelling.
In the worst case for older backports it can also be duplicated.
This must be backported to 3.0.
Simplify ssl_reuse.vtci so it can be started with variables:
- SSL_CACHESIZE allow to specify the size of the session cache size for
the frontend
- NO_TLS_TICKETS allow to specify the "no-tls-tickets" option on bind
It introduces these files:
- ssl/tls12_resume_stateful.vtc
- ssl/tls12_resume_stateless.vtc
- ssl/tls13_resume_stateless.vtc
- ssl/tls13_resume_stateful.vtc
- quic/tls13_resume_stateless.vtc
- quic/tls13_resume_stateful.vtc
- quic/tls13_0rtt_stateful.vtc
- quic/tls13_0rtt_stateless.vtc
stateful files have "no-tls-tickets" + tune.tls.cachesize 20000
stateless files have "tls-tickets" + tune.tls.cachesize 0
This allows to enable AWS-LC on TCP TLS1.2 and TCP TL1.3+tickets.
TLS1.2+stateless does not seem to work on WolfSSL.
The TLS resume test was never started with AWS-LC because the TLS1.3
part was not working. Since we split the reg-tests with a TLS1.2 part
and a TLS1.3 part, we can enable the tls1.2 part for AWS-LC.
Extend QUIC server configuration so that congestion algorithm and
maximum window size can be set on the server line. This can be achieved
using quic-cc-algo keyword with a syntax similar to a bind line.
This should be backported up to 3.3 as this feature is considered as
necessary for full QUIC backend support. Note that this relies on the
serie of previous commits which should be picked first.
Extract code from bind_parse_quic_cc_algo() related to pure parsing of
quic-cc-algo keyword. The objective is to be able to quickly duplicate
this option on the server line.
This may need to be backported to support QUIC congestion control
algorithm support on the server line in version 3.3.
Each QUIC congestion algorithm is defined as a structure with callbacks
in it. Every quic_conn has a member pointing to the configured
algorithm, inherited from the bind-conf keyword or to the default CUBIC
value.
Convert all these definitions to const. This ensures that there never
will be an accidental modification of a globally shared structure. This
also requires to mark quic_cc_algo field in bind_conf and quic_cc as
const.
This reverts commit a6504c9cfb6bb48ae93babb76a2ab10ddb014a79.
Each supported QUIC algo are associated with a set of callbacks defined
in a structure quic_cc_algo. Originally, bind_conf would use a constant
pointer to one of these definitions.
During pacing implementation, this field was transformed into a
dynamically allocated value copied from the original definition. The
idea was to be able to tweak settings at the listener level. However,
this was never used in practice. As such, revert to the original model.
This may need to be backported to support QUIC congestion control
algorithm support on the server line in version 3.3.
Because of missing "case" keyword in front of the values in a switch
case statement, the values were interpreted as goto tags and the switch
statement became useless.
This patch should fix GitHub issue #3200.
The fix should be backported up to 2.8.
Released version 3.3.0 with the following main changes :
- BUG/MINOR: acme: better challenge_ready processing
- BUG/MINOR: acme: warning ‘ctx’ may be used uninitialized
- MINOR: httpclient: complete the https log
- BUG/MEDIUM: server: do not use default SNI if manually set
- BUG/MINOR: freq_ctr: Prevent possible signed overflow in freq_ctr_overshoot_period
- DOC: ssl: Document the restrictions on 0RTT.
- DOC: ssl: Note that 0rtt works fork QUIC with QuicTLS too.
- BUG/MEDIUM: quic: do not prevent sending if no BE token
- BUG/MINOR: quic/server: free quic_retry_token on srv drop
- MINOR: quic: split global CID tree between FE and BE sides
- MINOR: quic: use separate global quic_conns FE/BE lists
- MINOR: quic: add "clo" filter on show quic
- MINOR: quic: dump backend connections on show quic
- MINOR: quic: mark backend conns on show quic
- BUG/MINOR: quic: fix uninit list on show quic handler
- BUG/MINOR: quic: release BE quic_conn on connect failure
- BUG/MINOR: server: fix srv_drop() crash on partially init srv
- BUG/MINOR: h3: do no crash on forwarding multiple chained response
- BUG/MINOR: h3: handle properly buf alloc failure on response forwarding
- BUG/MEDIUM: server/ssl: Unset the SNI for new server connections if none is set
- BUG/MINOR: acme: fix ha_alert() call
- Revert "BUG/MEDIUM: server/ssl: Unset the SNI for new server connections if none is set"
- BUG/MINOR: sock-inet: ignore conntrack for transparent sockets on Linux
- DEV: patchbot: prepare for new version 3.4-dev
- DOC: update INSTALL with the range of gcc compilers and openssl versions
- MINOR: version: mention that 3.3 is stable now
As reported in github issue #3192, in certain situations with transparent
listeners, it is possible to get the incoming connection's destination
wrong via SO_ORIGINAL_DST. Two cases were identified thus far:
- incorrect conntrack configuration where NOTRACK is used only on
incoming packets, resulting in reverse connections being created
from response packets. It's then mostly a matter of timing, i.e.
whether or not the connection is confirmed before the source is
retrieved, but in this case the connection's destination address
as retrieved by SO_ORIGINAL_DST is the client's address.
- late outgoing retransmit that recreates a just expired conntrack
entry, in reverse direction as well. It's possible that combinations
of RST or FIN might play a role here in speeding up conntrack eviction,
as well as the rollover of source ports on the client whose new
connection matches an older one and simply refreshes it due to
nf_conntrack_tcp_loose being set by default.
TPROXY doesn't require conntrack, only REDIRECT, DNAT etc do. However
the system doesn't offer any option to know how a conntrack entry was
created (i.e. normally or via a response packet) to let us know that
it's pointless to check the original destination, nor does it permit
to access the local vs peer addresses in opposition to src/dst which
can be wrong in this case.
One alternate approach could consist in only checking SO_ORIGINAL_DST
for listening sockets not configured with the "transparent" option,
but the problem here is that our low-level API only works with FDs
without knowing their purpose, so it's unknown there that the fd
corresponds to a listener, let alone in transparent mode.
A (slightly more expensive) variant of this approach here consists in
checking on the socket itself that it was accepted in transparent mode
using IP_TRANSPARENT, and skip SO_ORIGINAL_DST if this is the case.
This does the job well enough (no more client addresses appearing in
the dst field) and remains a good compromise. A future improvement of
the API could permit to pass the transparent flag down the stack to
that function.
This should be backported to stable versions after some observation
in latest -dev.
For reference, here are some links to older conversations on that topic
that Lukas found during this analysis:
https://lists.openwall.net/netdev/2019/01/12/34https://discourse.haproxy.org/t/send-proxy-not-modifying-some-traffic-with-proxy-ip-port-details/3336/9https://www.mail-archive.com/haproxy@formilux.org/msg32199.htmlhttps://lists.openwall.net/netdev/2019/01/23/114
This reverts commit de29000e602bda55d32c266252ef63824e838ac0.
The fix was in fact invalid. First it is not supprted by WolfSSL to call
SSL_set_tlsext_host_name with a hostname to NULL. Then, it is not specified
as supported by other SSL libraries.
But, by reviewing the root cause of this bug, it appears there is an issue
with the reuse of TLS sesisons. It must not be performed if the SNI does not
match. A TLS session created with a SNI must not be reused with another
SNI. The side effects are not clear but functionnaly speaking, it is
invalid.
So, for now, the commit above was reverted because it is invalid and it
crashes with WolfSSL. Then the init of the SSL connection must be reworked
to get the SNI earlier, to be able to reuse or not an existing TLS
session.
When a new SSL server connection is created, if no SNI is set, it is
possible to inherit from the one of the reused TLS session. The bug was
introduced by the commit 95ac5fe4a ("MEDIUM: ssl_sock: always use the SSL's
server name, not the one from the tid"). The mixup is possible between
regular connections but also with health-checks connections.
To fix the issue, when no SNI is set, for regular server connections and for
health-check connections, the SNI must explicitly be disabled by calling
ssl_sock_set_servername() with the hostname set to NULL.
Many thanks to Lukas for his detailed bug report.
This patch should fix the issue #3195. It must be backported as far as 3.0.
Replace BUG_ON() for buffer alloc failure on h3_resp_headers_to_htx() by
proper error handling. An error status is reported which should be
sufficient to initiate connection closure.
No need to backport.
h3_resp_headers_to_htx() is the function used to convert an HTTP/3
response into a HTX message. It was introduced on this release for QUIC
backend support.
A BUG_ON() would occur if multiple responses are forwarded
simultaneously on a stream without rcv_buf in between. Fix this by
removing it. Instead, if QCS HTX buffer is not empty when handling with
a new response, prefer to pause demux operation. This is restarted when
the buffer has been read and emptied by the upper stream layer.
No need to backport.
A recent patch has introduced free operation for QUIC tokens stored in a
server. These values are located in <per_thr> server array.
However, a server instance may be released prior to its full
initialization in case of a failure during "add server" CLI command. The
mentionned patch would cause a srv_drop() crash due to an invalid usage
of NULL <per_thr> member.
Fix this by adding a check on <per_thr> prior to dereference it in
srv_drop().
No need to backport.
If quic_connect_server() fails, quic_conn FD will remain unopened as set
to -1. Backend connections do not have a fallback socket for future
exchange, contrary to frontend one which can use the listener FD. As
such, it is better to release these connections early.
This patch adjusts such failure by extending quic_close(). This function
is called by the upper layer immediately after a connect issue. In this
case, release immediately a quic_conn backend instance if the FD is
unset, which means that connect has previously failed.
Also, quic_conn_release() is extended to ensure that such faulty
connections are immediately freed and not converted into a
quic_conn_closed instance.
Prior to this patch, a backend quic_conn without any FD would remain
allocated and possibly active. If its tasklet is executed, this resulted
in a crash due to access to an invalid FD.
No need to backport.
A recent patch has extended "show quic" capability. It is now possible
to list a specific list of connections, either active frontend, closing
frontend or backend connections.
An issue was introduced as the list is local storage. As this command is
reentrant, show quic context must be extended so that the currently
inspected list is also saved.
This issue was reported via GCC which mentions an uninitilized value
depending on branching conditions.
Add an extra "(B)" marker when displaying a backend connection during a
"show quic". This is useful to differentiate them with the frontend side
when displaying all connections.
Add a new "be" filter to "show quic". Its purpose is to be able to
display backend connections. These connections can also be listed using
"all" filter.
Each quic_conn instance is stored in a global list. Its purpose is to be
able to loop over all known connections during "show quic".
Split this into two separate lists for frontend and backend usage.
Another change is that closing backend connections do not move into
quic_conns_clo list. They remain instead in their original list. The
objective of this patch is to reduce the contention between the two
sides.
Note that this prevents backend connections to be listed in "show quic"
now. This will be adjusted in a future patch.
QUIC CIDs are stored in a global tree. Prior to this patch, CIDs used on
both frontend and backend sides were mixed together.
This patch implement CID storage separation between FE and BE sides. The
original tre quic_cid_trees is splitted as
quic_fe_cid_trees/quic_be_cid_trees.
This patch should reduce contention between frontend and backend usages.
Also, it should reduce the risk of random CID collision.
A recent patch has implemented caching of QUIC token received from a
NEW_TOKEN frame into the server cache. This value is stored per thread
into a <quic_retry_token> field.
This field is an ist, first set to an empty string. Via
qc_try_store_new_token(), it is reallocated to fit the size of the newly
stored token. Prior to this patch, the field was never freed so this
causes a memory leak.
Fix this by using istfree() on <quic_retry_token> field during
srv_drop().
No need to backport.
For QUIC client support, a token may be emitted along with INITIAL
packets during the handshake. The token is encoded during emission via
qc_enc_token() called by qc_build_pkt().
The token may be provided from different sources. First, it can be
retrieved via <retry_token> quic_conn member when a Retry packet was
received. If not present, a token may be reused from the server cache,
populated from NEW_TOKEN received from previous a connection.
Prior to this patch, the last method may cause an issue. If the upper
connection instance is released prior to the handshake completion, this
prevents access to a possible server token. This is considered an error
by qc_enc_token(). The error is reported up to calling functions,
preventing any emission to be performed. In the end, this prevented the
either the full quic_conn release or subsizing into quic_conn_closed
until the idle timeout completion (30s by default). With abortonclose
set now by default on HTTP frontends, early client shutdowns can easily
cause excessive memory consumption.
To fix this, change qc_enc_token() so that if connection is closed, no
token is encoded but also no error is reported. This allows to continue
emission and permit early connection release.
No need to backport.
Document that with QUIC, 0RTT only works with OpenSSL >= 3.5.2 and
AWS-LC, and for TLS/TCP, it only works with OpenSSL, and frontends
require that an ALPN be sent by the client to use the early data before
the handshake.
All of the other bandwidth-limiting code stores limits and intermediate
(byte) counters as unsigned integers. The exception here is
freq_ctr_overshoot_period which takes in unsigned values but returns a
signed value. While this has the benefit of letting the caller know how
far away from overshooting they are, this is not currently leveraged
anywhere in the codebase, and it has the downside of halving the positive
range of the result.
More concretely though, returning a signed integer when all intermediate
values are unsigned (and boundaries are not checked) could result in an
overflow, producing values that are at best unexpected. In the case of
flt_bwlim (the only usage of freq_ctr_overshoot_period in the codebase at
the time of writing), an overflow could cause the filter to wait for a
large number of milliseconds when in fact it shouldn't wait at all.
This is a niche possibility, because it requires that a bandwidth limit is
defined in the range [2^31, 2^32). In this case, the raw limit value would
not fit into a signed integer, and close to the end of the period, the
`(elapsed * freq)/period` calculation could produce a value which also
doesn't fit into a signed integer.
If at the same time `curr` (the number of events counted so far in the
current period) is small, then we could get a very large negative value
which overflows. This is undefined behaviour and could produce surprising
results. The most obvious outcome is flt_bwlim sometimes waiting for a
large amount of time in a case where it shouldn't wait at all, thereby
incorrectly slowing down the flow of data.
Converting just the return type from signed to unsigned (and checking for
the overflow) prevents this undefined behaviour. It also makes the range
of valid values consistent between the input and output of
freq_ctr_overshoot_period and with the input and output of other freq_ctr
functions, thereby reducing the potential for surprise in intermediate
calculations: now everything supports the full 0 - 2^32 range.
A new server feature "sni-auto" has been introduced recently. The
objective is to automatically set the SNI value to the host header if no
SNI is explicitely set.
668916c1a2fc2180028ae051aa805bb71c7b690b
MEDIUM: server/ssl: Base the SNI value to the HTTP host header by default
There is an issue with it : server SNI is currently always overwritten,
even if explicitely set in the configuration file. Adjust
check_config_validity() to ensure the default value is only used if
<sni_expr> is NULL.
This issue was detected as a memory leak on <sni_expr> was reported when
SNI is explicitely set on a server line.
This patch is related to github feature request #3081.
No need to backport, unless the above patch is.
The httpsclient_log_format variable lacks a few values in the TLS fields
that are now available as fetches.
On the backend side we have:
"%[fc_err]/%[ssl_fc_err,hex]/%[ssl_c_err]/%[ssl_c_ca_err]/%[ssl_fc_is_resumed] %[ssl_fc_sni]/%sslv/%sslc"
We now have enough sample fetches to have this equivalent in the
httpclient:
"%[bc_err]/%[ssl_bc_err,hex]/%[ssl_c_err]/%[ssl_c_ca_err]/%[ssl_bc_is_resumed] %[ssl_bc_sni]/%[ssl_bc_protocol]/%[ssl_bc_cipher]"
Instead of the current:
"%[bc_err]/%[ssl_bc_err,hex]/-/-/%[ssl_bc_is_resumed] -/-/-"
Please compiler with maybe-uninitialized warning
src/acme.c: In function ‘cli_acme_chall_ready_parse’:
include/haproxy/task.h:215:9: error: ‘ctx’ may be used uninitialized [-Werror=maybe-uninitialized]
215 | _task_wakeup(t, f, MK_CALLER(WAKEUP_TYPE_TASK_WAKEUP, 0, 0))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/acme.c:2903:17: note: in expansion of macro ‘task_wakeup’
2903 | task_wakeup(ctx->task, TASK_WOKEN_MSG);
| ^~~~~~~~~~~
src/acme.c:2862:26: note: ‘ctx’ was declared here
2862 | struct acme_ctx *ctx;
| ^~~
Backport to 3.2.
Improve the challenge_ready processing:
- do a lookup directly instead looping in the task tree
- only do a task_wakeup when every challenges are ready to avoid starting
the task and stopping it just after
- Compute the number of remaining challenge to setup
- Output a message giving the number of remaining challenges to setup
and if the task started again.
Backport to 3.2.
Released version 3.3-dev14 with the following main changes :
- MINOR: stick-tables: Rename stksess shards to use buckets
- MINOR: quic: do not use quic_newcid_from_hash64 on BE side
- MINOR: quic: support multiple random CID generation for BE side
- MINOR: quic: try to clarify quic_conn CIDs fields direction
- MINOR: quic: refactor qc_new_conn() prototype
- MINOR: quic: remove <ipv4> arg from qc_new_conn()
- MEDIUM: mworker: set the mworker-max-reloads to 50
- BUG/MEDIUM: quic-be: prevent use of MUX for 0-RTT sessions without secrets
- CLEANUP: startup: move confusing msg variable
- BUG/MEDIUM: mworker: signals inconsistencies during startup and reload
- BUG/MINOR: mworker: wrong signals during startup
- BUG/MINOR: acme: P-256 doesn't work with openssl >= 3.0
- REGTESTS: ssl: split the SSL reuse test into TLS 1.2/1.3
- BUILD: Makefile: make install with admin tools
- CI: github: make install-bin instead of make install
- BUG/MINOR: ssl: remove dead code in ssl_sock_from_buf()
- BUG/MINOR: mux-quic: implement max-reuse server parameter
- MINOR: quic: fix trace on quic_conn_closed release
- BUG/MINOR: quic: do not decrement jobs for backend conns
- BUG/MINOR: quic: fix FD usage for quic_conn_closed on backend side
- BUILD: Makefile: remove halog from install-admin
- REGTESTS: ssl: add basic 0rtt tests for TLSv1.2, TLSv1.3 and QUIC
- REGTESTS: ssl: also verify that 0-rtt properly advertises early-data:1
- MINOR: quic/flags: add missing QUIC flags for flags dev tool.
- MINOR: quic: uneeded xprt context variable passed as parameter
- MINOR: limits: keep a copy of the rough estimate of needed FDs in global struct
- MINOR: limits: explain a bit better what to do when fd limits are exceeded
- BUG/MEDIUM: quic-be/ssl_sock: TLS callback called without connection
- BUG/MINOR: acme: alert when the map doesn't exist at startup
- DOC: acme: add details about the DNS-01 support
- DOC: acme: explain how to dump the certificates
- DOC: acme: configuring acme needs a crt file
- DOC: acme: add details about key pair generation in ACME section
- BUG/MEDIUM: queues: Don't forget to unlock the queue before exiting
- MINOR: muxes: Support an optional ALPN string when defining mux protocols
- MINOR: config: Do proto detection for listeners before checks about ALPN
- BUG/MEDIUM: config: Use the mux protocol ALPN by default for listeners if forced
- DOC: config: Add a note about conflict with ALPN/NPN settings and proto keyword
- MINOR: quic: store source address for backend conns
- BUG/MINOR: quic: flag conn with CO_FL_FDLESS on backend side
- ADMIN: dump-certs: let dry-run compare certificates
- BUG/MEDIUM: connection/ssl: also fix the ssl_sock_io_cb() regarding idle list
- DOC: http: document 413 response code
- MINOR: limits: display the computed maxconn using ha_notice()
- BUG/MEDIUM: applet: Fix conditions to detect spinning loop with the new API
- BUG/MEDIUM: cli: State the cli have no more data to deliver if it yields
- MINOR: h3: adjust sedesc update for known input payload len
- BUG/MINOR: mux-quic: fix sedesc leak on BE side
- OPTIM: mux-quic: delay FE sedesc alloc to stream creation
- BUG/MEDIUM: quic-be: quic_conn_closed buffer overflow
- BUG/MINOR: mux-quic: check access on qcs stream-endpoint
- BUG/MINOR: acme: handle multiple auth with the same name
- BUG/MINOR: acme: prevent creating map entries with dns-01
In case of the dns-01 challenge, it is possible to have a domain
"example.com" and "*.example.com" in the same request. This will create
2 different auth objects, which need 2 different challenges.
However the associated domain is "example.com" for both auth objects.
When doing a "challenge_ready", the algorithm will break at the first
domain found. But since you can have multiple time the same domain in
this case, breaking at the first one prevent to have all auth objects in
a ready state.
This patch just remove the break so we can loop on every auth objects.
Must be backported to 3.2.
Since the following commit, allocation of stream-endpoint has been
delayed. The objective is to allocate it only for QCS attached to an
upper stream object.
commit e6064c561684d9b079e3b5725d38dc3b5c1b5cd5
OPTIM: mux-quic: delay FE sedesc alloc to stream creation
However, some MUX functions are unsafe as qcs->sd is dereferenced
without any check on it which will result in a crash. Fix this by
testing that qcs->sd is allocated before using it.
This does not need to be backported, unless the above patch is.
This bug impacts only the backends.
Recent commits have modified quic_rx_pkt_parse() for the QUIC backend to handle the
retry token, and version negotiation. This function is called for the quic_conn
even when is closing state (so for the quic_conn_closed struct). The quic_conn
struct and quic_conn_closed struct share some members thank to the leading
QUIC_CONN_COMMON struct. The recent modification impacts some members which do not
exist for the quic_connn_closed struct, leading to buffer overflows if modified.
For the backends only this patch:
1- silently drops the Retry packet (received/parsed only by backends)
2- silently drops the Initial packets received in closing state
This is safe for the Initial packets because in closing state the datagrams
are entirely skipped thanks to qc_rx_check_closing() in quic_dgram_parse().
No backport needed because the backend support arrived with the current dev.
On frontend side, a stream-endpoint is allocated on every qcs_new()
invokation. However, this is only used for bidirectional request
streams.
This patch delays stream-endpoint allocation to qcs_attach_sc(), just
prior the instantiation of the upper stream object. This does not bring
any behavior change but is a nice optimization.
On backend side, streams are instantiated prior to their QCS MUX
counterpart. Thus, QCS can reuse the stream-endpoint already allocated
with the streams, either on qmux_init() or attach operation.
However, a stream-endpoint is also always allocated in every qcs_new()
invokation. For backend QCS, it is thus overwritten on
qmux_init()/attach operation. This causes a memleak.
Fix this by restricting allocation of stream-endpoint only for frontend
connection.
This does not need to be backported.
A regression was introduced in the commit 2d7e3ddd4 ("BUG/MEDIUM: cli: do
not return ACKs one char at a time"). When the CLI is processing a command
line, we no longer send response immediately. It is especially useful for
clients sending a bunch of commands with very short response.
However, in that state, the CLI applet must state it has no more data to
deliver. Otherwise it will be woken up again and again because data are
found in its output buffer with no blocking conditions. In worst cases, if
the command rate is really high, this can trigger the watchdog.
This patch must be backported where the patch above is, so probably as far
as 3.0.
There was a mixup between read/send events and ability for an applet to
receive and send. The fix seems obvious by reading it. The call-rate must be
incremented when nothing was received from the applet while it was allowed
and nothing was sent to the applet while it was allowed.
This patch must be backported as far as 3.0.
The computed maxconn was only displayed in verbose or debug modes. This
is too bad because lots of users just don't know what they're starting
with and can be trapped when an environment changes. Let's use ha_notice()
instead of a conditional fprintf() so that it gets displayed right after
the other startup messages, hoping that users will get used to seeing it
and more easily spot anomalies. See github issue #3191 for more context.
The fix in commit 9481cef948 ("BUG/MEDIUM: connection: do not reinsert a
purgeable conn in idle list") is also needed for ssl_sock_io_cb() which
can also release an idle connection and must perform the same checks.
This fix must be backported to all stable versions containing the fix
above.
Let the --dry-run mode connect to the socket and compare the
certificates. It would exits the process just before trying to move
the previous certificate and replace it.
This allow to have the "[NOTICE] (1234) XXX is already up to date" message
with dry-run.
Connection struct defines an handle which can point to either a FD or a
quic_conn. On the latter case, CO_FL_FDLESS must be set. This is already
the case on frontend side.
This patch fixes QUIC backend support. Before setting connection handle
member to a quic_conn instance, ensure that CO_FL_FDLESS flag is set on
the connection.
Prior to this patch, crash can occur in "show sess all".
No need to backport.
quic_conn has a local_addr member which is used to store the connection
source address. On backend side, this member is initialized to NULL as
the address is not yet known prior to connect. With this patch,
quic_connect_server() is extended so that local_addr is updated after
connect() success.
Also, quic_sock_get_src() is completed for the backend side which now
returns local_addr member. This step is necessary to properly support
fetches bc_src/bc_src_port.
If a mux protocol is forced and an incompatible ALPN or NPN settings are
used, connection errors may be experienced. There is no check performed
during HAProxy startup and It is not necessarily obvious. So a note is added
to warn users about this usage.
Since the commit 5003ac7fe ("MEDIUM: config: set useful ALPN defaults for
HTTPS and QUIC"), the ALPN is set by default to "h2,http/1.1" for HTTPS
listeners. However, it is in conflict with the forced mux protocol, if
any. Indeed, with "proto" keyword, the mux can be forced. In that case, some
combinations with the default ALPN will triggers connections errors.
For instance, by setting "proto h2", it will not be possible to use the H1
multiplexer. So we must take care to not advertise it in the ALPN. Worse,
since the commit above, most modern HTTP clients will try to use the H2
because it is advertised in the ALPN. By setting "proto h1" on the bind line
will make all the traffic rejected in error.
To fix the issue, and thanks to previous commits, if it is defined, we are
now relying on the ALPN defined by the mux protocol by default. The H1
multiplexer (only the one that can be forced) defines it to "http/1.1" while
the H2 multiplexer defines it to "h2". So by default, if one or another of
these muxes is forced, and if no ALPN is set, the mux ALPN is used.
Other multiplexers are not defining any default ALPN for now, because it is
useless. In addition, only the listeners are concerned because there is no
default ALPN on the server side.Finally, there is no tests performed if the
ALPN is forced on the bind line. It is the user responsibility to properly
configure his listeners (at least for now).
This patch depends on:
* MINOR: config: Do proto detection for listeners before checks about ALPN
* MINOR: muxes: Support an optional ALPN string when defining mux protocols
The series must be backported as far as 2.8.
The verification of any forced mux protocol, via the "proto" keyword, for
listeners is now performed before any tests on the ALPN. It will be
mandatory to be able to force the default ALPN, if not forced on the bind
line.
This patch will be mandatory for the next fix.
When a multiplexer protocol is defined, it is now possible to specify the
ALPN it supports, in binary format. This info is optionnal. For now only the
h2 and the h1 multiplexers define an ALPN because this will be mandatory for
a fix. But this could be used in future for different purpose.
This patch will be mandatory for the next fix.
In assign_server_and_queue(), there's a rare case when the server was
full, so we created a pendconn, another server was considered but in the
meanwhile the pendconn was unqueued already, so we just left the
function. We did so, however, while still holding the queue lock, which
will ultimately lead to a deadlock, and ultimately the watchdog would
kill the process.
To fix that, just unlock the queue before leaving.
This should be backported to 3.2.
When configuring an acme section with the 'map' keyword, the user must
use an existing map. If the map doesn't exist, a log will be emitted
when trying to add the challenge to the map.
This patch change the behavior by checking at startup if the map exists,
so haproxy would warn and won't start with a non-existing map.
This must be backported in 3.2.
Contrary to TCP, QUIC does not SSL_free() its SSL * object when its ->close()
XPRT callback is called. This has as side effect to trigger some BUG_ON(!conn)
with <conn> the connection from TLS callbacks registered at configuration
parsing time, so after this <conn> have been released.
This is the case for instance with ssl_sock_srv_verifycbk() whose role is to
add some checks to the built-in server certificate verification process.
This patch prevents the pointer to <conn> dereferencing inside several callbacks
shared between TCP and QUIC.
Thank you to @InputOutputZ for its report in GH #3188.
As the QUIC backend feature arrived with the current 3.3 dev, no need to backport.
As shown in github issue #3191, the error message shown when FD limits
are exceeded is not very useful as-is, since the current hard limit is
not displayed, and no suggestion is made about what to change in the
config. Let's explain about maxconn/ulimit-n/fd-hard-limit, suggest
dropping them or setting them to a context-based value at roughly 49%
of the current limit minus the known used FDs for listeners and checks.
This allows common "large" hard limits to report mostly round maxconns.
Example:
[ALERT] (25330) : [haproxy.main()] Cannot raise FD limit to 4001020,
current limit is 1024 and hard limit is 4096. You may prefer to let
HAProxy adjust the limit by itself; for this, please just drop any
'maxconn' and 'ulimit-n' from the global section, and possibly add
'fd-hard-limit' lower than this hard limit. You may also force a new
'maxconn' value that is a bit lower than half of the hard limit minus
listeners and checks. This results in roughly 1500 here.
It's always a pain to guess the number of FDs that can be needed by
listeners, checks, threads, pollers etc. We have this estimate in
global.maxsock before calling set_global_maxconn(), but we lose it
the line after. Let's copy it into global.est_fd_usage and keep it.
This will be helpful to try to provide more accurate suggestions for
maxconn.
This quic_conn ->xrpt_ctx is passed to qc_send_ppkts(), the quic_conn is retrieved
from this context to be used inside this function and it is not used at all
by this function.
This patch simply directly passes the quic_conn to qc_send_ppkts(). This is only
what this function needs.
This patch completes the 0-rtt test to verify that early-data:1 is
properly emitted to the server in the relevant situations. We carefully
compare it with the expected values that are computed based on the TLS
version, the client and listener's support for 0-rtt and the resumption
status. A response header "x-early-data-test" is set to OK on success,
or KO on failure and the client tests this. The previous test is kept
as well. This was tested with quictls-1.1.1 and quictls-3.0.1 for TCP,
as well as aws-lc for QUIC.
These tests try all the combinations of {0,1}rtt <-> {0,1}rtt with
stateless and stateful tickets. They take into consideration the TLS
version to decide whether or not 0rtt should work. Since we cannot
use environment variables in the client, the tests are run in haproxy
itself where the frontends set a "x-early-rcvd-test" response header
that the client checks. At this stage, the test only verifies that
*some* early data were received.
Note that the tests are a bit complex because we need 4 listeners
for the various combinations of 0rtt/tickets, then we have to set
expectations based on the TLS version (1.2 vs 1.3), as well as the
session resumption status.
We have to set alpn on the server lines because currently our frontends
expect it for 0-rtt to work.
The dependency to halog build provokes problems when changing CFLAGS and
LDFLAGS, because you're suppose to have the same flags during the build
and the install if there's still some things to build.
We probably need to store the flags somewhere to reuse them at another
step, but we need to do it cleanly. In the meantime it's better not to
have this dependency.
On the frontend side, QUIC transfer can be performed either via a
connection owned FD or multiplex on the listener one. When a quic_conn
is freed and converted to quic_conn_closed instance, its FD if open is
closed and all exchanges are now multiplex via the listener FD.
This is different for the backend as connections only has the choice to
use their owned FD. Thus, special care care must be taken when freeing a
connection and converting it to a quic_conn_closed instance. In this
case, qc_release_fd() is delayed to the quic_conn_closed release.
Furthermore, when the FD is transferred, its iocb and owner fields are
updated to the new quic_conn_closed instance. Without it, a crash will
occur when accessing the freed quic_conn tasklet. A newly dedicated
handler quic_conn_closed_sock_fd_iocb is used to ensure access to
quic_conn_closed members only.
jobs is a global counter which serves to account activity through the
whole process. Soft-stop procedure will wait until this counter is
resetted to the nul value.
jobs is not used for backend connections. Thus, it is not incremented
when a QUIC backend connection is instantiated as expected. However,
decrement is performed on all sides during quic_conn_release(). This
causes the counter wrapping.
Fix this by decrementing jobs only for frontend connections. Without
this patch, soft stop procedure will hang indefinitely if QUIC backend
connections were in use.
Properly implement support for max-reuse server keyword. This is done by
adding a total count of streams seen for the whole connection. This
value is used in avail_streams callback.
When haproxy is compiled in -O0, the SSL_get_max_early_data() symbol is
used in the generated assembly, however -O2 seems to remove this symbol
when optimizing the code.
It happens because `if conn_is_back(conn)` and `if
(objt_listener(conn->target))` are opposed conditions, which mean we
never use the branch when objt_listener(conn->target) is true.
This patch removes the dead code. Bonus: SSL_get_max_early_data() is not
implemented in rustls, and that's the only thing preventing to start
with it.
This can be backported in every stable branches.
make install now have a dependency to install-admin which have a
dependency to admin/halog/halog.
halog links haproxy .o together with its own objects, but those objects
when built with ASAN must also be linked with ASAN or it won't be
possible to link the binary.
We don't need an ASAN-ready halog, so let's just do an install-bin
instead that will just install haproxy.
QUIC and TLS don't use the same tests because QUIC only supports
TLS 1.3 while SSL tests both TLS 1.2 and 1.3, which complicates
the tests scenarios.
This change extracts the core of the test into a single generic
ssl_reuse.vtci file and creates new high-level tests for TLSv1.2
over TCP, TLSv1.3 over TCP and TLSv1.3 over QUIC, which simply
include this file and set two variables. The test is now cleaner
and simpler.
When trying to use the P-256 curve in the acme configuration with
OpenSSL 3.x, the generation of the account was failing because OpenSSL
doesn't return a NIST or SECG curve name, but a ANSI X9.62 one.
Since the ANSI X9.62 curve names were not in the list, it couldn't match
anything supported.
This patch fixes the issue by adding both prime192v1 and prime256v1 name
in the struct curve array which is used during curve parsing.
Must be backported to 3.2.
Since the new master-worker model in 3.1, signals are registered in
step_init_3(). However, those signals were supposed to be registered
only for the worker or the standalone mode. It would call the wrong
callback in the master even during configuration parsing.
The patch set the signals handler to NULL for the master so it does
nothing until they really are registered.
Must be backported as far as 3.1.
Since haproxy 3.1, the master-worker mode changed to let the worker
parse the configuration instead of the master.
Previously, signals were blocked during configuration parsing and
unblocked before entering the polling loop of the master. This way it
was impossible to start a reload during the configuration parsing.
But with the new model, the polling loop is started in the master before
the configuration parsing is finished, and the signals are still
unblocked at this step. Meaning that it is possible to start a reload
while the configuration is parsing.
This patch reintroduce the behavior of blocking the signals during
configuration parsing adapted to the new model:
- Before the exec() of the reload, signals are blocked.
- When entering the polling loop, the SIGCHLD is unblocked because it is
required to get a failure during configuration parsing in the worker
- Once the configuration is parsed, upon success in _send_status() or
upon failure in run_master_in_recovery_mode() every signals are unblocked.
This patch must be backported as far as 3.1.
Move the char *msg variable declared in main() in a sub-block since
there's already multiple msg variable in other sub-blocks in this
function.
Also make it const.
The QUIC backend crashes when its peer does not support 0-RTT. In this case,
when the sessions are reused, no early-data level secrets are derived by
the TLS stack. This leads to crashes from qc_send_mux() which does not suppose
that both early-data level (qc->eel) and application level (qc->ael) cipher levels
could be non initialized.
To fix this:
- prevent qc_send_mux() to send data if these two encryption level are not
intialized. In this case it returns QUIC_TX_ERR_NONE;
- avoid waking up the MUX from XPRT ->start() callback if the MUX is ready
but without early-data level secrets to send them;
- ensure the MUX is woken up by qc_ssl_do_handshake() after handshake completion
if it is ready calling qc_notify_send()
Thank you to @InputOutputZ for having reported this issue in GH #3188.
No need to backport because QUIC backends is a current 3.3 development feature.
There was no mworker-max-reload value by default, it was set to INT_MAX
so this was impossible to reach.
The default value is now 50, which is still high, but no workers should
undergo that much reloads. Meaning that a worker will be killed with
SIGTERM if it reach this much reloads.
Remove <ipv4> argument from qc_new_conn(). This parameter is unnecessary
as it can be derived from the family type of the addresses also passed
as argument.
The objective of this patch is to streamline qc_new_conn() usage so that
it is similar for frontend and backend sides.
Previously, several parameters were set only for frontend connections.
These arguments are replaced by a single quic_rx_packet argument, which
represents the INITIAL packet triggering the connection allocation on
the server side. For a QUIC client endpoint, it remains NULL. This usage
is consider more explicit.
As a minor change, <target> is moved as the first argument of the
function. This is considered useful as this argument determines whether
the connection is a frontend or backend entry.
Along with these changes, qc_new_conn() documentation has been reworded
so that it is now up-to-date with the newest usage.
quic_conn has two fields named <dcid> and <scid>. It may cause confusion
as it is not obvious how these fields are related to the connection
direction. Try to improve this by extending the documentation of these
two fields.
When a new backend connection is instantiated, a CID is first randomly
generated. It will serve as the first DCID for incoming packets from the
server. Prior to this patch, if the generated CID caused a collision
with an other entries from another connection, an error is reported and
the connection cannot be allocated.
This patch improves this procedure by implementing retries when a
collision occurs. Now, at most three attemps will be performed before
giving up. This is the same procedure already performed for CIDs
instantiated after RETIRE_CONNECTION_ID frame parsing.
Along with this functional change, qc_new_conn() is refactored for
backend instantiation. The CID generation is extracted from it and the
value is passed as an argument. This is considered cleaner as the code
is more similar between frontend and backend sides.
quic_newcid_from_hash64 is an external callback. If defined, it serves
as a CID method generation, as an alternative to the default random
implementation.
This mechanism was not correctly implemented on the backend side.
Indeed, <hash64> quic_conn member is only setted for frontend
connections. The simplest solution would be to properly define it also
for backend ones. However, quic_newcid_from_hash64 derivation is really
only useful for the frontend side for now. Thus, this patch disables
using it on the backend side in favor of the default random generator.
To implement this, quic_cid_generate() is splitted in two functions, for
both methods of CIDs generation. This is the responsibility of the
caller to select the proper method. On backend side, only random
implementation is now used.
The shard keyword is already used by the peers and on the server lines. And
it is unrelated with the session keys distribution. So instead of talking
about shard for the session key hashing, we now use the term "bucket".
Released version 3.3-dev13 with the following main changes :
- BUG/MEDIUM: config: for word expansion, empty or non-existing are the same
- BUG/MINOR: quic: close connection on CID alloc failure
- MINOR: quic: adjust CID conn tree alloc in qc_new_conn()
- MINOR: quic: split CID alloc/generation function
- BUG/MEDIUM: quic: handle collision on CID generation
- MINOR: quic: extend traces on CID allocation
- MEDIUM/OPTIM: quic: alloc quic_conn after CID collision check
- MINOR: stats-proxy: ensure future-proof FN_AGE manipulation in me_generate_field()
- BUG/MEDIUM: stats-file: fix shm-stats-file preload not working anymore
- BUG/MINOR: do not account backend connections into maxconn
- BUG/MEDIUM: init: 'devnullfd' not properly closed for master
- BUG/MINOR: acme: more explicit error when BIO_new_file()
- BUG/MEDIUM: quic-be: do not launch the connection migration process
- MINOR: quic-be: Parse the NEW_TOKEN frame
- MEDIUM: quic-be: Parse, store and reuse tokens provided by NEW_TOKEN
- MINOR: quic-be: helper functions to save/restore transport params (0-RTT)
- MINOR: quic-be: helper quic_reuse_srv_params() function to reuse server params (0-RTT)
- MINOR: quic-be: Save the backend 0-RTT parameters
- MEDIUM: quic-be: modify ssl_sock_srv_try_reuse_sess() to reuse backend sessions (0-RTT)
- MINOR: quic-be: allow the preparation of 0-RTT packets
- MINOR: quic-be: Send post handshake frames from list of frames (0-RTT)
- MEDIUM: quic-be: qc_send_mux() adaptation for 0-RTT
- MINOR: quic-be: discard the 0-RTT keys
- MEDIUM: quic-be: enable the use of 0-RTT
- MINOR: quic-be: validate the 0-RTT transport parameters
- MINOR: quic-be: do not create the mux after handshake completion (for 0-RTT)
- MINOR: quic-be: avoid a useless I/O callback wakeup for 0-RTT sessions
- BUG/MEDIUM: acme: move from mt_list to a rwlock + ebmbtree
- BUG/MINOR: acme: can't override the default resolver
- MINOR: ssl/sample: expose ssl_*c_curve for AWS-LC
- MINOR: check: delay MUX init when SSL ALPN is used
- MINOR: cfgdiag: adjust diag on servers
- BUG/MINOR: check: only try connection reuse for http-check rulesets
- BUG/MINOR: check: fix reuse-pool if MUX inherited from server
- MINOR: check: clarify check-reuse-pool interaction with reuse policy
- DOC: configuration: add missing ssllib_name_startswith()
- DOC: configuration: add missing openssl_version predicates
- MINOR: cfgcond: add "awslc_api_atleast" and "awslc_api_before"
- REGTESTS: ssl: activate ssl_curve_name.vtc for AWS-LC
- BUILD: ech: fix clang warnings
- BUG/MEDIUM: stick-tables: Always return the good stksess from stktable_set_entry
- BUG/MINOR: stick-tables: Fix return value for __stksess_kill()
- CLEANUP: stick-tables: Don't needlessly compute shard number in stksess_free()
- MINOR: h1: h1_release() should return if it destroyed the connection
- BUG/MEDIUM: h1: prevent a crash on HTTP/2 upgrade
- MINOR: check: use auto SNI for QUIC checks
- MINOR: check: ensure QUIC checks configuration coherency
- CLEANUP: peers: remove an unneeded null check
- Revert "BUG/MEDIUM: connections: permit to permanently remove an idle conn"
- BUG/MEDIUM: connection: do not reinsert a purgeable conn in idle list
- DEBUG: extend DEBUG_STRESS to ease testing and turn on extra checks
- DEBUG: add BUG_ON_STRESS(): a BUG_ON() implemented only when DEBUG_STRESS > 0
- DEBUG: servers: add a few checks for stress-testing idle conns
- BUG/MINOR: check: fix QUIC check test when QUIC disabled
- BUG/MINOR: quic-be: missing version negotiation
- CLEANUP: quic: Missing succesful SSL handshake backend trace (OpenSSL 3.5)
- BUG/MINOR: quic-be: backend SSL session reuse fix (OpenSSL 3.5)
- REGTEST: quic: quic/ssl_reuse.vtc supports OpenSSL 3.5 QUIC API
This scripts is supported by OpenSSL 3.5 QUIC API since this previous commit:
BUG/MINOR: quic: backend SSL session reuse fix (HAVE_OPENSSL_QUIC)
Should be backported where this commit is backported.
This bug impacts only the QUIC backends when haproxy is compiled against
OpenSSL 3.5 with QUIC API(HAVE_OPENSSL_QUIC).
The QUIC clients could not reuse their SSL session because the TLS tickets
received from the servers could not be provided to the TLS stack. This should
be done when the stack calls ha_quic_ossl_crypto_recv_rcd()
(OSSL_FUNC_SSL_QUIC_TLS_CRYPTO_RECV_RCD callback).
According to OpenSSL team, an SSL_read() call must be done after the handshake
completion. It seems the correct location is at the same level as for
SSL_process_quic_post_handshake() for quictls.
Thank you to @mattcaswell, @Sashan and @vdukhovni for having helped in solving
this issue.
Must be backported to 3.1
This very minor issue impacts only the backend when compiled against OpenSSL 3.5
with QUIC API (HAVE_OPENSSL_QUIC).
The "SSL handshake OK" trace was not dumped by a TRACE() call. This was very
annoying when debugging.
Modify the concerned code section which is a bit ugly and simplify it.
The TRACE() call is done at a unique location for now on.
Should be backported to 3.2 to ease any further backport.
This bug impacts only the QUIC clients (or backends). The version negotiation
was not supported at all for them. This is an oversight.
Contrary to the QUIC server which choose the negotiated version after having
received the transport parameters (into ClientHello message) the client selects
the negotiated version from the first Initial packet version field. Indeed, the
server transport parameters are inside the ServerHello messages ciphered
into Handshake packets.
This non intrusive patch does not impact the QUIC server implementation.
It only selects the negotiated version from the first Initial packet
received from the server and consequently initializes the TLS cipher context.
Thank you to @InputOutputZ for having reporte this issue in GH #3178.
No need to backport because the QUIC backends support arrives with 3.3.
Latest commit ef206d441c ("MINOR: check: ensure QUIC checks configuration
coherency") introduced a regression when QUIC is not compiled in. Indeed,
not specifying a check proto sets mux_proto to NULL, which also happens to
be the value of get_mux_proto("QUIC"), so it complains about QUIC. Let's
add a non-null check in addition to this.
No backport is needed.
The latest idle conns fix 9481cef948 ("BUG/MEDIUM: connection: do not
reinsert a purgeable conn in idle list") addresses a very hard-to-hit
case which manifests itself with an attempt to reuse a connection fails
because conn->mux is NULL:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000655410b8642c in conn_backend_get (reuse_mode=4, srv=srv@entry=0x6554378a7140,
sess=sess@entry=0x7cfe140948a0, is_safe=is_safe@entry=0,
hash=hash@entry=910818338996668161) at src/backend.c:1390
1390 if (conn->mux->takeover && conn->mux->takeover(conn, i, 0) == 0) {
However the condition that leads to this situation can be detected
earlier, by the presence of the connection in the toremove_list, whose
race window is much larger and easier to detect.
This patch adds a few BUG_ON_STRESS() at selected places that an detect
this condition. When built with -DDEBUG_STRESS and run under stress with
two distinct processes communicating over H2 over SSL, under a stress of
400-500k req/s, the front process usually crashes in the first 10-30s
triggering in _srv_add_idle() if the fix above is reverted (and it does
not crash with the fix).
This is mainly included to serve as an illustration of how to instrument
the code for seamless stress testing.
The purpose of this new BUG_ON is beyond BUG_ON_HOT(). While BUG_ON_HOT()
is meant to be light but placed on very hot code paths, BUG_ON_STRESS()
might be heavy and only used under stress-testing, to try to detect early
that something bad is starting to happen. This one is not even type-checked
when not defined because we don't want to risk the compiler emitting the
slightest piece of code there in production mode, so as to give enough
freedom to the developers.
DEBUG_STRESS is currently used only to expose "stress-level". With this
patch, we go a bit further, by automatically forcing DEBUG_STRICT and
DEBUG_STRICT_ACTION to their highest values in order to enable all
BUG_ON levels, and make all of them result in a crash. In addition,
care is taken to always only have 0 or 1 in the macro, so that it can be
tested using "#if DEBUG_STRESS > 0" as well as "if (DEBUG_STRESS) { }"
everywhere.
The goal will be to ease insertion of extra tests for builds dedicated
to stress-testing that enable possibly expensive extra checks on certain
code paths that cannot reasonably be compiled in for production code
right now.
A recent patch was introduced to fix a rare race condition in idle
connection code which would result in a crash. The issue is when MUX IO
handler run on top of connection moved in the purgeable list. The
connection would be considered as present in the idle list instead, and
reinserted in it at the end of the handler while still in the purge
list.
096999ee208b8ae306983bc3fd677517d05948d2
BUG/MEDIUM: connections: permit to permanently remove an idle conn
This patch solves the described issue. However, it introduces another
bug as it may clear connection flag when removing a connection from its
parent list. However, these flags now serve primarily as a status which
indicate that the connection is accounted by the server. When a backend
connection is freed, server idle/used counters are decremented
accordingly to these flags. With the above patch, an incorrect counter
could be adjusted and thus wrapping would occured.
The first impact of this bug is that it may distort the estimated number
of connections needed by servers, which would result either in poor
reuse rate or too many idle connections kept. Another noticeable impact
is that it may prevent server deletion.
The main problem of the original and current issues is that connection
flags are misinterpreted as telling if a connection is present in the
idle list. As already described here, in fact these flags are solely a
status which indicate that the connection is accounted in server
counters. Thus, here are the definitive conclusion that can be learned
here :
* (conn->flags & CO_FL_LIST_MASK) == 1:
the connection is accounted by the server
it may or may not be present in the idle list
* (conn->flags & CO_FL_LIST_MASK) == 0
the connection is not accounted and not present in idle list
The discussion above does not mention session list, but a similar
pattern can be observed when CO_FL_SESS_IDLE flag is set.
To keep the original issue solved and fix the current one, IO MUX
handlers prologue are rewritten. Now, flags are not checked anymore for
list appartenance and LIST_INLIST macro is used instead. This is
definitely clearer with conn_in_list purpose here.
On IO MUX handlers end, conn idle flags may be checked if conn_in_list
was true, to reinsert the connection either in idle or safe list. This
is considered safe as no function should modify idle flags when a
connection is not stored in a list, except during conn_free() operation.
This patch must be backported to every stable versions after revert of
the above commit. It should be appliable up to 3.0 without any issue. On
2.8 and below, <idle_list> connection member does not exist. It should
be safe to check <leaf_p> tree node as a replacement.
The target patch fixes a rare race condition which happen when a MUX IO
handler is working on a connection already moved into the purge list. In
this case, the handler will incorrectly moved back the connection into
the idle list.
To fix this, conn_delete_from_tree() was extended to remove flags along
with the connection from the idle list. This was performed when the
connection is moved into the purge list. However, it introduces another
issue related to the idle server connection accounting. Thus it is
necessary to revert it prior to the incoming newer fix.
This patch must be backported to every version where the original commit
is.
Coverity reported in GH #3181 that a NULL test was useless, in
peers_trace(), which is true since the peer always belongs to a
peers section and it was already dereferenced. Let's just remove
the test to avoid the confusion.
QUIC is now supported on the backend side, thus it is possible to use it
with server checks. However, checks configuration can be quite
extensive, differing greatly from the server settings.
This patch ensures that QUIC checks are always performed under a
controlled context. Objectives are to avoid any crashes and ensure that
there is no suprise for users in respect to the configuration.
The first part of this patch ensures that QUIC checks can only be
activated on QUIC servers. Indeed, QUIC requires dedicated
initialization steps prior to its usage.
The other part of this patch disables QUIC usage when one or multiple
specific check connection settings are specified in the configuration,
diverging from the server settings. This is the simplest solution for
now and ensure that there is no hidden behavior to users. This means
that it's currently impossible to perform QUIC checks if other endpoints
that the server itself. However for now there is no real use-case for
this scenario.
Along with these changes, check-proto documentation is updated to
clarify QUIC checks behavior.
By default, check SNI is set to the Host header when an HTTPS check is
performed. This patch extends this mode so that it is also active when
QUIC checks are executed.
This patch should improve reuse rate with checks. Indeed, SNI is also
already automatically set for normal traffic. The same value must be
used during check so that a connection hash match can be found.
Change h1_process() to return -2 when the mux is destroyed but the
connection is not, so that we can differentiate between "both mux and
connection were destroyed" and "only the mux was destroyed".
It can happen that only the mux gets destroyed, and the connection is
still alive, if we did upgrade it to HTTP/2.
In h1_wake(), if the connection is alive, then return 0, as the wake
methods should only return -1 if the connection is dead.
This fixes a bug where the ssl xprt would consider the connection
destroyed, and thus would consider its tasklet should die, and return
NULL, and its TASK_RUNNING flag would never be removed, leading to an
infinite loop later on. This would happen anytime an HTTP/2 upgrade was
successful.
This should be backported up to 2.8. While the bug by commit
00f43b7c8b136515653bcb2fc014b0832ec32d61, it was not triggered before
only by chance, and exists in previous releases too.
h1_release() is called to destroy everything related to the mux h1,
usually even the connection. However, it handles upgrades to HTTP/2 too,
in which case the h1 mux will be destroyed, but the connection will
still be alive. So make it so it returns 0 if everything is destroyed,
and -1 if the connection is still alive.
This should be backported up to 2.8, as a future bugfix will depend on
it.
Since commit 0bda33a3e ("MINOR: stick-tables: remove the uneeded read lock
in stksess_free()"), the lock on the shard is no longer acquired. So it is
useless to still compture the shard number. The result is never used and can
be safely removed.
The commit 9938fb9c7 ("BUG/MEDIUM: stick-tables: Fix race with peers when
killing a sticky session") introduced a regression.
__stksess_kill() must always return 0 if the session cannot be released. But
when the ref_cnt is tested under the update lock, a success is reported if
the session is still in-used. 0 must be returned in that case.
This bug is harmless because callers never use the return value of
__stksess_kill() or stksess_kill().
This bug must be backported as far as 3.0.
In stktable_set_entry(), the return value of __stktable_store() is not
tested while it is possible to get an existing session with the same key
instead of the one we want to insert. It happens when we fails to upgrade
the read lock on the bucket to an write lock. In that case, we release the
lock for a short time to get a write lock.
So, to fix the bug, we must check the session returned by __stktable_store()
and take care to return this one.
The bug was introduced by the commit e62885237c ("MEDIUM: stick-table: make
stktable_set_entry() look up under a read lock"). It must be backported as
far as 2.8.
No impact as the state is either SHOW_ECH_SPECIFIC or SHOW_ECH_ALL but
never anything else.
src/ech.c:240:6: error: variable 'p' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
240 | if (ctx->state == SHOW_ECH_ALL) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
src/ech.c:275:12: note: uninitialized use occurs here
275 | ctx->pp = p;
| ^
src/ech.c:240:2: note: remove the 'if' if its condition is always true
240 | if (ctx->state == SHOW_ECH_ALL) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ech.c:228:17: note: initialize the variable 'p' to silence this warning
228 | struct proxy *p;
| ^
| = NULL
src/ech.c:240:6: error: variable 'bind_conf' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
240 | if (ctx->state == SHOW_ECH_ALL) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
src/ech.c:276:11: note: uninitialized use occurs here
276 | ctx->b = bind_conf;
| ^~~~~~~~~
src/ech.c:240:2: note: remove the 'if' if its condition is always true
240 | if (ctx->state == SHOW_ECH_ALL) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ech.c:229:29: note: initialize the variable 'bind_conf' to silence this warning
229 | struct bind_conf *bind_conf;
| ^
| = NULL
2 errors generated.
make: *** [Makefile:1062: src/ech.o] Error 1
AWS-LC features are not easily tested with just the openssl version
constant. AWS-LC uses its own API versioning stored in the
AWSLC_API_VERSION constant.
This patch add the two awslc_api_atleast and awslc_api_before predicates
that help to check the AWS-LC API.
Add missing openssl_version_atleast() and openssl_version_before()
predicates.
The predicates exist since 3aeb3f9347 ("MINOR: cfgcond: implements
openssl_version_atleast and openssl_version_before").
Must be backported in every stable versions.
Add the missing ssllib_name_startswith() predicate in the documentation.
The predicate was introduced with b01179aa9 ("MINOR: ssl: Add
ssllib_name_startswith precondition").
Must be backported as far as 2.6.
check-reuse-pool can only perform as expected if reuse policy on the
backend is set to aggressive or higher. Update the documentation to
reflect this and implement a server diag warning.
Check reuse is only performed if no specific check connect options are
specified on the configuration. This ensures that reuse won't be
performed if intending to use different connection parameters from the
default traffic.
This relies on tcpcheck_use_nondefault_connect() which indicates if the
check has any specific connection parameters. One of them if check
<mux_proto> field. However, this field may be automatically set during
init_srv_check() in some specific conditions without any explicit
configuration, most notably when using http-check rulesets on an HTTP
backend. Thus, it prevents connection reuse for these checks.
This commit fixes this by adjuting tcpcheck_use_nondefault_connect().
Beside checking check <mux_proto> field, it also detects if it is
different from the server configuration. This is sufficient to know if
the value is derived from the configuration or automatically calculated
in init_srv_check().
Note that this patch introduces a small behavior change. Prior to it,
check reuse were never performed if "check-proto" is explicitely
configured. Now, check reuse will be performed if the configured value
is identical to the server MUX protocol. This is considered as
acceptable as connection reuse is safe when using a similar MUX
protocol.
This must be backported up to 3.2.
In 3.2, a new server keyword "check-reuse-pool" has been introduced. It
allows to reuse a connection for a new check, instead of always
initializing a new one. This is only performed if the check does not
rely on specific connection parameters differing from the server.
This patch further restricts reuse for checks only when an HTTP ruleset
is used at the backend level. Indeed, reusing a connection outside of
HTTP is an undefined behavior. The impact of this bug is unknown and
depends on the proxy/server configuration. In the case of an HTTP
backend with non-HTTP checks, check-reuse-pool would probably cause a
drop in reuse rate.
Along this change, implement a new diagnostic warning on servers to
report that check-reuse-pool cannot apply due to an incompatible check
type.
This must be backported up to 3.2.
Adjust code dealing with diagnostics performed on server. The objective
is to extract the check on duplicate cookies in a dedicated function
outside of the proxies/servers loop.
This does not have any noticeable impact. This patch is merely a code
improvment to implement easily new future diagnostics on servers.
When instantiating a new connection for check, its MUX may be
initialized early. This was not performed though if SSL ALPN negotiation
will be used, except if check MUX is already fixed.
However, this method of initialization is problematic when QUIC MUX is
used. Indeed, this multiplexer must only be instantiated after the above
application protocol is known, which is derived from the ALPN
negotiation. If this is not the case a crash will occur in qmux_init().
In fact, a similar problem was already encountered for normal traffic.
Thus, a change was performed in connect_server() : MUX early
initialization is now always skipped if SSL ALPN negotiation is active,
even if MUX is already fixed. This patch introduces a similar change for
checks.
Without this patch, it is not possible to perform check on QUIC servers
as expected. Indeed, when http-check ruleset is active a crash would
occur prior to it.
The underlying SSL_get_negotiated_group function has been backported
into AWS-LC [1], so expose the feature for users of this TLS stack
as well. Note that even though it was actually added in AWS-LC 1.56.0,
we require AWSLC_API_VERSION >= 35 which was released in AWS-LC 1.57.0,
because API version wasn't incremented after this change. As the delta
is one minor version (less than two weeks), I consider this acceptable
to avoid relying on a proxy constant like TLSEXT_nid_unknown which
might be removed at some point.
[1] d6a37244ad
httpclient_acme_init() was called in cfg_parse_acme() which is at
section parsing. httpclient_acme_init() also calls
httpclient_create_proxy() which could create a "default" resolvers
section if it doesn't exists.
If one tries to override the default resolvers section after an ACME
section, the resolvers section parsing will fail because the section was
already created by httpclient_create_proxy().
This patch fixes the issue by moving the initialization of the ACME
proxy to a pre_check callback, which is called just before
check_config_validity().
Must be backported in 3.2.
The current ACME scheduler suffers from problems due to the way the
tasks are stored:
- MT_LIST are not scalables when having a lot of ACME tasks and having
to look for a specific one.
- the acme_task pointer was stored in the ckch_store in order to not
passing through the whole list. But a ckch_store can be updated and
the pointer lost in the previous one.
- when a task fails, the ptr in the ckch_store was not removed because
we only work with a copy of the original ckch_store, it would need to
lock the ckchs_tree and remove this pointer.
This patch fixes the issues by removing the MT_LIST-based architecture,
and replacing it by a simple ebmbtree + rwlock design.
The pointer to the task is not stored anymore in the ckch_store, but
instead it is stored in the acme_tasks tree. Finding a task is done by
doing a lookup on this tree with a RDLOCK.
Instead of checking if store->acme_task is not NULL, a lookup is also
done.
This allow to remove the stuck "acme_task" pointer in the store, which
was preventing to restart an acme task when the previous failed for this
specific certificate.
Must be backported in 3.2.
For backends and 0-RTT sessions, this patch modifies the ->start() callback to
wake up the I/O callback only if the connection (and the mux) is not ready. Note that
connect_server() has been modified to call this xprt callback just after having
created the mux and installed the mux. Contrary to 1-RTT session, for 0-RTT sessions,
the connections are always ready before calling this ->start xprt callback.
This is required during connection with 0-RTT support, to prevent two mux creations.
Indeed, for 0-RTT sessions, the QUIC mux is already started very soon from
connect_server() (src/backend.c).
During 0-RTT sessions, some server transport parameters are reused after having
been save from previous sessions. These parameters must not be reduced
when it resends them. The client must check this is the case when some early data
are accepted by the server. This is what is implemented by this patch.
Implement qc_early_tranport_params_validate() which checks the new server parameters
are not reduced.
Also implement qc_ssl_eary_data_accepted() which was not implemented for TLS
stack without 0-RTT support (for instance wolfssl). That said this function
was no more used. This is why the compilation against wolfssl could not fail.
This patch allows the use of 0-RTT feature on QUIC server lines with "allow-0rtt"
option. In fact 0-RTT is really enabled only if ssl_sock_srv_try_reuse_sess()
successfully manages to reuse the SSL session and the chosen application protocol
from previous connections.
Note that, at this time, 0-RTT works only with quictls and aws-lc as TLS stack.
(0-RTT does not work at all (even for QUIC frontends) with libressl).
When entering this function, a selection is done about the encryption level
to be used to send data. For a client, the early data encryption level
is used to send 0-RTT if this encryption level is initialized.
The Initial encryption is also registered to the send list for clients if there
is Initial crypto data to send. This allow Initial and 0-RTT packets to
be coalesced by datagrams.
This patch is required to make 0-RTT work. It modifies the prototype of
quic_build_post_handshake_frames() to send post handshake frames from a
list of frames in place of the application encryption level (used
as <qc->ael> local variable).
This patch does not modify at all the current QUIC stack behavior (even for
QUIC frontends). It must be considered as a preparation for the code
to come about 0-RTT support for QUIC backends.
A QUIC server never sends 0-RTT packets contrary to the client.
This very simple modification allow the the preparation of 0-RTT packets
with early data as encryption level (->eel).
This function is called for both TCP and QUIC connections to reuse SSL sessions
saved by ssl_sess_new_srv_cb() callback called upon new SSL session creation.
In addition to this, a QUIC SSL session must reuse the ALPN and some specific QUIC
transport parameters. This is what is added by this patch for QUIC 0-RTT sessions.
Note that for now on, ssl_sock_srv_try_reuse_sess() may fail for QUIC connections
if it did not managed to reuse the ALPN. The caller must be informed of such an
issue. It must not enable 0-RTT for the current session in this case. This is
impossible without ALPN which is required to start a mux.
ssl_sock_srv_try_reuse_sess() is modified to always succeeds for TCP connections.
For both TCP and QUIC connections, this is ssl_sess_new_srv_cb() callback which
is called when a new SSL session is created. Its role is to save the session to
be reused for the next sessions.
This patch modifies this callback to save the QUIC parameters to be reused
for the next 0-RTT sessions (or during SSL session resumption).
The already existing path_params->nego_alpn member is used to store the ALPN as
this is done for TCP alongside path_params->tps new quic_early_transport_params
struct used to save the QUIC transport parameters to be reused for 0-RTT sessions.
Implement quic_reuse_srv_params() whose role is to reuse the ALPN negotiated
during a first connection to a QUIC backend alongside its transport parameters.
Define quic_early_transport_params new struct for QUIC transport parameters
in relation with 0-RTT. This parameters must be saved during a first session to
be reused for 0-RTT next sessions.
qc_early_transport_params_cpy() copies the 0-RTT transport parameters to be
saved during a first connection to a backend. The copy is made from
a quic_transport_params struct to a quic_ealy_transport_params struct.
On the contrary, qc_early_transport_params_reuse() copies the transport parameters
to be reused for a 0-RTT session from a previous one. The copy is made
from a quic_early_transport_params strcut to a quic_transport_params struct.
Also add QUIC_EV_EARLY_TRANSP_PARAMS trace event to dump such 0-RTT
transport parameters from traces.
Add a per thread ist struct to srv_per_thread struct to store the QUIC token to
be reused for subsequent sessions.
Parse at packet level (from qc_parse_ptk_frms()) these tokens and store
them calling qc_try_store_new_token() newly implemented function. This is
this new function which does its best (may fail) to update the tokens.
Modify qc_do_build_pkt() to resend these tokens calling quic_enc_token()
implemented by this patch.
Rename ->data qf_new_token struct field to ->w_data to distinguish it from
->r_data new field used to parse the NEW_TOKEN frame. Indeed to build the
NEW_TOKEN we need to write it to a static buffer into the frame struct. To
parse it we only need to store the address of the token field into the
RX buffer.
At this time the connection migration is not supported by QUIC backends.
This patch prevents this process to be launched for connections to QUIC backends.
Furthermore, the connection migration process could be started systematically
when connecting a backend to INADDR_ANY, leading to crashes into qc_handle_conn_migration()
(when referencing qc->li).
Thank you to @InputOutputZ for having reported this issue in GH #3178.
This patch simply checks the connection type (listener or not) before checking if
a connection migration must be started.
No need to backport because support for QUIC backends is available from 3.3.
Replace the error message of BIO_new_file() when the account-key cannot
be created on disk by "acme: cannot create the file '%s'". It was
previously "acme: out of memory." Which is unclear.
Must be backported to 3.2.
Since commit "1ec59d3 MINOR: init: Make devnullfd global and create it
earlier in init" the devnullfd pointing towards /dev/null gets created
early in the init process but it was closed after the call to
"mworker_run_master". The master process never got to the FD closing
code and we had an FD leak.
This patch does not need to be backported.
Remove QUIC backend connections from global actconn accounting. Indeed,
this counter is only used on the frontend side. This is required to
ensure maxconn coherence.
Due to recent commit 5c299dee ("MEDIUM: stats: consider that shared stats
pointers may be NULL") shm-stats-file preloading suddenly stopped working
In fact preloading should be considered as an initializing step so the
counters may be assigned there without checking for NULL first.
Indeed there are supposed to be NULL because preloading occurs before
counters_{fe,be}_shared_prepare() which takes care of setting the pointers
for counters if they weren't set before.
Obviously this corner-case was overlooked during 5c299dee writing and
testing. Thanks to Nick Ramirez for having reported the issue.
No backport needed, this issue is specific to 3.3.
Commit ad1bdc33 ("BUG/MAJOR: stats-file: fix crash on non-x86 platform
caused by unaligned cast") revealed an ambiguity in me_generate_field()
around FN_AGE manipulation. For now FN_AGE can only be stored as u32 or
s32, but in the future we could also support 64bit FN_AGES, and the
current code assumes 32bits types and performs and explicit unsigned int
cast. Instead we group current 32 bits operations for FF_U32 and FF_S32
formats, and let room for potential future formats for FN_AGE.
Commit ad1bdc33 also suggested that the fix was temporary and the approach
must change, but after a code review it turns out the current approach
(generic types manipulation under me_generate_field()) is legit. The
introduction of shm-stats-file feature didn't change the logic which
was initially implemented in 3.0. It only extended it and since shared
stats are now spread over thread-groups since 3.3, the use of atomic
operations made typecasting errors more visible, and structure mapping
change from d655ed5f14 ("BUG/MAJOR: stats-file: ensure
shm_stats_file_object struct mapping consistency (2nd attempt)") was in
fact the only change to blame for the crash on non-x86 platforms.
With ambiguities removed in me_generate_field(), let's hope we don't face
similar bugs in the future. Indeed, with generic counters, and more
specifically shared ones (which leverage atomic ops), great care must be
taken when changing their underlying types as me_generate_field() solely
relies on stat_col descriptor to know how to read the stat from a generic
pointer, so any breaking change must be reflected in that function as well
No backport needed.
On Initial packet parsing, a new quic_conn instance is allocated via
qc_new_conn(). Then a CID is allocated with its value derivated from
client ODCID. On CID tree insert, a collision can occur if another
thread was already parsing an Initial packet from the same client. In
this case, the connection is released and the packet will be requeued to
the other thread.
Originally, CID collision check was performed prior to quic_conn
allocation. This was changed by the commit below, as this could cause
issue on quic_conn alloc failure.
commit 4ae29be18c5b212dd2a1a8e9fa0ee2fcb9dbb4b3
BUG/MINOR: quic: Possible endless loop in quic_lstnr_dghdlr()
However, this procedure is less optimal. Indeed, qc_new_conn() performs
many steps, thus it could be better to skip it on Initial CID collision,
which can happen frequently. This patch restores the older order of
operations, with CID collision check prior to quic_conn allocation.
To ensure this does not cause again the same bug, the CID is removed in
case of quic_conn alloc failure. This should prevent any loop as it
ensures that a CID found in the global tree does not point to a NULL
quic_conn, unless if CID is attach to a foreign thread. When this thread
will parse a re-enqueued packet, either the quic_conn is already
allocated or the CID has been removed, triggering a fresh CID and
quic_conn allocation procedure.
CIDs are provided by haproxy so that the peer can use them as DCID of
its packets. Their value is set via a random generator. It happens on
several occasions during connection lifetime:
* via ODCID derivation if haproxy is the server
* on quic_conn init if haproxy is the client
* during post-handshake if haproxy is the server
* on RETIRE_CONNECTION_ID frame parsing
CIDs are stored in a global tree. On ODCID derivation, a check is
performed to ensure the CID is not a duplicate value. This is mandatory
to properly handle multiple INITIAL packets from the same client on
different thread.
However, for the other cases, no check is performed for CID collision.
As _quic_cid_insert() is silent, the issue is not detected at all. This
results in a CID advertized to the peer but not stored in the global
one. In the end, this may cause two issues. The first one is that
packets from the client which use the new CID will be rejected by
haproxy, most probably with a STATELESS_RESET. The second issue is that
it can cause a crash during quic_conn release. Indeed, the CID is stored
in the quic_conn local tree and thus eb_delete() for the global tree
will be performed. As <leaf_p> member is uninit, this results in a
segfault.
Note that this issue is pretty rare. It can only be observed if running
with a high number of concurrent connections in parallel, so that the
random generator will provide duplicate values. Patch is still labelled
as MEDIUM as this modifies code paths used frequently.
To fix this, _quic_cid_insert() unsafe function is completely removed.
Instead, quic_cid_insert() can be used, which reports an error code if a
collision happens. CID are then stored in the quic_conn tree only after
global tree insert success. Here is the solution for each steps if a
collision occurs :
* on init as client: the connection is completely released
* post-handshake: the CID is immediately released. The connection is
kept, but it will miss an extra CID.
* on RETIRE_CONNECTION_ID parsing: a loop is implemented to retry random
generation. It it fails several times, the connection is closed in
error.
A small convenience change is made to quic_cid_insert(). Output
parameter <new_tid> can now be NULL, which is useful as most of the
times caller do not care about it.
This must be backported up to 2.6.
Split new_quic_cid() function into multiple ones. This patch should not
introduce any visible change. The objective is to render CID allocation
and generation more modular.
The first advantage of this patch is to bring code simplication. In
particular, conn CID sequence number increment and insertion into
connection tree is simpler than before. Another improvment is also that
errors could now be handled easier at each different steps of the CID
init.
This patch is a prerequisite for the fix on CID collision, thus it must
be backported prior to it to every affected version.
Change qc_new_conn() so that the connection CID tree is allocated
earlier in the function. This patch does not introduce a behavior
change. Its objective is to facilitate future evolutions on CIDs
handling.
This patch is a prerequisite for the fix on CID collision, thus it must
be backported prior to it to every affected version.
During RETIRE_CONNECTION_ID frame parsing, a new connection ID is
immediately reallocated after the release of the previous one. This is
done to ensure that the peer will never run out of DCID.
Prior to this patch, a CID allocation failure was be silently ignored.
This prevent the emission of a new CID, which could prevent the peer to
emit packets if it had no other CIDs available for use. Now, such error
is considered fatal to the connection. This is the safest solution as
it's better to close connections when memory is running low.
It must be backported up to 2.8.
Amaury reported a case where "${FOO[*]}" still produces an empty field.
It happens if the variable is defined but does not contain any non-space
characters. The reason is that we special-case word expansion only on
non-existing vars. Let's change the ordering of operations so that word-
expanded vars always pretend the current arg is not an empty quote, so
that we don't make any difference between a non-existing var and an
empty one.
No backport is needed unless commit 1968731765 ("BUG/MEDIUM: config:
solve the empty argument problem again") is.
Released version 3.3-dev12 with the following main changes :
- MINOR: quic: enable SSL on QUIC servers automatically
- MINOR: quic: reject conf with QUIC servers if not compiled
- OPTIM: quic: adjust automatic ALPN setting for QUIC servers
- MINOR: sample: optional AAD parameter support to aes_gcm_enc/dec
- REGTESTS: converters: check USE_OPENSSL in aes_gcm.vtc
- BUG/MINOR: resolvers: ensure fair round robin iteration
- BUG/MAJOR: stats-file: fix crash on non-x86 platform caused by unaligned cast
- OPTIM: backend: skip conn reuse for incompatible proxies
- SCRIPTS: build-ssl: allow to build a FIPS version without FIPS
- OPTIM: proxy: move atomically access fields out of the read-only ones
- SCRIPTS: build-ssl: fix rpath in AWS-LC install for openssl and bssl bin
- CI: github: update to macos-26
- BUG/MINOR: quic: fix crash on client handshake abort
- MINOR: quic: do not set conn member if ssl_sock_ctx
- MINOR: quic: remove connection arg from qc_new_conn()
- BUG/MEDIUM: server: Add a rwlock to path parameter
- BUG/MEDIUM: server: Also call srv_reset_path_parameters() on srv up
- BUG/MEDIUM: mux-h1: fix 414 / 431 status code reporting
- BUG/MEDIUM: mux-h2: make sure not to move a dead connection to idle
- BUG/MEDIUM: connections: permit to permanently remove an idle conn
- MEDIUM: cfgparse: deprecate 'master-worker' keyword alone
- MEDIUM: cfgparse: 'daemon' not compatible with -Ws
- DOC: configuration: deprecate the master-worker keyword
- MINOR: quic: remove <mux_state> field
- BUG/MEDIUM: stick-tables: Make sure we handle expiration on all tables
- MEDIUM: stick-tables: Optimize the expiration process a bit.
- MEDIUM: ssl/ckch: use ckch_store instead of ckch_data for ckch_conf_kws
- MINOR: acme: generate a temporary key pair
- MEDIUM: acme: generate a key pair when no file are available
- BUILD: ssl/ckch: wrong function name in ckch_conf_kws
- BUILD: acme: acme_gen_tmp_x509() signedness and unused variables
- BUG/MINOR: acme: fix initialization issue in acme_gen_tmp_x509()
- BUILD: ssl/ckch: fix ckch_conf_kws parsing without ACME
- MINOR: server: move the lock inside srv_add_idle()
- DOC: acme: crt-store allows you to start without a certificate
- BUG/MINOR: acme: allow 'key' when generating cert
- MINOR: stconn: Add counters to SC to know number of bytes received and sent
- MINOR: stream: Add samples to get number of bytes received or sent on each side
- MINOR: counters: Add req_in/req_out/res_in/res_out counters for fe/be/srv/li
- MINOR: stream: Remove bytes_in and bytes_out counters from stream
- MINOR: counters: Remove bytes_in and bytes_out counter from fe/be/srv/li
- MINOR: stats: Add stats about request and response bytes received and sent
- MINOR: applet: Add function to get amount of data in the output buffer
- MINOR: channel: Remove total field from channels
- DEBUG: stream: Add bytes_in/bytes_out value for both SC in session dump
- MEDIUM: stktables: Limit the number of stick counters to 100
- BUG/MINOR: config: Limit "tune.maxpollevents" parameter to 1000000
- BUG/MEDIUM: server: close a race around ready_srv when deleting a server
- BUG/MINOR: config: emit warning for empty args when *not* in discovery mode
- BUG/MEDIUM: config: solve the empty argument problem again
- MEDIUM: config: now reject configs with empty arguments
- MINOR: tools: add support for ist to the word fingerprinting functions
- MINOR: tools: add env_suggest() to suggest alternate variable names
- MINOR: tools: have parse_line's error pointer point to unknown variable names
- MINOR: cfgparse: try to suggest correct variable names on errors
- IMPORT: cebtree: Replace offset calculation with offsetof to avoid UB
- BUG/MINOR: acme: wrong dns-01 challenge in the log
- MEDIUM: backend: Defer conn_xprt_start() after mux creation
- MINOR: peers: Improve traces for peers
- MEDIUM: peers: No longer ack updates during a full resync
- MEDIUM: peers: Remove commitupdate field on stick-tables
- BUG/MEDIUM: peers: Fix update message parsing during a full resync
- MINOR: sample/stats: Add "bytes" in req_{in,out} and res_{in,out} names
- BUG/MEDIUM: stick-tables: Make sure updates are seen as local
- BUG/MEDIUM: proxy: use aligned allocations for struct proxy
- BUG/MEDIUM: proxy: use aligned allocations for struct proxy_per_tgroup
- BUG/MINOR: acme: avoid a possible crash on error paths
In acme_EVP_PKEY_gen(), an error message is printed if *errmsg is set,
however, since commit 546c67d13 ("MINOR: acme: generate a temporary key
pair"), errmsg is passed as NULL in at least one occurrence, leading
the compiler to issue a NULL deref warning at -O3. And indeed, if the
errors are encountered, a crash will occur. No backport is needed.
In 3.2, commit f879b9a18 ("MINOR: proxies: Add a per-thread group field
to struct proxy") introduced struct proxy_per_tgroup that is declared as
thread_aligned, but is allocated using calloc(). Thus it is at risk of
crashing on machines using instructions requiring 64-byte alignment such
as AVX512. Let's use ha_aligned_zalloc_typed() instead of malloc().
For 3.2, we don't have aligned allocations, so instead the THREAD_ALIGNED()
will have to be removed from the struct definition. Alternately, we could
manually align it as is done for fdtab.
Commit fd012b6c5 ("OPTIM: proxy: move atomically access fields out of
the read-only ones") caused the proxy struct to be 64-byte aligned,
which allows the compiler to use optimizations such as AVX512 to zero
certain fields. However the struct was allocated using calloc() so it
was not necessarily aligned, causing segv on startup on compatible
machines. Let's just use ha_aligned_zalloc_typed() to allocate the
struct.
No backport is needed.
In stktable_touch_with_exp, if it is a local update, add it to the
pending update list even if it's already in the tree as a remote update,
otherwise it will never be communicated to other peers;
It used to work before 3.2 because of the ordering of operations, but
it's been broken by adding an extra step with the pending update list,
so we now have to explicitely check for that.
This should be backported to 3.2.
Number of bytes received or sent by a client or a server are now
saved. Sample fetches and stats fields to retrieve these informations are
renamed to add "bytes" in names to avoid any ambiguity with number of
requests and responses.
The commit 590c5ff2e ("MEDIUM: peers: No longer ack updates during a full
resync") introduced a regression. During a full resync, the ID of an update
message is not parsed at all. Thus, the parsing of the whole message in
desynchronized.
On full resync the update id itself is ignored, to not be acked, but it must
be parsed. It is now fixed.
It is a 3.3-specific bug, no backport needed.
This stick-table field was atomically updated with the last update id pushed
and dumped on the CLI but never used otherwise. And all peer sessions share
the same id because it is a stick-table info. So the info in peers dump is
pretty limited.
So, let's remove it.
ACK messages received by a peer sending updates during a full resync are
ignored. So, on the other side, there is no reason to still send these ACK
messages. Let's skip them.
In addition, the received updates during this stage are not considered as to
be acked. It is important to be sure to properly emit ACK messages once the
full sync finished.
Trace messages for peers were only protocol oriented and information
provided were quite light. With this patch, the traces were
improved. information about the peer, its applet and the section are
dumped. Several verbosities are now available and messages are dumped at
different levels depending on the context. It should easier to track issues
in the peers.
In connect_server(), defer the call to conn_xprt_start() until after we
had a chance to create the mux. The xprt can behave differently
depending on if a mux is or is not available at this point, as if it is,
it may want to wait until some data comes from the mux.
This does not need to be backported.
Since 861fe532046 ("MINOR: acme: add the dns-01-record field to the
sink"), the dns-01 challenge is output in the dns_record trash, instead
of the global trash.
The send_log string was never updated with this change, and dumps some
data from the global trash instead. Since the last data emitted in the
trash seems to be the dns-01 token from the authorization object, it
looks like the response to the challenge.
This must be backported to 3.2.
This is the same as the equivalent fix in ebtree:
The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.
This is cebtree commit 2d08958858c2b8a1da880061aed941324e20e748.
When an empty argument comes from the use of a non-existing variable,
we'll now detect the difference with an empty variable (error pointer
points to the variable's name instead), and submit it to env_suggest()
to see if another variable looks likely to be the right one or not.
This can be quite useful to quickly figure how to fix misspelled variable
names. Currently only series of letters, digits and underscores are
attempted to be resolved as a name. A typical example is:
peer "${HAPROXY_LOCAL_PEER}" 127.0.0.1:10000
which produces:
[ALERT] (24231) : config : parsing [bug-argv4.cfg:2]: argument number 1 at position 13 is empty and marks the end of the argument list:
peer "${HAPROXY_LOCAL_PEER}" 127.0.0.1:10000
^
[NOTICE] (24231) : config : Hint: maybe you meant HAPROXY_LOCALPEER instead ?
When an argument is empty, parse_line() currently returns a pointer to
the empty string itself. This is convenient, but it's only actionable by
the user who will see for example "${HAPROXY_LOCALPEER}" and figure what
is wrong. Here we slightly change the reported pointer so that if an empty
argument results from the evaluation of an empty variable (meaning that
all variables in string are empty and no other char is present), then
instead of pointing to the opening quote, we'll return a pointer to the
first character of the variable's name. This will allow to make a
difference between an empty variable and an unknown variable, and for
the caller to take action based on this.
I.e. before we would get:
log "${LOG_SERVER_IP}" local0
^
if LOG_SERVER_IP is not set, and now instead we'll get this:
log "${LOG_SERVER_IP}" local0
^
The purpose here is to look in the environment for a variable whose
name looks like the provided one. This will be used to try to auto-
correct misspelled environment variables that would silently be turned
to an empty string.
The word fingerprinting functions are used to compare similar words to
suggest a correctly spelled one that looks like what the user proposed.
Currently the functions only support const char*, but there's no reason
for this, and it would be convenient to support substrings extracted
from random pieces of configurations. Here we're adding new variants
"_with_len" that take these ISTs and which are in fact a slight change
of the original ones that the old ones now rely on.
As prepared during 3.2, we must error on empty arguments because they
mark the end of the line and cause subsequent arguments to be silently
ignored. It was too late in 3.2 to turn that into an error so it's a
warning, but for 3.3 it needed to be an alert.
This patch does that. It doesn't instantly break, instead it counts
one fatal error per violating line. This allows to emit several errors
at once, which can often be caused by the same variable being missed,
or a group of variables sharing a same misspelled prefix for example.
Tests show that it helps locate them better. It also explains what to
look for in the config manual for help with variables expansion.
This mostly reverts commit ff8db5a85 ("BUG/MINOR: config: Stopped parsing
upon unmatched environment variables").
As explained in commit #2367, finally the fix above was incorrect because
it causes other trouble such as this:
log "192.168.100.${NODE}" "local0"
being resolved to this:
log 192.168.100.local0
when NODE does not exist due to the loss of the spaces. In fact, while none
of us was well aware of this, when the user had:
server app 127.0.0.1:80 "${NO_CHECK}" weight 123
in fact they should have written it this way:
server app 127.0.0.1:80 "${NO_CHECK[*]}" weight 123
so that the variable is expanded to zero, one or multiple words, leaving
no empty arg (like in shell). This is supported since 2.3 with commit
fa41cb6 so the right fix is in the config, let's revert the fix and
properly address the issue.
Some changes are necessary however, since after that patch, the in_arg
checks were added and are now inserting an empty argument even for
proper error reporting. For example, the following statement:
acl foo path "/a" "${FOO[*]}" "/b"
would complain about an empty arg at FOO due to in_arg=1, while dropping
this in_arg=1 with the following config:
acl foo path "/a" "${FOO}" "/b"
would silently stop after "/a" instead of complaining about an empty
field. So the approach here consists in noting whether or not something
was written since the quotes were emitted, in order to decide whether
or not to produce an argument. This way, "" continues to be an explicitly
empty arg, just like the same with an unknown variable, while "${FOO[*]}"
is allowed to prevent the creation of an argument if empty.
This should be backported to *some* versions, but the risk that some
configs were altered to rely on the broken fix is not null. At least
recent LTS should be reverted. Note that this requires previous commit:
BUG/MINOR: config: emit warning for empty args when *not* in discovery mode
otherwise this will break again configs relying on HAPROXY_LOCALPEER and
maybe a few other variables set at the end of discovery.
This actually reverses the condition of commit 5f1fad1690 ("BUG/MINOR:
config: emit warning for empty args only in discovery mode"). Indeed,
some variables are not known in discovery mode (e.g. HAPROXY_LOCALPEER),
and statements like:
peer "${HAPROXY_LOCALPEER}" 127.0.0.1:10000
are broken during discovery mode. It turns out that the warning is
currently hidden by commit ff8db5a85d ("BUG/MINOR: config: Stopped
parsing upon unmatched environment variables") since it silently drops
empty args which is sufficient to hide the warning, but it also breaks
other configs and needs to be reverted, which will break configs like
above again.
In issue #2995 we were not fully decided about discovery mode or not,
and already suspected some possible issues without being able to guess
which ones. The only downside of not displaying them in discovery mode
is that certain empty fields on the rare keywords specific to master
mode might remain silent until used. Let's just flip the condition to
check for empty args in normal mode only.
This should be backported to 3.2 after some time of observation.
When a server is being disabled or deleted, in case it matches the
backend's ready_srv, this one is reset. However it's currently done in
a non-atomic way when the server goes down, and that could occasionally
reset the entry matching another server, but more importantly if in
parallel some requests are dequeued for that server, it may re-appear
there after having been removed, leading to a possible crash once it
is fully removed, as shown in issue #3177.
Let's make sure we reset the pointer when detaching the server from
the proxy, and use a CAS in both cases to only reset this server.
This fix needs to be backported to 3.2. There, srv_detach() is in
server.c instead of server.h. Thanks to Basha Mougamadou for the
detailed report and the useful backtraces.
"tune.maxpollevents" global parameter was not limited. It was possible to
set any integer value. But this value is used to allocate the array of
events used by epoll. With a huge value, it seems the allocation silently
fail, making haproxy totally unresponsive.
So let's to limit its value to 1 million. It is pretty high and it should
not be an issue to forbid greater values. The documentation was updated
accordingly.
This patch could be backported to all stable branches.
"tune.stick-counters" global parameter was accepting any positive integer
value. But the maximum value is incredibly high. Setting a huge value has
signitifcant impact on memory and CPU usage. To avoid any issue, this value
is now limited to 100. It should be greater enough to all usage.
It can be seen as a breaking change.
The <total> field in the channel structure is now useless, so it can be
removed. The <bytes_in> field from the SC is used instead.
This patch is related to issue #1617.
The helper function applet_output_data() returns the amount of data in the
output buffer of an applet. For applets using the new API, it is based on
data present in the outbuf buffer. For legacy applets, it is based on input
data present in the input channel's buffer. The HTX version,
applet_htx_output_data(), is also available
This patch is related to issue #1617.
In previous patches, these counters were added per frontend, backend, server
and listener. With this patch, these counters are reported on stats,
including promex.
Note that the stats file minor version was incremented by one because the
shm_stats_file_object struct size has changed.
This patch is related to issue #1617.
bytes_in and bytes_out counters per frontend, backend, listener and server
were removed and we now rely on, respectively on, req_in and res_in
counters.
This patch is related to issue #1617.
per-stream bytes_in and bytes_out counters was removed and replaced by
req.in and res.in. Coorresponding samples still exists but replies on new
counters.
This patch is related to issue #1617.
Thanks to the previous patch, and based on info available on the stream, it
is now possible to have counters for frontends, backends, servers and
listeners to report number of bytes received and sent on both sides.
This patch is related to issue #1617.
req.in and req.out samples can now be used to get the number of bytes
received by a client and send to the server. And res.in and res.out samples
can be used to get the number of bytes received by a server and send to the
client. These info are stored in the logs structure inside a stream.
This patch is related to issue #1617.
<bytes_in> and <bytes_out> counters were added to SC to count, respectively,
the number of bytes received from an endpoint or sent to an endpoint. These
counters are updated for connections and applets.
This patch is related to issue #1617.
If your acme certificate is declared in a crt-store, and the certificate
file does not exist on the disk, HAProxy will start with a temporary key
pair.
Almost all callers of _srv_add_idle() lock the list then call the
function. It's not the most efficient and it requires some care from
the caller to take care of that lock. Let's change this a little bit by
having srv_add_idle() that takes the lock and calls _srv_add_idle() that
is now inlined. This way callers don't have to handle the lock themselves
anymore, and the lock is only taken around the sensitive parts, not the
function call+return.
Interestingly, perf tests show a small perf increase from 2.28-2.32M RPS
to 2.32-2.37M RPS on a 128-thread system.
src/acme.c: In function ‘acme_gen_tmp_x509’:
src/acme.c:2685:15: error: ‘digest’ may be used uninitialized [-Werror=maybe-uninitialized]
2685 | if (!(X509_sign(newcrt, pkey, digest)))
| ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/acme.c:2628:23: note: ‘digest’ was declared here
2628 | const EVP_MD *digest;
| ^~~~~~
When an acme keyword is associated to a crt and key, and the corresponding
files does not exist, HAProxy would not start.
This patch allows to configure acme without pre-generating a keypair before
starting HAProxy. If the files does not exist, it tries to generate a unique
keypair in memory, that will be used for every ACME certificates that don't
have a file on the disk yet.
This patch provides two functions acme_gen_tmp_pkey() and
acme_gen_tmp_x509().
These functions generates a unique keypair and X509 certificate that
will be stored in tmp_x509 and tmp_pkey. If the key pair or certificate
was already generated they will return the existing one.
The key is an RSA2048 and the X509 is generated with a expiration in the
past. The CN is "expired".
These are just placeholders to be used if we don't have files.
This is an API change, instead of passing a ckch_data alone, the
ckch_conf_kws.func() is called with a ckch_store.
This allows the callback to access the whole ckch_store, with the
ckch_conf and the ckch_data. But it requires the ckch_conf to be
actually put in the ckch_store before.
In process_tables_expire(), if the table we're analyzing still has
entries, and thus should be put back into the tree, do not put it in the
mt_list, to have it put back into the tree the next time the task runs.
There is no problem with putting it in the tree right away, as either
the next expiration is in the future, or we handled the maximum number
of expirations per task call and we're about to stop, anyway.
This does not need to be backported.
In process_tables_expire(), when parsing all the tables with expiration
set, to check if the any entry expired, make sure we start from the
oldest one, we can't just rely on eb32_first(), because of sign issues
on the timestamp.
Not doing that may mean some tables are not considered for expiration.
This does not need to be backported.
This patch removes <mux_state> field from quic_conn structure. The
purpose of this field was to indicate if MUX layer above quic_conn is
not yet initialized, active, or already released.
It became tedious to properly set it as initialization order of the
various quic_conn/conn/MUX layers now differ between the frontend and
backend sides, and also depending if 0-RTT is used or not. Recently, a
new change introduced in connect_server() will allow to initialize QUIC
MUX earlier if ALPN is cached on the server structure. This had another
level of complexity.
Thus, this patch removes <mux_state> field completely. Instead, a new
flag QUIC_FL_CONN_XPRT_CLOSED is defined. It is set at a single place
only on close XPRT callback invokation. It can be mixed with the new
utility functions qc_wait_for_conn()/qc_is_conn_ready() to determine the
status of conn/MUX layers now without an extra quic_conn field.
Deprecate the 'master-worker' keyword in the global section.
Split the configuration of the 'no-exit-on-failure' subkeyword in
another section which is not deprecated yet and explains that its only
meant for debugging purpose.
Emit a warning when the 'daemon' keyword is used in master-worker mode
for systemd (-Ws). This never worked and was always ignored by setting
MODE_FOREGROUND during cmdline parsing.
Warn when the 'master-worker' keyword is used without
'no-exit-on-failure'.
Warn when the 'master-worker' keyword is used and -W and -Ws already set
the mode.
There's currently a function conn_delete_from_tree() which is used to
detach an idle connection from the tree it's currently attached to so
that it is no longer found. This function is used in three circumstances:
- when picking a new connection that no longer has any avail stream
- when temporarily working on the connection from an I/O handler,
in which case it's re-added at the end
- when killing a connection
The 2nd case above is quite specific, as it requires to preserve the
CO_FL_LIST_MASK flags so that the connection can be re-inserted into
the proper tree when leaving the handler. However, there's a catch.
When killing a connection, we want to be certain it will not be
reinserted into the tree. The flags preservation is causing a tiny
race if an I/O happens while the connection is in the kill list,
because in this case the I/O handler will note the connection flags,
do its work, then reinsert the connection where it believed it was,
then the connection gets purged, and another user can find it in the
tree.
The issue is very difficult to reproduce. On a 128-thread machine it
happens in H2 around 500k req/s after around 50M requests. In H1 it
happens after around 1 billion requests.
The fix here consists in passing an extra argument to the function to
indicate if the removal is permanent or not. When it's permanent, the
function will clear the associated flags. The callers were adjusted
so that all those dequeuing a connection in order to kill it do it
permanently and all other ones do it only temporarily.
A slightly different approach could have worked: the function could
always remove all flags, and the callers would need to restore them.
But this would require trickier modifications of the various call
places, compared to only passing 0/1 to indicate the permanent status.
This will need to be backported to all stable versions. The issue was
at least reproduced since 3.1 (not tested before). The patch will need
to be adjusted for 3.2 and older, because a 2nd argument "thr" was
added in 3.3, so the patch will not apply to older versions as-is.
In h2_detach(), it looks possible to place a dead connection back to
the idle list, and to later call h2_release() on it once detected as
dead. It's not certain that it happens but nothing in the code shows
it is not possible, so better make sure it cannot happen.
This should be preventively backported to all versions.
The more detailed status code reporting introduced with bc967758a2 is
checking against the error state to determine whether it is a too long
URL or too large headers. The check used always returns true which
results in a 414 as the error state is only set at a later point.
This commit adjusts the check to use the current state instead to return
the intended status code.
This patch must be backported as far as 3.1.
Also call srv_reset_path_parameters() when the server changed states,
and got up. It is not enough to do it when the server goes down, because
there's a small race condition, and a connection could get established
just after we did it, and could have set the path parameters.
This does not need to be backported.
Add a rwlock to control the server's path_parameter, to make sure
multiple threads don't set it at the same time, and it can't be seen in
an inconsistent state.
Also don't set the parameter every time, only set them if they have
changed, to prevent needless writes.
This does not need to be backported.
This patch is similar to the previous one, this time dealing with
qc_new_conn(). This function was asymetric on frontend and backend side,
as connection argument was set only in the latter case.
This was required prior due to qc_alloc_ssl_sock_ctx() signature. This
has changed with the previous patch, thus qc_new_conn() can also be
realigned on both FE and BE sides. <conn> member of quic_conn instance
is always set outside it, in qc_xprt_start() on the backend case.
ssl_sock_ctx is a generic object used both on TCP/SSL and QUIC stacks.
Most notably it contains a <conn> member which is a pointer to struct
connection.
On QUIC frontend side, this member is always set to NULL. Indeed,
connection is only created after handshake completion. However, this has
changed for backend side, where the connection is instantiated prior to
its quic_conn counterpart. Thus, ssl_sock_ctx member would be set in
this case as a convenience for use later in qc_ssl_do_hanshake().
However, this method was unsafe as the connection can be released,
without resetting ssl_sock_ctx member. Thus, the previous patch fixes
this by using on <conn> member through the quic_conn instance which is
the proper way.
Thus, this patch resets ssl_sock_ctx <conn> member to NULL. This is
deemed the cleanest method as it ensures that both frontend and backend
sides must not use it anymore.
On backend side, a connection can be aborted and released prior to
handshake completion. This causes a crash in qc_ssl_do_hanshake() as
<conn> member of ssl_sock_ctx is not reset in this case.
To fix this, use <conn> member of quic_conn instead. This is safe as it
is properly set to NULL when a connection is released.
No impact on the frontend side as <conn> member is not accessed. Indeed,
in this case connection is most of the times allocated after handshake
completion.
No need to be backported.
macOS-15 images seems to have difficulties to run the reg-tests since a
few days for an unknown reason. Doing a rollback of both VTest2 and
haporxy doesn't seem to fix the problem so this is probably related to a
change in github actions.
This patch switches the image to the new macos-26 images which seems to
fix the problem.
Perf top showed that h1_snd_buf() was having great difficulties accessing
the proxy's server_id_hdr_name field in the middle of the headers loop.
Moving the assignment out of the loop to a local variable moved the
problem there as well:
| if (!(h1m->flags & H1_MF_RESP) && isttest(h1c->px->server_id_hdr_n
0.10 |20b0: mov -0x120(%rbp),%rdi
1.33 | mov 0x60(%rdi),%r10
0.01 | test %eax,%eax
0.18 | jne 2118
12.87 | mov 0x350(%r10),%rdi
0.01 | test %rdi,%rdi
0.05 | je 2118
| mov 0x358(%r10),%r11
It turns out that there are several atomically accessed fields in its
vicinity, causing the cache line to bounce all the time. Let's collect
the few frequently changed fields and place them together at the end
of the structure, and plug the 32-bit hole with another isolated field.
Doing so also reduced a little bit the cost of decrementing be->be_conn
in process_stream(), and overall the HTTP/1 performance increased by
about 1% both on ARM and x86_64.
build-ssl.sh is always prepending a "v" to the version, preventing to
build a FIPS version without FIPS enabled.
This patch checks if FIPS is in the version string to chose to add the
"v" or not.
Example:
AWS_LC_VERSION=AWS-LC-FIPS-3.0.0 BUILDSSL_DESTDIR=/opt/awslc-3.0.0 ./scripts/build-ssl.sh
When trying to reuse a backend connection, a connection hash is
calculated to match an entry with similar parameters. Previously, this
operation was skipped if the stream content wasn't based on HTTP, as it
would have been incompatible with http-reuse.
With the introduction of SPOP backends, this condition was removed, so
that it can also benefit from connection reuse. However, this means that
now hash calcul is always performed when connecting to a server, even
for TCP or log backends. This is unnecessary as these proxies cannot
perform connection reuse.
Note also that reuse mode is resetted on postparsing for incompatible
backends. This at least guarantees that no tree lookup will be performed
via be_reuse_connection(). However, connection lookup is still performed
in the session via session_get_conn() which is another unnecessary
operation.
Thus, this patch restores the condition so that reuse operations are now
entirely skipped if a backend mode is incompatible. This is implemented
via a new utility function named be_supports_conn_reuse().
This could be backported up to 3.1, as this commit could be considered
as a performance regression for tcp/log backend modes.
Since commit d655ed5f14 ("BUG/MAJOR: stats-file: ensure
shm_stats_file_object struct mapping consistency (2nd attempt)"), the
last_state_change field in the counters is a uint (to match how it's
reported). However, it happens that there are explicit casts in function
me_generate_field() to retrieve the value, and which cause crashes on
aarch64 and likely other non-x86 64-bit platforms due to atomically
reading an unaligned 64-bit value, and may even randomly crash other
64-bit platforms when reading past the end of the structure.
The fix for now adapts the cast to match the one used by the accessed
type (i.e. unsigned int), but the approach must change, as there's
nothing there which allows to figure whether or not the type is correct
by just reading the code. At minima a typeof() on a named field is
needed, but this requires more invasive changes, hence this temporary
fix.
No backport is needed, as stats-file is only in 3.3.
Previous fixes restored round robin iteration, but an imbalance remains
when the response tree contains record types other than A or AAAA. Let's
take the following example: the DNS answers two A records and a CNAME.
The response "tree" (which is actually flat, more like a list) may look
as follows, ordered by hash:
- 1st item: first A record with IP 1
- 2nd item: second A record with IP 2
- 3rd item: CNAME record
As a consequence, resolv_get_ip_from_response will iterate as follows,
while the TTL is still valid:
- 1st call: DNS request is done, response tree is created, iteration
starts at the first item, IP 1 is returned.
- 2nd call: cached response tree is used, iteration starts at the second
item, IP 2 is returned.
- 3rd call: cached response tree is used, iteration starts at the third
item, but it's a CNAME, so we continue to the next item, which restarts
iteration at the first item, and IP 1 is returned.
- 4th call: cached response tree is used and iteration restarts at the
beginning, returning IP 1 again.
The 1-2-1-1-2-1-1-2 sequence will repeat, so IP 1 will be used twice as
often as IP 2, creating a strong imbalance. Even with more IP addresses,
the first one by hashing order in the tree will always receive twice the
traffic of the others.
To fix this, set the next iteration item to the one following the selected
IP record, if any. This ensures we never use the same IP twice in a row.
This commit should be backported where 3023e9819 ("BUG/MINOR: resolvers:
Restore round-robin selection on records in DNS answers") is, so as far
as 2.6.
The aes_gcm_enc() and aes_gcm_dec() sample converters now accept an
optional fifth argument for Additional Authenticated Data (AAD). When
provided, the AAD value is base64-decoded and used during AES-GCM
encryption or decryption. Both string and variable forms are supported.
This enables use cases that require authentication of additional data.
If a QUIC server is declared without ALPN, "h3" value is automatically
set during _srv_parse_finalize().
This patch adjusts this operation. Instead of relying on
ssl_sock_parse_alpn(), a plain strdup() is used. This is considered more
efficient as the ALPN string is constant in this case. This method is
already used for listeners on the frontend side.
Ensure that QUIC support is compiled into haproxy when a QUIC server is
configured. This check is performed during _srv_parse_finalize() so that
it is detected both on configuration parsing and when adding a dynamic
server via the CLI.
Note that this changes the behavior of srv_is_quic() utility function.
Previously, it always returned false when QUIC support wasn't compiled.
With this new check introduced, it is now guaranteed that a QUIC server
won't exist if compilation support is not active. Hence srv_is_quic()
does not rely anymore on USE_QUIC define.
Previously, QUIC servers were rejected if SSL was not explicitely
activated using 'ssl' configuration keyword.
Change this behavior : now SSL is automatically activated for QUIC
servers when the keyword is missing. A warning is displayed as it is
considered better to explicitely note that SSL is in use.
Released version 3.3-dev11 with the following main changes :
- BUG/MEDIUM: mt_list: Make sure not to unlock the element twice
- BUG/MINOR: quic-be: unchecked connections during handshakes
- BUG/MEDIUM: cli: also free the trash chunk on the error path
- MINOR: initcalls: Add a new initcall stage, STG_INIT_2
- MEDIUM: stick-tables: Use a per-shard expiration task
- MEDIUM: stick-tables: Remove the table lock
- MEDIUM: stick-tables: Stop if stktable_trash_oldest() fails.
- MEDIUM: stick-tables: Stop as soon as stktable_trash_oldest succeeds.
- BUG/MEDIUM: h1-htx: Don't set HTX_FL_EOM flag on 1xx informational messages
- BUG/MEDIUM: h3: properly encode response after interim one in same buf
- BUG/MAJOR: pools: fix default pool alignment
- MINOR: ncbuf: extract common types
- MINOR: ncbmbuf: define new ncbmbuf type
- MINOR: ncbmbuf: implement add
- MINOR: ncbmbuf: implement iterator bitmap utilities functions
- MINOR: ncbmbuf: implement ncbmb_data()
- MINOR: ncbmbuf: implement advance operation
- MINOR: ncbmbuf: add tests as standalone mode
- BUG/MAJOR: quic: use ncbmbuf for CRYPTO handling
- MINOR: quic: remove received CRYPTO temporary tree storage
- MINOR: stats-file: fix typo in shm-stats-file object struct size detection
- MINOR: compiler: add FIXED_SIZE(size, type, name) macro
- MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct
- BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency
- BUG/MEDIUM: build: limit excessive and counter-productive gcc-15 vectorization
- BUG/MEDIUM: stick-tables: Don't loop if there's nothing left
- MINOR: acme: add the dns-01-record field to the sink
- MINOR: acme: display the complete challenge_ready command in the logs
- BUG/MEDIUM: mt_lists: Avoid el->prev = el->next = el
- MINOR: quic: remove unused conn-tx-buffers limit keyword
- MINOR: quic: prepare support for options on FE/BE side
- MINOR: quic: rename "no-quic" to "tune.quic.listen"
- MINOR: quic: duplicate glitches FE option on BE side
- MINOR: quic: split congestion controler options for FE/BE usage
- MINOR: quic: split Tx options for FE/BE usage
- MINOR: quic: rename max Tx mem setting
- MINOR: quic: rename retry-threshold setting
- MINOR: quic: rename frontend sock-per-conn setting
- BUG/MINOR: quic: split max-idle-timeout option for FE/BE usage
- BUG/MINOR: quic: split option for congestion max window size
- BUG/MINOR: quic: rename and duplicate stream settings
- BUG/MEDIUM: applet: Improve again spinning loops detection with the new API
- Revert "BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency"
- Revert "MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct"
- Revert "MINOR: compiler: add FIXED_SIZE(size, type, name) macro"
- BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency (2nd attempt)
- BUG/MINOR: stick-tables: properly index string-type keys
- BUILD: openssl-compat: fix build failure with OPENSSL=0 and KTLS=1
- BUG/MEDIUM: mt_list: Use atomic operations to prevent compiler optims
- MEDIUM: quic: Fix build with openssl-compat
- MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty
- MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()
- BUG/MEDIUM: cli: do not return ACKs one char at a time
- BUG/MEDIUM: ssl: Crash because of dangling ckch_store reference in a ckch instance
- BUG/MINOR: ssl: Remove unreachable code in CLI function
- BUG/MINOR: acl: warn if "_sub" derivative used with an explicit match
- DOC: config: fix confusing typo about ACL -m ("now" vs "not")
- DOC: config: slightly clarify the ssl_fc_has_early() behavior
- MINOR: ssl-sample: add ssl_fc_early_rcvd() to detect use of early data
- CI: disable fail-fast on fedora rawhide builds
- MINOR: http: fix 405,431,501 default errorfile
- BUG/MINOR: init: Do not close previously created fd in stdio_quiet
- MINOR: init: Make devnullfd global and create it earlier in init
- MINOR: init: Use devnullfd in stdio_quiet calls instead of recreating a fd everytime
- MEDIUM: ssl: Add certificate password callback that calls external command
- MEDIUM: ssl: Add local passphrase cache
- MINOR: ssl: Do not dump decrypted privkeys in 'dump ssl cert'
- BUG/MINOR: resolvers: Apply dns-accept-family setting on additional records
- MEDIUM: h1: Immediately try to read data for frontend
- REGTEST: quic: add ssl_reuse.vtc new QUIC test
- BUG/MINOR: ssl: returns when SSL_CTX_new failed during init
- MEDIUM: ssl/ech: config and load keys
- MINOR: ssl/ech: add logging and sample fetches for ECH status and outer SNI
- MINOR: listener: implement bind_conf_find_by_name()
- MINOR: ssl/ech: key management via stats socket
- CI: github: add USE_ECH=1 to haproxy for openssl-ech job
- DOC: configuration: "ech" for bind lines
- BUG/MINOR: ech: non destructive parsing in cli_find_ech_specific_ctx()
- DOC: management: document ECH CLI commands
- MEDIUM: mux-h2: do not needlessly refrain from sending data early
- MINOR: mux-h2: extract the code to send preface+settings into its own function
- BUG/MINOR: mux-h2: send the preface along with the first request if needed
Tests involving 0-RTT and H2 on the backend show that 0-RTT is being
partially used but does not work. The analysis shows that only the
preface and settings are sent using early-data and the request is sent
separately. As explained in the previous patch, this is caused by the
fact that a wakeup of the iocb is needed just to send the preface, then
a new call to process_stream is needed to try sending again.
Here with this patch, we're making h2_snd_buf() able to send the preface
if it was not yet sent. Thanks to this, the preface, settings and first
request can now leave as a single TCP segment. In case of TLS with 0-RTT,
it now allows all the block to leave in early data.
Even in clear-text H2, we're now seeing a 15% lower context-switch count,
and the number of calls to process_stream() per connection dropped from 3
to 2. The connection rate increased by an extra 9.5%. Compared to without
the last 3 patches, this is a 22% reduction of context-switches, 33%
reduction of process_stream() calls, and 15.7% increase in connection
rate. And more importantly, 0-RTT now really works with H2 on the
backend, saving one full RTT on the first request.
This fix is only for a missed optimization and a non-functional 0-RTT
on the backend. It's worth backporting it, but it doesn't cause enough
harm to hurry a backport. Better wait for it to live a little bit in
3.3 (till at least a week or two after the final release) before
backporting it. It's not sure that it's worth going beyond 3.2 in any
case. It depends on the these two previous commits:
MEDIUM: mux-h2: do not needlessly refrain from sending data early
MINOR: mux-h2: extract the code to send preface+settings into its own function
The code that deals with sending preface + settings and changing the
state currently is in h2_process_mux(), but we'll want to do it as
well from h2_snd_buf(), so let's move it to a dedicate function first.
At this point there is no functional change.
The mux currently refrains from sending data before H2_CS_FRAME_H, i.e.
before the peer's SETTINGS frame was received. While it makes sense on
the frontend, it's causing harm on the backend because it forces the
first request to be sent in two halves over an extra RTT: first the
preface and settings, second the request once the settings are received.
This is totally contrary to the philosophy of the H2 protocol, consisting
in permitting the client to send as soon as possible.
Actually what happens is the following:
- process_stream() calls connect_server()
- connect_server() creates a connection, and if the proto/alpn is guessed
or known, the mux is instantiated for the current request.
- the H2 init code wakes the h2 tasklet up and returns
- process_stream() tries to send the request using h2_snd_buf(), but that
one sees that we're before H2_CS_FRAME_H, refrains from doing so and
returns.
- process_stream() subscribes and quits
- the h2 tasklet can now execute to send the preface and settings, which
leave as a first TCP segment. The connection is ready.
- the iocb is woken again once the server's SETTINGS frame is received,
turning the connection to the H2_CS_FRAME_H state, and the iocb wake
up process_stream().
- process_stream() executes again and can try to send again.
- h2_snd_buf() is called and finally sends the request as a second TCP
segment.
Not only this is inefficient, but it also renders 0-RTT and TFO impossible
on H2 connections. When 0-RTT is used, only the preface and settings leave
as early data (the very first data of that connection), which is totally
pointless.
In order to fix this, we have to go through a few steps:
- first we need to let data be sent to a server immediately after the
SETTINGS frame was sent (i.e. in H2_CS_SETTINGS1 state instead of
H2_CS_FRAME_H). However, some protocol extensions are advertised by
the server using SETTINGS (e.g. RFC8441) and some requests might need
to know the existence of such extensions. For this reason we're adding
a new h2c flag, H2_CF_SETTINGS_NEEDED, which indicates that some
operations were not done because a server's SETTINGS frame is needed.
This is set when trying to send a protocol upgrade or extended CONNECT
during H2_CS_SETTINGS1, indicating that it's needed to wait for
H2_CS_FRAME_H in this case. The flag is always set on frontend
connections. This is what is being done in this patch.
- second, we need to be able to push the preface opportunistically with
the first h2_snd_buf() so that it's not needed to wake the tasklet up
just to send that and wake process_stream() again. This will be in a
separate patch.
By doing the first step, we're at least saving one needless tasklet
wakeup per connection (~9%), which results in ~5% backend connection
rate increase.
cli_find_ech_specific_ctx() parses the <frontend>/<bind_conf> and sets
a \0 in place the '/'. But the originals tring is still used to emit
messages in the CLI so we only output the frontend part.
This patch do the parsing in a trash buffer instead.
ECH is an experimental features which still a draft, but already exists as a
feature branch in OpenSSL.
This patch explains how to configure "ech" on bind lines.
This patch extends the ECH support by adding runtime CLI commands to
view and modify ECH configurations.
New commands are added to the HAProxy CLI:
- "show ssl ech [<name>]" displays all ECH configurations or a specific
one.
- "add ssl ech <name> <payload>" adds a new PEM-formatted ECH
configuration.
- "set ssl ech <name> <payload>" replaces all existing ECH
configurations.
- "del ssl ech <name> [<age-in-secs>]" removes ECH configurations,
optionally filtered by age.
Returns a pointer to the first bind_conf matching <name> in a frontend
<front>.
When name is prefixed by a @ (@<filename>:<linenum>), it tries to look
for the corresponding filename and line of the configuration file.
NULL is returned if no match is found.
This patch adds functions to expose Encrypted Client Hello (ECH) status
and outer SNI information for logging and sample fetching.
Two new helper functions are introduced in ech.c:
- conn_get_ech_status() places the ECH processing status string into a
buffer.
- conn_get_ech_outer_sni() retrieves the outer SNI value if ECH
succeeded.
Two new sample fetch keywords are added:
- "ssl_fc_ech_status" returns the ECH status string.
- "ssl_fc_ech_outer_sni" returns the outer SNI value seen during ECH.
These allow ECH information to be used in HAProxy logs, ACLs, and
captures.
This patch introduces the USE_ECH option in the Makefile to enable
support for Encrypted Client Hello (ECH) with OpenSSL.
A new function, load_echkeys, is added to load ECH keys from a specified
directory. The SSL context initialization process in ssl_sock.c is
updated to load these keys if configured.
A new configuration directive, `ech`, is introduced to allow users to
specify the ECH key directory in the listener configuration.
In ssl_sock_initial_ctx(), returns when SSL_CTX_new() failed instead of
trying to apply anything on the ctx. This may avoid crashing when
there's not enough memory anymore during configuration parsing.
Could be backported in every haproxy versions
Note that this test does not work with OpenSSL 3.5.0 QUIC API because
the callback set by SSL_CTX_sess_set_new_cb() (ssl_sess_new_srv_cb()) is not
called (at least for QUIC clients)
The role of this new QUIC test is to run the same SSL/TCP test as
reg-tests/ssl/ssl_reuse.vtc but with QUIC connections where applicable (only with
TLSv1.3).
To do so, this QUIC test uses the "include" vtc command to run ssl/ssl_reuse.vtc
It also sets the VTC_SOCK_TYPE environment variable with the "setenv" command and
"quic" as value. This will ask vtest2 to use QUIC sockets for all "fd@{...}"
addresses prefixed by "${VTC_SOCK_TYPE}+" socket type if VTC_SOCK_TYPE value is "quic".
The SSL/TCP is modified to set this environment variable with "setenv -ifunset"
from ssl/ssl_reuse.vtc with "stream" as value, if it not already set.
vtest2 must be used with this patch to support this new QUIC test:
9aa4d498db
Thanks to this latter patch, vtest2 retrieves the VTC_SOCK_TYPE environment variable
value, then it parses the vtc file to retrieve all the fd addresses prefixed by
"${VTC_SOCK_TYPE}+" and creates a QUIC socket or a TCP socket depending on this
variable value.
In h1_init(), if we're a frontend connection, immediately attempt to
read data, if the connection is ready, instead of just subscribing.
There may already be data available, at least if we're using 0RTT.
This may be backported up to 2.8 in a while, after 3.3 is released, so
that if it causes problem, we have a chance to hear about it.
dns-accept-family setting was only evaluated for responses to A / AAAA DNS
queries. It was ignored when additional records in SRV responses were
parsed.
With this patch, whena SRV responses is parsed, additional records not
matching the dns-accept-family setting are ignored, as expected.
This patch must be backported to 3.2.
A private keys that is password protected and was decoded during init
thanks to the password obtained thanks to 'ssl-passphrase-cmd' should
not be dumped via 'dump ssl cert' CLI command.
Instead of calling the external password command for all loaded
encrypted certificates, we will keep a local password cache.
The passwords won't be stored as plain text, they will be stored
obfuscated into the password cache. The obfuscation is simply based on a
XOR'ing with a random number built during init.
After init is performed, the password cache is overwritten and freed so
that no dangling info allowing to dump the passwords remains.
When a certificate is protected by a password, we can provide the
password via the dedicated pem_password_cb param provided to
PEM_read_bio_PrivateKey.
HAProxy will fetch the password automatically during init by calling a
user-defined external command that should dump the right password on its
standard output (see new 'ssl-passphrase-cmd' global option).
Since commit "65760d MINOR: init: Make devnullfd global and create it
earlier in init" the devnullfd file descriptor pointing to /dev/null
is created regardless of the process's parameters so we can use it in
all 'stdio_quiet' calls instead or recreating an FD.
The devnull fd might be needed during configuration parsing, if some
options require to fork/exec for instance. So we now create it much
earlier in the init process and without depending on the '-q' or '-d'
parameters.
During init we were calling 'stdio_quiet' and passing the previously
created 'devnullfd' file descriptor. But the 'stdio_quiet' was also
closed afterwards which raised an error (EBADF).
If we keep from closing FDs that were opened outside of the
'stdio_quiet' function we will let the caller manage its FD and avoid
double close calls.
This patch can be backported to all stable branches.
A few typos were present in the default errorfiles for the status codes
above (missing dot at the end of the sentence, extra closing bracket).
This fixes them. This can be backported.
Previously builds were dependent in terms that if one fails, other are
stopped. By their nature those builds are independent, let's not to fail
them altogether
We currently have ssl_fc_has_early() which says that early data are still
unconfirmed by a final handshake, but nothing to see if a client has been
able to use early data at all, which is a problem because such mechanisms
generally depend on multiple factors and it's hard to know when they start
to work. This new sample fetch function will indicate that some early data
were seen over that front connection, i.e. this can be used to confirm
that at some point the client was able to push some. This is essentially
a debugging tool that has no practical use case other than debugging.
Clarify that it's about handshake *completion*, and also mention that
the action to be used to wait for the handshake is "wait-for-handshake",
which was not mentioned.
This can be backported though it's very minor.
A one-letter typo in the doc update comint with commit 6ea50ba462 ("MINOR:
acl; Warn when matching method based on a suffix is overwritten") inverts
the meaning of the sentence. It was "is not allowed" and not
"is now allowed". Needs to be backported only if the commit above ever is
(unlikely).
Recently, a new warning is displayed when an ACL derivative match method
is override with another '-m' method. This is implemented via the
following patch :
6ea50ba462692d6dcf301081f23cab3e0f6086e4
MINOR: acl; Warn when matching method based on a suffix is overwritten
However, this warning was not reported when "_sub" suffix was specified.
Fix this by adding PAT_MATCH_SUB in the warning comparison.
No backport needed except if above commit is.
When updating CAs via the CLI, we need to create new copies of all the
impacted ckch instances (as in referenced in the ckch_inst_link list of
the updated CA) in order to use them instead of the old ones once the
updated is completed. This relies on the ckch_inst_rebuild function that
would set the ckch_store field of the ckch_inst. But we forgot to also
add the newly created instances in the ckch_inst list of the
corresponding ckch_store.
When updating a certificate afterwards, we iterate over all the
instances linked in the ckch_inst list of the ckch_store (which is
missing some instances because of the previous command) and rebuild the
instances before replacing the ckch_store. The previous ckch_store,
still referenced by the dangling ckch instance then gets deleted which
means that the instance keeps a reference to a free'd object.
Then if we were to once again update the CA file, we would iterate over
the ckch instances referenced in the cafile_entry's ckch_inst_link list,
which includes the first mentioned ckch instance with the dead
ckch_store reference. This ends up crashing during the ckch_inst_rebuild
operation.
This bug was raised in GitHub #3165.
This patch should be backported to all stable branches.
Since 3.0 where the CLI started to use rcv_buf, it appears that some
external tools sending chained commands are randomly experiencing
failures. Each time this happens when the whole command is sent as a
single packet, immediately followed by a close. This is not a correct
way to use the CLI but this has been working for ages for simple
netcat-based scripts, so we should at least try to preserve this.
The cause of the failure is that the first LF that acks a command is
immediately sent back to the client and rejected due to the closed
connection. This in turn forwards the error back to the applet which
aborts its processing.
Before 3.0 the responses would be queued into the buffer, then sent
back to the channel, and would all fail at once. This changed when
snd_buf/rcv_buf were implemented because the applets are much more
responsive and since they yield between each command, they can
deliver one ACK at a time that is immediately forwarded down the
chain.
An easy way to observe the problem is to send 5 map updates, a shutdown,
and immediately close via tcploop, and in parallel run a periodic
"show map" to count the number of elements:
$ tcploop -U /tmp/sock1 C S:"add map #0 1 1; add map #0 2 2; add map #0 3 3; add map #0 4 4; add map #0 5 5\n" F K
Before 3.0, there would always be 5 elements. Since 3.0 and before
20ec1de214 ("MAJOR: cli: Refacor parsing and execution of pipelined
commands"), almost always 2. And since that commit above in 3.2, almost
always one. Doing the same using socat or netcat shows almost always 5...
It's entirely timing-dependent, and might even vary based on the RTT
between the client and haproxy!
The approach taken here consists in doing the same principle as MSG_MORE
or Nagle but on the response buffer: the applet doesn't need to send a
single ACK for each command when it has already been woken up and is
scheduled to come back to work. It's fine (and even desirable) that
ACKs are grouped in a single packet as much as possible.
For this reason, this patch implements APPCTX_CLI_ST1_YIELD, a new CLI
flag which indicates that the applet left in yielding condition, i.e.
it has not finished its work. This flag is used by .rcv_buf to hold
pending data. This way we won't return partial responses for no reason,
and we can continue to emulate the previous behavior.
One very nice benefit to this is that it saves huge amounts of CPU on
the client. In the test below that tries to update 1M map entries, the
CPU used by socat went from 100% to 0% and the total transfer time
dropped by 28%:
before:
$ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null
real 0m2.407s
user 0m1.485s
sys 0m1.682s
after:
$ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null
real 0m1.721s
user 0m0.952s
sys 0m0.057s
The difference is also quite visible on the number of syscalls during
the test (for 1k updates):
before:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.071691 0 100001 sendmsg
after:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000011 1 9 sendmsg
This patch will need to be backported to 3.0, and depends on these two
patches to be backported as well:
MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty
MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()
This is in preparation for a future fix. For now it's simply a pure
copy of the original function, but dedicated to the CLI. It will
have to be backported to 3.0.
appctx_rcv_buf() prepares all the work to schedule the transfers between
the applet and the channel, and it takes care of setting the various flags
that indicate what condition is blocking the transfer from progressing.
There is one limitation though. In case an applet refrains from sending
data (e.g. rate-limited, prefers to aggregate blocks etc), it will leave
a possibly empty channel buffer, and keep some data in its outbuf. The
data in its outbuf will be seen by the function above as an indication
of a channel full condition, so it will place SE_FL_WANT_ROOM. But later,
sc_applet_recv() will see this flag with a possibly empty channel, and
will rightfully trigger a BUG_ON().
appctx_rcv_buf() should be more accurate in fact. It should only set
SE_FL_RCV_MORE when more data are present in the applet, then it should
either set or clear SE_FL_WANT_ROOM dependingon whether the channel is
empty or not.
Right now it doesn't seem possible to trigger this condition in the
current state of applets, but this will become possible with a future
bugfix that will have to be backported, so this patch will need to be
backported to 3.0.
As the QUIC options have been split into backend and frontend, there is
no more GTUNE_QUIC_LISTEN_OFF to be found in global.tune.options, look
for QUIC_TUNE_FE_LISTEN_OFF in quic_tune.fe instead.
This should fix the build with USE_QUIC and USE_QUIC_OPENSSL_COMPAT.
As a folow-up to f40f5401b9f24becc6fdd2e77d4f4578bbecae7f, explicitely
use atomic operations to set the prev and next fields, to make sure the
compiler can't assume anything about it, and just does it.
This should be backported after f40f5401b9 up to 2.8.
The USE_KTLS test is currently being done outside of the USE_OPENSSL
guard so disabling USE_OPENSSL still results in build failures on
libcs built with support for kernels before 4.17, because we enable
KTLS by default on linux. Let's move the KTLS block inside the
USE_OPENSSL guard instead.
No backport is needed since KTLS is only in 3.3.
This is one of the rare pleasant surprises of fixing an almost 16-years
old bug that remained unnoticed since the feature was implemented. In
1.4-dev7, commit 3bd697e071 ("[MEDIUM] Add stick table (persistence)
management functions and types") introduced stick-tables with multiple
key types, including strings, IP addresses and integers. Entries are
coded in binary and their binary representation is indexed. A special
case was made for strings in order to index them as zero-terminated
strings. However, there's one subtlety. While strings indeed have a
zero appended, they're still indexed using ebmb_insert(), which means
that all the bytes till the configured size are indexed as well. And
while these bytes generally come from a temporary storage that often
contains zeroes, or that is longer than the configured string length
and will result in truncation, it's not always the case and certain
traffic patterns with certain configurations manage to occasionally
present unpadded strings resulting in apparent duplicate keys appearing
in the dump, as shown in GH issue #3161. It seems to be essentially
reproducible at boot, and not to be particularly affected by mixed
patterns. These keys are in fact not exact duplicates in memory, but
everywhere they're used (including during synchronization), they are
equal.
What's interesting is that when this happens, one key can be presented
to a peer with its own data and will be indexed as the only one, possibly
replacing contents from the previous key, which might replace them again
later once updated in turn. This is visible in the dump of the issue
above, where key "localhost:8001" was split into two entries, one with a
request count of one and the other with a request count of 499999, and
indeed, all peers see only that last value, which overwrote the first
one.
This fix must be backported to all stable branches. Special kudos to
Mark Wort for undelining that one.
This is a second attempt at fixing issues on 32bits systems which would
trigger the following BUG_ON() statement:
FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825 shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this
This is a drop-in replacement for d30b88a6c + 4693ee0ff, as suggested by
Willy.
Indeed, on supported platforms unsigned int can be assumed to be 4 bytes
long, and long can be assumed to be 8 bytes long. As such, the previous
attempt was overkill and added unecessary maintenance complexity which
could result in bugs if not used properly. Moreover, it would only
partially solve the issue, since on little endian vs big endian
architectures, the provisioned memory areas (originating from the same
shm stats file) could be read differently by the host.
Instead we fix the aligments issues, and this alone helps to ensure
struct memory consistency on 64 vs 32bits platforms. It was tested
on both i386 and i586.
last_change and last_sess counters are now stored as unsigned int, as
it helped to fix the alignment issues and they were found to be used
as 32bits integers anyway.
Thanks to Willy for problem analysis and the patch proposal.
No backport needed.
This reverts commit 466a603b59ed77e9787398ecf1baf77c46ae57b1.
Due to the last 2 commits, this macro is now unused, and will probably
never be used, so let's get rid of that for now.
This reverts commit 4693ee0ff7a5fa4a12ff69b1a33adca142e781ac.
As discussed in GH #3168, this works but it is not the proper way to fix
the issue. See following commits.
This reverts commit d30b88a6cc47d662e92b524ad5818be312401d0e.
As discussed in GH #3168, this works but it is not the proper way to fix
the issue. See following commits.
A first attempt to fix this issue was already pushed (54b7539d6 "BUG/MEDIUM:
apppet: Improve spinning loop detection with the new API"). But it not was
fully accurrate. Indeed, we must check if something was received or sent by
the applet before incrementing the call rate. But we must also take care the
applet is allowed to receive or send data. That is what is performed in this
patch.
This patch must be backported as far as 3.0 with the patch above.
Several settings can be set to control stream multiplexing and
associated receive window. Previously, all of these settings were
configured using prefix "tune.quic.frontend.", despite being applied
blindly on both sides.
Fix this by duplicating these settings specific to frontend and backend
side. Options are also renamed to use the standardize prefix
"tune.quic.[be|fe].stream." notation.
Also, each option is individually renamed to better reflect its purpose
and hide technical details relative to QUIC transport parameter naming :
* max-data-size -> stream.rxbuf
* max-streams-bidi -> stream.max-concurrent
* stream-data-ratio -> stream.data-ratio
No need to backport.
Streamline max-idle-timeout option. Rename it to use the newer cohesive
naming scheme 'tune.quic.fe|be.'.
Two different fields were already defined in global struct. These fields
are moved into quic_tune along with other QUIC settings. However, no
parser was defined for backend option, this commit fixes this.
No need to backport this.
On frontend side, a quic_conn can have a dedicated FD or use the
listener one. These different modes can be activated via a global QUIC
tune setting.
This patch adjusts the option. First, it is renamed to the more
meaningful name 'tune.quic.fe.sock-per-conn'. Also, arguments are now
either 'default-on' or 'force-off'. The objective is to better highlight
reliationship with 'quic-socket' bind option.
The older option is deprecated and will be removed in 3.5.
A QUIC global tune setting is defined to be able to force Retry emission
prior to handshake. By definition, this ability is only supported by
QUIC servers, hence it is a frontend option only.
Rename the option to use "fe" prefix. The old option name is deprecated
and will be removed in 3.5
QUIC global memory can be limited across the entire process via a global
tune setting. Previously, this setting used to misleading "frontend"
prefix. As this is applied as a sum between all QUIC connections, both
from frontend and backend sides, remove the prefix. The new option name
is "tune.quic.mem.tx-max".
The older option name is deprecated and will be removed in 3.5.
This patch is similar to the previous one, except that it is focused on
Tx QUIC settings. It is now possible to toggle GSO and pacing on
frontend and backend sides independently.
As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUI
settings. Older options are deprecated and will be removed on 3.5
release.
Various settings can be configured related to QUIC congestion controler.
This patch duplicates them to be able to set independent values on
frontend and backend sides.
As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUIC
settings. Older options are deprecated and will be removed on 3.5
release.
Previously, QUIC glitches support was only implemented for frontend
side. Extend this so that the option can be specified separately both on
frontend and backend sides. Function _qcc_report_glitch() now retrieves
the relevant max value based on connection side.
In addition to this, option has been renamed to use "fe/be" prefixes.
This is part of the current serie of commits which unify QUIC settings.
Older options are deprecated and will be removed on 3.5 release.
Rename the option to quickly enable/disable every QUIC listeners. It now
takes an argument on/off. The documentation is extended to reflect the
fact that QUIC backend are not impacted by this option.
The older keyword is simply removed. Deprecation is considered
unnecessary as this setting is only useful during debugging.
A major reorganization of QUIC settings is going to be performed. One of
its objective is to clearly define options which can be separately
configured on frontend and backend proxy sides.
To implement this, quic_tune structure is extended to support fe and be
options. A set of macros/functions is also defined : it allows to
retrieve an option defined on both sides with unified code, based on
proxy side of a quic_conn/connection instance.
Remove parsing code for tune.quic.frontend.conn-tx-buffers.limit. This
option was deprecated for some time and in fact was noop and not
mentionned anymore in the documentation.
Avoid setting both el->prev and el->next on the same line.
The goal is to set both el->prev and el->next to el, but a naive
compiler, such as when we're using -O0, will set el->next first, then
will set el->prev to the value of el->next, but if we're unlucky,
el->next will have been set to something else by another thread.
So explicitely set both to what we want.
This should be backported up to 2.8.
When using a wildcard DNS domain in the ACME configuration, for example
*.example.com, one might think that it needs to use the challenge_ready
command with this domain. But that's not the case, the challenge_ready
command takes the domain asked by the ACME server, which is stripped of
the wildcard.
In order to be clearer, the log message shows exactly the command the
user should sent, which is clearer.
The dns-01-record field in the dpapi sink, output the authentication
token which is needed in the TXT record in order to validate the DNS-01
challenge.
Before waking up the expiration task again at the end of it, make sure
the next date is set. If there's nothing left to do, then task_exp will
be TASK_ETERNITY and we then don't want to be waken up again.
In https://bugs.gentoo.org/964719, Dan Goodliffe reported that using
CFLAGS="-O3 -march=westmere" creates a binary that segfaults on startup
with gcc-15. This could be reproduced here, is isolated to gcc-15 and
-O3, and is caused by gcc emitting "movdqa" instructions to read unaligned
longs taken from chars that were carefully isolated within ifdefs checking
for support for unaligned integers on the platform...
Some experiments showed that changing all casts all over the code using
either typedef-enforced align(1) or using the packed union trick does
the job, it needs a more in-depth validation since it's obvious that
it doesn't produce the same code at all (at least on more modern
machines).
However, the offending optimization option could be isolated, it's
"-fvect-cost-model=dynamic" which causes this, while -O2 uses
"-fvect-cost-model=very-cheap". Turning it back to very-cheap solves the
issue, reduces the code, and yields an extra 5% performance increase on
the http-request rate (181k vs 172k on a single core)! This could at
least partially explain why it has been observed several times over
the last few years that -O3 yields bigger and slower code than -O2.
It was also verified that the option doesn't change the emitted code
at -O0..-O2,-Os,-Oz, but only at -O3.
This patch detects the presence of this option and turns it on to
address the problem that some distros are facing after an upgrade to
gcc-15. As such it should be backported to recent LTS and stable
branches. Here, 3.1 was used, so it seems legit to at least target
the last two LTS branches (i.e. go as far as 3.0).
Thanks to Dan Goodliffe for sharing a working reproducer, Sam James
for starting the investigations and Christian Ruppert for bringing
the issue to us.
As reported by @tianon on GH #3168, running haproxy on 32bits i386
platform would trigger the following BUG_ON() statement:
FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825
shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this
In fact, some efforts were already taken to ensure shm_stats_file_object
struct size remains consistent on 64 vs 32 bits platforms, since
shm_stats_file_object is part of the public API and directly exposed in
the stats file.
However, some parts were overlooked: some structs that are embedded in
shm_stats_file_object struct itself weren't using fixed-width integers,
and would sometime be unaligned. The result of this is that it was
up to the compiler (platform-dependent) to choose how to deal with such
ambiguities, which could cause the struct mapping/size to be inconsistent
from one platform to another.
Hopefully this was caught by the BUG_ON() statement and with the precious
help of @tianon
To fix this, we now use fixed-width integers everywhere for members
(and submembers) of shm_stats_file_object struct, and we use explicit
padding where missing to avoid automatic padding when we don't expect
one. As for the previous commit, we leverage FIXED_SIZE() and
FIXED_SIZE_ARRAY() macro to set the expected width for each integer
without causing build issues on platform that don't support larger
integers.
No backport needed, this feature was introduced during 3.3-dev.
freq-ctr struct is used by the shm_stats_file API, and more precisely,
it is used in the shm_stats_file_object struct for counters.
shm_stats_file_object struct requires to be plateform-independent, thus
we switch to using explicit size types (AKA fixed width integer types)
for freq-ctr, in the attempt to make freq-ctr size and memory mapping
consistent from one platform to another.
We cannot simply use fixed-width integer because some of them are
involved in atomic operations, and forcing a given width could
cause build issues on some platforms where atomic ops are not
implemented for large integers. Instead we leverage the FIXED_SIZE
macro to keep handling the integers as before, but forcing them to
be stored using expected number of bytes (unused bytes will simply
be ignored).
No change of behavior should be expected.
FIXED_SIZE() macro can be used to instruct the compiler that the struct
member named <name>, handled as <type>, must be stored using <size> bytes
and that even if the type used is actualler smaller than the expected size
FIXED_SIZE_ARRAY(), similar to FIXED_SIZE() but for arrays: it takes an
extra argument which is the number of members.
They may be used for portability concerns to ensure a structure mapping
remains consistent between platforms.
As reported by @TimWolla on GH #3168, there was a typo in shm stats file
BUG_ON to report that the size of shm_stats_file_object changed.
No backport needed.
The previous commit switch from ncbuf to ncbmbuf as storage for received
CRYPTO frames. The latter ensures that buffering of such frames cannot
fail anymore due to gaps size.
Previously, extra mechanism were implemented on QUIC frames parsing
function to overcome the limitation of ncbuf on gaps size. Before
insertion, CRYPTO frames were stored in a temporary tree to order their
insertion. As this is not necessary anymore, this commit removes the
temporary tree insertion.
This commit is closely associated to the previous bug fix. As it
provides a neat optimization and code simplication, it can be backported
with it, but not in the next immediate release to spot potential
regression.
In QUIC, TLS handshake messages such as ClientHello are encapsulated in
CRYPTO frames. Each QUIC implementation can split the content in several
frames of random sizes. In fact, this feature is now used by several
clients, based on chrome so-called "Chaos protection" mechanism :
https://quiche.googlesource.com/quiche/+/cb6b51054274cb2c939264faf34a1776e0a5bab7
To support this, haproxy uses a ncbuf storage to store received CRYPTO
frames before passing it to the SSL library. However, this storage
suffers from a limitation as gaps between two filled blocks cannot be
smaller than 8 bytes. Thus, depending on the size of received CRYPTO
frames and their order, ncbuf may not be sufficient. Over time, several
mechanisms were implemented in haproxy QUIC frames parsing to overcome
the ncbuf limitation.
However, reports recently highlight that with some clients haproxy is
not able to deal with CRYPTO frames reception. In particular, this is
the case with the latest ngtcp2 release, which implements a similar
chaos protection mechanism via the following patch. It also seems that
this impacts haproxy interaction with firefox.
commit 89c29fd8611d5e6d2f6b1f475c5e3494c376028c
Author: Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com>
Date: Mon Aug 4 22:48:06 2025 +0900
Crumble Client Initial CRYPTO (aka chaos protection)
To fix haproxy CRYPTO frames buffering once and for all, an alternative
non-contiguous buffer named ncbmbuf has been recently implemented. This
type does not suffer from gaps size limitation, albeit at the cost of a
small reduction in the size available for data storage.
Thus, the purpose of this current patch is to replace ncbuf with the
newer ncbmbuf for QUIC CRYPTO frames parsing. Now, ncbmb_add() is used
to buffer received frames which is guaranteed to suceed. The only
remaining case of error is if a received frame offset and length exceed
the ncbmbuf data storage, which would result in a CRYPTO_BUFFER_EXCEEDED
error code.
A notable behavior change when switching to ncbmbuf implementation is
that NCB_ADD_COMPARE mode cannot be used anymore during add. Instead,
crypto frame content received at a similar offset will be overwritten.
A final note regarding STREAM frames parsing. For now, it is considered
unnecessary to switch from ncbuf in this case. Indeed, QUIC clients does
not perform aggressive fragmentation for them. Keeping ncbuf ensure that
the data storage size is bigger than the equivalent ncbmbuf area.
This should fix github issue #3141.
This patch must be backported up to 2.6. It is first necessary to pick
the relevant commits for ncbmbuf implementation prior to it.
Write some tests for ncbmbuf buf. These tests should be run each time
ncbmbuf implementation is adjusted. Use the following command :
$ gcc -g -DSTANDALONE -I./include -o ncbmbuf src/ncbmbuf.c && ./ncbmbuf
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Implement ncbmb_advance() function for the ncbmbuf type. This allows to
remove bytes in front of the buffer, regardless of the existing gaps.
This is implemented by resetting the corresponding bits of the bitmap.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Implement ncbmb_data() function for the ncbmbuf type. Its purpose is
similar to its ncbuf counterpart : it returns the size in bytes of data
starting at a specific offset until the next gap.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Extend private API for ncbmbuf type by defining an iterator type for the
buffer bitmap handling. The purpose is to provide a simple method to
iterate over the bitmap one byte at a time, with a proper bitmask set to
hide irrelevant bits.
This internal type is unused for now, but will become useful when
implementing ncb_data() and ncb_advance() functions.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
This patch implements add operation for ncbmbuf type.
This function is simpler than its ncbuf counterpart. Indeed, for now
only NCB_ADD_OVERWRT mode is supported. This compromise has been chosen
as ncbmbuf will be first used for QUIC CRYPTO frames handling, which
does not mandate to compare existing filled blocks during insertion.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Define ncbmbuf which is an alternative non-contiguous buffer
implementation. "bm" abbreviation stands for bitmap, which reflects how
gaps and filled blocks are encoded. The main purpose of this
implementation is to get rid of the ncbuf limitation regarding the
minimal size for gaps between two blocks of data.
This commit adds the new module ncbmbuf. Along with it, some utility
functions such as ncbmb_make(), ncbmb_init() and ncbmb_is_empty() are
defined. Public API of ncbmbuf will be extended in the following
patches.
This patch is not considered a bug fix. However, it will be required to
fix issue encountered on QUIC CRYPTO frames parsing. Thus, it will be
necessary to backport the current patch prior to the fix to come.
ncbuf is a module which provide a non-contiguous buffer type
implementation. This patch extracts some basic types related to it into
a new file ncbuf_common.h.
This patch will be useful to provide a new non-contiguous buffer
alternative implementation based on a bitmap.
This patch is not a bug fix. However, it is necessary for ncbmbuf
implementation which will be required to fix a QUIC issue on CRYPTO
frames parsing. This, it will be necessary to backport the current patch
prior to the fix to come.
The doc in commit 977feb5617 ("DOC: api: update the pools API with the
alignment and typed declarations") says that alignment of zero means
the type's alignment. And this is followed by the DECLARE_TYPED_POOL()
macro. Yet this is not what is done in create_pool_from_reg() which
only raises the alignment to a void* if lower, while it should start
from the type's. The effect is haproxy refusing to start on some 32-bit
platforms since that commit, displaying an error such as:
"BUG in the code: at src/mux_h2.c:454, requested creation of pool
'h2s' aligned to 4 while type requires alignment of 8! Please
report to developers. Aborting."
Let's just apply the default type's alignment.
Thanks to @tianon for reporting this in GH issue #3168. No backport is
needed since aligned pools are 3.3-only.
Recently, proper support for interim responses forwarding to HTTP/3
client has been implemented. However, there was still an issue if two
responses are both encoded in the same snd_buf() iteration.
The issue is caused due to H3 HEADERS frame encoding method : 5 bytes
are reserved in front of the buffer to encode both H3 frame type and
varint length field. After proper headers encoding, output buffer head
is adjusted so that length can be encoded using the minimal varint size.
However, if the buffer is not empty due to a previous response already
encoded but not yet emitted, messing with the buffer head will corrupt
the entire H3 message. This only happens when encoding of both responses
is done in the same snd_buf() iteration, or at least without emission to
quic_conn layer in between.
The result of this bug is that the HTTP/3 client will be unable to parse
the response, most of the time reporting a formatting error. This can
be reproduced using the following netcat as HTTP/1 server to haproxy :
$ while sleep 0.2; do \
printf "HTTP/1.1 100 continue\r\n\r\nHTTP/1.1 200 ok\r\nContent-length: 5\r\nConnection: close\r\n\r\nblah\n" | nc -lp8002
done
To fix this, only adjust buffer head if content is empty. If this is not
the case, frame length is simply encoded as a 4-bytes varint size so
that messages are contiguous in the buffer.
This must be backported up to 2.6.
1xx informational messages are part of the HTTP response. It is not expected
to have a HX_FL_EOM flag set after parsing such messages when received from
a server. It is espacially important whne an informational messages is
processed on client side while the final response was not recieved yet, to
not erroneously detect the end of the message.
The HTTP multiplexers seem to ignore the HTX_FL_EOM flag for information
messages, but it remains an error from the HTX specification point of
view. So it must be fixed.
While it should theorically be backported as far as 3.0, it is a good idea
to not do so for now because no bug was reported and regressions may happen.
stktable_trash_oldest() goes through all the shards, trying to free a
number of entries. Going through each shard is expensive, as we have to
take the shard lock, so stop as soon as we free'd at least one entry, as
it is only called when we want to make room for one entry.
In stksess_new(), if the table is full, we call stktable_trash_oldest()
to remove a few entries so that we have some room for a new one.
It is unlikely, but possible, that stktable_trash_oldest() will fail. If
so, just give up and do not add the new entry, instead of adding it
anyway.
Give up if stktable_trash_oldest() fails to free any entry
Instead of having per-table expiration tasks, just use one per shard.
The task will now go through all the tables to expire entries. When a
table gets an expiration earlier than the one previously known, it will
be put in a mt-list, and the task will be responsible to put it into an
eb32, ordered based on the next expiration.
Each per-shard task will run on a different thread, so it should lead to
a better load distribution than the per-table tasks.
Add a new initcall stage, STG_INIT_2, for stuff to be called after
step_init_2() is called, so after we know for sure that global.nbthread
will be set.
Modify stick-tables stkt_late_init() to run at STG_INIT_2 instead of
STG_INIT, in anticipation for it to be enhanced and have a need for
global.nbthread.
Since commit 20ec1de214 ("MAJOR: cli: Refacor parsing and execution of
pipelined commands"), command not returning any response (e.g. "quit")
don't pass through the free_trash_chunk() call, possibly leaking the
cmdline buffer. A typical way to reproduce it is to loop on "quit" on
the CLI, though it very likely affects other specific commands.
Let's make sure in the release handler that we always release that
chunk in any case. This must be backported to 3.2.
This bug impacts only the backends.
The ->conn (pointer to struct connection) member validity of the ssl_sock_ctx
struct was not checked before being dereferenced, leading to possible crashes
in qc_ssl_do_hanshake() during handshake.
This was reported by GH #3163 issue.
No need to backport because the QUIC backend support arrived with 3.3
In mt_list_delete(), if the element was not in a list, then n and p will
point to it, and so setting n->prev and n->next will be enough to unlock it.
Don't do it twice, as once it's been done the first time, another thread may
be working with it, and may have added it to a list already, and doing it
a second time can lead to list inconsistencies.
This should be backported up to 2.8.
Released version 3.3-dev10 with the following main changes :
- BUG/MEDIUM: connections: Only avoid creating a mux if we have one
- BUG/MINOR: sink: retry attempt for sft server may never occur
- CLEANUP: mjson: remove MJSON_ENABLE_RPC code
- CLEANUP: mjson: remove MJSON_ENABLE_PRINT code
- CLEANUP: mjson: remove MJSON_ENABLE_NEXT code
- CLEANUP: mjson: remove MJSON_ENABLE_BASE64 code
- CLEANUP: mjson: remove unused defines and math.h
- BUG/MINOR: http-ana: Reset analyse_exp date after 'wait-for-body' action
- CLEANUP: mjson: remove unused defines from mjson.h
- BUG/MINOR: acme: avoid overflow when diff > notAfter
- DEV: patchbot: use git reset+checkout instead of pull
- MINOR: proxy: explicitly permit abortonclose on frontends and clarify the doc
- REGTESTS: fix h2_desync_attacks to wait for the response
- REGTESTS: http-messaging: fix the websocket and upgrade tests not to close early
- MINOR: proxy: only check abortonclose through a dedicated function
- MAJOR: proxy: enable abortonclose by default on HTTP proxies
- MINOR: proxy: introduce proxy_abrt_close_def() to pass the desired default
- MAJOR: proxy: enable abortonclose by default on TLS listeners
- MINOR: h3/qmux: Set QC_SF_UNKNOWN_PL_LENGTH flag on QCS when headers are sent
- MINOR: stconn: Add two fields in sedesc to replace the HTX extra value
- MINOR: h1-htx: Increment body len when parsing a payload with no xfer length
- MINOR: mux-h1: Set known input payload length during demux
- MINOR: mux-fcgi: Set known input payload length during demux
- MINOR: mux-h2: Use <body_len> H2S field for payload without content-length
- MINOR: mux-h2: Set known input payload length of the sedesc
- MINOR: h3: Set known input payload length of the sedesc
- MINOR: stconn: Move data from kip to kop when data are sent to the consumer
- MINOR: filters: Reset knwon input payload length if a data filter is used
- MINOR: hlua/http-fetch: Use <kip> instead of HTX extra field to get body size
- MINOR: cache: Use the <kip> value to check too big objects
- MINOR: compression: Use the <kip> value to check body size
- MEDIUM: mux-h1: Stop to use HTX extra value when formatting message
- MEDIUM: htx: Remove the HTX extra field
- MEDIUM: acme: don't insert acme account key in ckchs_tree
- BUG/MINOR: acme: memory leak from the config parser
- CI: cirrus-ci: bump FreeBSD image to 14-3
- BUG/MEDIUM: ssl: take care of second client hello
- BUG/MINOR: ssl: always clear the remains of the first hello for the second one
- BUG/MEDIUM: stconn: Properly forward kip to the opposite SE descriptor
- MEDIUM: applet: Forward <kip> to applets
- DEBUG: mux-h1: Dump <kip> and <kop> values with sedesc info
- BUG/MINOR: ssl: leak in ssl-f-use
- BUG/MINOR: ssl: leak crtlist_name in ssl-f-use
- BUILD: makefile: disable tail calls optimizations with memory profiling
- BUG/MEDIUM: apppet: Improve spinning loop detection with the new API
- BUG/MINOR: ssl: Free global_ssl structure contents during deinit
- BUG/MINOR: ssl: Free key_base from global_ssl structure during deinit
- MEDIUM: jwt: Remove certificate support in jwt_verify converter
- MINOR: jwt: Add new jwt_verify_cert converter
- MINOR: jwt: Do not look into ckch_store for jwt_verify converter
- MINOR: jwt: Add new "jwt" certificate option
- MINOR: jwt: Add specific error code for known but unavailable certificate
- DOC: jwt: Add doc about "jwt_verify_cert" converter
- MINOR: ssl: Dump options in "show ssl cert"
- MINOR: jwt: Add new "add/del/show ssl jwt" CLI commands
- REGTEST: jwt: Test new CLI commands
- BUG/MINOR: ssl: Potential NULL deref in trace macro
- MINOR: regex: use a thread-local match pointer for pcre2
- BUG/MEDIUM: pools: fix bad freeing of aligned pools in UAF mode
- MEDIUM: pools: detect() when munmap() fails in UAF mode
- TESTS: quic: useless param for b_quic_dec_int()
- BUG/MEDIUM: pools: fix crash on filtered "show pools" output
- BUG/MINOR: pools: don't report "limited to the first X entries" by default
- BUG/MAJOR: lb-chash: fix key calculation when using default hash-key id
- BUG/MEDIUM: stick-tables: Don't forget to dec count on failure.
- BUG/MINOR: quic: check applet_putchk() for 'show quic' first line
- TESTS: quic: fix uninit of quic_cc_path const member
- BUILD: ssl: can't build when using -DLISTEN_DEFAULT_CIPHERS
- BUG/MAJOR: quic: uninitialized quic_conn_closed struct members
- BUG/MAJOR: quic: do not reset QUIC backends fds in closing state
- BUG/MINOR: quic: SSL counters not handled
- DOC: clarify the experimental status for certain features
- MINOR: config: remove experimental status on tune.disable-fast-forward
- MINOR: tree-wide: add missing TAINTED flags for some experimental directives
- MEDIUM: config: warn when expose-experimental-directives is used for no reason
- BUG/MEDIUM: threads/config: drop absent threads from thread groups
- REGTESTS: remove experimental from quic/retry.vtc
Recent commit 8b7a82cd30 ("MEDIUM: config: warn when
expose-experimental-directives is used for no reason") triggered on
this test exactly for the reason it was made for. The tests were just
done without quic on it. Let's drop the unneeded option.
Thread groups can be assigned arbitrary thread ranges, but if the
mentioned threads do not exist, this causes crashes in listener_accept()
or some connections to be ignored. The reason is that the calculated
mask is derived from the thread group's enabled threads count. Examples:
global
nbthread 2
thread-groups 2
thread-group 1 1-64
thread-group 2 65-128
frontend f-crash
bind :8001 thread 1/all
frontend f-freeze
bind :8002 thread 2/all
This commit removes missing threads, emits a warning when the thread
group just has less threads than requested, and an error when it is
left with no threads at all.
This must be backported to 3.1 since the issue is present there already.
If users start to enable expose-experimental-directives for the purpose
of testing one specific feature, there are chances that the option remains
forever and hides the experimental status of other options.
Let's emit a warning if the option appears and is not used. This will
remind users that they can now drop it, and help keep configs safe for
future upgrades.
We normally taint the process when using experimental directives, but
a handful of places were missed so we don't always know that they are
in use. Let's fix these places (hint for future directives, just look
for places checking for "experimental_directives_allowed", and add
"mark_tainted(TAINTED_CONFIG_EXP_KW_DECLARED);").
The option was turned to off by default in 2.8 with commit 2f7c82bfd
("BUG/MINOR: haproxy: Fix option to disable the fast-forward"), however
at the same time it should have dropped its experimental status since
the feature is enabled by default. The only goal of the option is to
debug something, like many other tune.xxx options. The option should
still normally not be used without being invited to do so by developers
looking for something specific though.
This could be backported if desired to simplify debugging, though this
has never been needed for now.
Certain features require "expose-experimental-directives" to be set in
the global section. Let's clarify that experimental featuers are only
maintained in best effort mode, may break during the stable cycle, and
are generally not maintained beyond the release of the next LTS branch
since it is extremely challenging, and early adopters are expected to
upgrade to benefit from improvements anyway.
The SSL counters were not handled at all for QUIC connections. This patch
implement ssl_sock_update_counters() extracting the code from ssl_sock.c
and call this function where applicable both in TLS/TCP and QUIC parts.
Must be backported as far as 2.8.
This bug impacts only the backends.
When entering the closing state, a quic_closed_conn is used to replace the quic_conn.
In this state, the ->fd value was reset to -1 value calling qc_init_fd(). This value
is used by qc_may_use_saddr() which supposes it cannot be -1 for a backend, leading
->li to be dereferencd, which is legal only for a listener.
This bug impacts only the backend but with possible crash when qc_may_use_saddr()
is called: qc_test_fd() is false leading qc->li to be dereferenced. This is legal
only for a listener.
This patch prevents such fd value resettings for backends.
No need to backport because the QUIC backends support arrived with 3.3.
A quic_conn_closed struct is initialized to replace the quic_conn when the
connection enters the closing to reduce the connection memory footprint.
->max_udp_payload quic_conn_close was not initialized leading to possible
BUG_ON()s in qc_rcv_buf() when comparing the RX buf size to this payload.
->cntrs counters were alon not initialized with the only consequence
to generate wrong values for these counters.
Must be backported as far as 2.9.
Emeric reported that he can't build haproxy anymore since 9bc6a034
("BUG/MINOR: ssl: Free global_ssl structure contents during deinit").
src/ssl_sock.c:7020:40: error: comparison with string literal results in unspecified behavior [-Werror=address]
7020 | if (global_ssl.listen_default_ciphers != LISTEN_DEFAULT_CIPHERS)
| ^~
src/ssl_sock.c:7023:41: error: comparison with string literal results in unspecified behavior [-Werror=address]
7023 | if (global_ssl.connect_default_ciphers != CONNECT_DEFAULT_CIPHERS)
| ^~
src/ssl_sock.c: At top level:
Indeed the mentionned patch is checking the pointer in order to free
something freeable, but that can't work because these constant are
strings literal which can be passed from the compiler and not pointers.
Also the test is not useful, because these strings are strdup() in
__ssl_sock_init, so they can be free directly.
Must be backported in every stable branches with 9bc6a034.
Fix quic_tx unittest module by adding an explicit define for <mtu> const
member of quic_cc_path.
This should fix coverity report from github issue #3162.
This can be backported up to 3.2.
Ensure applet_putchk() return value is checked when outputing via the
CLI 'show quic' header line.
This is only to align with other usages of the same function, as trash
output buffer should always be large enough for it. As such, the command
is simply aborted if this is not the case.
This should fix coverity report from github issue #3139.
This could be backported up to 2.8.
In stksess_new(), if we failed to allocate memory for the new stksess,
don't forget to decrement the table entry count, as nobody else will
do it for us.
An artificially high count could lead to at least purging entries while
there is no need to.
This should be backported up to 2.8.
WIP decrement current on allocation failure
A subtle regression was introduced in 3.0 by commit faa8c3e02 ("MEDIUM:
lb-chash: Deterministic node hashes based on server address"). When keys
are calculated from the server's ID (which is the default), due to the
reorganisation of the code, the key ended up being hashed twice instead
of being multiplied by the scaling range.
While most users will never notice it, it is blocking some large cache
users from upgrading from 2.8 to 3.0 or 3.2 because the keys are
redistributed.
After a check with users on the mailing list [1] it was estimated that
keep the current situation is the worst choice because those who have
not yet upgraded will face the problem while by fixing it, those who
already have and for whom it happened smoothly will handle it just
right again.
As such this fix must be backported to 3.0 without waiting (in order
to preserve those who upgrade from two redistributions). Please note
that only configurations featuring "hash-type consistent" and not
having "hash-key" present with a value other than "id" are affected,
others are not (e.g. "hash-key addr" is unaffected).
[1] https://www.mail-archive.com/haproxy@formilux.org/msg46115.html
With the fix in commit 982805e6a3 ("BUG/MINOR: pools: Fix the dump of
pools info to deal with buffers limitations"), the max count is now
compared to the number of dumped pools instead of the configured
numbered, and keeping >= is no longer valid because maxcnt is set by
default to the same value when not set, so this means that since this
patch we're always displaying "limited to the first X entries" where X
is the number of dumped entries even in the absence of any limitation.
Let's just fix the comparison to only show this when the limit is lower.
This must be backported to 3.2 where the patch above already is.
The truncation of pools output that was adressed in commit 982805e6a3
("BUG/MINOR: pools: Fix the dump of pools info to deal with buffers
limitations") required to split the pools filling from dumping. However
there is a problem when a limit is passed that is lower than the number
of pools or if a pool name is specified or if pool caches are disabled,
because in this case the number of filled slots will be lower than the
initially allocated one, and empty entries will be visited either by the
sort functions when filling the entries if "byxxx" is specified, or by
the dump function after the last entry, but none of these functions was
expecting to be passed a NULL entry.
Let's just re-adjust nbpools to match the number of filled entries at
the end. Anyway the totals are calculated on the number of dumped
entries.
This must be backported to 3.2 since the fix above was backported there
as well.
The third parameter passed to b_quic_dec_int() is unitialized. This is not a bug.
But this disturbs coverity for an unknown reason as revealed by GH issue #3154.
This patch takes the opportunity to use NULL as passed value to avoid using such
an uneeded third parameter.
Should be backported to 3.2 where this unit test was introduced.
Better check that munmap() always works, otherwise it means we might
have miscalculated an address, and if it fails silently, it will eat
all the memory extremely quickly. Let's add a BUG_ON() on munmap's
return.
As reported by Christopher, in UAF mode memory release of aligned
objects as introduced in commit ef915e672a ("MEDIUM: pools: respect
pool alignment in allocations") does not work. The padding calculation
in the freeing code is no longer correct since it now depends on the
alignment, so munmap() fails on EINVAL. Fortunately we don't care much
about it since we know it's the low bits of the passed address, which
is much simpler to compute, since all mmaps are page-aligned.
There's no need to backport this, as this was introduced in 3.3.
The pcre2 matching requires an array of matches for grouping, that is
allocated when executing the rule by pre-processing it, and that is
immediately freed after use. This is quite inefficient and results in
annoying patterns in "show profiling" that attribute the allocations
to libpcre2 and the releases to haproxy.
A good suggestion from Dragan is to pre-allocate these per thread,
since the entry is not specific to a regex. In addition we're already
limited to MAX_MATCH matches so we don't even have the problem of
having to grow it while parsing nor processing.
The current patch adds a per-thread pair of init/deinit functions to
allocate a thread-local entry for that, and gets rid of the dynamic
allocations. It will result in cleaner memory management patterns and
slightly higher performance (+2.5%) when using pcre2.
'ctx' might be NULL when we exit 'ssl_sock_handshake', it can't be
dereferenced without check in the trace macro.
This was found by Coverity andraised in GitHub #3113.
This patch should be backported up to 3.2
The new "add/del ssl jwt <file>" commands allow to change the "jwt" flag
of an already loaded certificate. It allows to delete certificates used
for JWT validation, which was not yet possible.
The "show ssl jwt" command iterates over all the ckch_stores and dumps
the ones that have the option set.
Add information about the new "jwt_verify_cert" converter and update the
existing "jwt_converter" doc to remove mentions of certificates from it.
Add information about the new "jwt" certificate option.
A certificate that does not have the 'jwt' flag enabled cannot be used
for JWT validation. We now raise a specific return value so that such a
case can be identified.
This option can be used to enable the use of a given certificate for JWT
verification. It defaults to 'off' so certificates that are declared in
a crt-store and will be used for JWT verification must have a
"jwt on" option in the configuration.
This converter will be in charge of performing the same operation as the
'jwt_verify' one except that it takes a full-on pem certificate path
instead of a public key path as parameter.
The certificate path can be either provided directly as a string or via
a variable. This allows to use certificates that are not known during
init to perform token validation.
The jwt_verify converter will not take full-on certificates anymore
in favor of a new soon to come jwt_verify_cert. We might end up with a
new jwt_verify_hmac in the future as well which would allow to deprecate
the jwt_verify converter and remove the need for a specific internal
tree for public keys.
The logic to always look into the internal jwt tree by default and
resolve to locking the ckch tree as little as possible will also be
removed. This allows to get rid of the duplicated reference to
EVP_PKEYs, the one in the jwt tree entry and the one in the ckch_store.
The key_base field of the global_ssl structure is an strdup'ed field
(when set) which was never free'd during deinit.
This patch can be backported up to branch 3.0.
Some fields of the global_ssl structure are strings that are strdup'ed
but never freed. There is only one static global_ssl structure so not
much memory is used but we might as well free it during deinit.
This patch can be backported to all stable branches.
Conditions to detect the spinning loop for applets based on the new API are
not accurrate. We cannot continue to check the channel's buffers state to
know if an applet has made some progress. At least, we must also check the
applet's buffers.
After digging to find the right way to do, it was clear that the best is to
use something similar to what is performed for the streams, namely, checking
read and write events. And in fact, it is quite easy to do with the new
API. So let's do so.
This patch must be backported as far as 3.0.
The purpose of memory profiling precisely is to figure what function
allocates and what function frees for specific objects. It turns out
that a non-negligible number of release callbacks basically do nothing
but a free() or pool_free() call and return, which the compiler happily
turns into a jump, making the caller of that callback appear as the
real one. That's how we can see libcrypto release to pools such as
ssl-capture for example, which also makes the per-DSO calls appear
wrong:
10000 0 10720000 0| 0x448c8d ssl_async_fd_free+0x3b9d p_alloc(1072) [pool=ssl-capture]
50000 0 6800000 0| 0x4456b9 ssl_async_fd_free+0x5c9 p_alloc(136) [pool=ssl-keylogf]
10072 0 644608 0| 0x447f14 ssl_async_fd_free+0x2e24 p_alloc(64) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445987 ssl_async_fd_free+0x897 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459b8 ssl_async_fd_free+0x8c8 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459e9 ssl_async_fd_free+0x8f9 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a1a ssl_async_fd_free+0x92a p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a4b ssl_async_fd_free+0x95b p_free(-136) [pool=ssl-keylogf]
0 20072 0 11364608| 0x7f5f1397db62 libcrypto:CRYPTO_free_ex_data+0xf2/0x261 p_free(-566) [pool=ssl-keylogf] [locked=72 (0.3 %)]
Worse, as can be seen on the last line above, there can be a single pool
per call place (since we don't release to arbitrary pools), and the stats
are misleading by reporting the first used pool only when a same function
can call multiple release callbacks. This is why the free call totals
10k ssl-capture and 10072 ssl-keylogfile.
Let's just disable tail call optimization when using memory profiling.
The gains are only very marginal and complicate so much the debugging
that it's not worth it. Now the output is correct, and no longer claims
that libcrypto is the caller:
10000 0 10720000 0| 0x448c9f ssl_async_fd_free+0x3b9f p_alloc(1072) [pool=ssl-capture]
0 10000 0 10720000| 0x445af0 ssl_async_fd_free+0x9f0 p_free(-1072) [pool=ssl-capture]
50000 0 6800000 0| 0x4456c9 ssl_async_fd_free+0x5c9 p_alloc(136) [pool=ssl-keylogf]
10177 0 1221240 0| 0x45543d ssl_async_fd_handler+0xb51d p_alloc(120) [pool=ssl_sock_ct] [locked=165 (1.6 %)]
10061 0 643904 0| 0x447f1c ssl_async_fd_free+0x2e1c p_alloc(64) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445987 ssl_async_fd_free+0x887 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459b8 ssl_async_fd_free+0x8b8 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459e9 ssl_async_fd_free+0x8e9 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a1a ssl_async_fd_free+0x91a p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a4b ssl_async_fd_free+0x94b p_free(-136) [pool=ssl-keylogf]
0 10188 0 1222560| 0x44f518 ssl_async_fd_handler+0x55f8 p_free(-120) [pool=ssl_sock_ct] [locked=176 (1.7 %)]
0 10072 0 644608| 0x445aa6 ssl_async_fd_free+0x9a6 p_free(-64) [pool=ssl-keylogf] [locked=72 (0.7 %)]
An attempt was made to only instrument pool_free() to place a compiler
barrier, but that resulted in much larger code and wouldn't cover
functions ending with a simple "free()" call. "ha_free()" however is
already immune against tail call optimization since it has to write
the NULL when returning from free().
This should be backported to recent stable releases that are still
regularly being debugged.
For now, no applets are using the <kop> value when consuming data. At least,
as far as I know. But it remains a good idea to keep the applet API
compatible. So now, the <kip> of the opposite side is properly forwarded to
applets.
By refactoring the HTX to remove the extra field, a bug was introduced in
the stream-connector part. The <kip> (known input payload) value of a sedesc
was moved to <kop> (knwon output payload) using the same sedesc. Of course,
this is totally wrong. <kip> value of a sedesc must be forwarded to the
opposite side.
In addition, the operation is performed in sc_conn_send(). In this function,
we manipulate the stream-connectors. So se_fwd_kip() function was changed to
use the stream-connectors directely.
Now, the function sc_ep_fwd_kip() is now called with the both
stream-connectors to properly forward <kip> from on side to the opposite
side.
The bug is 3.3-specific. No backport needed.
William rightfully pointed that despite the ssl capture being a
structure, some of its entries are only set for certain contents,
so we need to always zero it before using it so as to clear any
remains of a previous use, otherwise we could possibly report some
entries that were only present in the first hello and not the second
one. No need to clear the data though, since any remains will not be
referenced by the fields.
This must be backported wherever commit 336170007c ("BUG/MEDIUM: ssl:
take care of second client hello") is backported.
For a long time we've been observing some sporadic leaks of ssl-capture
pool entries on haproxy.org without figuring exactly the root cause. All
that was seen was that less calls to the free callback were made than
calls to the hello parsing callback, and these were never reproduced
locally.
It recently turned out to be triggered by the presence of "curves" or
"ecdhe" on the "bind" line. Captures have shown the presence of a second
client hello, called "Change Cipher Client Hello" in wireshark traces,
that calls the client hello callback again. That one wasn't prepared for
being called twice per connection, so it allocates an ssl-capture entry
and assigns it to the ex_data entry, possibly overwriting the previous
one.
In this case, the fix is super simple, just reuse the current ex_data
if it exists, otherwise allocate a new one. This completely solves the
problem.
Other callbacks have been audited for the same issue and are not
affected: ssl_ini_keylog() already performs this check and ignores
subsequent calls, and other ones do not allocate data.
This must be backported to all supported versions.
This patch fixes some memory leaks in the configuration parser:
- deinit_acme() was never called
- add ha_free() before every strdup() for section overwrite
- lacked some free() in deinit_acme()
Don't insert the acme account key in the ckchs_tree anymore. ckch_store
are not made to only include a private key. CLI operations are not
possible with them either. That doesn't make much sense to keep it that
way until we rework the ckch_store.
Thanks for previous changes, it is now possible to remove the <extra> field
from the HTX structure. HTX_FL_ALTERED_PAYLOAD flag is also removed because
it is now unsued.
When data are sent to the consumer, the known output payload length is
updated using the known input payload length value and this last one is then
reset. se_fwd_kip() function is used for this purpose.
Set <kip> value when data are transfer to the upper layer, in h3_rcv_buf().
The difference between the known length of the payload before and after a
parsing loop is added to <kip> value. When a content-length is specified in
the message, the h3s <body_len> field is used. Otherwise, it is the h3s
<data_len> field.
Set <kip> value when data are transfer to the upper layer, in h2_rcv_buf().
The new <body_len> filed of the H2S is used to increment <kip> value and
then it is reset. The patch relies on the previous one ("MINOR: mux-h2: Save
the known length of the payload").
Before, the <body_len> H2S field was only use for verity the annonced
content-lenght value was respected. Now, this field is used for all
messages. Messages with a content-length are still handled the same way.
<body_len> is set to the content-length value and decremented by the size of
each DATA frame. For other messages, the value is initialized to ULLONG_MAX
and still decremented by the size of each DATA frame. This change is
mandatory to properly define the known input payload length value of the
sedesc.
Set <kip> value during the response parsing. The difference between the body
length before and after a parsing loop is added. The patch relies on the
previous one ("MINOR: h1-htx: Increment body len when parsing a payload with
no xfer length").
Set <kip> value during the message parsing. The difference between the body
length before and after a parsing loop is added. The patch relies on the
previous one ("MINOR: h1-htx: Increment body len when parsing a payload with
no xfer length").
In the H1 parseur, the body length was only incremented when the transfer
length was known. So when the content-length was specified or when the
transfer-encoding value was set to "chunk".
Now for messages with unknown transfer length, it is also incremented. It is
mandatory to be able to remove the extra field from the HTX message.
For now, the HTX extra value is used to specify the known part, in bytes, of
the HTTP payload we will receive. It may concerne the full payload if a
content-length is specified or the current chunk for a chunk-encoded
message. The main purpose of this value is to be used on the opposite side
to be able to announce chunks bigger than a buffer. It can also be used to
check the validity of the payload on the sending path, to properly detect
too big or too short payload.
However, setting this information in the HTX message itself is not really
appropriate because the information is lost when the HTX message is consumed
and the underlying buffer released. So the producer must take care to always
add it in all HTX messages. it is especially an issue when the payload is
altered by a filter.
So to fix this design issue, the information will be moved in the sedesc. It
is a persistent area to save the information. In addition, to avoid the
ambiguity between what the producer say and what the consumer see, the
information will be splitted in two fields. In this patch, the fields are
added:
* kip : The known input payload length
* kop : The known output payload lenght
The producer will be responsible to set <kip> value. The stream will be
responsible to decrement <kip> and increment <kop> accordingly. And the
consumer will be responsible to remove consumed bytes from <kop>.
QC_SF_UNKNOWN_PL_LENGTH flag is set on the qcs to know a payload of message
has an unknown length and not send a RESET_STREAM on shutdown. This flag was
based on the HTX extra field value. However, it is not necessary. When
headers are processed, before sending them, it is possible to check the HTX
start-line to know if the length of the payload is known or not.
So let's do so and don't use anymore the HTX extra field for this purpose.
In the continuity of https://github.com/orgs/haproxy/discussions/3146,
we must also enable abortonclose by default for TLS listeners so as not
to needlessly compute TLS handshakes on dead connections. The change is
very small (just set the default value to 1 in the TLS code when neither
the option nor its opposite were set).
It may possibly cause some TLS handshakes to start failing with 3.3 in
certain legacy environments (e.g. TLS health-checks performed using only
a client hello and closing afterwards), and in this case it is sufficient
to disable the option using "no option abortonclose" in either the
affected frontend or the "defaults" section it derives from.
With this function we can now pass the desired default value for the
abortonclose option when neither the option nor its opposite were set.
Let's also take this opportunity for using it directly from the HTTP
analyser since there's no point in re-checking the proxy's mode there.
As discussed on https://github.com/orgs/haproxy/discussions/3146 and on
the mailing list, there's a marked preference for having abortonclose
enabled by default when relevant. The point being that with todays'
internet, the large majority of requests sent with a closed input
channel are aborted requests, and that it's pointless to waste resources
processing them.
This patch now considers both "option abortonclose" and its opposite
"no option abortonclose" to figure whether abortonclose is enabled or
disabled in a backend. When neither are set (thus not even inherited
from a defaults section), then it considers the proxy's mode, and HTTP
mode implies abortonclose by default.
This may make some legacy services fail starting with 3.3. In this case
it will be sufficient to add "no option abortonclose" in either the
affected backend or the defaults section it derives from. But for
internet-facing proxies it's better to stay with the option enabled.
In order to prepare for changing the way abortonclose works, let's
replace the direct flag check with a similarly named function
(proxy_abrt_close) which returns the on/off status of the directive
for the proxy. For now it simply reflects the flag's state.
By default when building an H2 request, vtest sets the END_STREAM flag
on the HEADERS frame. This is problematic with the websocket and proto
upgrade tests since we're using CONNECT, because it immediately closes
afterwards, which does not correspond to what we're testing. Doing this
in abortonclose mode rightfully produces an error. Let's fix the test
so as not to set the flag on the HEADERS frame. However, doing so means
we'll receive a window update that we must also accept. Now the test
works both with and without abortonclose.
Tests with abortonclose showed a bug with this test where the client
would close the stream immediately after sending the request, without
waiting for the response, causing some random failures on the server
side.
The "abortonclose" option was recently deprecated in frontends because its
action was essentially limited to the backend part (queuing etc). But in
3.3 we started to support it for TLS on frontends, though it would only
work when placed in a defaults section. Let's officially support it in
frontends, and take this opportunity to clarify the documentation on this
topic, which was incomplete regarding frontend and TLS support. Now the
doc tries to better cover the different use cases.
The patchbot stopped on a previous ultra-rare forced push due to wanting
the user's name and e-mail before proceeding. We don't want merges nor
rebases anyway, only to reset the tree to the next one, so let's do that.
'wait-for-body' action set analyse_exp date for the channel to the
configured time. However, when the action is finished, it does not reset
it. It is an issue for some following actions, like 'pause', that also rely
on this date.
To fix the issue, we must take care to reset the analyse_exp date to
TICK_ETERNITY when the 'wait-for-body' action is finished.
This patch should fix the issue #3147. It must be backported to all stable
versions.
Since 9561b9fb6 ("BUG/MINOR: sink: add tempo between 2 connection
attempts for sft servers"), there is a possibility that the tempo we use
to schedule the task expiry may point to TICK_ETERNITY as we add ticks to
tempo with a simple addition that doesn't take care of potential wrapping.
When this happens (although relatively rare, since now_ms only wraps every
49.7 days, but a forced wrap occurs 20 seconds after haproxy is started
so it is more likely to happen there), the process_sink_forward() task
expiry being set to TICK_ETERNITY, it may never be called again, this
is especially true if the ring section only contains a single server.
To fix the issue, we must use tick_add() helper function to set the tempo
value and this way we ensure that the value will never be TICK_ETERNITY.
It must be backported everywhere 9561b9fb6 was backported (up to 2.6
it seems).
In connect_server(), only avoid creating a mux when we're reusing a
connection, if that connection already has one. We can reuse a
connection with no mux, if we made a first attempt at connecting to the
server and it failed before we could create the mux (or during the mux
creation). The connection will then be reused when trying again.
This fixes a bug where a stream could stall if the first connection
attempt failed before the mux creation. It is easy to reproduce by
creating random memory allocation failure with -dmFail.
This was introduced by commit 4aaf0bfbced22d706af08725f977dcce9845d340,
and thus does not need any backport as long as that commit is not
backported.
Released version 3.3-dev9 with the following main changes :
- BUG/MINOR: acl: Fix error message about several '-m' parameters
- MINOR: server: Parse sni and pool-conn-name expressions in a dedicated function
- BUG/MEDIUM: server: Use sni as pool connection name for SSL server only
- BUG/MINOR: server: Update healthcheck when server settings are changed via CLI
- OPTIM: backend: Don't set SNI for non-ssl connections
- OPTIM: proto_rhttp: Don't set SNI for non-ssl connections
- OPTIM: tcpcheck: Don't set SNI and ALPN for non-ssl connections
- BUG/MINOR: tcpcheck: Don't use sni as pool-conn-name for non-SSL connections
- MEDIUM: server/ssl: Base the SNI value to the HTTP host header by default
- MEDIUM: httpcheck/ssl: Base the SNI value on the HTTP host header by default
- OPTIM: tcpcheck: Reorder tcpchek_connect structure fields to fill holes
- REGTESTS: ssl: Add a script to test the automatic SNI selection
- MINOR: quic: add useful trace about padding params values
- BUG/MINOR: quic: too short PADDING frame for too short packets
- BUG/MINOR: cpu_topo: work around a small bug in musl's CPU_ISSET()
- BUG/MEDIUM: ssl: Properly initialize msg_controllen.
- MINOR: quic: SSL session reuse for QUIC
- BUG/MEDIUM: proxy: fix crash with stop_proxy() called during init
- MINOR: stats-file: use explicit unsigned integer bitshift for user slots
- CLEANUP: quic: fix typo in quic_tx trace
- TESTS: quic: add unit-tests for QUIC TX part
- MINOR: quic: restore QUIC_HP_SAMPLE_LEN constant
- REGTESTS: ssl: Fix the script about automatic SNI selection
- BUG/MINOR: pools: Fix the dump of pools info to deal with buffers limitations
- MINOR: pools: Don't dump anymore info about pools when purge is forced
- BUG/MINOR: quic: properly support GSO on backend side
- BUG/MEDIUM: mux-h2: Reset MUX blocking flags when a send error is caught
- BUG/MEDIUM: mux-h2; Don't block reveives in H2_CS_ERROR and H2_CS_ERROR2 states
- BUG/MEDIUM: mux-h2: Restart reading when mbuf ring is no longer full
- BUG/MINOR: mux-h2: Remove H2_CF_DEM_DFULL flags when the demux buffer is reset
- BUG/MEDIUM: mux-h2: Report RST/error to app-layer stream during 0-copy fwding
- BUG/MEDIUM: mux-h2: Reinforce conditions to report an error to app-layer stream
- BUG/MINOR: hq-interop: adjust parsing/encoding on backend side
- OPTIM: check: do not delay MUX for ALPN if SSL not active
- BUG/MEDIUM: checks: fix ALPN inheritance from server
- BUG/MINOR: check: ensure checks are compatible with QUIC servers
- MINOR: check: reject invalid check config on a QUIC server
- MINOR: debug: report the process id in warnings and panics
- DEBUG: stream: count the number of passes in the connect loop
- MINOR: debug: report the number of loops and ctxsw for each thread
- MINOR: debug: report the time since last wakeup and call
- DEBUG: peers: export functions that use locks
- MINOR: stick-table: permit stksess_new() to temporarily allocate more entries
- MEDIUM: stick-tables: relax stktable_trash_oldest() to only purge what is needed
- MEDIUM: stick-tables: give up on lock contention in process_table_expire()
- MEDIUM: stick-tables: don't wait indefinitely in stktable_add_pend_updates()
- MEDIUM: peers: don't even try to process updates under contention
- BUG/MEDIUM: h1: Allow reception if we have early data
- BUG/MEDIUM: ssl: create the mux immediately on early data
- MINOR: ssl: Add a flag to let it known we have an ALPN negociated
- MINOR: ssl: Use the new flag to know when the ALPN has been set.
- MEDIUM: server: Introduce the concept of path parameters
- CLEANUP: backend: clarify the role of the init_mux variable in connect_server()
- CLEANUP: backend: invert the condition to start the mux in connect_server()
- CLEANUP: backend: simplify the complex ifdef related to 0RTT in connect_server()
- CLEANUP: backend: clarify the cases where we want to use early data
- MEDIUM: server: Make use of the stored ALPN stored in the server
- BUILD: ssl: address a recent build warning when QUIC is enabled
- BUG/MINOR: activity: fix reporting of task latency
- MINOR: activity: indicate the number of calls on "show tasks"
- MINOR: tools: don't emit "+0" for symbol names which exactly match known ones
- BUG/MEDIUM: stick-tables: don't loop on non-expirable entries
- DEBUG: stick-tables: export stktable_add_pend_updates() for better reporting
- BUG/MEDIUM: ssl: Fix a crash when using QUIC
- BUG/MEDIUM: ssl: Fix a crash if we failed to create the mux
- MEDIUM: dns: bind the nameserver sockets to the initiating thread
- MEDIUM: resolvers: make the process_resolvers() task single-threaded
- BUG/MINOR: stick-table: make sure never to miss a process_table_expire update
- MEDIUM: stick-table: move process_table_expire() to a single thread
- MEDIUM: peers: move process_peer_sync() to a single thread
- BUG/MAJOR: stream: Force channel analysis on successful synchronous send
- MINOR: quic: get rid of ->target quic_conn struct member
- MINOR: quic-be: make SSL/QUIC objects use their own indexes (ssl_qc_app_data_index)
- MINOR: quic: display build warning for compat layer on recent OpenSSL
- DOC: quic: clarifies limited-quic support
- BUG/MINOR: acme: null pointer dereference upon allocation failure
- BUG/MEDIUM: jws: return size_t in JWS functions
- BUG/MINOR: ssl: Potential NULL deref in trace macro
- BUG/MINOR: ssl: Fix potential NULL deref in trace callback
- BUG/MINOR: ocsp: prototype inconsistency
- MINOR: ocsp: put internal functions as static ones
- MINOR: ssl: set functions as static when no protypes in the .h
- BUILD: ssl: functions defined but not used
- BUG/MEDIUM: resolvers: Properly cache do-resolv resolution
- BUG/MINOR: resolvers: Restore round-robin selection on records in DNS answers
- MINOR: activity: don't report the lat_tot column for show profiling tasks
- MINOR: activity: add a new lkw_avg column to show profiling stats
- MINOR: activity: collect time spent waiting on a lock for each task
- MINOR: thread: add a lock level information in the thread_ctx
- MINOR: activity: add a new lkd_avg column to show profiling stats
- MINOR: activity: collect time spent with a lock held for each task
- MINOR: activity: add a new mem_avg column to show profiling stats
- MINOR: activity: collect CPU time spent on memory allocations for each task
- MINOR: activity/memory: count allocations performed under a lock
- DOC: proxy-protocol: Add TLS group and sig scheme TLVs
- BUG/MEDIUM: resolvers: Test for empty tree when getting a record from DNS answer
- BUG/MEDIUM: resolvers: Make resolution owns its hostname_dn value
- BUG/MEDIUM: resolvers: Accept to create resolution without hostname
- BUG/MEDIUM: resolvers: Wake resolver task up whne unlinking a stream requester
- BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
- Revert "BUG/MINOR: ocsp: Crash when updating CA during ocsp updates"
- BUG/MEDIUM: http_ana: fix potential NULL deref in http_process_req_common()
- MEDIUM: log/proxy: store log-steps selection using a bitmask, not an eb tree
- BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
- BUG/MINOR: resolvers: always normalize FQDN from response
- BUILD: makefile: implement support for running a command in range
- IMPORT: cebtree: import version 0.5.0 to support duplicates
- MEDIUM: migrate the patterns reference to cebs_tree
- MEDIUM: guid: switch guid to more compact cebuis_tree
- MEDIUM: server: switch addr_node to cebis_tree
- MEDIUM: server: switch conf.name to cebis_tree
- MEDIUM: server: switch the host_dn member to cebis_tree
- MEDIUM: proxy: switch conf.name to cebis_tree
- MEDIUM: stktable: index table names using compact trees
- MINOR: proxy: add proxy_get_next_id() to find next free proxy ID
- MINOR: listener: add listener_get_next_id() to find next free listener ID
- MINOR: server: add server_get_next_id() to find next free server ID
- CLEANUP: server: use server_find_by_id() when looking for already used IDs
- MINOR: server: add server_index_id() to index a server by its ID
- MINOR: listener: add listener_index_id() to index a listener by its ID
- MINOR: proxy: add proxy_index_id() to index a proxy by its ID
- MEDIUM: proxy: index proxy ID using compact trees
- MEDIUM: listener: index listener ID using compact trees
- MEDIUM: server: index server ID using compact trees
- CLEANUP: server: slightly reorder fields in the struct to plug holes
- CLEANUP: proxy: slightly reorganize fields to plug some holes
- CLEANUP: backend: factor the connection lookup loop
- CLEANUP: server: use eb64_entry() not ebmb_entry() to convert an eb64
- MINOR: server: pass the server and thread to srv_migrate_conns_to_remove()
- CLEANUP: backend: use a single variable for removed in srv_cleanup_idle_conns()
- MINOR: connection: pass the thread number to conn_delete_from_tree()
- MEDIUM: connection: move idle connection trees to ceb64
- MEDIUM: connection: reintegrate conn_hash_node into connection
- CLEANUP: tools: use the item API for the file names tree
- CLEANUP: vars: use the item API for the variables trees
- BUG/MEDIUM: pattern: fix possible infinite loops on deletion
- CI: scripts: add support for git in openssl builds
- CI: github: add an OpenSSL + ECH job
- CI: scripts: mkdir BUILDSSL_TMPDIR
- Revert "BUG/MEDIUM: pattern: fix possible infinite loops on deletion"
- BUG/MEDIUM: pattern: fix possible infinite loops on deletion (try 2)
- CLEANUP: log: remove deadcode in px_parse_log_steps()
- MINOR: counters: document that tg shared counters are tied to shm-stats-file mapping
- DOC: internals: document the shm-stats-file format/mapping
- IMPORT: ebtree: delete unusable ebpttree.c
- IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs
- IMPORT: eb32/eb64: use a more parallelizable check for lack of common bits
- IMPORT: eb32: drop the now useless node_bit variable
- IMPORT: eb32/eb64: place an unlikely() on the leaf test
- IMPORT: ebmb: optimize the lookup for modern CPUs
- IMPORT: eb32/64: optimize insert for modern CPUs
- IMPORT: ebtree: only use __builtin_prefetch() when supported
- IMPORT: ebst: use prefetching in lookup() and insert()
- IMPORT: ebtree: Fix UB from clz(0)
- IMPORT: ebtree: add a definition of offsetof()
- IMPORT: ebtree: replace hand-rolled offsetof to avoid UB
- MINOR: listener: add the "cc" bind keyword to set the TCP congestion controller
- MINOR: server: add the "cc" keyword to set the TCP congestion controller
- BUG/MEDIUM: ring: invert the length check to avoid an int overflow
- MINOR: trace: don't call strlen() on the thread-id numeric encoding
- MINOR: trace: don't call strlen() on the function's name
- OPTIM: sink: reduce contention on sink_announce_dropped()
- OPTIM: sink: don't waste time calling sink_announce_dropped() if busy
- CLEANUP: ring: rearrange the wait loop in ring_write()
- OPTIM: ring: always relax in the ring lock and leader wait loop
- OPTIM: ring: check the queue's owner using a CAS on x86
- OPTIM: ring: avoid reloading the tail_ofs value before the CAS in ring_write()
- BUG/MEDIUM: sink: fix unexpected double postinit of sink backend
- MEDIUM: stats: consider that shared stats pointers may be NULL
- BUG/MEDIUM: http-client: Fix the test on the response start-line
- MINOR: acme: acme-vars allow to pass data to the dpapi sink
- MINOR: acme: check acme-vars allocation during escaping
- BUG/MINOR: acme/cli: wrong description for "acme challenge_ready"
- CI: move VTest preparation & friends to dedicated composite action
- BUG/MEDIUM: stick-tables: Don't let table_process_entry() handle refcnt
- BUG/MINOR: compression: Test payload size only if content-length is specified
- BUG/MINOR: pattern: Properly flag virtual maps as using samples
- BUG/MINOR: acme: possible overflow on scheduling computation
- BUG/MINOR: acme: possible overflow in acme_will_expire()
- CLEANUP: acme: acme_will_expire() uses acme_schedule_date()
- BUG/MINOR: pattern: Fix pattern lookup for map with opt@ prefix
- CI: scripts: build curl with ECH support
- CI: github: add curl+ech build into openssl-ech job
- BUG/MEDIUM: ssl: ca-file directory mode must read every certificates of a file
- MINOR: acme: provider-name for dpapi sink
- BUILD: acme: fix false positive null pointer dereference
- MINOR: backend: srv_queue helper
- MINOR: backend: srv_is_up converter
- BUILD: halog: misleading indentation in halog.c
- CI: github: build halog on the vtest job
- BUG/MINOR: acme: don't unlink from acme_ctx_destroy()
- BUG/MEDIUM: acme: cfg_postsection_acme() don't init correctly acme sections
- MINOR: acme: implement "reuse-key" option
- ADMIN: haproxy-dump-certs: implement a certificate dumper
- ADMIN: dump-certs: don't update the file if it's up to date
- ADMIN: dump-certs: create files in a tmpdir
- ADMIN: dump-certs: fix lack of / in -p
- ADMIN: dump-certs: use same error format as haproxy
- ADMIN: reload: add a synchronous reload helper
- BUG/MEDIUM: acme: free() of i2d_X509_REQ() with AWS-LC
- ADMIN: reload: introduce verbose and silent mode
- ADMIN: reload: introduce -vv mode
- MINOR: mt_list: Implement MT_LIST_POP_LOCKED()
- BUG/MEDIUM: stick-tables: Make sure not to free a pending entry
- MINOR: sched: let's permit to share the local ctx between threads
- MINOR: sched: pass the thread number to is_sched_alive()
- BUG/MEDIUM: wdt: improve stuck task detection accuracy
- MINOR: ssl: add the ssl_bc_sni sample fetch function to retrieve backend SNI
- MINOR: rawsock: introduce CO_RFL_TRY_HARDER to detect closures on complete reads
- MEDIUM: ssl: don't always process pending handshakes on closed connections
- MEDIUM: servers: Schedule the server requeue target on creation
- MEDIUM: fwlc: Make it so fwlc_srv_reposition works with unqueued srv
- BUG/MEDIUM: fwlc: Handle memory allocation failures.
- DOC: config: clarify some known limitations of the json_query() converter
- BUG/CRITICAL: mjson: fix possible DoS when parsing numbers
- BUG/MINOR: h2: forbid 'Z' as well in header field names checks
- BUG/MINOR: h3: forbid 'Z' as well in header field names checks
- BUG/MEDIUM: resolvers: break an infinite loop in resolv_get_ip_from_response()
The fix in 3023e98199 ("BUG/MINOR: resolvers: Restore round-robin
selection on records in DNS answers") still contained an issue not
addressed f6dfbbe870 ("BUG/MEDIUM: resolvers: Test for empty tree
when getting a record from DNS answer"). Indeed, if the next element
is the same as the first one, then we can end up with an endless loop
because the test at the end compares the next pointer (possibly null)
with the end one (first).
Let's move the null->first transition at the end. This must be
backported where the patches above were backported (3.2 for now).
The current tests in _h3_handle_hdr() and h3_trailers_to_htx() check
for an interval between 'A' and 'Z' for letters in header field names
that should be forbidden, but mistakenly leave the 'Z' out of the
forbidden range, resulting in it being implicitly valid.
This has no real consequences but should be fixed for the sake of
protocol validity checking.
This must be backported to all relevant versions.
The current tests in h2_make_htx_request(), h2_make_htx_response()
and h2_make_htx_trailers() check for an interval between 'A' and 'Z'
for letters in header field names that should be forbidden, but
mistakenly leave the 'Z' out of the forbidden range, resulting in it
being implicitly valid.
This has no real consequences but should be fixed for the sake of
protocol validity checking.
This must be backported to all relevant versions.
Mjson comes with its own strtod() implementation for portability
reasons and probably also because many generic strtod() versions as
provided by operating systems do not focus on resource preservation
and may call malloc(), which is not welcome in a parser.
The strtod() implementation used here apparently originally comes from
https://gist.github.com/mattn/1890186 and seems to have purposely
omitted a few parts that were considered as not needed in this context
(e.g. skipping white spaces, or setting errno). But when subject to the
relevant test cases of the designated file above, the current function
provides the same results.
The aforementioned implementation uses pow() to calculate exponents,
but mjson authors visibly preferred not to introduce a libm dependency
and replaced it with an iterative loop in O(exp) time. The problem is
that the exponent is not bounded and that this loop can take a huge
amount of time. There's even an issue already opened on mjson about
this: https://github.com/cesanta/mjson/issues/59. In the case of
haproxy, fortunately, the watchdog will quickly stop a runaway process
but this remains a possible denial of service.
A first approach would consist in reintroducing pow() like in the
original implementation, but if haproxy is built without Lua nor
51Degrees, -lm is not used so this will not work everywhere.
Anyway here we're dealing with integer exponents, so an easy alternate
approach consists in simply using shifts and squares, to compute the
exponent in O(log(exp)) time. Not only it doesn't introduce any new
dependency, but it turns out to be even faster than the generic pow()
(85k req/s per core vs 83.5k on the same machine).
This must be backported as far as 2.4, where mjson was introduced.
Many thanks to Oula Kivalo for reporting this issue.
CVE-2025-11230 was assigned to this issue.
Oula Kivalo reported that different JSON libraries may process duplicate
keys differently and that most JSON libraries usually decode the stream
before extracting keys, while the current mjson implementation decodes the
contents during extraction instead. Let's document this point so that
users are aware of the limitations and do not rely on the current behavior
and do not use it for what it's not made for (e.g. content sanitization).
This is also the case for jwt_header_query(), jwt_payload_query() and
jwt_verify(), which already refer to this converter for specificities.
Properly handle memory allocation failures, by checking the return value
for pool_alloc(), and if it fails, make sure that the caller will take
it into account.
The only use of pool_alloc() in fwlc is to allocate the tree elements in
order to properly queue the server into the ebtree, so if that
allocation fails, just schedule the requeue tasklet, that will try
again, until it hopefully eventually succeeds.
This should be backported to 3.2.
This should fix github issue #3143.
Modify fwlc_srv_reposition() so that it does not assume that the server
was already queued, and so make it so it works even if s->tree_elt is
NULL.
While the server will usually be queued, there is an unlikely
possibility that when the server attempted to get queued when it got up,
it failed due to a memory allocation failure, and it just expect the
server_requeue tasklet to run to take care of that later.
This should be backported to 3.2.
This is part of an attempt to fix github issue #3143
On creation, schedule the server requeue once it's been created.
It is possible that when the server went up, it tried to queue itself
into the lb specific code, failed to do so, and expect the tasklet to
run to take care of that.
This should be backported to 3.2.
This is part of an attempt to fix github issue #3143.
If a client aborts a pending SSL connection for whatever reason (timeout
etc) and the listen queue is large, it may inflict a severe load to a
frontend which will spend the CPU creating new sessions then killing the
connection. This is similar to HTTP requests aborted just after being
sent, except that asymmetric crypto is way more expensive.
Unfortunately "option abortonclose" has no effect on this, because it
only applies at a higher level.
This patch ensures that handshakes being received on a frontend having
"option abortonclose" set will be checked for a pending close, and if
this is the case, then the connection will be aborted before the heavy
calculations. The principle is to use recv(MSG_PEEK) to detect the end,
and to destroy the pending handshake data before returning to the SSL
library so that it cannot start computing, notices the error and stops.
We don't do it without abortonclose though, because this can be used for
health checks from other haproxy nodes or even other components which
just want to see a handshake succeed.
This is in relation with GH issue #3124.
Normally, when reading a full buffer, or exactly the requested size, it
is not really possible to know if the peer had closed immediately after,
and usually we don't care. There's a problematic case, though, which is
with SSL: the SSL layer reads in small chunks of a few bytes, and can
consume a client_hello this way, then start computation without knowing
yet that the client has aborted. In order to permit knowing more, we now
introduce a new read flag, CO_RFL_TRY_HARDER, which says that if we've
read up to the permitted limit and the flag is set, then we attempt one
extra byte using MSG_PEEK to detect whether the connection was closed
immediately after that content or not. The first use case will obviously
be related to SSL and client_hello, but it might possibly also make sense
on HTTP responses to detect a pending FIN at the end of a response (e.g.
if a close was already advertised).
Sometimes in order to debug certain difficult situations it can be useful
to know what SNI was configured on a connection going to a server, for
example to match it against what the server saw or to detect cases where
a server would route on SNI instead of Host. This sample fetch function
simply retrieves the SNI configured on the backend connection, if any.
The fact that the watchdog timer measures the execution time from the
last return from the poller tends to amplify the impact of multiple
bad tasks, and may explain some of the panics reported by Felipe and
Ricardo in GH issues #3084, #3092 and #3101. The problem is that we
check the time if we see that the scheduler appears not to be moving
anymore, but one situation may still arise and catch a bad task:
- one slow task takes so long a time that it triggers the watchdog
twice, emitting a warning the second time (~200ms). The scheduler
is rightfully marked as stuck.
- then it completes and the scheduler is no longer stuck. Many other
tasks run in turn, they all take quite some time but not enough to
trigger a warning. But collectively their cost adds up.
- then a task takes more than the warning time (100ms), and causes
the total execution time to cross the second. The watchdog is
called, sees that we've spend more than 1 second since we left the
poller, and marks the thread as stuck.
- the task is not finished, the watchdog is called again, sees more
than one second with a stuck thread and panics 100ms later.
The total time away from the poller is indeed more than one second,
which is very bad, but no single task caused this individually, and
while the warnings are OK, the watchdog should not panic in this case.
This patch revisits the approach to store the moment the scheduler was
marked as stuck in the wdt context. The idea is that this date will be
used to detect warnings and panics. And by doing so and exploiting the
new is_sched_alive(thr), we can greatly simplify the mechanism so that
the signal handling thread does the strict minimum (mark the scheduler
as possibly stuck and update the stuck_start date), and only bounces to
the reporting thread if the scheduler made no progress since last call.
This means that without even doing computations in the handing thread,
we can continue to avoid all bounces unless a warning is required. Then
when the reporting thread is signaled, it will check the dates from the
last moment the scheduler was marked, and will decide to warn or panic.
The panic decision continues to pass via a TH_FL_STUCK flag to probe the
code so that exceptionally slow code (e.g. live cert generation etc) can
still find a way to avoid the panic if absolutely certain that things
are still moving.
This means that now we have the guarantee that panics will only happen
if a given task spends more than one full second not moving, and that
warnings will be issued for other calls crossing the warn delay boundary.
This was tested using artificially slow operations, and all combinations
which individually took less than a second only resulted in floods of
warnings even if the total reported time in the warning was much higher,
while those above one second provoked the panic.
One improvement could consist in reporting the time since last stuck
in the thread dumps to differentiate the individual task from the whole
set.
This needs to be backported to 3.2 along with the two previous patches:
MINOR: sched: let's permit to share the local ctx between threads
MINOR: sched: pass the thread number to is_sched_alive()
Now it will be possible to query any thread's scheduler state, not
only the current one. This aims at simplifying the watchdog checks
for reported threads. The operation is now a simple atomic xchg.
The watchdog timer has to go through complex operations due to not being
able to check if another thread's scheduler is still ticking. This is
simply because the scheduler status is marked as thread-local while it
could in fact also be an array. Let's do that (and align the array to
avoid false sharing) so that it's now possible to check any scheduler's
status.
There is a race condition, an entry can be free'd by stksess_kill()
between the time stktable_add_pend_updates() gets the entry from the
mt_list, and the time it adds it to the ebtree.
To prevent this, use the newly implemented MT_LIST_POP_LOCKED() to keep
the stksess locked until it is added to the tree. That way,
__stksess_kill() will wait until we're done with it.
This should be backported to 3.2.
Implement MT_LIST_POP_LOCKED(), that behaves as MT_LIST_POP() and
removes the first element from the list, if any, but keeps it locked.
This should be backported to 3.2, as it will be use in a bug fix in the
stick tables that affects 3.2 too.
The -v verbose mode displays the loading messages returned by the master
CLI reload command upon error.
The new -vv mode displays the loading messages even upon success,
showing the content of `show startup-logs` after the reload attempt.
By default haproxy-reload displays the error that are not emitted by
haproxy, but only emitted by haproxy-reload.
-s silent mode, don't display any error
-v verbose mode, display the loading messages returned by the master CLI
reload command upon error.
When using AWS-LC, the free() of the data ptr resulting from
i2d_X509_REQ() might crash, because it uses the free() of the libc
instead of OPENSSL_free().
It does not seems to be a problem on openssl builds.
Must be backported in 3.2.
Replace error/notice by [ALERT]/[WARNING]/[NOTICE] like it's done in
haproxy.
ALERT means a failure and the program will exit 1 just after it
WARNING will continue the execution of the program
NOTICE will continue the execution as well
Files dumped from the socket are put in a temporary directory, this
directory is then removed upon exit.
Variable were cleaned to be clearer:
- crt_filename -> prev_crt
- key_filename -> prev_key
- ${crt_filename}.${tmp} -> new_crt
- ${key_filename}.${tmp} -> new_key
Compare the fingerprint of the leaf certificate to the previous file to
check if it needs to be updated or not
Also skip the check if no file is on the disk.
haproxy-dump0-certs is a bash script that connects to your master socket
or your stat socket in order to dump certificates from haproxy memory to
the corresponding files.
The cfg_postsection_acme() redefines its own cur_acme variable, pointing
to the first acme section created. Meaning that the first section would
be init multiple times, and the next sections won't never be
initialized.
It could result in crashes at the first use of all sections that are not
the first one.
Must be backported in 3.2
Unlinking the acme_ctx element from acme_ctx_destroy() requires to have
the element unlocked, because MT_LIST_DELETE() locks the element.
acme_ctx_destroy() frees the data from acme_ctx with the ctx still
linked and unlocked, then lock to unlink. So there's a small risk of
accessing acme_ctx from somewhere else. The only way to do that would be
to use the `acme challenge_ready` CLI command at the same time.
Fix the issue by doing a mt_list_unlock_link() and a
mt_list_unlock_self() to unlink the element under the lock, then destroy
the element.
This must be backported in 3.2.
admin/halog/halog.c: In function 'filter_count_url':
admin/halog/halog.c:1685:9: error: this 'if' clause does not guard... [-Werror=misleading-indentation]
1685 | if (unlikely(!ustat))
| ^~
admin/halog/halog.c:1687:17: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
1687 | if (unlikely(!ustat)) {
| ^~
This patch fixes the indentation.
Must be backported where fbd0fb20a22 ("BUG/MINOR: halog: Add OOM checks
for calloc() in filter_count_srv_status() and filter_count_url()") was
backported.
There is currently an srv_queue converter which is capable of taking the
output of a dynamic name and determining the queue length for a given
server. In addition there is a sample fetcher for whether a server is
currently up. This simply combines the two such that srv_is_up can be
used as a converter too.
Future work might extend this to other sample fetchers for servers, but
this is probably the most useful for acl routing.
In preparation of providing further server converters, split the code
for finding the server from the sample out.
Additionally, update the documentation for srv_queue converter to note
security concerns.
src/acme.c: In function ‘cfg_parse_acme_vars_provider’:
src/acme.c:471:9: error: potential null pointer dereference [-Werror=null-dereference]
471 | free(*dst);
| ^~~~~~~~~~
gcc13 on ubuntu 24.04 detects a false positive when building
3e72a9f ("MINOR: acme: provider-name for dpapi sink").
Indeed dst can't be NULL. Clarify the code so gcc don't complain
anymore.
Like "acme-vars", the "provider-name" in the acme section is used in
case of DNS-01 challenge and is sent to the dpapi sink.
This is used to pass the name of a DNS provider in order to chose the
DNS API to use.
This patch implements the cfg_parse_acme_vars_provider() which parses
either acme-vars or provider-name options and escape their strings.
Example:
$ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock - | cat -e
<0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
acme-vars "var1=foobar\"toto\",var2=var2"$
provider-name "godaddy"$
{$
"identifier": {$
"type": "dns",$
"value": "example.com"$
},$
"status": "pending",$
"expires": "2025-09-25T14:41:57Z",$
[...]
The httpclient is configured with @system-ca by default, which uses the
directory returned by X509_get_default_cert_dir().
On debian/ubuntu systems, this directory contains multiple certificate
files that are loaded successfully. However it seems that on other
systems the files in this directory is the direct result of
ca-certificates instead of its source. Meaning that you would only have
a bundle file with every certificates in it.
The loading was not done correctly in case of directory loading, and was
only loading the first certificate of each file.
This patch fixes the issue by using X509_STORE_load_locations() on each
file from the scandir instead of trying to load it manually with BIO.
Not that we can't use X509_STORE_load_locations with the `dir` argument,
which would be simpler, because it uses X509_LOOKUP_hash_dir() which
requires a directory in hash form. That wouldn't be suited for this use
case.
Must be backported in every stable branches.
Fix issue #3137.
Add a script to build curl with ECH support, to specify the path of the
openssl+ECH library, you should set the SSL_LIB variable with the prefix
of the library.
Example:
SSL_LIB=/opt/openssl-ech CURL_DESTDIR=/opt/curl-ech/ ./build-curl.sh
When we look for a map file reference, the file@ prefix is removed because
if may be omitted. The same is true with opt@ prefix. However this case was
not properly performed in pat_ref_lookup(). Let's do so.
This patch must be backported as far as 3.0.
Date computation between acme_will_expire() and acme_schedule_date() are
the same. Call acme_schedule_date() from acme_will_expire() and put the
functions as static. The patch also move the functions in the right
order.
acme_will_expire() computes the schedule date using notAfter and
notBefore from the certificate. However notBefore could be greater than
notAfter and could result in an overflow.
This is unlikely to happen and would mean an incorrect certificate.
This patch fixes the issue by checking that notAfter > notBefore.
It also replace the int type by a time_t to avoid overflow on 64bits
architecture which is also unlikely to happen with certificates.
`(date.tv_sec + diff > notAfter)` was also replaced by `if (notAfter -
diff <= date.tv_sec)` to avoid an overflow.
Fix issue #3135.
Need to be backported to 3.2.
acme_schedule_date() computes the schedule date using notAfter and
notBefore from the certificate. However notBefore could be greater than
notAfter and could result in an overflow.
This is unlikely to happen and would mean an incorrect certificate.
This patch fixes the issue by checking that notAfter > notBefore.
It also replace the int type by a time_t to avoid overflow on 64bits
architecture which is also unlikely to happen with certificates.
Fix issue #3136.
Need to be backported to 3.2.
When a map file is load, internally, the pattern reference is flagged as
based on a sample. However it is not performed for virtual maps. This flag
is only used during startup to check the map compatibility when it used at
different places. At runtime this does not change anything. But errors can
be triggered during configuration parsing. For instance, the following valid
config will trigger an error:
http-request set-map(virt@test) foo bar if !{ str(foo),map(virt@test) -m found }
http-request set-var(txn.foo) str(foo),map(virt@test)
The fix is quite obvious. PAT_REF_SMP flag must be set for virtual map as
any other map.
A workaround is to use optional map (opt@...) by checking the map id cannot
reference an existing file.
This patch must be backported as far as 3.0.
When a minimum size is defined to performe the comression, the message
payload size is tested. To do so, information from the HTX message a used to
determine the message length. However it is performed regardless the payload
length is fully known or not. Concretely, the test must on be performed when
a content-length value was speficied or when the message was fully received
(EOM flag set). Otherwise, we are unable to really determine the real
payload length.
Because of this bug, compression may be skipped for a large chunked message
because the first chunks received are too small. But this does not mean the
whole message is small.
This patch must be backported to 3.2.
Instead of having table_process_entry() decrement the session's ref
counter, do it outside, from the caller. Some were missed, such as when
an action was invalid, which would lead to the ref counter not being
decremented, and the session not being destroyable.
It makes more sense to do that from the caller, who just obtained the
ref counter, anyway.
This should be backporter up to 2.8.
The "acme challenge_ready" command mistakenly use the description of the
"acme status" command. This patch adds the right description.
Must be backported to 3.2.
Handle allocation properly during acme-vars parsing.
Check if we have a allocation failure in both the malloc and the
realloc and emits an error if that's the case.
In the case of the dns-01 challenge, the agent that handles the
challenge might need some extra information which depends on the DNS
provider.
This patch introduces the "acme-vars" option in the acme section, which
allows to pass these data to the dpapi sink. The double quotes will be
escaped when printed in the sink.
Example:
global
setenv VAR1 'foobar"toto"'
acme LE
directory https://acme-staging-v02.api.letsencrypt.org/directory
challenge DNS-01
acme-vars "var1=${VAR1},var2=var2"
Would output:
$ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock - | cat -e
<0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
acme-vars "var1=foobar\"toto\",var2=var2"$
{$
"identifier": {$
"type": "dns",$
"value": "example.com"$
},$
"status": "pending",$
"expires": "2025-09-25T14:41:57Z",$
[...]
The commit 88aa7a780 ("MINOR: http-client: Trigger an error if first
response block isn't a start-line") introduced a bug. From an endpoint, an
applet or a mux, the <first> index must never be used. It is reserved to the
HTTP analyzers. From endpoint, this value may be undefined or just point on
any other block that the first one. Instead we must always get the head
block.
In taht case, to be sure the first HTX block in a response is a start-line,
we must use htx_get_head_type() function instead of htx_get_first_type().
Otherwise, we can trigger an error while the response is in fact properly
formatted.
It is a 3.3-speific issue. cNo backport needed.
This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.
The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)
Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.
So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.
Willy experienced an unexpected behavior with the config below:
global
stats socket :1514
ring buf1
server srv1 127.0.0.1:1514
Indeed, haproxy would connect to the ring server twice since commit 23e5f18b
("MEDIUM: sink: change the sink mode type to PR_MODE_SYSLOG"), and one of the
connection would report errors.
The reason behind is is, despite the above commit saying no change of behavior
is expected, with the sink forward_px proxy now being set with PR_MODE_SYSLOG,
postcheck_log_backend() was being automatically executed in addition to the
manual cfg_post_parse_ring() function for each "ring" section. The consequence
is that sink_finalize() was called twice for a given "ring" section, which
means the connection init would be triggered twice.. which in turn resulted in
the behavior described above, plus possible unexpected side-effects.
To fix the issue, when we create the forward_px proxy, we now set the
PR_CAP_INT capability on it to tell haproxy not to automatically manage the
proxy (ie: to skip the automatic log backend postinit), because we are about
to manually manage the proxy from the sink API.
No backport needed, this bug is specific to 3.3
The load followed by the CAS seem to cause two bus cycles, one to
retrieve the cache line in shared state and a second one to get
exclusive ownership of it. Tests show that on x86 it's much better
to just rely on the previous value and preset it to zero before
entering the loop. We just mask the ring lock in case of failure
so as to challenge it on next iteration and that's done.
This little change brings 2.3% extra performance (11.34M msg/s) on
a 64-core AMD.
In the loop where the queue's leader tries to get the tail lock,
we also need to check if another thread took ownership of the queue
the current thread is currently working for. This is currently done
using an atomic load.
Tests show that on x86, using a CAS for this is much more efficient
because it allows to keep the cache line in exclusive state for a
few more cycles that permit the queue release call after the loop
to be done without having to wait again. The measured gain is +5%
for 128 threads on a 64-core AMD system (11.08M msg/s vs 10.56M).
However, ARM loses about 1% on this, and we cannot afford that on
machines without a fast CAS anyway, so the load is performed using
a CAS only on x86_64. It might not be as efficient on low-end models
but we don't care since they are not the ones dealing with high
contention.
Tests have shown that AMD systems really need to use a cpu_relax()
in these two loops. The performance improves from 10.03 to 10.56M
messages per second (+5%) on a 128-thread system, without affecting
intel nor ARM, so let's do this.
The loop is constructed in a complicated way with a single break
statement in the middle and many continue statements everywhere,
making it hard to better factor between variants. Let's first
reorganize it so as to make it easier to escape when the ring
tail lock is obtained. The sequence of instrucitons remains the
same, it's only better organized.
If we see that another thread is already busy trying to announce the
dropped counter, there's no point going there, so let's just skip all
that operation from sink_write() and avoid disturbing the other thread.
This results in a boost from 244 to 262k req/s.
perf top shows that sink_announce_dropped() consumes most of the CPU
on a 128-thread x86 system. Digging further reveals that the atomic
fetch_or() on the dropped field used to detect the presence of another
thread is entirely responsible for this. Indeed, the compiler implements
it using a CAS that loops without relaxing and makes all threads wait
until they can synchronize on this one, only to discover later that
another thread is there and they need to give up.
Let's just replace this with a hand-crafted CAS loop that will detect
*before* attempting the CAS if another thread is there. Doing so
achieves the same goal without forcing threads to agree. With this
simple change, the sustained request rate on h1 with all traces on
bumped from 110k/s to 244k/s!
This should be backported to stable releases where it's often needed
to help debugging.
Currently there's a small mistake in the way the trace function and
macros. The calling function name is known as a constant until the
macro and passed as-is to the __trace() function. That one needs to
know its length and will call ist() on it, resulting in a real call
to strlen() while that length was known before the call. Let's use
an ist instead of a const char* for __trace() and __trace_enabled()
so that we can now completely avoid calling strlen() during this
operation. This has significantly reduced the importance of
__trace_enabled() in perf top.
In __trace(), we're making an integer for the thread id but this one
is passed through strlen() in the call to ist() because it's not a
constant. We do know that it's exactly 3 chars long so we can manage
this using ist2() and pass it the length instead in order to reduce
the number of calls to strlen().
Also let's note that the thread number will no longer be numeric for
thread numbers above 100.
Vincent Gramer reported in GH issue #3125 a case of crash on a BUG_ON()
condition in the rings. What happens is that a message that is one byte
less than the maximum ring size is emitted, and it passes all the checks,
but once inflated by the extra +1 for the refcount, it can no longer. But
the check was made based on message size compared to space left, except
that this space left can now be negative, which is a high positive for
size_t, so the check remained valid and triggered a BUG_ON() later.
Let's compute the size the other way around instead (i.e. current +
needed) since we can't have rings as large as half of the memory space
anyway, thus we have no risk of overflow on this one.
This needs to be backported to all versions supporting multi-threaded
rings (3.0 and above).
Thanks to Vincent for the easy and working reproducer.
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with outgoing connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Unknown or
forbidden algorithms will be ignored and the default one will continue to
be used.
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with incoming connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Permission issues
might be reported (as warnings).
The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.
Note that there's no clear statement about this point in the spec,
only several points which together converge to this:
- From N3220, 6.5.3.4:
A postfix expression followed by the -> operator and an identifier
designates a member of a structure or union object. The value is
that of the named member of the object to which the first expression
points, and is an lvalue.
- From N3220, 6.3.2.1:
An lvalue is an expression (with an object type other than void) that
potentially designates an object; if an lvalue does not designate an
object when it is evaluated, the behavior is undefined.
- From N3220, 6.5.4.4 p3:
The unary & operator yields the address of its operand. If the
operand has type "type", the result has type "pointer to type". If
the operand is the result of a unary * operator, neither that operator
nor the & operator is evaluated and the result is as if both were
omitted, except that the constraints on the operators still apply and
the result is not an lvalue. Similarly, if the operand is the result
of a [] operator, neither the & operator nor the unary * that is
implied by the [] is evaluated and the result is as if the & operator
were removed and the [] operator were changed to a + operator.
=> In short, this is saying that C guarantees these identities:
1. &(*p) is equivalent to p
2. &(p[n]) is equivalent to p + n
As a consequence, &(*p) doesn't result in the evaluation of *p, only
the evaluation of p (and similar for []). There is no corresponding
special carve-out for ->.
See also: https://pvs-studio.com/en/blog/posts/cpp/0306/
After this patch, HAProxy can run without crashing after building w/
clang-19 -fsanitize=undefined -fno-sanitize=function,alignment
This is ebtree commit bd499015d908596f70277ddacef8e6fa998c01d5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 5211c2f71d78bf546f5d01c8d3c1484e868fac13.
We'll use this to improve the definition of container_of(). Let's define
it if it does not exist. We can rely on __builtin_offsetof() on recent
enough compilers.
This is ebtree commit 1ea273e60832b98f552b9dbd013e6c2b32113aa5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 69b2ef57a8ce321e8de84486182012c954380401.
From 'man gcc': passing 0 as the argument to "__builtin_ctz" or
"__builtin_clz" invokes undefined behavior. This triggers UBsan
in HAProxy.
[wt: tested in treebench and verified not to cause any performance
regression with opstime-u32 nor stress-u32]
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 8c29daf9fa6e34de8c7684bb7713e93dcfe09029.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit cf3b93736cb550038325e1d99861358d65f70e9a.
While the previous optimizations couldn't be preserved due to the
possibility of out-of-bounds accesses, at least the prefetch is useful.
A test on treebench shows that for 64k short strings, the lookup time
falls from 276 to 199ns per lookup (28% savings), and the insert falls
from 311 to 296ns (4.9% savings), which are pretty respectable, so
let's do this.
This is ebtree commit b44ea5d07dc1594d62c3a902783ed1fb133f568d.
It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no
mention of it in 3.0's doc. Let's replace it with eb_prefetch() which
maps to __builtin_prefetch() on supported compilers and falls back to
the usual do{}while(0) on other ones. It was tested to properly build
with tcc as well as gcc-2.95.
This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.
Similar to previous patches, let's improve the insert() descent loop to
avoid discovering mandatory data too late. The change here is even
simpler than previous ones, a prefetch was installed and troot is
calculated before last instruction in a speculative way. This was enough
to gain +50% insertion rate on random data.
This is ebtree commit e893f8cc4d44b10f406b9d1d78bd4a9bd9183ccf.
This is the same principles as for the latest improvements made on
integer trees. Applying the same recipes made the ebmb_lookup()
function jump from 10.07 to 12.25 million lookups per second on a
10k random values tree (+21.6%).
It's likely that the ebmb_lookup_longest() code could also benefit
from this, though this was neither explored nor tested.
This is ebtree commit a159731fd6b91648a2fef3b953feeb830438c924.
In the loop we can help the compiler build slightly more efficient code
by placing an unlikely() around the leaf test. This shows a consistent
0.5% performance gain both on eb32 and eb64.
This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.
This one was previously used to preload from the node and keep a copy
in a register on i386 machines with few registers. With the new more
optimal code it's totally useless, so let's get rid of it. By the way
the 64 bit code didn't use that at all already.
This is ebtree commit 1e219a74cfa09e785baf3637b6d55993d88b47ef.
Instead of shifting the XOR value right and comparing it to 1, which
roughly requires 2 sequential instructions, better test if the XOR has
any bit above the current bit, which means any bit set among those
strictly higher, or in other words that XOR & (-bit << 1) is non-zero.
This is one less instruction in the fast path and gives another nice
performance gain on random keys (in million lookups/s):
eb32 1k: 33.17 -> 37.30 +12.5%
10k: 15.74 -> 17.08 +8.51%
100k: 8.00 -> 9.00 +12.5%
eb64 1k: 34.40 -> 38.10 +10.8%
10k: 16.17 -> 17.10 +5.75%
100k: 8.38 -> 8.87 +5.85%
This is ebtree commit c942a2771758eed4f4584fe23cf2914573817a6b.
The current code calculates the next troot based on a calculation.
This was efficient when the algorithm was developed many years ago
on K6 and K7 CPUs running at low frequencies with few registers and
limited branch prediction units but nowadays with ultra-deep pipelines
and high latency memory that's no longer efficient, because the CPU
needs to have completed multiple operations before knowing which
address to start fetching from. It's sad because we only have two
branches each time but the CPU cannot know it. In addition, the
calculation is performed late in the loop, which does not help the
address generation unit to start prefetching next data.
Instead we should help the CPU by preloading data early from the node
and calculing troot as soon as possible. The CPU will be able to
postpone that processing until the dependencies are available and it
really needs to dereference it. In addition we must absolutely avoid
serializing instructions such as "(a >> b) & 1" because there's no
way for the compiler to parallelize that code nor for the CPU to pre-
process some early data.
What this patch does is relatively simple:
- we try to prefetch the next two branches as soon as the
node is known, which will help dereference the selected node in
the next iteration; it was shown that it only works with the next
changes though, otherwise it can reduce the performance instead.
In practice the prefetching will start a bit later once the node
is really in the cache, but since there's no dependency between
these instructions and any other one, we let the CPU optimize as
it wants.
- we preload all important data from the node (next two branches,
key and node.bit) very early even if not immediately needed.
This is cheap, it doesn't cause any pipeline stall and speeds
up later operations.
- we pre-calculate 1<<bit that we assign into a register, so as
to avoid serializing instructions when deciding which branch to
take.
- we assign the troot based on a ternary operation (or if/else) so
that the CPU knows upfront the two possible next addresses without
waiting for the end of a calculation and can prefetch their contents
every time the branch prediction unit guesses right.
Just doing this provides significant gains at various tree sizes on
random keys (in million lookups per second):
eb32 1k: 29.07 -> 33.17 +14.1%
10k: 14.27 -> 15.74 +10.3%
100k: 6.64 -> 8.00 +20.5%
eb64 1k: 27.51 -> 34.40 +25.0%
10k: 13.54 -> 16.17 +19.4%
100k: 7.53 -> 8.38 +11.3%
The performance is now much closer to the sequential keys. This was
done for all variants ({32,64}{,i,le,ge}).
Another point, the equality test in the loop improves the performance
when looking up random keys (since we don't need to reach the leaf),
but is counter-productive for sequential keys, which can gain ~17%
without that test. However sequential keys are normally not used with
exact lookups, but rather with lookup_ge() that spans a time frame,
and which does not have that test for this precise reason, so in the
end both use cases are served optimally.
It's interesting to note that everything here is solely based on data
dependencies, and that trying to perform *less* operations upfront
always ends up with lower performance (typically the original one).
This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.
Since commit 21fd162 ("[MEDIUM] make ebpttree rely solely on eb32/eb64
trees") it was no longer used and no longer builds. The commit message
mentions that the file is no longer needed, probably that a rebase failed
and left the file there.
This is ebtree commit fcfaf8df90e322992f6ba3212c8ad439d3640cb7.
Add some documentation about shm stats file structure to help writing
tools that can parse the file to use the shared stats counters.
This file was written for shm stats file version 1.0 specifically,
it may need to be updated when the shm stats file structure changes
in the future.
Let's explicitly mention that fe_counters_shared_tg and
be_counters_shared_tg structs are embedded in shm_stats_file_object
struct so any change in those structs will result in shm stats file
incompatibility between processes, thus extra precaution must be
taken when making changes to them.
Note that the provisionning made in shm_stats_file_object struct could
be used to add members to {fe,be}_counters_shared_tg without changing
shm_stats_file_object struct size if needed in order to preserve
shm stats file version.
When logsteps proxy storage was migrated from eb nodes to bitmasks in
6a92b14 ("MEDIUM: log/proxy: store log-steps selection using a bitmask,
not an eb tree"), some unused eb node related code was left over in
px_parse_log_steps()
Not only this code is unused, it also resulted in wasted memory since
an eb node was allocated for nothing.
This should fix GH #3121
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.
While this bug is 3.3 only, it turns out that 3.2 is also affected by
the incorrect loop construct in pat_ref_set_from_node(), where it's
possible to run an infinite loop since commit 010c34b8c7 ("MEDIUM:
pattern: consider gen_id in pat_ref_set_from_node()") due to the
"continue" statement being placed before the ebmb_next_dup() call.
As such the relevant part of this fix (pat_ref_set_from_elt) will
need to be backported to 3.2.
This reverts commit 359a829ccb8693e0b29808acc0fa7975735c0353.
The fix is neither sufficient nor correct (it triggers ASAN). Better
redo it cleanly rather than accumulate invalid fixes.
The upcoming ECH feature need a patched OpenSSL with the "feature/ech"
branch.
This daily job launches an openssl build, as well as haproxy build with
reg-tests.
Add support for git releases downloaded from github in openssl builds:
- GIT_TYPE variable allow you to chose between "branch" or "commit"
- OPENSSL_VERSION variable supports a "git-" prefix
- "git-${commit_id}" is stored in .openssl_version instead of the branch
name for version comparison.
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.
No backport is needed, this is only 3.3.
The variables trees use the immediate cebtree API, better use the
item one which is more expressive and safer. The "node" field was
renamed to "name_node" to avoid any ambiguity.
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.
With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.
Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.
This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):
struct before after delta
conn_hash_nodea 56 32 -24
srv_per_thread 96 72 -24
The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.
We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.
There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.
The connection lookup loop is made of two identical blocks, one looking
in the idle or safe lists and the other one looking into the safe list
only. The second one is skipped if a connection was found or if the request
looks for a safe one (since already done). Also the two are slightly
different due to leftovers from earlier versions in that the second one
checks for safe connections and not the first one, and the second one
sets is_safe which is not used later.
Let's just rationalize all this by placing them in a loop which checks
first from the idle conns and second from the safe ones, or skips the
first step if the request wants a safe connection. This reduces the
code and shortens the time spent under the lock.
The proxy struct has several small holes that deserved being plugged by
moving a few fields around. Now we're down to 3056 from 3072 previously,
and the remaining holes are small.
At the moment, compared to before this series, we're seeing these
sizes:
type\size 7d554ca62 current delta
listener 752 704 -48 (-6.4%)
server 4032 3840 -192 (-4.8%)
proxy 3184 3056 -128 (-4%)
stktable 3392 3328 -64 (-1.9%)
Configs with many servers have shrunk by about 4% in RAM and configs
with many proxies by about 3%.
The struct server still has a lot of holes and padding that make it
quite big. By moving a few fields aronud between areas which do not
interact (e.g. boot vs aligned areas), it's quite easy to plug some
of them and/or to arrange larger ones which could be reused later with
a bit more effort. Here we've reduced holes by 40 bytes, allowing the
struct to shrink by one more cache line (64 bytes). The new size is
3840 bytes.
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
The proxy ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct proxy for this node, plus 8 bytes in
the root (which is static so not much relevant here). Now the proxy is
3088 bytes large.
In srv_parse_id(), there's no point doing all the low-level work with
the tree functions to check for the existence of an ID, we already have
server_find_by_id() which does exactly this, so let's use it.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated listener_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This is used to index the proxy's name and it contains a copy of the
pointer to the proxy's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct proxy, which is now 3112
bytes large.
Here we need to continue to support duplicates since they're still
allowed between type-incompatible proxies.
Interestingly, the use of cebis_next_dup() instead of cebis_next() in
proxy_find_by_name() allows us to get rid of an strcmp() that was
performed for each use_backend rule. A test with a large config
(100k backends) shows that we can get 3% extra performance on a
config involving a static use_backend rule (3.09M to 3.18M rps),
and even 4.5% on a dynamic rule selecting a random backend (2.47M
to 2.59M).
This member is used to index the hostname_dn contents for DNS resolution.
Let's replace it with a cebis_tree to save another 32 bytes (24 for the
node + 8 by avoiding the duplication of the pointer). The struct server is
now at 3904 bytes.
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.
It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.
The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
The current guid struct size is 56 bytes. Once reduced using compact
trees, it goes down to 32 (almost half). We're not on a critical path
and size matters here, so better switch to this.
It's worth noting that the name part could also be stored in the
guid_node at the end to save 8 extra byte (no pointer needed anymore),
however the purpose of this struct is to be embedded into other ones,
which is not compatible with having a dynamic size.
Affected struct sizes in bytes:
Before After Diff
server 4032 4032 0*
proxy 3184 3160 -24
listener 752 728 -24
*: struct server is full of holes and padding (176 bytes) and is
64-byte aligned. Moving the guid_node elsewhere such as after sess_conn
reduces it to 3968, or one less cache line. There's no point in moving
anything now because forthcoming patches will arrange other parts.
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).
Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):
$ time ./haproxy -c -f acl-ip.cfg
Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.
Before:
RAM (VSZ/RSS): 4443912 3912444
real 0m19.211s
user 0m18.138s
sys 0m1.068s
Overhead Command Shared Object Symbol
44.79% haproxy haproxy [.] ebst_insert
25.07% haproxy haproxy [.] ebmb_insert_prefix
3.44% haproxy libc-2.33.so [.] __libc_calloc
2.71% haproxy libc-2.33.so [.] _int_malloc
2.33% haproxy haproxy [.] free_pattern_tree
1.78% haproxy libc-2.33.so [.] inet_pton4
1.62% haproxy libc-2.33.so [.] _IO_fgets
1.58% haproxy libc-2.33.so [.] _int_free
1.56% haproxy haproxy [.] pat_ref_push
1.35% haproxy libc-2.33.so [.] malloc_consolidate
1.16% haproxy libc-2.33.so [.] __strlen_avx2
0.79% haproxy haproxy [.] pat_idx_tree_ip
0.76% haproxy haproxy [.] pat_ref_read_from_file
0.60% haproxy libc-2.33.so [.] __strrchr_avx2
0.55% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.54% haproxy libc-2.33.so [.] __memchr_avx2
0.46% haproxy haproxy [.] pat_ref_append
After:
RAM (VSZ/RSS): 4166108 3634768
real 0m18.114s
user 0m17.113s
sys 0m0.996s
Overhead Command Shared Object Symbol
38.99% haproxy haproxy [.] cebs_insert
27.09% haproxy haproxy [.] ebmb_insert_prefix
3.63% haproxy libc-2.33.so [.] __libc_calloc
3.18% haproxy libc-2.33.so [.] _int_malloc
2.69% haproxy haproxy [.] free_pattern_tree
1.99% haproxy libc-2.33.so [.] inet_pton4
1.74% haproxy libc-2.33.so [.] _IO_fgets
1.73% haproxy libc-2.33.so [.] _int_free
1.57% haproxy haproxy [.] pat_ref_push
1.48% haproxy libc-2.33.so [.] malloc_consolidate
1.22% haproxy libc-2.33.so [.] __strlen_avx2
1.05% haproxy libc-2.33.so [.] __strcmp_avx2
0.80% haproxy haproxy [.] pat_idx_tree_ip
0.74% haproxy libc-2.33.so [.] __memchr_avx2
0.69% haproxy libc-2.33.so [.] __strrchr_avx2
0.69% haproxy libc-2.33.so [.] _IO_getline_info
0.62% haproxy haproxy [.] pat_ref_read_from_file
0.56% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.56% haproxy libc-2.33.so [.] cfree@GLIBC_2.2.5
0.46% haproxy haproxy [.] pat_ref_append
If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).
On large strings such as user agents (1 million here), it's now slightly
slower (+9%):
Before:
real 0m2.475s
user 0m2.316s
sys 0m0.155s
After:
real 0m2.696s
user 0m2.544s
sys 0m0.147s
But such patterns are much less common than short ones, and the memory
savings do still count.
Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.
The support for duplicates is necessary for various use cases related
to config names, so let's upgrade to the latest version which brings
this support. This updates the cebtree code to commit 808ed67 (tag
0.5.0). A few tiny adaptations were needed:
- replace a few ceb_node** with ceb_root** since pointers are now
tagged ;
- replace cebu*.h with ceb*.h since both are now merged in the same
include file. This way we can drop the unused cebu*.h files from
cebtree that are provided only for compatibility.
- rename immediate storage functions to cebXX_imm_XXX() as per the API
change in 0.5 that makes immediate explicit rather than implicit.
This only affects vars and tools.c:copy_file_name().
The tests continue to work.
When running "make range", it would be convenient to support running
reg tests or anything else such as "size", "pahole" or even benchmarks.
Such commands are usually specific to the developer's environment, so
let's just pass a generic variable TEST_CMD that is executed as-is if
not empty.
This way it becomes possible to run "make range RANGE=... TEST_CMD=...".
RFC1034 states the following:
By convention, domain names can be stored with arbitrary case, but
domain name comparisons for all present domain functions are done in a
case-insensitive manner, assuming an ASCII character set, and a high
order zero bit. This means that you are free to create a node with
label "A" or a node with label "a", but not both as brothers; you could
refer to either using "a" or "A".
In practice, most DNS resolvers normalize domain labels (i.e., convert
them to lowercase) before performing searches or comparisons to ensure
this requirement is met.
While HAProxy normalizes the domain name in the request, it currently
does not do so for the response. Commit 75cc653 ("MEDIUM: resolvers:
replace bogus resolv_hostname_cmp() with memcmp()") intentionally
removed the `tolower()` conversion from `resolv_hostname_cmp()` for
safety and performance reasons.
This commit re-introduces the necessary normalization for FQDNs received
in the response. The change is made in `resolv_read_name()`, where labels
are processed as an unsigned char string, allowing `tolower()` to be
applied safely. Since a typical FQDN has only 3-4 labels, replacing
`memcpy()` with an explicit copy that also applies `tolower()` should
not introduce a significant performance degradation.
This patch addresses the rare edge case, as most resolvers perform this
normalization themselves.
This fixes the GitHub issue #3102. This fix may be backported in all stable
versions since 2.5 included 2.5.
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.
The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.
This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
An eb tree was used to anticipate for infinite amount of custom log steps
configured at a proxy level. In turns out this makes no sense to configure
that much logging steps for a proxy, and the cost of the eb tree is non
negligible in terms of memory footprint, especially when used in a default
section.
Instead, let's use a simple bitmask, which allows up to 64 logging steps
configured at proxy level. If we lack space some day (and need more than
64 logging steps to be configured), we could simply modify
"struct log_steps" to spread the bitmask over multiple 64bits integers,
minor some adjustments where the mask is set and checked.
As reported by @kenballus in GH #3118, a potential NULL-deref was
introduced in 3da1d63 ("BUG/MEDIUM: http_ana: handle yield for "stats
http-request" evaluation")
Indeed, px->uri_auth may be NULL when stats directive is not involved in
the current proxy section.
The bug went unnoticed because it didn't seem to cause any side-effect
so far and valgrind didn't catch it. However ASAN did, so let's fix it
before it causes harm.
It should be backported with 3da1d63.
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.
The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.
This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
Another regression introduced with the commit 3023e9819 ("BUG/MINOR:
resolvers: Restore round-robin selection on records in DNS answers"). Stream
requesters are unlinked from any theards. So we must not try to queue the
resolver's task here because it is not allowed to do so from another thread
than the task thread. Instead, we can simply wake the resolver's task up. It
is only performed when the last stream requester is unlink from the
resolution.
This patch should fix the issue #3119. It must be backported with the commit
above.
A regression was introduced by commit 6cf2401ed ("BUG/MEDIUM: resolvers:
Make resolution owns its hostname_dn value"). In fact, it is possible (an
allowed ?!) to create a resolution without hostname (hostname_dn ==
NULL). It only happens on startup for a server relying on a resolver but
defined with an IP address and not a hostname
Because of the patch above, an error is triggered during the configuration
parsing when this happens, while it should be accepted.
This patch must be backported with the commit above.
The commit 37abe56b1 ("BUG/MEDIUM: resolvers: Properly cache do-resolv
resolution") introduced a regression. A resolution does not own its
hostname_dn value, it is a pointer on the first request value. But since the
commit above, it is possible to have orphan resolution, with no
requester. So it is important to modify the resolutions to make it owns its
hostname_dn value by duplicating it when it is created.
This patch must be backported with the commit above.
In the previous fix 5d1d93fad ("BUG/MEDIUM: resolvers: Properly handle empty
tree when getting a record from the DNS answer"), I missed the fact the
answer tree can be empty.
So, to avoid crashes, when the answer tree is empty, we immediately exit
from resolv_get_ip_from_response() function with RSLV_UPD_NO_IP_FOUND. In
addition, when a record is removed from the tree, we take care to reset the
next node saved if necessary.
This patch must be backported with the commit above.
This change adds the PP2_SUBTYPE_SSL_GROUP and PP2_SUBTYPE_SSL_SIG_SCHEME
code point reservations in proxy_protocol.txt. The motivation for adding
these two TLVs is for backend visibility into the negotiated TLS key
exchange group and handshake signature scheme.
Demand for visibility is expected to increase as endpoints migrate to use
new Post-Quantum resistant algorithms for key exchange and signatures.
By checking the current thread's locking status, it becomes possible
to know during a memory allocation whether it's performed under a lock
or not. Both pools and memprofile functions were instrumented to check
for this and to increment the memprofile bin's locked_calls counter.
This one, when not zero, is reported on "show profiling memory" with a
percentage of all allocations that such locked allocations represent.
This way it becomes possible to try to target certain code paths that
are particularly expensive. Example:
$ socat - /tmp/sock1 <<< "show profiling memory"|grep lock
20297301 0 2598054528 0| 0x62a820fa3991 sockaddr_alloc+0x61/0xa3 p_alloc(128) [pool=sockaddr] [locked=54962 (0.2 %)]
0 20297301 0 2598054528| 0x62a820fa3a24 sockaddr_free+0x44/0x59 p_free(-128) [pool=sockaddr] [locked=34300 (0.1 %)]
9908432 0 1268279296 0| 0x62a820eb8524 main+0x81974 p_alloc(128) [pool=task] [locked=9908432 (100.0 %)]
9908432 0 554872192 0| 0x62a820eb85a6 main+0x819f6 p_alloc(56) [pool=tasklet] [locked=9908432 (100.0 %)]
263001 0 63120240 0| 0x62a820fa3c97 conn_new+0x37/0x1b2 p_alloc(240) [pool=connection] [locked=20662 (7.8 %)]
71643 0 47307584 0| 0x62a82105204d pool_get_from_os_noinc+0x12d/0x161 posix_memalign(660) [locked=5393 (7.5 %)]
When task profiling is enabled, the pool alloc/free code will measure the
time it takes to perform memory allocation after a cache miss or memory
freeing to the shared cache or OS. The time taken with the thread-local
cache is never measured as measuring that time is very expensive compared
to the pool access time. Here doing so costs around 2% performance at 2M
req/s, only when task profiling is enabled, so this remains reasonable.
The scheduler takes care of collecting that time and updating the
sched_activity entry corresponding to the current task when task profiling
is enabled.
The goal clearly is to track places that are wasting CPU time allocating
and releasing too often, or causing large evictions. This appears like
this in "show profiling tasks aggr":
Tasks activity over 11.428 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lkd_avg mem_avg lat_avg
process_stream 44183891 16.47m 22.36us 491.0ns 1.154us 1.000ns 101.1us
h1_io_cb 57386064 4.011m 4.193us 20.00ns 16.00ns - 29.47us
sc_conn_io_cb 42088024 49.04s 1.165us - - - 54.67us
h1_timeout_task 438171 196.5ms 448.0ns - - - 100.1us
srv_cleanup_toremove_conns 65 1.468ms 22.58us 184.0ns 87.00ns - 101.3us
task_process_applet 3 508.0us 169.3us - 107.0us 1.847us 29.67us
srv_cleanup_idle_conns 6 225.3us 37.55us 15.74us 36.84us - 49.47us
accept_queue_process 2 45.62us 22.81us - - 4.949us 54.33us
This new column will be used for reporting the average time spent
allocating or freeing memory in a task when task profiling is enabled.
For now it is not updated.
When DEBUG_THREAD > 0 and task profiling enabled, we'll now measure the
time spent with at least one lock held for each task. The time is
collected by locking operations when locks are taken raising the level
to one, or released resetting the level. An accumulator is updated in
the thread_ctx struct that is collected by the scheduler when the task
returns, and updated in the sched_activity entry of the related task.
This allows to observe figures like this one:
Tasks activity over 259.516 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lkd_avg lat_avg
h1_io_cb 15466589 2.574m 9.984us - - 33.45us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
sc_conn_io_cb 8047994 8.325s 1.034us - - 870.1us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
process_stream 7734689 4.356m 33.79us 1.990us 1.641us 1.554ms <- sc_notify@src/stconn.c:1206 task_wakeup
process_stream 7734292 46.74m 362.6us 278.3us 132.2us 972.0us <- stream_new@src/stream.c:585 task_wakeup
sc_conn_io_cb 7733158 46.88s 6.061us - - 68.78us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
task_process_applet 6603593 4.484m 40.74us 16.69us 34.00us 96.47us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
task_process_applet 4761796 3.420m 43.09us 18.79us 39.28us 138.2us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
process_table_expire 4710662 4.880m 62.16us 9.648us 53.95us 158.6us <- run_tasks_from_lists@src/task.c:671 task_queue
stktable_add_pend_updates 4171868 6.786s 1.626us - 1.487us 47.94us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
h1_io_cb 2871683 1.198s 417.0ns 70.00ns 69.00ns 1.005ms <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
process_peer_sync 2304957 5.368s 2.328us - 1.156us 68.54us <- stktable_add_pend_updates@src/stick_table.c:873 task_wakeup
process_peer_sync 1388141 3.174s 2.286us - 1.130us 52.31us <- run_tasks_from_lists@src/task.c:671 task_queue
stktable_add_pend_updates 463488 3.530s 7.615us 2.000ns 7.134us 771.2us <- stktable_touch_with_exp@src/stick_table.c:654 tasklet_wakeup
Here we see that almost the entirety of stktable_add_pend_updates() is
spent under a lock, that 1/3 of the execution time of process_stream()
was performed under a lock and that 2/3 of it was spent waiting for a
lock (this is related to the 10 track-sc present in this config), and
that the locking time in process_peer_sync() has now significantly
reduced. This is more visible with "show profiling tasks aggr":
Tasks activity over 475.354 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lkd_avg lat_avg
h1_io_cb 25742539 3.699m 8.622us 11.00ns 10.00ns 188.0us
sc_conn_io_cb 22565666 1.475m 3.920us - - 473.9us
process_stream 21665212 1.195h 198.6us 140.6us 67.08us 1.266ms
task_process_applet 16352495 11.31m 41.51us 17.98us 36.55us 112.3us
process_peer_sync 7831923 17.15s 2.189us - 1.107us 41.27us
process_table_expire 6878569 6.866m 59.89us 9.359us 51.91us 151.8us
stktable_add_pend_updates 6602502 14.77s 2.236us - 2.060us 119.8us
h1_timeout_task 801 703.4us 878.0ns - - 185.7us
srv_cleanup_toremove_conns 347 12.43ms 35.82us 240.0ns 70.00ns 1.924ms
accept_queue_process 142 1.384ms 9.743us - - 340.6us
srv_cleanup_idle_conns 74 475.0us 6.418us 896.0ns 5.667us 114.6us
This new column will be used for reporting the average time spent
in a task with at least one lock held. It will only have a non-zero
value when DEBUG_THREAD > 0. For now it is not updated.
The new lock_level field indicates the number of cumulated locks that
are held by the current thread. It's fed as soon as DEBUG_THREAD is at
least 1. In addition, thread_isolate() adds 128, so that it's even
possible to check for combinations of both. The value is also reported
in thread dumps (warnings and panics).
When DEBUG_THREAD > 0, and if task profiling is enabled, then each
locking attempt will measure the time it takes to obtain the lock, then
add that time to a thread_ctx accumulator that the scheduler will then
retrieve to update the current task's sched_activity entry. The value
will then appear avearaged over the number of calls in the lkw_avg column
of "show profiling tasks", such as below:
Tasks activity over 48.298 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lat_avg
h1_io_cb 3200170 26.81s 8.377us - 32.73us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
sc_conn_io_cb 1657841 1.645s 992.0ns - 853.0us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
process_stream 1600450 49.16s 30.71us 1.936us 1.392ms <- sc_notify@src/stconn.c:1206 task_wakeup
process_stream 1600321 7.770m 291.3us 209.1us 901.6us <- stream_new@src/stream.c:585 task_wakeup
sc_conn_io_cb 1599928 7.975s 4.984us - 65.77us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
task_process_applet 997609 46.37s 46.48us 16.80us 113.0us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
process_table_expire 922074 48.79s 52.92us 7.275us 181.1us <- run_tasks_from_lists@src/task.c:670 task_queue
stktable_add_pend_updates 705423 1.511s 2.142us - 56.81us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
task_process_applet 683511 34.75s 50.84us 18.37us 153.3us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
h1_io_cb 535395 198.1ms 370.0ns 72.00ns 930.4us <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
It now makes it pretty obvious which tasks (hence call chains) spend their
time waiting on a lock and for what share of their execution time.
This new column will be used for reporting the average time spent waiting
for a lock. It will only have a non-zero value when DEBUG_THREAD > 0. For
now it is not updated.
This column is pretty useless, as the total latency experienced by tasks
is meaningless, what matters is the average per call. Since we'll add more
columns and we need to keep all of this readable, let's get rid of this
column.
Since the commit dcb696cd3 ("MEDIUM: resolvers: hash the records before
inserting them into the tree"), When several records are found in a DNS
answer, the round robin selection over these records is no longer performed.
Indeed, before a list of records was used. To ensure each records was
selected one after the other, at each selection, the first record of the
list was moved at the end. When this list was replaced bu a tree, the same
mechanism was preserved. However, the record is indexed using its key, a
hash of the record. So its position never changes. When it is removed and
reinserted in the tree, its position remains the same. When we walk though
the tree, starting from the root, the records are always evaluated in the
same order. So, even if there are several records in a DNS answer, the same
IP address is always selected.
It is quite easy to trigger the issue with a do-resolv action.
To fix the issue, the node to perform the next selection is now saved. So
instead of restarting from the root each time, we can restart from the next
node of the previous call.
Thanks to Damien Claisse for the issue analysis and for the reproducer.
This patch should fix the issue #3116. It must be backported as far as 2.6.
As stated by the documentation, when a do-resolv resolution is performed,
the result should be cached for <hold.valid> milliseconds. However, the only
way to cache the result is to always have a requester. When the last
requester is unlink from the resolution, the resolution is released. So, for
a do-resolv resolution, it means it could only work by chance if the same
FQDN is requested enough to always have at least two streams waiting for the
resolution. And because in that case, the cached result is used, it means
the traffic must be quite high.
In fact, a good approach to fix the issue is to keep orphan resolutions to
be able cache the result and only release them after hold.valid milliseconds
after the last real resolution. The resolver's task already releases orphan
resolutions. So we only need to check the expiration date and take care to
not release the resolution when the last stream is unlink from it.
This patch should be backported to all stable versions. We can start to
backport it as far as 3.1 and then wait a bit.
Previous patch 50d191b ("MINOR: ssl: set functions as static when no
protypes in the .h") broke the WolfSSL function with unused functions.
This patch add __maybe_unused to ssl_sock_sctl_parse_cbk(),
ssl_sock_sctl_add_cbk() and ssl_sock_msgcbk()
-Wmissing-prototypes let us check which functions can be made static and
is not used elsewhere.
rc/ssl_ocsp.c:1079:5: error: no previous prototype for ‘ssl_ocsp_update_insert_after_error’ [-Werror=missing-prototypes]
1079 | int ssl_ocsp_update_insert_after_error(struct certificate_ocsp *ocsp)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1116:6: error: no previous prototype for ‘ocsp_update_response_stline_cb’ [-Werror=missing-prototypes]
1116 | void ocsp_update_response_stline_cb(struct httpclient *hc)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1127:6: error: no previous prototype for ‘ocsp_update_response_headers_cb’ [-Werror=missing-prototypes]
1127 | void ocsp_update_response_headers_cb(struct httpclient *hc)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1138:6: error: no previous prototype for ‘ocsp_update_response_body_cb’ [-Werror=missing-prototypes]
1138 | void ocsp_update_response_body_cb(struct httpclient *hc)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1149:6: error: no previous prototype for ‘ocsp_update_response_end_cb’ [-Werror=missing-prototypes]
1149 | void ocsp_update_response_end_cb(struct httpclient *hc)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:2095:5: error: no previous prototype for ‘ocsp_update_postparser_init’ [-Werror=missing-prototypes]
2095 | int ocsp_update_postparser_init()
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Inconsistencies between the .h and the .c can't be catched because the
.h is not included in the .c.
ocsp_update_init() does not have the right prototype and lacks a const
attribute.
Must be backported in all previous stable versions.
'conn' might be NULL in the trace callback so the calls to
conn_err_code_str must be covered by a proper check.
This issue was found by Coverity and raised in GitHub #3112.
The patch must be backported to 3.2.
'ctx' might be NULL when we exit 'ssl_sock_handshake', it can't be
dereferenced without check in the trace macro.
This was found by Coverity andraised in GitHub #3113.
This patch should be backported up to 3.2.
JWS functions are supposed to return 0 upon error or when nothing was
produced. This was done in order to put easily the return value in
trash->data without having to check the return value.
However functions like a2base64url() or snprintf() could return a
negative value, which would be casted in a unsigned int if this happen.
This patch add checks on the JWS functions to ensure that no negative
value can be returned, and change the prototype from int to size_t.
This is also related to issue #3114.
Must be backported to 3.2.
Reported in issue #3115:
11. var_compare_op: Comparing task to null implies that task might be null.
681 if (!task) {
682 ret++;
683 ha_alert("acme: couldn't start the scheduler!\n");
684 }
CID 1609721: (#1 of 1): Dereference after null check (FORWARD_NULL)
12. var_deref_op: Dereferencing null pointer task.
685 task->nice = 0;
686 task->process = acme_scheduler;
687
688 task_wakeup(task, TASK_WOKEN_INIT);
689 }
690
Task would be dereferenced upon allocation failure instead of falling
back to the end of the function after the error.
Should be backported in 3.2.
This patch extends the documentation for "limited-quic" global keyword.
It mentions first that it relies on USE_QUIC_OPENSSL_COMPAT=1 build
option.
Compatibility with TLS libraries is now clearly exposed. In particular,
it highlights the fact that it is mostly targetted at OpenSSL version
prior to 3.5.2, and that it should be disabled if a recent OpenSSL
release is available. It also states that limited-quic does nothing if
USE_QUIC_OPENSSL_COMPAT is not set during compilation.
Build option USE_QUIC_OPENSSL_COMPAT=1 must be set to activate QUIC
support for OpenSSL prior to version 3.5.2. This compiles an internal
compatibility layer, which must be then activated at runtime with global
option limited-quic.
Starting from OpenSSL version 3.5.2, a proper QUIC TLS API is now
exposed. Thus, the compatibility layer is unneeded. However it can still
be compiled against newer OpenSSL releases and activated at runtime,
mostly for test purpose.
As this compatibility layer has some limitations, (no support for QUIC
0-RTT), it's important that users notice this situation and disable it
if possible. Thus, this patch adds a notice warning when
USE_QUIC_OPENSSL_COMPAT=1 is set when building against OpenSSL 3.5.2 and
above. This should be sufficient for users and packagers to understand
that this option is not necessary anymore.
Note that USE_QUIC_OPENSSL_COMPAT=1 is incompatible with others TLS
library which exposed a QUIC API based on original BoringSSL patches
set. A build error will prevent the compatibility layer to be built.
limited-quic option is thus silently ignored.
This index is used to retrieve the quic_conn object from its SSL object, the same
way the connection is retrieved from its SSL object for SSL/TCP connections.
This patch implements two helper functions to avoid the ugly code with such blocks:
#ifdef USE_QUIC
else if (qc) { .. }
#endif
Implement ssl_sock_get_listener() to return the listener from an SSL object.
Implement ssl_sock_get_conn() to return the connection from an SSL object
and optionally a pointer to the ssl_sock_ctx struct attached to the connections
or the quic_conns.
Use this functions where applicable:
- ssl_tlsext_ticket_key_cb() calls ssl_sock_get_listener()
- ssl_sock_infocbk() calls ssl_sock_get_conn()
- ssl_sock_msgcbk() calls ssl_sock_get_ssl_conn()
- ssl_sess_new_srv_cb() calls ssl_sock_get_conn()
- ssl_sock_srv_verifycbk() calls ssl_sock_get_conn()
Also modify qc_ssl_sess_init() to initialize the ssl_qc_app_data_index index for
the QUIC backends.
The ->li (struct listener *) member of quic_conn struct was replaced by a
->target (struct obj_type *) member by this commit:
MINOR: quic-be: get rid of ->li quic_conn member
to abstract the connection type (front or back) when implementing QUIC for the
backends. In these cases, ->target was a pointer to the ojb_type of a server
struct. This could not work with the dynamic servers contrary to the listeners
which are not dynamic.
This patch almost reverts the one mentioned above. ->target pointer to obj_type member
is replaced by ->li pointer to listener struct member. As the listener are not
dynamic, this is easy to do this. All one has to do is to replace the
objt_listener(qc->target) statement by qc->li where applicable.
For the backend connection, when needed, this is always qc->conn->target which is
used only when qc->conn is initialized. The only "problematic" case is for
quic_dgram_parse() which takes a pointer to an obj_type as third argument.
But this obj_type is only used to call quic_rx_pkt_parse(). Inside this function
it is used to access the proxy counters of the connection thanks to qc_counters().
So, this obj_type argument may be null for now on with this patch. This is the
reason why qc_counters() is modified to take this into consideration.
This patchs reverts commit a498e527b ("BUG/MAJOR: stream: Remove READ/WRITE
events on channels after analysers eval") because of a regression. It was an
attempt to properly detect synchronous sends, even when the stream was woken
up on a write event. However, the fix was wrong because it could mask
shutdowns performed during process_stream() and block the stream.
Indeed, when a shutdown is performed, because an error occurred for
instance, a write event is reported. The commit above could mask this event
while the shutdown prevent any synchronous sends. In such case, the stream
could remain blocked infinitly because an I/O event was missed.
So to properly fix the original issue (#3070), the write event must not be
masked before a synchronous send. Instead, we now force the channel analysis
by setting explicitly CF_WAKE_ONCE flags on the corresponding channel if a
write event is reported after the synchronous send. CF_WRITE_EVENT flag is
remove explicitly just before, so it is quite easy to detect.
This patch must be backport to all stable version in same time of the commit
above.
The remaining half of the task_queue() and task_wakeup() contention
is caused by this function when peers are in use, because just like
process_table_expire(), it's created using task_new_anywhere() and
is woken up for local updates. Let's turn it to single thread by
rotating the assigned threads during initialization so that a table
only runs on one thread at a time.
Here we go backwards to assign the threads, so that on small setups
they don't end up on the same CPUs as the ones used by the stick-tables.
This way this will make an even better use of large machines. The
performance remains the same as with previous patch, even slightly
better (1-3% on avg).
At this point there's almost no multi-threaded task activity anymore
(only srv_cleanup_idle_server once in a while). This should improve
the situation described by Felipe in issues #3084 and #3101.
This should be backported to 3.2 after some extended checks.
A big deal of the task_queue() contention is caused by this function
because it's created using task_new_anywhere() and is subject to
heavy updates. Let's turn it to single thread by rotating the assigned
threads during initialization so that a table only runs on one thread
at a time.
However there's a trick: the function used to call task_queue() to
requeue the task if it had advanced its timer (may only happen when
learning an entry from a peer). We can't do that anymore since we can't
queue another thread's task. Thus instead of the task needs to be
scheduled earlier than previously planned, we simply perform a wakeup.
It will likely do nothing and will self-adjust its next wakeup timer.
Doing so halves the number of multi-thread task wakeups. In addition
the request rate at saturation increased by 12% with 16 peers and 40
tables on a 16 8-thread processes. This should improve the situation
described by Felipe in issues #3084 and #3101.
This should be backported to 3.2 after some extended checks.
In stktable_requeue_exp(), there's a tiny race at the beginning during
which we check the task's expiration date to decide whether or not to
wake process_table_expire() up. During this race, the task might just
have finished running on its owner thread and we can miss a task_queue()
opportunity, which probably explains why during testing it seldom happens
that a few entries are left at the end.
Let's perform a CAS to confirm the value is still the same before
leaving. This way we're certain that our value has been seen at least
once.
This should be backported to 3.2.
This task is sometimes caught triggering the watchdog while waiting for
the infamous resolvers lock, or the scheduler's wait queue lock in
task_queue(). Both are caused by its multi-threaded capability. The
task may indeed start on a thread that's different from the one that
is currently receiving a response and that holds the resolvers lock,
and when being queued back, it requires to lock the wait queue. Both
problems disappear when sticking it to a single thread. But for configs
running multiple resolvers sections, it would be suboptimal to run them
all on the same thread. In order to avoid this, we implement a counter
in the resolvers_finalize_config() section that rotates the thread for
each resolvers section.
This was sufficient to further improve the performance here, making the
CPU usage drop to about 7% (from 11 previously or 38 initially) and not
showing any resolvers lock contention anymore in perf top output.
The change was kept fairly minimal to permit a backport once enough
testing is conducted on it. It could address a significant part of
the trouble reported by Felipe in GH issue #3101.
There's still a big architectural limitation in the dns/resolvers code
regarding threads: resolvers run as a task that is scheduled to run
anywhere, and each NS dgram socket is bound to any thread of the same
thread group as the initiating thread. This becomes a big problem when
dealing with multiple nameservers because responses arrive on any thread,
start by locking the resolvers section, and other threads dealing with
responses are just stuck waiting for the lock to disappear. This means
that most of the time is exclusively spent causing contention. The
process_resolvers() function also also suffers from this contention
but apparently less often.
It turns out that the nameserver sockets are created during emission
of the first packet, triggered from the resolvers task. The present
patch exploits this to stick all sockets to the calling thread instead
of any thread. This way there is no longer any contention between
multiple nameservers of a same resolvers section. Tests with a section
having 10 name servers showed that the CPU usage dropped from 38 to
about 10%, or almost by a factor of 4.
Note that TCP resolvers do not offer this possibility because the
tasks that manage the applets are created earlier to run anywhere
during config parsing. This might possibly be refined later, e.g.
by changing the task's affinity when it first runs.
The change was kept fairly minimal to permit a backport once enough
testing is conducted on it. It could address a significant part of
the trouble reported by Felipe in GH issue #3101.
In ssl_sock_io_cb(), if we failed to create the mux, we may have
destroyed the connection, so only attempt to access it to get the ALPN
if conn_create_mux() was successful.
This fixes crashes that may happen when using ssl.
Commit 5ab9954faa9c815425fa39171ad33e75f4f7d56f introduced a new flag in
ssl_sock_ctx, to know that an ALPN was negociated, however, the way to
get the ssl_sock_ctx was wrong for QUIC. If we're using QUIC, get it
from the quic_conn.
This should fix crashes when attempting to use QUIC.
This function is a tasklet handler used to send peers updates, and it can
happen quite a bit in "show tasks" and "show profiling tasks", so let's
export it so that we don't face a cryptic symbol name:
$ socat - /tmp/haproxy-n10.stat <<< "show tasks"
Running tasks: 43 (8 threads)
function places % lat_tot lat_avg calls_tot calls_avg calls%
process_table_expire 16 37.2 1.072m 4.021s 115831 7239 15.4
task_process_applet 15 34.8 1.072m 4.287s 486299 32419 65.0
stktable_add_pend_updates 8 18.6 - - 89725 11215 12.0
sc_conn_io_cb 3 6.9 - - 5007 1669 0.6
process_peer_sync 1 2.3 4.293s 4.293s 50765 50765 6.7
This should be backported to 3.2 as it participates to debugging the
table+peers processing overhead.
The stick-table expiration of ref-counted entries was insufficiently
addresse by commit 324f0a60ab ("BUG/MINOR: stick-tables: never leave
used entries without expiration"), because now entries are just requeued
where they were, so they're visited over and over for long sessions,
causing process_table_expire() to loop, eating CPU and causing lock
contention.
Here we take care of refreshing their timeer when they are met, so
that we don't meet them more than once per stick-table lifetime. It
should address at least a part of the recent degradation that Felipe
noticed in GH #3084.
Since the fix above was marked for backporting to 3.2, this one should
be backported there as well.
resolve_sym_name() knows a number of symbols, but when one exactly matches
(e.g. a task's handler), it systematically displays the offset behind it
("+0"). Let's only show the offset when non-zero. This can be backported
as this is helpful for debugging.
The "show tasks" command can be useful to inspect run queues for active
tasks, but currently it's difficult to distinguish an occasional running
task from a heavily active one. Let's collect the number of calls for
each of them, report them average on the number of instances of each task
as well as a percentage of the total used. This way it even becomes
possible to get a hint about how CPU usage is distributed.
In 2.4, "show tasks" was introduced by commit 7eff06e162 ("MINOR:
activity: add a new "show tasks" command to list currently active tasks")
to expose some info about running tasks. The latency is not correct
because it's a u32 subtracted from a u64. It ought to have been casted
to u32 for the operation, which is what this patch does.
This can be backported to 2.4.
Since commit 5ab9954faa ("MINOR: ssl: Add a flag to let it known we have
an ALPN negociated"), when building with QUIC we get this warning:
src/ssl_sock.c: In function 'ssl_sock_advertise_alpn_protos':
src/ssl_sock.c:2189:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
Let's just move the instructions after the optional declaration. No
backport is needed.
Now that which ALPN gets negociated for a given server, use that to
decide if we can create the mux right away in connect_server(), and use
it in conn_install_mux_be().
That way, we may create the mux soon enough for early data to be sent,
before the handshake has been completed.
This commit depends on several previous commits, and it has not been
deemed important enough to backport.
The conditions to use early data on output are super tricky and
detected later, so that it's difficult to figure how this works. This
patch splits the condition in two parts, the one that can be performed
early that is based on config/client/etc. It is used to clear a variable
that allows early data to be used in case any condition is not satisfied.
It was purposely split into multiple independent and reviewable tests.
The second part remains where it was at the end, and is used to temporarily
clear the handshake flags to let the data layer use early data. This one
being tricky, a large comment explaining the principle was added.
The logic was not changed at all, only the code was made more readable.
Since 3.0 we have HAVE_SSL_0RTT precisely to avoid checking horribly
complicated and unmaintainable conditions to detect support for 0RTT.
Let's just drop the complex condition and use the macro instead.
Instead of trying to switch from delayed start to instant start based
on a single condition, let's do the opposite and preset the condition
to instant start and detect what could cause it to be delayed, thus
falling back to the slow mode. The condition remains exactly the
inverted one and better matches the comment about ALPN being the only
cause of such a delay.
The init_mux variable is currently used in a way that's not super easy
to grasp. It's set a bit too late and requires to know a lot of info at
once. Let's first rename it to "may_start_mux_now" to clarify its role,
as the purpose is not to *force* the mux to be initialized now but to
permit it to do it.
Add a new field in struct server, path parameters. It will contain
connection informations for the server that are not expected to change.
For now, just store the ALPN negociated with the server. Each time an
handhskae is done, we'll update it, even though it is not supposed to
change. This will be useful when trying to send early data, that way
we'll know which mux to use.
Each time the server goes down or is disabled, those informations are
erased, as we can't be sure those parameters will be the same once the
server will be back up.
How that we have a flag to let us know the ALPN has been set, we no
longer have to call ssl_sock_get_alpn() to know if the alpn has been
negociated already.
Remove the call to conn_create_mux() from ssl_sock_handshake(), and just
reuse the one already present in ssl_sock_io_cb() if we have received
early data, and if the flag is set.
Add a new flag to the ssl_sock_ctx, to be set as soon as the ALPN has
been negociated.
This happens before the handshake has been completed, and that
information will let us know that, when we receive early data, if the
ALPN has been negociated, then we can immediately create a mux, as the
ALPN will tell us which mux to use.
If we received early data, and an ALPN has been negociated, then
immediately try to create a mux if we did not have one already.
Generally, at this point we would not have one, as the mux is decided by
the ALPN, however at this point, even if the handshake is not done yet,
we have enough to determine the ALPN, so we can immediately create the
mux.
Doing so makes up able to treat the request immediately, without waiting
for the handshake to be done.
This should be backported up to 2.8.
In h1_recv_allowed(), do not forbid the reception if we are yet to
complete the connection, if we have received early data on it. That way,
we can deal with them right away, instead of waiting for the handshake
to be done.
This should be backported up to 2.8.
Recent fix 2421c3769a ("BUG/MEDIUM: peers: don't fail twice to grab the
update lock") improved the situation a lot for peers under locking
contention but still not enough for situations with many peers and
many entries to expire fast. It's indeed still possible to trigger
warnings at end of injection sessions for 16 peers at 100k req/s each
doing 10 random track-sc when process_table_expire() runs and holds the
update lock if compiled with a high value of STKTABLE_MAX_UPDATES_AT_ONCE
(1000). Better just not insist in this case and postpone the update.
At this point, under load only ebmb_lookup() consumes CPU, other functions
are in the few percent, indicating reasonable contention, and peers remain
updated.
This should be backported to 3.2 after a bit of testing.
This one doesn't need to wait forever, if it cannot work it can postpone
it. When building with a high value of STKTABLE_MAX_UPDATES_AT_ONCE (1000),
it's still possible to trigger warnings in this function on the write lock
that is contended by peers and expiration. Changing it for a trylock resolves
the issue.
This should be backported to 3.2 after a bit of testing.
process_table_expire() can take quite a lot of time running over all
shards. During this time it will hinder track-sc rules and peers, which
will experience an increased latency to do their work, especially peers
where each message will cause a lock, whose cumulated time can exceed
the watchdog's patience.
Here, we proceed just like in stktable_trash_oldest(), which is that
we're using a trylock to detect contention. The first time it happens,
if we hadn't purged anything, we switch to a regular lock to perform
the operation, and next time it happens we abort. This guarantees that
some entries will be expired and that contention will be reduced with
when detected.
With this change, various tests didn't manage to produce any warning,
including at the end of the load generation session.
This should be backported to 3.2 after a bit more testing.
stktable_trash_oldest() does insist a lot on purging what was requested,
only limited by STKTABLE_MAX_UPDATES_AT_ONCE. This is called in two
conditions, one to allocate a new stksess, and the other one to purge
entries of a stopping process. The cost of iterating over all shards
is huge, and a shard lock is taken each time before looking up entries.
Moreover, multiple threads can end up doing the same and looking hard for
many entries to purge when only one is needed. Furthermore, all threads
start from the same shard, hence synchronize their locks. All of this
costs a lot to other operations such as access from peers.
This commit simplifies the approach by ignoring the budget, starting
from a random shard number, and using a trylock so as to be able to
give up early in case of contention. The approach chosen here consists
in trying hard to flush at least one entry, but once at least one is
evicted or at least one trylock failed, then a failure on the trylock
will result in finishing.
The function now returns a success as long as one entry was freed.
With this, tests no longer show watchdog warnings during tests, though
a few still remain when stopping the tests (which are not related to
this function but to the contention from process_table_expire()).
With this change, under high contention some entries' purge might be
postponed and the table may occasionally contain slightly more entries
than their size (though this already happens since stksess_new() first
increments ->current before decrementing it).
Measures were made on a 64-core system with 8 peers
of 16 threads each, at CPU saturation (350k req/s each doing 10
track-sc) for 10M req, with 3 different approaches:
- this one resulted in 1500 failures to find an entry (0.015%
size overhead), with the lowest contention and the fairest
peers distibution.
- leaving only after a success resulted in 229 failures (0.0029%
size overhead) but doubled the time spent in the function (on
the write lock precisely).
- leaving only when both a success and a failed lock were met
resulted in 31 failures (0.00031% overhead) but the contention
was high enough again so that peers were not all up to date.
Considering that a saturated machine might exceed its entries by
0.015% is pretty minimal, the mechanism is kept.
This should be backported to 3.2 after a bit more testing as it
resolves some watchdog warnings and panics. It requires precedent
commit "MINOR: stick-table: permit stksess_new() to temporarily
allocate more entries" to over-allocate instead of failing in case
of contention.
stksess_new() calls stktable_trash_oldest() to release some entries.
If it fails however, it will fail to allocate an entry. This is a problem
because it doesn't permit stktable_trash_oldest() to be used in best effort
mode, which forces it to impose high contention. There's no problem with
allocating slightly more in practice. In the worst case if all entries are
in use, it's not shocking to temporarily exceed the number of entries by a
few units.
Let's relax this problematic rule. This patch might need to be backported
to 3.2 after a bit more testing in order to support locking relaxation.
The following functions take locks and are often involved in warnings
but are currently not resolved, so let's export them so that they are
properly decoded:
peer_prepare_updatemsg(), peer_send_teachmsgs(),
peer_treat_updatemsg(), peer_send_msgs(), peer_io_handler()
This should be backported to 3.2.
When task profiling is enabled, the current thread knows when the
currently running task was woken up and called, so we can calculate
how long ago it was woken up and called. This is convenient to figure
whether or not a warning or panic is caused by this task or by a
previous one, so let's report this info in thread outputs when known.
It would be useful to backport this to 3.2.
When multiple similar warnings are emitted, it can be difficult to know
whether only one task is looping slowly or if many are sharing the CPU.
Let's report the number of context switches and polling loop turns in
thread dumps so that warnings are easier to understand.
This should be backported to 3.2.
Normally the connect loop cannot loop, but some recent traces can easily
convince one of the opposite. Let's add a counter, including in panic
dumps, in order to avoid the repeated long head scratching sessions
starting with "and what if...". In addition, if it's found to loop, this
time it will be certain and will indicate what to zoom in. This should
be backported to 3.2.
Warning and panic messages currently do not report the PID. This is
annoying when trying to reproduce problems because warnings do not
allow know which process to attach to in order to debug, and panics
do not permit to know which core dump corresponds to which dump.
Let's add them in both messages. This should probably be backported
at least to 3.2.
QUIC is now supported on the backend side. The previous commit ensures
that simple checks can be activated on QUIC servers without any issue.
The current patch ensures that check server settings remain compatible
with a QUIC server. Thus, configuration is now invalid if check
specifies an explicit MUX proto other than QUIC, disables SSL or try to
use PROXY protocol.
Previously, checks were only performed on TCP. However, QUIC is now
supported on backend. Prior to this patch, check activation for QUIC
servers would result in a crash.
To ensure compatibility between QUIC servers and checks, adjust
protocol_lookup() performed during check connect step. Instead of using
a hardcoded PROTO_TYPE_STREAM, the value is now derived from server
settings.
This does not need to be backported.
If no specific check settings are defined on a server line, it is
expected that these checks will be performed with the same parameters as
normal connections on the same server.
ALPN must be carefully taken into account for checks. Most notably, MUX
initialization is delayed so that it is performed only after SSL
handshake.
Prior to this patch, MUX init delay was only performed if ALPN was
defined via check settings. Thus, with the following settings, checks
would be performed on HTTP/1.1 without consulting ALPN negotiation
result from the server :
server s1 127.0.0.1:443 ssl crt <...> alpn h2 check
This bug may result in checks reporting failure, for example in case of
a server answering HTTP/2 to ALPN negotiation to the configuration
above. Besides, there is incoherency between normal and check
connections, which is not what the documentation specifies.
This patch fixes this code. Now server parameters are also taken into
account. This ensures that checks and normal connections by default
use the same connection method.
This must be backported up to 2.4.
To ensure ALPN is properly applied on checks, MUX initialization is
delayed so that it is created on SSL handshake completion. However, this
does not check if SSL is really active for the connection.
This patch adjusts the condition so that MUX init is not delayed if SSL
is not active for the check connection. A similar process is already
conducted for normal connections via connect_server().
This must be backported up to 2.4. Despite not being a bug, it must be
backported for the following patch which fixes check ALPN inheritance
from server settings.
HTTP/0.9 is available on top of QUIC. This protocol is reserved for
internal use, mostly interop purpose.
This patch adjusts HTTP/0.9 layer with the following changes :
* version is not emitted anymore on the status line. This is performed
as some servers does not parse it correctly.
* status line is set explicitely on HTX status-line. This ensures the
correct HTTP status code is reported to the upper stream layer.
This does not need to be backported.
This patch relies on the previous one ("BUG/MEDIUM: mux-h2: Report RST/error to
app-layer stream during 0-copy fwding").
When the end of the connection is detected, so when the H2_CF_END_REACHED
flag is set after the shutdown was received and all incoming data were
processed, if a stream is blocked by the flow control (the stream one or the
connection one), an error must be reported to the app-layer stream.
Otherwise, outgoing data won't be sent and the opposite side will handle
this as a lack of room. So the stream will be blocked until the write
timeout is triggerd. By reporting the error early, the stream can be
immediately closed.
This patch should be backported to 3.2. For older versions, it is probably a
good idea to wait for bug report.
In h2_nego_ff(), it is important to report reset and error to app-layer
stream and to send the RST-STREAM frame accordingly. It is not clear if it
is an issue or not. But it is clearly a difference with the classical
forwarding via h2_snd_buf. And it is mandatory for the next fix.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this area.
This only happens when a connection error is detected or when the H2
connection is in ERR/ERR2 state. The demux buffer is explicitly reset. In
that case, it is important to remove the flag reporting this buffer as full.
It is probably worth to backport this patch to 3.2. But it is not mandatory
on older versions because it does not fix any known issue.
When the mbuf ring buffer is full, the flag H2_CF_DEM_MROOM is set on the H2
connection to block any demux. It is important to properly handle ACK
frames. However, we must take care to restart reading when some data were
removed from the mbuf. Otherwise, we may block the demux for no reason. It
is especially an issue if the demux buffer is full. In that case, the H2
connection is blocked, waiting for the timeout.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this area.
The H2 connection is switched to ERR when a GOAWAY must be sent and in ERR2
when it is sent. In these states, no more data can be emitted by the
mux. But there is no reason to not try to process incoming data or to not
try to receive data. It is espcially important to be able to get the
shutdown from the TCP connection when a SSL connection was previously
detected. Otherwise, it is possible to block a H2 connection until its
timeout expiration to be able to close it.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this
area.
When an send error is detected on the underlying connection, a pending error
is reported to the H2 connection by setting H2_CF_ERR_PENDING flag. When
this happen the tail of the mux ring buffer is reset. However some blocking
flags remain set and have no chance to be removed later because of the
pending error. Especially the flag H2_CF_DEM_MROOM which block data
demultiplexing. Thus, it is possible to block a H2 connection with unparsed
incoming data.
Worse, if a read event is received, it could lead to a wakeup loop between
the H2 connection and the underlying SSL connection. The H2 connection is
unable to convert the pending error to a fatal error because the
demultiplexing is blocked. In the mean time, it tries to receive more data
because of the not-consumed read event. On the underlying connection side,
the error detected earlier blocks the read, but the H2 connection is woken
up to handle the error.
To fix the issue, blocking flags must be removed when a send error is caught,
H2_CF_MUX_MFULL and H2_CF_DEM_MROOM flags. But, it is not necessary to only
release the tail of the mbuf ring. When a send error is detected, all outgoing
data can be flushed. So, now, in h2_send(), h2_release_mbuf() function is called
on pending error. The mbuf ring is fully released and H2_CF_MUX_MFULL and
H2_CF_DEM_MROOM flags are removed.
Many thanks to Krzysztof Kozłowski for its help to spot this issue.
This patch could be backported at least as far as 2.8. But it is a bit
sensitive. So, it is probably a good idea to backport it to 3.2 for now and
wait for bug report on older versions.
Previously, GSO emission was explicitely disabled on backend side. This
is not true since the following patch, thus GSO can be used, for example
when transfering large POST requests to a HTTP/3 backend.
commit e064e5d46171d32097a84b8f84ccc510a5c211db
MINOR: quic: duplicate GSO unsupp status from listener to conn
However, GSO on the backend side may cause crash when handling EIO. In
this case, GSO must be completely disabled. Previously, this was
performed by flagging listener instance. In backend side, this would
cause a crash as listener is NULL.
This patch fixes it by supporting GSO disable flag for servers. Thus, in
qc_send_ppkts(), EIO can be converted either to a listener or server
flag depending on the quic_conn proxy side. On backend side, server
instance is retrieved via <qc.conn.target>. This is enough to guarantee
that server is not deleted.
This does not need to be backported.
Historically, when the purge of pools was forced by sending a SIGQUIT to
haproxy, information about the pools were first dumped. It is now totally
pointless because these info can be retrieved via the CLI. It is even less
relevant now because the purge is forced typically when there are memroy
issues and to dump pools information, data must be allocated.
dump_pools_info() function was simplified because it is now called only from
an applet. No reason to still try to dump info on stderr.
The "show pools" CLI command was not designed to dump information exceeding
the size of a buffer. But there is now much more pools than few years ago
and when detailed information are dumped, we exceeds the buffer limit and
the output is truncated.
To fix the issue, the command must be refactored to be able to stream the
result. To do so, the array containing pools info is now part of the command
context and it is dynamically allocated. A dedicated function was created to
fill all info. In addition, the index of the next pool to dump is saved in
the command context too to properly handle resumption cases. Finally global
information about pools are also stored in the command context for
convenience.
This patch should fix the issue #3067. It must be backported to 3.2. On
older release, the buffer limit is never reached.
First, the barrier to delay the client execution was moved before the client
definition. Otherwise, the connection is established too early and with
short timeouts it could be closed before the requests are sent.
The main purpose of the barrier was to workaround slow health-checks. This
is also the reason why the script was flagged as slow. But it can be
significantly speed-up by setting a slow "inter" value. It is now set to
100ms and the script is no longer slow.
The below patch fixes padding emission for small packets, which is
required to ensure that header protection removal can be performed by
the recipient.
commit d7dea408c64c327cab6aebf4ccad93405b675565
BUG/MINOR: quic: too short PADDING frame for too short packets
In addition to the proper fix, constant QUIC_HP_SAMPLE_LEN was removed
and replaced by QUIC_TLS_TAG_LEN. However, it still makes sense to have
a dedicated constant which represent the size of the sample used for
header protection. Thus, this patch restores it.
Special instructions for backport : above patch mentions that no
backport is needed. However, this is incorrect, as bug is introduced by
another patch scheduled for backport up to 2.6. Thus, it is first
mandatory to schedule d7dea408c64c327cab6aebf4ccad93405b675565 after it.
Then, this patch can also be used for the sake of code clarity.
Define a new "quic_tx" unit-test which is used to test QUIC TX module.
For the moment, a single test is performed on qc_do_build_pkt(). It
checks that PADDING is correctly added for HP sampling in case of a
small packet.
As reported in GH #3104, there remained a place where (1 << shift was
used to set or remove bits from uint64_t users bitfield. It is incorrect
and could lead to bugs for values > 32 bits.
Instead, let's use 1ULL to ensure the operation remains 64bits consistent.
No backport needed.
Willy reported that the following config would segfault right after the
"removing incomplete section 'peer' is emitted:
peers peers
bind :2300
server n10 127.0.0.1:2310
listen dummy
bind localhost:9999
This is caused by the fact that stop_proxy(), which tries to read shared
counters, is called during early init while shared counters are not yet
initialized. To fix the crash, let's check if we're still during starting
phase, in which case we assume the counters are not initialized and we
assume 0 value instead.
No backport needed unless 16eb0fab31 ("MAJOR: counters: dispatch counters
over thread groups") is.
Mimic the same behavior as the one for SSL/TCP connetion to implement the
SSL session reuse.
Extract the code which try to reuse the SSL session for SSL/TCP connections
to implement ssl_sock_srv_try_reuse_sess().
Call this function from QUIC ->init() xprt callback (qc_conn_init()) as this
done for SSL/TCP connections.
When kTLS is compiled in, make sure msg_controllen is initialized to 0.
If we're not actually kTLS, then it won't be set, but we'll check that
it is non-zero later to check if we ancillary data.
This does not need to be backported.
This should fix CID 1620865, as reported in github issue #3106.
As found in GH issue #3103, CPU_ISSET() on musl 1.25 doesn't match the man
page which says it's returning an int. The reason is pretty simple, it's
a macro that operates on the bits directly and returns the result of the
bit field applied to the mask as an unsigned long. Bits above 31 will
simply be dropped if returned as an int, which causes CPUs 32..63 to
appear as absent from cpu_sets.
The fix is trivial, it consists in just comparing the result against zero
(i.e. turning it to a boolean), but before it's merged and deployed we'll
have to face such deployments, so better implement the same workaround
in the code here since we have access to the raw long value.
This workaround should be backported to 3.0.
This bug arrvived with this commit:
MINOR: quic: centralize padding for HP sampling on packet building
What was missed is the fact that at the centralization point for the
PADDING frame to add for too short packet, <len> payload length already includes
<*pn_len> the packet number field length value.
So when computing the length of the PADDING frame, the packet field length must
not be considered and added to the payload length (<len>).
This bug leaded too short PADDING frame to too short packets. This was the case,
most of times with Application level packets with a 1-byte packet number field
followed by a 1-byte PING frame. A 1-byte PADDING frame was added in this case
in place of a correct 2-bytes PADDINF frame. The header packet protection of
such packet could not be removed by the clients as for instance for ngtcp2 with
such traces:
I00001828 0x5a135c81e803f092c74bac64a85513b657 pkt could not decrypt packet number
As the header protection could no be removed, the header keyupdate bit could also
not be read by packet analyzers such as pyshark used during the keyupdate tests.
No need to backport.
The script reg-tests/ssl/ssl_sni_auto.vtc tests the automatic SNI selection
for regular server connections and for health-check ones. It rely on a
3.3-dev8 feature (in fact, it was pushed just after the dev8).
Similarly to the automic SNI selection for regulat SSL traffic, the SNI of
health-checks HTTPS connection is now automatically set by default by using
the host header value. "check-sni-auto" and "no-check-sni-auto" server
settings were added to change this behavior.
Only implicit HTTPS health-checks can take advantage of this feature. In
this case, the host header value from the "option httpchk" directive is used
to extract the SNI. It is disabled if http-check rules are used. So, the SNI
must still be explicitly specified via a "http-check connect" rule.
This patch with should paritally fix the issue #3081.
For HTTPS outgoing connections, the SNI is now automatically set using the
Host header value if no other value is already set (via the "sni" server
keyword). It is now the default behavior. It could be disabled with the
"no-sni-auto" server keyword. And eventually "sni-auto" server keyword may
be used to reset any previous "no-sni-auto" setting. This option can be
inherited from "default-server" settings. Finally, if no connection name is
set via "pool-conn-name" setting, the selected value is used.
The automatic selection of the SNI is enabled by default for all outgoing
connections. But it is concretely used for HTTPS connections only. The
expression used is "req.hdr(host),host_only".
This patch should paritally fix the issue #3081. It only covers the server
part. Another patch will add the feature for HTTP health-checks.
When we try to ruse connection to perform an healtcheck, the SNI, from the
tcpcheck connection or the healthcheck itself, must not be used as
connection name for non-SSL connections.
This patch must be backported to 3.2.
There is no reason to set the SNI and ALPN for non-ssl connections. It is
not really an issue because ssl_sock_set_servername() and
ssl_sock_set_alpn() functions will do nothing. But it is cleaner this way
and this could avoid bugs in future.
No backport needed, because there is no bug.
There is no reason to set the SNI for non-ssl connections. It is not really
an issue because ssl_sock_set_servername() function will do nothing. But
there is no reason to uselessly evaluate an expression.
No backport needed, because there is no bug.
There is no reason to set the SNI for non-ssl connections. It is not really
an issue because ssl_sock_set_servername() function will do nothing. But
there is no reason to uselessly evaluate an expression.
No backport needed, because there is no bug.
not all changes are concerned. But when the SSL is enabled or disabled for a
server, the healthcheck xprt must be eventually be updated too. This happens
when the healthcheck relies on the server settings.
In the same spirit, when the healthcheck address and port are updated, we
must fallback on the raw xprt if the SSL is not explicitly enabled for the
healthcheck with a "check-ssl" parameter.
This patch should be backported to all stable versions.
By default, for a given server, when no pool-conn-name is specified, the
configured sni is used. However, this must only be done when SSL is in-use
for the server. Of course, it is uncommon to have a sni expression for
now-ssl server. But this may happen.
In addition, the SSL may be disabled via the CLI. In that case, the
pool-conn-name must be discarded if it was copied from the sni. And, we must
of course take care to set it if the ssl is enabled.
Finally, when the attac-srv action is checked, we now checked the
pool-conn-name expression.
This patch should be backported as far as 3.0. It relies on "MINOR: server:
Parse sni and pool-conn-name expressions in a dedicated function" which
should be backported too.
This change is mandatory to fix an issue. The parsing of sni and
pool-conn-name expressions (from string to expression) is now handled in a
dedicated function. This will avoid to duplicate the same code at different
places.
There is a typo in the commit * c51ddd5c3 ("MINOR: acl: Only allow one '-m'
matching method") . '*m' was reported in the error message instead of '-m'.
In addition, it is now mentionned that only the last one should be keep if
an old config triggers the error.
No backport needed, except if the commit above is backported.
Released version 3.3-dev8 with the following main changes :
- BUG/MEDIUM: mux-h2: fix crash on idle-ping due to unwanted ABORT_NOW
- BUG/MINOR: quic-be: missing Initial packet number space discarding
- BUG/MEDIUM: quic-be: crash after backend CID allocation failures
- BUG/MEDIUM: ssl: apply ssl-f-use on every "ssl" bind
- BUG/MAJOR: stream: Remove READ/WRITE events on channels after analysers eval
- MINOR: dns: dns_connect_nameserver: fix fd leak at error path
- BUG/MEDIUM: quic: reset padding when building GSO datagrams
- BUG/MINOR: quic: do not emit probe data if CONNECTION_CLOSE requested
- BUG/MAJOR: quic: fix INITIAL padding with probing packet only
- BUG/MINOR: quic: don't coalesce probing and ACK packet of same type
- MINOR: quic: centralize padding for HP sampling on packet building
- MINOR: http_ana: fix typo in http_res_get_intercept_rule
- BUG/MEDIUM: http_ana: handle yield for "stats http-request" evaluation
- MINOR: applet: Rely on applet flag to detect the new api
- MINOR: applet: Add function to test applet flags from the appctx
- MINOR: applet: Add a flag to know an applet is using HTX buffers
- MINOR: applet: Make some applet functions HTX aware
- MEDIUM: applet: Set .rcv_buf and .snd_buf functions on default ones if not set
- BUG/MEDIUM: mux-spop: Reject connection attempts from a non-spop frontend
- REGTESTS: jwt: create dynamically "cert.ecdsa.pem"
- BUG/MEDIUM: spoe: Improve error detection in SPOE applet on client abort
- MINOR: haproxy: abort config parsing on fatal errors for post parsing hooks
- MEDIUM: server: split srv_init() in srv_preinit() + srv_postinit()
- MINOR: proxy: handle shared listener counters preparation from proxy_postcheck()
- DOC: configuration: reword 'generate-certificates'
- BUG/MEDIUM: quic-be: avoid crashes when releasing Initial pktns
- BUG/MINOR: quic: reorder fragmented RX CRYPTO frames by their offsets
- MINOR: ssl: diagnostic warning when both 'default-crt' and 'strict-sni' are used
- MEDIUM: ssl: convert diag to warning for strict-sni + default-crt
- DOC: configuration: clarify 'default-crt' and implicit default certificates
- MINOR: quic: remove ->offset qf_crypto struct field
- BUG/MINOR: mux-quic: trace with non initialized qcc
- BUG/MINOR: acl: set arg_list->kw to aclkw->kw string literal if aclkw is found
- BUG/MEDIUM: mworker: fix startup and reload on macOS
- BUG/MINOR: connection: rearrange union list members
- BUG/MINOR: connection: remove extra session_unown_conn() on reverse
- MINOR: cli: display failure reason on wait command
- BUG/MINOR: server: decrement session idle_conns on del server
- BUG/MINOR: mux-quic: do not access conn after idle list insert
- MINOR: session: document explicitely that session_add_conn() is safe
- MINOR: session: uninline functions related to BE conns management
- MINOR: session: refactor alloc/lookup of sess_conns elements
- MEDIUM: session: protect sess conns list by idle_conns_lock
- MINOR: server: shard by thread sess_conns member
- MEDIUM: server: close new idle conns if server in maintenance
- MEDIUM: session: close new idle conns if server in maintenance
- MINOR: server: cleanup idle conns for server in maint already stopped
- MINOR: muxes: enforce thread-safety for private idle conns
- MEDIUM: conn/muxes/ssl: reinsert BE priv conn into sess on IO completion
- MEDIUM: conn/muxes/ssl: remove BE priv idle conn from sess on IO
- MEDIUM: mux-quic: enforce thread-safety of backend idle conns
- MAJOR: server: implement purging of private idle connections
- MEDIUM: session: account on server idle conns attached to session
- MAJOR: server: do not remove idle conns in del server
- BUILD: mworker: fix ignoring return value of ‘read’
- DOC: unreliable sockpair@ on macOS
- MINOR: muxes: adjust takeover with buf_wait interaction
- OPTIM: backend: set release on takeover for strict maxconn
- DOC: configuration: confuse "strict-mode" with "zero-warning"
- MINOR: doc: add missing statistics column
- MINOR: doc: add missing statistics column
- MINOR: stats: display new curr_sess_idle_conns server counter
- MINOR: proxy: extend "show servers conn" output
- MEDIUM: proxy: Reject some header names for 'http-send-name-header' directive
- BUG/BUILD: stats: fix build due to missing stat enum definition
- DOC: proxy-protocol: Make example for PP2_SUBTYPE_SSL_SIG_ALG accurate
- CLEANUP: quic: remove a useless CRYPTO frame variable assignment
- BUG/MEDIUM: quic: CRYPTO frame freeing without eb_delete()
- BUG/MAJOR: mux-quic: fix crash on reload during emission
- MINOR: conn/muxes/ssl: add ASSUME_NONNULL() prior to _srv_add_idle
- REG-TESTS: map_redirect: Don't use hdr_dom in ACLs with "-m end" matching method
- MINOR: acl: Only allow one '-m' matching method
- MINOR: acl; Warn when matching method based on a suffix is overwritten
- BUG/MEDIUM: server: Duplicate healthcheck's alpn inherited from default server
- BUG/MINOR: server: Duplicate healthcheck's sni inherited from default server
- BUG/MINOR: acl: Properly detect overwritten matching method
- BUG/MINOR: halog: Add OOM checks for calloc() in filter_count_srv_status() and filter_count_url()
- BUG/MINOR: log: Add OOM checks for calloc() and malloc() in logformat parser and dup_logger()
- BUG/MINOR: acl: Add OOM check for calloc() in smp_fetch_acl_parse()
- BUG/MINOR: cfgparse: Add OOM check for calloc() in cfg_parse_listen()
- BUG/MINOR: compression: Add OOM check for calloc() in parse_compression_options()
- BUG/MINOR: tools: Add OOM check for malloc() in indent_msg()
- BUG/MINOR: quic: ignore AGAIN ncbuf err when parsing CRYPTO frames
- MINOR: quic/flags: complete missing flags
- BUG/MINOR: quic: fix room check if padding requested
- BUG/MINOR: quic: fix padding issue on INITIAL retransmit
- BUG/MINOR: quic: pad Initial pkt with CONNECTION_CLOSE on client
- MEDIUM: quic: strengthen BUG_ON() for unpad Initial packet on client
- DOC: configuration: rework the jwt_verify keyword documentation
- BUG/MINOR: haproxy: be sure not to quit too early on soft stop
- BUILD: acl: silence a possible null deref warning in parse_acl_expr()
- MINOR: quic: Add more information about RX packets
- CI: fix syntax of Quic Interop pipelines
- MEDIUM: cfgparse: warn when using user/group when built statically
- BUG/MEDIUM: stick-tables: don't leave the expire loop with elements deleted
- BUG/MINOR: stick-tables: never leave used entries without expiration
- BUG/MEDIUM: peers: don't fail twice to grab the update lock
- MINOR: stick-tables: limit the number of visited nodes during expiration
- OPTIM: stick-tables: exit expiry faster when the update lock is held
- MINOR: counters: retrieve detailed errmsg upon failure with counters_{fe,be}_shared_prepare()
- MINOR: stats-file: introduce shm-stats-file directive
- MEDIUM: stats-file: processes share the same clock source from shm-stats-file
- MINOR: stats-file: add process slot management for shm stats file
- MEDIUM: stats-file/counters: store and preload stats counters as shm file objects
- DOC: config: document "shm-stats-file" directive
- OPTIM: stats-file: don't unnecessarily die hard on shm_stats_file_reuse_object()
- MINOR: compiler: add ALWAYS_PAD() macro
- BUILD: stats-file: fix aligment issues
- MINOR: stats-file: reserve some bytes in exported structs
- MEDIUM: stats-file: add some BUG_ON() guards to ensure exported structs are not changed by accident
- BUG/MINOR: check: ensure check-reuse is compatible with SSL
- BUG/MINOR: check: fix dst address when reusing a connection
- REGTESTS: explicitly use "balance roundrobin" where RR is needed
- MAJOR: backend: switch the default balancing algo to "random"
- BUG/MEDIUM: conn: fix UAF on connection after reversal on edge
- BUG/MINOR: connection: streamline conn detach from lists
- BUG/MEDIUM: quic-be: too early SSL_SESSION initialization
- BUG/MINOR: log: fix potential memory leak upon error in add_to_logformat_list()
- MEDIUM: init: always warn when running as root without being asked to
- MINOR: sample: Add base2 converter
- MINOR: version: add -vq, -vqb, and -vqs flags for concise version output
- BUILD: trace: silence a bogus build warning at -Og
- MINOR: trace: accept trace spec right after "-dt" on the command line
- BUILD: makefile: bump the default minimum linux version to 4.17
As explained during the 3.3-dev7 announcement below:
https://www.mail-archive.com/haproxy@formilux.org/msg46073.html
no regularly maintained distro supports a kernel older than 4.18 anymore,
and KTLS is supported since 4.17. So it's about the right moment to bump
the default minimum kernel version supported by glibc and musl to
automatically cover new features. The linux-glibc-legacy target still
supports 2.6.28 and above.
I continue to mistakenly set the traces using "-dtXXX" and to have to
refer to the doc to figure that it requires a separate argument and
differs from some other options. Worse, "-dthelp" doesn't say anything
and silently ignores the argument.
Let's make the parser take whatever follows "-dt" as the argument if
present, otherwise take the next one (as it currently does). Doing
this even allows to simplify the code, and is easier to figure the
syntax since "-dthelp" now works.
gcc-13.3 at -Og emits an incorrect build warning in trace.c about a
possibly initialized variable:
In file included from include/haproxy/api.h:35,
from src/trace.c:22:
src/trace.c: In function 'trace_parse_cmd':
include/haproxy/bug.h:431:17: warning: 'arg' may be used uninitialized [-Wmaybe-uninitialized]
431 | free(*__x); \
| ^~~~~~~~~~
src/trace.c:1136:9: note: in expansion of macro 'ha_free'
1136 | ha_free(&oarg);
| ^~~~~~~
src/trace.c:1008:15: note: 'arg' was declared here
1008 | char *arg, *oarg;
| ^~~
The warning is obviously wrong since the field is initialized in one of
the two branches of an "if" whose complementary one returns. But the
compiler doesn't seem to see this because the if is in fact two ifs each
with an opposite condition: "if (arg_src)" then "if (!arg_src)". Let's
just move upwards the default one that returns and eliminate the other
one. Reading the diff with "git diff -b" better shows the tiny change.
It could be backported to 3.0.
This patch introduces three new command line flags to display HAProxy version
info more flexibly:
- `-vqs` outputs the short version string without commit info (e.g., "3.3.1").
- `-vqb` outputs only the branch (major.minor) part of the version (e.g., "3.3").
- `-vq` outputs the full version string with suffixes (e.g., "3.3.1-dev5-1bb975-71").
This allows easier parsing of version info in automation while keeping existing -v and -vv behaviors.
The command line argument parsing now calls `display_version_plain()` with a
display_mode parameter to select the desired output format. The function handles
stripping of commit or patch info as needed, depending on the mode.
Signed-off-by: Nikita Kurashkin <nkurashkin@stsoft.ru>
This commit adds the base2 converter to turn binary input into it's
string representation. Each input byte is converted into a series of
eight characters which are either 0s and 1s by bit-wise comparison.
Like many exposed network deamons, haproxy does normally not need to run
as root and strongly recommends against this, unless strictly necessary.
On some operating systems, capabilities even totally alleviate this need.
Lately, maybe due to a raise of containerization or automated config
generation or a bit of both, we've observed a resurgence of this bad
practice, possibly due to the fact that users are just not aware of the
conditions they're using their daemon.
Let's add a warning at boot when starting as root without having requested
it using "uid" or "user". And take this opportunity for warning the user
about the existence of capabilities when supported, and encouraging the
use of a chroot.
This is achieved by leaving global.uid set to -1 by default, allowing us
to detect if it was explicitly set or not.
As reported on GH #3099, upon memory error add_to_logformat_list() will
return and error but it fails to properly memory which was allocated
within the function, which could result in memory leak.
Let's free all relevant variables allocated by the function before returning.
No backport needed unless 22ac1f5ee ("("BUG/MINOR: log: Add OOM checks for
calloc() and malloc() in logformat parser and dup_logger()") is.
When an SNI is set on a QUIC server line, ssl_sock_set_servername() is called
from connect_server() (backend.c). This leads some BUG_ON() to be triggered
because the CO_FL_WAIT_L6_CONN | CO_FL_SSL_WAIT_HS were not set. This must
be done into the ->init() xprt callback. This patch move the flags settings
from ->start() to ->init() callback.
Indeed, connect_server() calls these functions in this order:
->init(),
ssl_sock_set_servername() # => crash if CO_FL_WAIT_L6_CONN | CO_FL_SSL_WAIT_HS not set
->start()
Furthermore ssl_sock_set_servername() has a side effect to reset the SSL_SESSION
object (attached to SSL object) calling SSL_set_session(), leading to crashes as follows:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./haproxy -f quic_srv.cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 tls_process_server_hello (s=0x560c259733b0, pkt=0x7fffac239f20)
at ssl/statem/statem_clnt.c:1624
1624 if (s->session->session_id_length > 0) {
[Current thread is 1 (Thread 0x7fc364e53dc0 (LWP 35514))]
(gdb) bt
#0 tls_process_server_hello (s=0x560c259733b0, pkt=0x7fffac239f20)
at ssl/statem/statem_clnt.c:1624
#1 0x00007fc36540fba4 in ossl_statem_client_process_message (s=0x560c259733b0,
pkt=0x7fffac239f20) at ssl/statem/statem_clnt.c:1042
#2 0x00007fc36540d028 in read_state_machine (s=0x560c259733b0) at ssl/statem/statem.c:646
#3 0x00007fc36540ca70 in state_machine (s=0x560c259733b0, server=0)
at ssl/statem/statem.c:439
#4 0x00007fc36540c576 in ossl_statem_connect (s=0x560c259733b0) at ssl/statem/statem.c:250
#5 0x00007fc3653f1698 in SSL_do_handshake (s=0x560c259733b0) at ssl/ssl_lib.c:3835
#6 0x0000560c22620327 in qc_ssl_do_hanshake (qc=qc@entry=0x560c25961f60,
ctx=ctx@entry=0x560c25963020) at src/quic_ssl.c:863
#7 0x0000560c226210be in qc_ssl_provide_quic_data (len=90, data=<optimized out>,
ctx=0x560c25963020, level=ssl_encryption_initial, ncbuf=0x560c2588bb18)
at src/quic_ssl.c:1071
#8 qc_ssl_provide_all_quic_data (qc=qc@entry=0x560c25961f60, ctx=0x560c25963020)
at src/quic_ssl.c:1123
#9 0x0000560c2260ca5f in quic_conn_io_cb (t=0x560c25962f80, context=0x560c25961f60,
state=<optimized out>) at src/quic_conn.c:791
#10 0x0000560c228255ed in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:648
#11 0x0000560c22825f7a in process_runnable_tasks () at src/task.c:889
#12 0x0000560c22793dc7 in run_poll_loop () at src/haproxy.c:2836
#13 0x0000560c22794481 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3056
#14 0x0000560c2259082d in main (argc=<optimized out>, argv=<optimized out>)
at src/haproxy.c:3667
<s> is the SSL object, and <s->session> is the SSL_SESSION object.
For the client, this is the first call do SSL_do_handshake() which initializes this
SSL_SESSION object from ->init() xpt callback. Then it is reset by
ssl_sock_set_servername(), then tls_process_server_hello() TLS stack is called with
NULL value for s->session when receiving the ServerHello TLS message.
To fix this, simply move the first call to SSL_do_handshake to ->start xprt call
back (qc_xprt_start()).
No need to backport.
Over their lifetime, connections are attached to different list. These
lists depends on whether connection is on frontend or backend side.
Attach point members are stored via a union in struct connection. The
next commit reorganizes them so that a proper frontend/backend
separation is performed :
commit a96f1286a75246fef6db3e615fabdef1de927d83
BUG/MINOR: connection: rearrange union list members
On conn_free(), connection instance must be removed from these lists to
ensure there is no use-after-free case. However code was still shaky
there, despite no real issue. Indeed, <toremove_list> was detached for
all connections, despite being only used on backend side only.
This patch streamlines the freeing of connection. Now, <toremove_list>
detach is performed in conn_backend_deinit(). Moreover, a new helper
conn_frontend_deinit() is defined. It ensures that <stopping_list>
detach is done. Prior it was performed individually by muxes.
Note that a similar procedure is performed when the connection is
reversed. Hence, conn_frontend_deinit() is now used here as well,
rendering reversal from FE to BE or vice versa symmetrical.
As mentionned above, no crash occured prior to this patch, but the code
was fragile, in particular access to <toremove_list> for frontend
connections. Thus this patch is considered as a bug fix worthy of a
backport along with above mentionned patch, currently up to 3.0.
When a connection is reversed, some elements must be resetted prior to
reusing it. Most notably, connection must be removed from lists specific
on frontend/backend sides.
When reverse was performed for frontend to backend side, connection was
not removed via its <stopping_list> attach point. On previous releases,
this did not cause any issue. However, crashes start to occur recently,
probably due to the recent reorganization of connection list attach
points from the following patch.
commit a96f1286a75246fef6db3e615fabdef1de927d83
BUG/MINOR: connection: rearrange union list members
To fix this, simply ensure that <stopping_list> detach is performed via
conn_reverse().
This patch must be backported up to 3.0 release.
For many years, an unset load balancing algorithm would use "roundrobin".
It was shown several times that "random" with at least 2 draws (the
default) generally provides better performance and fairness in that
it will automatically adapt to the server's load and capacity. This
was further described with numbers in this discussion:
https://www.mail-archive.com/haproxy@formilux.org/msg46011.htmlhttps://github.com/orgs/haproxy/discussions/3042
BTW there were no objection and only support for the change.
The goal of this patch is to change the default algo when none is
specified, from "roundrobin" to "random". This way, users who don't
care and don't set the load balancing algorithm will benefit from a
better one in most cases, while those who have good reasons to prefer
roundrobin (for session affinity or for reproducible sequences like used
in regtests) can continue to specify it.
The vast majority of users should not notice a difference.
A few tests explicitly rely on the server ordering granted by
"balance roundrobin", but didn't specify the balance algorithm.
As it will change soon, let's explicit it.
The keyword check-reuse-pool allows to reuse an idle connection to
perform a health check instead of opening a new one. It is implemented
similarly to HTTP transfer reuse : a hash is calculated with a subset of
properties to lookup a connection with the same characteristics.
One of these properties is the destination address. Initially it was
always set to NULL prior to reuse check, as this is necessary to match
connections on a reverse-HTTP server. However, this prevents reuse on
other servers with a proper address configured. Indeed, in this case
destination address is always used as key for connections inserted in
idle pool.
This patch fixes this by properly setting destination address for check
reuse. By default, it reuses the address from the server. The only
exception is if the server is using reverse-HTTP, in which case address
remains NULL.
A new test is also performed prior to try check reuse to ensure this is
not performed on a transparent server. Indeed, in this case server
address would be unset. Anyway, check cannot reuse a connection in this
case so this is OK. Note that this does not prevent to continue check
with a newly connection with a NULL address : this should be handled
more properly in another patch.
This must be backported up to 3.2.
SSL may be activated implicitely if a server relies on SSL, even without
check-ssl keyword. This is performed by init_srv_check() function. The
main operation is to change xprt layer for check to SSL.
Prior to this patch, <use_ssl> check member was also set, despite not
strictly necessary. This has a negative side-effect of rendering
check-reuse-pool ineffective. Indeed, reuse on check is only performed
if no specific check configuration has been specified (see
tcpcheck_use_nondefault_connect()).
This patch fixes check reuse with SSL : <use_ssl> is not set in case SSL
is inherited implicitely from server configuration. Thus, <use_ssl> is
now only set if an explicit check-ssl keyword is set, which disables
connection reuse for check.
This must be backported up to 3.2.
Add two BUG_ON() in shm_stats_file_prepare() which will trigger if
exported structures (shm_stats_file_hdr and shm_stats_file_object) change
in size, because it means that they will become incompatible with older
versions and thus precautions should be taken by the developer to ensure
compatibility with olders versions, or at least detect incompatible
versions by changing the version number to prevent bugs resulting
from inconsistent mapping between versions. The BUG_ON() may be
safely adjusted then.
Please note that it doesn't protect against accidental struct member
re-ordering if the resulting struct size is equal..
We may need additional struct members in shm_stats_file_object and
shm_stats_file_hdr, yet since these structs are exported they should
not change in size nor ordering else it would require a version change
to break compability on purpose since mapping would differ.
Here we reserve 64 additional bytes in shm_stats_file_object, and
128 bytes in shm_stats_file_hdr for future usage.
Document some byte holes and fix some potential aligment issues
between 32 and 64 bits architectures to ensure the shm_stats_file memory
mapping is consistent between operating systems.
same as THREAD_PAD() but doesn't depend on haproxy being compiled with
thread support. It may be useful for memory (or files) that may be
shared between multiple processed.
shm_stats_file_reuse_object() has a non negligible cost, especially if
the shm file contains a lot of objects because the functions scans the
whole shm file to find available slots.
During startup, if no existing objects could be mapped in the shm
file shm_stats_file_add_object() for each object (server, fe, be or
listener) with a GUID set. On large config it means
shm_stats_file_add_object() could be called a lot of times in a row.
With current implementation, each shm_stats_file_add_object() call
leverages shm_stats_file_reuse_object(), so the more objects are defined
in the config, the slower the startup will be.
To try to optimize startup time a bit with large configs, we don't
sytematically call shm_stats_file_reuse_object(), especially when we
know that the previous attempt to reuse objects failed. In this case
we add a small tempo between failed attempts to reuse objects because
we assume the new attempt will probably fail anyway. (For slots to
become available, either an old process has to clean its entries,
or they have to time out which implies that the clock needs to be updated)
Add some documentation for "shm-stats-file" and
"shm-stats-file-max-objects" experimental directives related to the use
of shared memory for storing stats counters (see previous commits for
implementation details)
This is the last patch of the shm stats file series, in this patch we
implement the logic to store and fetch shm stats objects and associate
them to existing shared counters on the current process.
Shm objects are stored in the same memory location as the shm stats file
header. In fact they are stored right after it. All objects (struct
shm_stats_file_object) have the same size (no matter their type), which
allows for easy object traversal without having to check the object's
type, and could permit the use of external tools to scan the SHM in the
future. Each object stores a guid (of GUID_MAX_LEN+1 size) and tgid
which allows to match corresponding shared counters indexes. Also,
as stated before, each object stores the list of users making use of
it. Objects are never released (the map can only grow), but unused
objects (when no more users or active users are found in objects->users),
the object is automatically recycled. Also, each object stores its
type which defines how the object generic data member should be handled.
Upon startup (or reload), haproxy first tries to scan existing shm to
find objects that could be associated to frontends, backends, listeners
or servers in the current config based on GUID. For associations that
couldn't be made, haproxy will automatically create missing objects in
the SHM during late startup. When haproxy matches with an existing object,
it means the counter from an older process is preserved in the new
process, so multiple processes temporarily share the same counter for as
long as required for older processes to eventually exit.
Now that all processes tied to the same shm stats file now share a
common clock source, we introduce the process slot notion in this
patch.
Each living process registers itself in a map at a free index: each slot
stores information about the process' PID and heartbeat. Each process is
responsible for updating its heartbeat, a slot is considered as "free" if
the heartbeat was never set or if the heartbeat is expired (60 seconds of
inactivity). The total number of slots is set to 64, this is on purpose
because it allows to easily store the "users" of a given shm object using
a 64 bits bitmask. Given that when haproxy is reloaded olders processes
are supposed to die eventually, it should be large enough (64 simultaneous
processes) to be safe. If we manage to reach this limit someday, more
slots could be added by splitting "users" bitmask on multiple 64bits
variable.
The use of the "shm-stats-file" directive now implies that all processes
using the same file now share a common clock source, this is required
for consistency regarding time-related operations.
The clock source is stored in the shm stats file header.
When the directive is set, all processes share the same clock
(global_now_ms and global_now_ns both point to variables in the map),
this is required for time-based counters such as freq counters to work
consistently. Since all processes manipulate global clock with atomic
operations exclusively during runtime, and don't systematically relies
on it (thanks to local now_ms and now_ns), it is pretty much transparent.
add initial support for the "shm-stats-file" directive and
associated "shm-stats-file-max-objects" directive. For now they are
flagged as experimental directives.
The shared memory file is automatically created by the first process.
The file is created using open() so it is up to the user to provide
relevant path (either on regular filesystem or ramfs for performance
reasons). The directive takes only one argument which is path of the
shared memory file. It is passed as-is to open().
The maximum number of objects per thread-group (hard limit) that can be
stored in the shm is defined by "shm-stats-file-max-objects" directive,
Upon initial creation, the main shm stats file header is provisioned with
the version which must remains the same to be compatible between processes
and defaults to 2k. which means approximately 1mb max per thread group
and should cover most setups. When the limit is reached (during startup)
an error is reported by haproxy which invites the user to increase the
"shm-stats-file-max-objects" if desired, but this means more memory will
be allocated. Actual memory usage is low at start, because only the mmap
(mapping) is provisionned with the maximum number of objects to avoid
relocating the memory area during runtime, but the actual shared memory
file is dynamically resized when objects are added (resized by following
half power of 2 curve when new objects are added, see upcoming commits)
For now only the file is created, further logic will be implemented in
upcoming commits.
counters_{fe,be}_shared_prepare now take an extra <errmsg> parameter
that contains additional hints about the error in case of failure.
It must be freed accordingly since it is allocated using memprintf
It helps keep the contention level low: when we hold the update lock
that we know other parts may be relying on (peers, track-sc etc),
we decrease the remaining visit counters 4 times as fast to further
reduce the contention. At this point no more warnings are seen during
intense synchronization (2x64 cores, 1.5M req/s with a track-sc each,
5M entries in use).
As reported by Felipe in GH issue #3084, on large systems it's not
sufficient to leave the expiration process after a certain number of
expired entries, because if they accumulate too fast, it's possible
to still spend some time visiting many (e.g. those still in use),
which takes time.
Thus here we're taking a stricter approach consisting in counting the
number of visited entries, which allows to leave early if we can't do
the expected work in a reasonable amount of time.
In order to avoid always stopping on first shards and never visiting
last ones, we're always starting from a random shard number and looping
from that one. This way even if we always leave early, all shards will
be handled equally.
This should be backported to 3.2.
When the expire task is running fast (i.e. running almost alone), it's
super hard to grab the update lock and peers can easily trigger the
watchdog because the time it takes to grab this lock is multiplied by
the number of updates to perform. This is easier to trigger at the end
of an injection session where the expire task is omni-present. Let's
just record that we failed once and don't fail a second time in the
loop.
This should be backported to 3.2, but probably not further given that
this area changed significantly in 3.2.
When trying to kill/expire entries, if a ref-counted entry is found,
let's requeue it with its expiration timer instead of leaving it out,
because other ref-counters (e.g. peers) will not purge it otherwise,
leaving it orphan. This one seems trickier to trigger, though it seems
to happen sometimes when peers are late and a long resync is active
and competing with intense calls to process_table_expire() (i.e. when
no other acitvity is there).
This must be backported to 3.2. It's likely that older versions are
affected as well, but possibly differently since the expiration
mechanism changed between 3.1 and 3.2, so better not take unneeded
risks there.
In 3.2, the table expiration latency was improved by commit 994cc58576
("MEDIUM: stick-tables: Limit the number of entries we expire"), however
it introduced an issue by which it's possible to leave the loop after a
certain number of elements were expired, without requeuing the deleted
elements. The issue it causes is that other places with a non-null ref_cnt
will not necessarily delete it themselves, resulting in orphan elements in
the table. These ones will then pollute it and force recycling old ones
more often which in turn results in an increase of the contention.
Let's check for the expiration counter before deleting the element so
that it can be found upon next visit.
This fix must be backported to 3.2. It is directly related to GH
issue #3084. Thanks to Felipe and Ricardo for sharing precious info
and testing a candidate fix.
In issue #3013, an user observed a crash at startup of haproxy when
building statically and using the "user" global section.
This is a known problem of the glibc and the linker even warn about
this:
> warning: Using 'getgrnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
> warning: Using 'getpwnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Let's emit a warning when using user/group in this case.
The fix in commit 441cd614f9 ("BUG/MINOR: acl: set arg_list->kw to
aclkw->kw string literal if aclkw is found") involves an unchecked
access to "al" after that one is tested for possibly being NULL. This
rightfully upsets Coverity (GH #3095) and might also trigger warnings
depending on the compilers. However, no known caller to date passes
a NULL arg list here so there's no way to trigger this theoretical
bug.
This should be backported along with the fix above to avoid emitting
warnings, possibly as far as 2.6 since that fix was tagged as such.
The fix in 4a9e3e102e ("BUG/MINOR: haproxy: only tid 0 must not sleep
if got signal") had the nasty side effect of breaking the graceful
reload operations: threads whose id is non-zero could quit too early and
not process incoming traffic, which is visible with broken connections
during reloads. They just need to ignore the the stopping condition
until the signal queue is empty. In any case, it's the thread in charge
of the signal queue which will notify them once it receives the signal.
It was verified that connections are no longer broken with this fix,
and that the issue that required it (#2537, looping threads on reload)
does not re-appear with the reproducer, while it still did without the
fix above. Since the fix above was backported to every stable version,
this one will also have to.
Split the documentation in multiple sections:
- Explanation about what it does and how
- <alg> parameter with array of parameters
- <key> parameter with details about certificates and public keys
- Return value
Others changes:
- certificates does not need to be known during configuration parsing
- differences between public key and certificate
To avoid anti-amplification limit, it is required that Initial packet
are padded to be at least 1.200 bytes long. On server side, this only
applies to ack-eliciting packets. However, for client side, this is
mandatory for every packets.
This patch adjusts qc_txb_store() BUG_ON statement used to catch too
small Initial packets. On QUIC client side, ack-eliciting flag is now
ignored, thus every packets are checked.
This is labelled as MEDIUM as this BUG_ON() is known to be easily
triggered, as QUIC datagrams encoding function are complex. However,
it's important that a QUIC endpoint respects it, else the peer will drop
the invalid packet and could immediately close the connection.
Currently, when connection is closing, only CONNECTION_CLOSE frame is
emitted via qc_prep_pkts()/qc_do_build_pkt(). Also, only the first
registered encryption level is considered while the others are
dismissed. This results in a single packet datagram.
This can cause issues for QUIC client support, as padding is required
for every Initial packet, contrary to server side where only
ack-eliciting packets are eligible. Thus a client must add padding to a
CONNECTION_CLOSE frame on Initial level.
This patch adjusts qc_prep_pkts() to ensure such packet will be
correctly padded on client side. It sets <final_packet> variable which
instructs that if padding is necessary it must be apply immediately on
the current encryption level instead of the last one.
It could appear as unnecessary to pad a CONNECTION_CLOSE packet, as the
peer will enter in draining state when processing it. However, RFC
mandates that a client Initial packet too small must be dropped by the
server, so there is a risk that the CONNECTION_CLOSE is simply discarded
prior to its processing if stored in a too small datagram.
No need to backport as this is a QUIC backend issue only.
On loss detection timer expiration, qc_dgrams_retransmit() is used to
reemit lost packets. Different code paths are present depending on the
active encryption level.
If Initial level is still initialized, retransmit is performed both for
Initial and Handshake spaces, by first retrieving the list of lost
frames for each of them.
Prior to this patch, Handshake level was always registered for emission
after Initial, even if it dit not have any frame to reemit. In this
case, most of the time it would result in a datagram containing Initial
reemitted frames packet coalesced with a Handshake packet consisting
only of a PADDING frame. This is because padding is only added for the
last registered QEL.
For QUIC backend support, this may cause issues. This is because
contrary to QUIC server side, Initial and Handshake levels keys are not
derived simultaneously for a QUIC client. Thus, if the latter keys are
unavailable, Handshake packet cannot be encoded in sending, leaving a
single Initial packet. However, this is now too late to add PADDING.
Thus the resulting datagram is invalid : this triggers the BUG_ON()
assert failure located on qc_txb_store().
This patch fixes this by amending qc_dgrams_retransmit(). Now, Handshake
level is only registered for emission if there is frame to retransmit,
which implies that Handshake keys are already available. Thus, PADDING
will now either be added at Initial or Handshake level as expected.
Note that this issue should not be present on QUIC frontend, due to
Initial and Handshake keys derivation almost simultaneously. However,
this should still be backported up to 3.0.
qc_prep_pkts() activates padding when building an Initial packet. This
ensures that resulting datagram will always be at least 1.200 bytes,
which is mandatory to prevent deadlock over anti-amplication.
Prior to padding activation, a check is performed to ensure that output
buffer is big enough for a padded datagram. However, this did not take
into account previously built packets which would be coalesced in the
same datagram. Thus this patch fixes this comparison check.
In theory, prior to this patch, in some cases Initial packets could not
be built despite a datagram of the proper size. Currently, this probably
never happens as Initial packet is always the first encoded in a
datagram, thus there is no coalesced packet prior to it. However, there
is no hard requirement on this, so it's better to reflect this in the
code.
This should be backported up to 2.6.
This fix follows this previous one:
BUG/MINOR: quic: reorder fragmented RX CRYPTO frames by their offsets
which is not sufficient when a client fragments and mixes its CRYPTO frames AND
leaveswith holes by packets. ngtcp2 (and perhaps chrome) splits theire CRYPTO
frames but without hole by packet. In such a case, the CRYPTO parsing leads to
QUIC_RX_RET_FRM_AGAIN errors which cannot be fixed when the peer resends its packets.
Indeed, even if the peer resends its frames in a different order, this does not
help because since the previous commit, the CRYPTO frames are ordered on haproxy side.
This issue was detected thanks to the interopt tests with quic-go as client. This
client fragments its CRYPTO frames, mixes them, and generate holes, and most of
the times with the retry test.
To fix this, when a QUIC_RX_RET_FRM_AGAIN error is encountered, the CRYPTO frames
parsing is not stop. This leaves chances to the next CRYPTO frames to be parsed.
Must be backported as far as 2.6 as the commit mentioned above.
This patch adds a missing out-of-memory (OOM) check after
the call to `malloc()` in `indent_msg()`. If memory
allocation fails, the function returns NULL to prevent
undefined behavior.
Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This patch adds a missing out-of-memory (OOM) check after
the call to `calloc()` in `parse_compression_options()`. If
memory allocation fails, an error message is set, the function
returns -1, and parsing is aborted to ensure safe handling
of low-memory conditions.
Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This commit adds a missing out-of-memory (OOM) check
after the call to `calloc()` in `cfg_parse_listen()`.
If memory allocation fails, an alert is logged, error
codes are set, and parsing is aborted to prevent
undefined behavior.
Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This patch adds a missing out-of-memory (OOM) check after
the call to `calloc()` in `smp_fetch_acl_parse()`. If
memory allocation fails, an error message is set and
the function returns 0, improving robustness in
low-memory situations.
Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This patch adds missing out-of-memory (OOM) checks after calls
to `calloc()` and `malloc()` in the logformat parser and the
`dup_logger()` function. If memory allocation fails, an error
is reported or NULL is returned, preventing undefined behavior
in low-memory conditions.
Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This patch adds missing out-of-memory (OOM) checks after calls to
calloc() in the functions `filter_count_srv_status()` and `filter_count_url()`.
If memory allocation fails, an error message is printed to stderr
and the process exits with status 1. This improves robustness
and prevents undefined behavior in low-memory situations.
Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
A bug was introduced by the commit 6ea50ba46 ("MINOR: acl; Warn when
matching method based on a suffix is overwritten"). The test on the match
function, when defined was not correct. It is now fixed.
No backport needed, except if the commit above is backported.
It is not really an issue, but the "check-sni" value inerited from a default
server is not duplicated while the paramter value is duplicated during the
parsing. So here there is a small leak if several "check-sni" parameters are
used on the same server line. The previous value is never released. But to
fix this issue, the value inherited from the default server must also be
duplicated. At the end it is safer this way and consistant with the parsing
of the "sni" parameter.
It is harmless so there is no reason to backport this patch.
When "check-alpn" parameter is inherited from the default server, the value
is not duplicated, the pointer of the default server is used. However, when
this parameter is overridden, the old value is released. So the "check-alpn"
value of the default server is released. So it is possible to have a UAF if
if another server inherit from the same the default server.
To fix the issue, the "check-alpn" parameter must be handled the same way
the "alpn" is. The default value is duplicated. So it could be safely
released if it is forced on the server line.
This patch should fix the issue #3096. It must be backported to all stable
versions.
From time to time, issues are reported about string matching based on suffix
(for instance path_beg). Each time, it appears these ACLs are used in
conjunction with a converter or followed by an explicit matching method
(-m).
Unfortunatly, it is not an issue but an expected behavior, while it is not
obvious. matching suffixes can be consider as aliases on the corresponding
'-m' matching method. Thus "path_beg" is equivalent to "path -m beg". When a
converter is used the original matching (string) is used and the suffix is
lost. When followed by an explicit matching method, it overwrites the
matching method based on the suffix.
It is expected but confusing. Thus now a warning is emitted because it is a
configuration issue for sure. Following sample fetch functions are concerned:
* base
* path
* req.cook
* req.hdr
* res.hdr
* url
* urlp
The configuration manual was modified to make it less ambiguous.
Several '-m' explicit matching method was allowed, but only the last one was
really used. There is no reason to specify several matching method and it is
most probably an error or a lack of understanding of how matchings are
performed. So now, an error is triggered during the configuration parsing to
avoid any bad usage.
hdr_dom() is a alias of "hdr() -m dom". So using it with another explicit
matching method does not work because the matching on the domain will never
be performed. Only the last matching method is used. The scripts was working
by chance because no port was set on host header values.
The script was fixed by using "host_only" converter. In addition, host
header values were changed to have a port now.
When manipulating idle backend connections for input/output processing,
special care is taken to ensure the connection cannot be accessed by
another thread, for example via a takeover. When processing is over,
connection is reinserted in its original list.
A connection can either be attached to a session (private ones) or a
server idle tree. In the latter case, <srv> is guaranteed to be non null
prior to _srv_add_idle() thanks to CO_FL_LIST_MASK comparison with conn
flags. This patch adds an ASSUME_NONNULL() to better reflect this.
This should fix coverity reports from github issue #3095.
MUX QUIC restricts buffer allocation per connection based on the
underlying congestion window. If a QCS instance cannot allocate a new
buffer, it is put in a buf_wait list. Typically, this will cause stream
upper layer to subscribe for sending.
A BUG_ON() was present on snd_buf and nego_ff callback prologue to
ensure that these functions were not called if QCS is already in
buf_wait list. The objective was to guarantee that there is no wake up
on a stream if it cannot allocate a buffer.
However, this BUG_ON() is not correct, as it can be fired legitimely.
Indeed, stream layer can retry emission even if no wake up occured. This
case can happen on reload. Thus, BUG_ON() will cause an unexpected
crash.
Fix this by removing these BUG_ON(). Instead, snd_buf/nego_ff callbacks
ensure that QCS is not subscribed in buf_wait list. If this is the case,
a nul value will be returned, which is sufficient for the stream layer
to pause emission and subscribe if necessary.
Occurences for this crash have been reported on the mailing list. It is
also the subject of github issue #3080, which should be fixed with this
patch.
This must be backported up to 3.0.
Since this commit:
BUG/MINOR: quic: reorder fragmented RX CRYPTO frames by their offsets
when they are parsed, the CRYPTO frames are ordered by their offsets into an ebtree.
Then their data are provided to the ncbufs.
But in case of error, when qc_handle_crypto_frm() returns QUIC_RX_RET_FRM_FATAL
or QUIC_RX_RET_FRM_AGAIN), they remain attached to their tree. Then
from <err> label, they are deteleted and deleted (with a while(node) { eb_delete();
qc_frm_free();} loop). But before this loop, these statements directly
free the frame without deleting it from its tree, if this is a CRYPTO frame,
leading to a use after free when running the loop:
if (frm)
qc_frm_free(qc, &frm);
This issue was detected by the interop tests, with quic-go as client. Weirdly, this
client sends CRYPTO frames by packet with holes.
Must be backported as far as 2.6 as the commit mentioned above.
This modification should have arrived with this commit:
MINOR: quic: remove ->offset qf_crypto struct field
Since this commit, the CRYPTO offset node key assignment is done at parsing time
when calling qc_parse_frm() from qc_parse_pkt_frms().
This useless assigment has been reported in GH #3095 by coverity.
This patch should be easily backported as far as 2.6 as the one mentioned above
to ease any further backport to come.
The docs call out that this field is the algorithm used to
sign the certificate. However, the example only had the hash portion of
the signature algorithm. This change updates the example to be accurate
based on a value written by HAProxy, which is based on an OID for
signature algorithms. I based example on a real TLV written by
HAProxy on my machine with all SSL TLVs enabled in config.
Recently, new server counter for private idle connections have been
added to statistics output. However, the patch was missing
ST_I_PX_PRIV_IDLE_CUR enum definition.
No need to backport.
From time to time, we saw the 'http-send-name-header' directive used to
overwrite the Host header to workaround limitations of a buggy application.
Most of time, this led to troubles. This was never officially supported and
each time we strongly discouraged anyone to do so. We already thought to
deprecate this directive, but it seems to be still used by few people. So
for now, we decided to strengthen the tests performed on it.
The header name is now checked during the configuration parsing to forbid
some risky names. 'Host', 'Content-Length', 'Transfer-Encoding' and
'Connection' header names are now rejected. But more headers could be added
in future.
CLI command "show servers conn" is used as a debugging tool to monitor
the number of connections per server. This patch extends its output by
adding the content of two server counters.
<served> is the first added column. It represents the number of active
streams on a server. <curr_sess_idle_conns> is the second added column.
This is a recently added value which account private idle connections
referencing a server.
Add a new stats column in proxy stats to display server counter for
private idle connections. This counter has been introduced recently.
The value is displayed on CSV output on the last column before modules.
It is also displayed on HTLM page alongside other idle server counters.
4b10302fd8 ("MINOR: cfgparse: implement a simple if/elif/else/endif
macro block handler") introduces a confusion between "strict-mode" and
"zero-warning".
This patch fixes the issue by changing "strict-mode" by "zero-warning"
in section 2.4. Conditional blocks.
Must be backported as far as 2.4.
When strict maxconn is enforced on a server, it may be necessary to kill
an idle connection to never exceed the limit. To be able to delete a
connection from any thread, takeover is first used to migrate it on the
current thread prior to its deletion.
As takeover is performed to delete a connection instead of reusing it,
<release> argument can be set to true. This removes unnecessary
allocations of resources prior to connection deletion. As such, this
patch is a small optimization for strict maxconn implementation.
Note that this patch depends on the previous one which removes any
assumption in takeover implementation that thread isolation is active if
<release> is true.
Takeover operation defines an argument <release>. It's a boolean which
if set indicate that freed connection resources during the takeover does
not have to be reallocated on the new thread. Typically, it is set to
false when takever is performed to reuse a connection. However, when
used to be able to delete a connection from a different thread,
<release> should be set to true.
Previously, <release> was only set in conjunction with "del server"
handler. This operation was performed under thread isolation, which
guarantee that not thread-safe operation such as removal from buf_wait
list could be performed on takeover if <release> was true. In the
contrary case, takeover operation would fail.
Recently, "del server" handler has been adjusted to remove idle
connection cleanup with takeover. As such, <release> is never set to
true in remaining takeover usage.
However, takeover is also used to enforce strict-maxconn on a server.
This is performed to delete a connection from any thread, which is the
primary reason of <release> to true. But for the moment as takeover
implementers considers that thread isolation is active if <release> is
set, this is not yet applicable for strict-maxconn usage.
Thus, the purpose of this patch is to adjust takeover implementation.
Remove assumption between <release> and thread-isolation mode. It's not
possible to remove a connection from a buf_wait list, an error will be
return in any case.
We discovered that the sockpair@ protocol is unreliable in macOS, this
is the same problem that we fixed in d7f6819. But it's not possible to
implement a acknowledgment once the socket are in non-blocking mode.
The problem was discovered in issue #3045.
Must be backported in every stable versions.
Fix read return value unused result.
src/haproxy.c: In function ‘main’:
src/haproxy.c:3630:17: error: ignoring return value of ‘read’ declared with attribute ‘warn_unused_result’ [-Werror=unused-result]
3630 | read(sock_pair[1], &c, 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Must be backported where d7f6819 is backported.
Do not remove anymore idle and purgeable connections directly under the
"del server" handler. The main objective of this patch is to reduce the
amount of work performed under thread isolation. This should improve
"del server" scheduling with other haproxy tasks.
Another objective is to be able to properly support dynamic servers with
QUIC. Indeed, takeover is not yet implemented for this protocol, hence
it is not possible to rely on cleanup of idle connections performed by a
single thread under "del server" handler.
With this change it is not possible anymore to remove a server if there
is still idle connections referencing it. To ensure this cannot be
performed, srv_check_for_deletion() has been extended to check server
counters for idle and idle private connections.
Server deletion should still remain a viable procedure, as first it is
mandatory to put the targetted server into maintenance. This step forces
the cleanup of its existing idle connections. Thanks to a recent change,
all finishing connections are also removed immediately instead of
becoming idle. In short, this patch transforms idle connections removal
from a synchronous to an asynchronous procedure. However, this should
remain a steadfast and quick method achievable in less than a second.
This patch is considered major as some users may notice this change when
removing a server. In particular with the following CLI commands
pipeline:
"disable server <X>; shutdown sessions server <X>; del server <X>"
Server deletion will now probably fail, as idle connections purge cannot
be completed immediately. Thus, it is now highly advise to always use a
small delay "wait srv-removable" before "del server" to ensure that idle
connections purge is executed prior.
Along with this change, documentation for "del server" and related
"shutdown sessions server" has been refined, in particular to better
highlight under what conditions a server can be removed.
This patch adds a new member <curr_sess_idle_conns> on the server. It
serves as a counter of idle connections attached on a session instead of
regular idle/safe trees. This is used only for private connections.
The objective is to provide a method to detect if there is idle
connections still referencing a server.
This will be particularly useful to ensure that a server is removable.
Currently, this is not yet necessary as idle connections are directly
freed via "del server" handler under thread isolation. However, this
procedure will be replaced by an asynchronous mechanism outside of
thread isolation.
Careful: connections attached to a session but not idle will not be
accounted by this counter. These connections can still be detected via
srv_has_streams() so "del server" will be safe.
This counter is maintain during the whole lifetime of a private
connection. This is mandatory to guarantee "del server" safety and is
conform with other idle server counters. What this means it that
decrement is performed only when the connection transitions from idle to
in use, or just prior to its deletion. For the first case, this is
covered by session_get_conn(). The second case is trickier. It cannot be
done via session_unown_conn() as a private connection may still live a
little longer after its removal from session, most notably when
scheduled for idle purging.
Thus, conn_free() has been adjusted to handle the final decrement. Now,
conn_backend_deinit() is also called for private connections if
CO_FL_SESS_IDLE flag is present. This results in a call to
srv_release_conn() which is responsible to decrement server idle
counters.
When a server goes into maintenance, or if its IP address is changed,
idle connections attached to it are scheduled for deletion via the purge
mechanism. Connections are moved from server idle/safe list to the purge
list relative to their thread. Connections are freed on their owned
thread by the scheduled purge task.
This patch extends this procedure to also handle private idle
connections stored in sessions instead of servers. This is possible
thanks via <sess_conns> list server member. A call to the newly
defined-function session_purge_conns() is performed on each list
element. This moves private connections from their session to the purge
list alongside other server idle connections.
This change relies on the serie of previous commits which ensure that
access to private idle connections is now thread-safe, with idle_conns
lock usage and careful manipulation of private idle conns in
input/output handlers.
The main benefit of this patch is that now all idle connections
targetting a server set in maintenance are removed. Previously, private
connections would remain until their attach sessions were closed.
Complete QUIC MUX for backend side. Ensure access to idle connections
are performed in a thread-safe way. Even if takeover is not yet
implemented for this protocol, it is at least necessary to ensure that
there won't be any issue with idle connections purging mechanism.
This change will also be necessary to ensure that QUIC servers can
safely be removed via CLI "del server". This is not yet sufficient as
currently server deletion still relies on takeover for idle connections
removal. However, this will be adjusted in a future patch to instead use
idle connections standard purging mechanism.
This is a direct follow-up of previous patch which adjust idle private
connections access via input/output handlers.
This patch implement the handlers prologue part. Now, private idle
connections require a similar treatment with non-private idle
connections. Thus, private conns are removed temporarily from its
session under protection of idle_conns lock.
As locking usage is already performed in input/output handler,
session_unown_conn() cannot be called. Thus, a new function
session_detach_idle_conn() is implemented in session module, which
performs basically the same operation but relies on external locking.
When dealing with input/output on a connection related handler, special
care must be taken prior to access the connection if it is considered as
idle, as it could be manipulated by another thread. Thus, connection is
first removed from its idle tree before processing. The connection is
reinserted on processing completion unless it has been freed during it.
Idle private connections are not concerned by this, because takeover is
not applied on them. However, a future patch will implement purging of
these connections along with regular idle ones. As such, it is necessary
to also protect private connections usage now. This is the subject of
this patch and the next one.
With this patch, input/output handlers epilogue of
muxes/SSL/conn_notify_mux() are adjusted. A new code path is able to
deal with a connection attached to a session instead of a server. In
this case, session_reinsert_idle_conn() is used. Contrary to
session_add_conn(), this new function is reserved for idle connections
usage after a temporary removal.
Contrary to _srv_add_idle() used by regular idle connections,
session_reinsert_idle_conn() may fail as an allocation can be required.
If this happens, the connection is immediately destroyed.
This patch has no effect for now. It must be coupled with the next one
which will temporarily remove private idle connections on input/output
handler prologue.
When a backend connnection becomes idle, muxes must activate some
protection to mark future access on it as dangerous. Indeed, once a
connection is inserted in an idle list, it may be manipulated by another
thread, either via takeover or scheduled for purging.
Private idle connections are stored into a session instead of the server
tree. They are never subject to a takeover for reuse or purge mechanism.
As such, currently they do not require the same level of protection.
However, a new patch will introduce support for private idle connections
purging. Thus, the purpose of this patch is to ensure protection is
activated as well now.
TASK_F_USR1 was already set on them as an anticipation for such need.
Only some extra operations were missing, most notably xprt_set_idle()
invokation. Also, return path of muxes detach operation is adjusted to
ensure such connection are never accessed after insertion.
When a server goes into maintenance mode, its idle connections are
scheduled for an immediate purge. However, this is not the case if the
server is already in stopped state, for example due to a health check
failure.
Adjust _srv_update_status_adm() to ensure that idle connections are
always scheduled for purge when going into maintenance in both cases.
The main advantage of this patch is to ensure consistent behavior for
server maintenance mode.
Note that it will also become necessary as server deletion will be
adjusted with a future patch. Idle connection closure won't be performed
by "del server" handler anymore, so it's important to ensure that a full
cleanup is always performed prior to executing it, else the server may
not be removable during a certain delay.
Previous patch ensures that a backend connection going into idle state
is rejected and freed if its target server is in maintenance.
This patch introduces a similar change for connections attached in the
session. session_check_idle_conn() now returns an errorl if connection
target server is in maintenance, similarly to session max idle conns
limit reached. This is sufficient to instruct muxes to delete the
connection immediately.
Currently, when a server is set on maintenance mode, its idle connection
are scheduled for purge. However, this does not prevent currently used
connection to become idle later on, even if the server is still off.
Change this behavior : an idle connection is now rejected by the server
if it is in maintenance. This is implemented with a new condition in
srv_add_to_idle_list() which returns an error value. In this case, muxes
stream detach callback will immediately free the connection.
A similar change is also performed in each MUX and SSL I/O handlers and
in conn_notify_mux(). An idle connection is not reinserted in its idle
list if server is in maintenance, but instead it is immediately freed.
Server member <sess_conns> is a mt_list which contains every backend
connections attached to a session which targets this server. These
connecions are not present in idle server trees.
The main utility of this list is to be able to cleanup these connections
prior to removing a server via "del server" CLI. However, this procedure
will be adjusted by a future patch. As such, <sess_conns> member must be
moved into srv_per_thread struct. Effectively, this duplicates a list
for every threads.
This commit does not introduce functional change. Its goal is to ensure
that these connections are now ordered by their owning thread, which
will allow to implement a purge, similarly to idle connections attached
to servers.
Introduce idle_conns_lock usage to protect manipulation to <priv_conns>
session member. This represents a list of intermediary elements used to
store backend connections attached to a session to prevent their sharing
across multiple clients.
Currently, this patch is unneeded as sessions are only manipulated on a
single-thread. Indeed, contrary to idle connections stored in servers,
takeover is not implemented for connections attached to a session.
However, a future patch will introduce purging of these connections,
which is already performed for connections attached to servers. As this
can be executed by any thread, it is necessary to introduce
idle_conns_lock usage to protect their manipulation.
By default backend connections are stored into idle/avail server trees.
However, if such connections cannot be shared between multiple clients,
session serves as the alternative storage.
To be able to quickly reuse a backend conn from a session, they are
indexed by their target, which is either a server or a backend proxy.
This is the purpose of 'struct sess_priv_conns' intermediary stockage
element.
Lookup and allocation of these elements are performed in several session
function, for example to add, get or remove a backend connection from a
session. The purpose of this patch is to simplify this by providing two
internal functions sess_alloc_sess_conns() and sess_get_sess_conns().
Along with this, a new BUG_ON() is added into session_unown_conn(),
which ensure that sess_priv_conns element is found when the connection
is removed from the session.
Move from header to source file functions related to session management
of backend connections. These functions are big enough to remove inline
attribute.
A set of recent patches have simplified management of backend connection
attached to sessions. The API is now stricter to prevent any misuse.
One of this change is the addition of a BUG_ON() in session_add_conn(),
which ensures that a connection is not attached to a session if its
<owner> field points to another entry.
On older haproxy releases, this assertion could not be enforced due to
NTLM as a connection is turned as private during its transfer. When
using a true multiplexed protocol on the backend side, the connection
could be assigned in turn to several sessions. However, NTLM is now only
applied for HTTP/1.1 as it does not make sense if the connection is
already shared.
To better clarify this situation, extend the comment on BUG_ON() inside
session_add_conn().
Once a connection is inserted into the server idle/safe tree during
stream detach, it is not accessed anymore by the muxes without
idle_conns_lock protection. This is because the connection could have
been already stolen by a takeover operation.
Adjust QUIC MUX detach implementation to follow the same pattern. Note
that, no bug can occur due to takeover as QUIC does not implement it.
However, prior to this patch, there may still exist race-conditions with
idle connection purging.
No backport needed.
When a server is deleted, each of its idle connections are removed. This
is also performed for every private connections stored on sessions which
referenced the target server.
As mentionned above, these private connections are idle, guaranteed by
srv_check_for_deletion(). A BUG_ON() on CO_FL_SESS_IDLE is already
present to guarantee this. Thus, these connections are accounted on the
session to enforce max-session-srv-conns limit.
However, this counter is not decremented during private conns cleanup on
"del server" handler. This patch fixes this by adding a decrement for
every private connections removed via "del server".
This should be backported up to 3.0.
wait CLI command can be used to wait until either a defined timeout or a
specific condition is reached. So far, srv-removable is the only event
supported. This is tested via srv_check_for_deletion().
This is implemented via srv_check_for_deletion(), which is
able to report a message describing the reason if the condition is
unmet.
Previously, wait return a generic string, to specify if the condition is
met, the timer has expired or an immediate error is encountered. In case
of srv-removable, it did not report the real reason why a server could
not be removed.
This patch improves wait command with srv-removable. It now displays the
last message returned by srv_check_for_deletion(), either on immediate
error or on timeout. This is implemented by using dynamic string output
with cli_dynmsg/dynerr() functions.
When a connection is reversed via rhttp protocol on the edge endpoint,
it migrates from frontend to backend side. This operation is performed
by conn_reverse(). During this transition, the conn owning session is
freed as it becomes unneeded.
Prior to this patch, session_unown_conn() was also called during
frontend to backend migration. However, this is unnecessary as this
function is only used for backend connection reuse. As such, this patch
removes this unnecessary call.
This does not cause any harm to the process as session_unown_conn() can
handle a connection not inserted yet. However, for clarity purpose it's
better to backport this patch up to 3.0.
A connection can be stored in several lists, thus there is several
attach points in struct connection. Depending on its proxy side, either
frontend or backend, a single connection will only access some of them
during its lifetime.
As an optimization, these attach points are organized in a union.
However, this repartition was not correctly achieved along
frontend/backend side delimitation.
Furthermore, reverse HTTP has recently been introduced. With this
feature, a connection can migrate from frontend to backend side or vice
versa. As such, it becomes even more tedious to ensure that these
members are always accessed in a safe way.
This commit rearrange these fields. First, union is now clearly splitted
between frontend and backend only elements. Next, backend elements are
initialized with conn_backend_init(), which is already used during
connection reversal on an edge endpoint. A new function
conn_frontend_init() serves to initialize the other members, called both
on connection first instantiation and on reversal on a dialer endpoint.
This model is much cleaner and should prevent any access to fields from
the wrong side.
Currently, there is no known case of wrong access in the existing code
base. However, this cleanup is considered an improvement which must be
backported up to 3.0 to remove any possible undefined behavior.
Since the mworker rework in haproxy 3.1, the worker need to tell the
master that it is ready. This is done using the sockpair protocol by
sending a _send_status message to the master.
It seems that the sockpair protocol is buggy on macOS because of a known
issue around fd transfer documented in sendmsg(2):
https://man.freebsd.org/cgi/man.cgi?sendmsg(2) BUGS section
Because sendmsg() does not necessarily block until the data has been
transferred, it is possible to transfer an open file descriptor across
an AF_UNIX domain socket (see recv(2)), then close() it before it has
actually been sent, the result being that the receiver gets a closed
file descriptor. It is left to the application to implement an
acknowledgment mechanism to prevent this from happening.
Indeed the recv side of the sockpair is closed on the send side just
after the send_fd_uxst(), which does not implement an acknowledgment
mechanism. So the master might never recv the _send_status message.
In order to implement an acknowledgment mechanism, a blocking read() is
done before closing the recv fd on the sending side, so we are sure that
the message was read on the other side.
This was only reproduced on macOS, meaning the master CLI is also
impacted on macOS. But no solution was found on macOS for it.
Implementing an acknowledgment mechanism would complexify too much the
protocol in non-blocking mode.
The problem was reported in ticket #3045, reproduced and analyzed by
@cognet.
Must be backported as far as 3.1.
During configuration parsing *args can contain different addresses, it is
changing from line to line. smp_resolve_args() is called after the
configuration parsing, it uses arg_list->kw to create an error message, if a
userlist referenced in some ACL is absent. This leads to wrong keyword names
reported in such message or some garbage is printed.
It does not happen in the case of sample fetches. In this case arg_list->kw is
assigned to a string literal from the sample_fetch struct returned by
find_sample_fetch(). Let's do the same in parse_acl_expr(), when find_acl_kw()
lookup returns a corresponding acl_keyword structure.
This fixes the issue #3088 at GitHub.
This should be backported in all stable versions since 2.6 including 2.6.
This issue leads to crashes when the QUIC mux traces are enabled and could be
reproduced with -dMfail. When the qcc allocation fails (qcc_init()) haproxy
crashes into qmux_dump_qcc_info() because ->conn qcc member is initialized:
Program terminated with signal SIGSEGV, Segmentation fault.
at src/qmux_trace.c:146
146 const struct quic_conn *qc = qcc->conn->handle.qc;
[Current thread is 1 (LWP 1448960)]
(gdb) p qcc
$1 = (const struct qcc *) 0x7f9c63719fa0
(gdb) p qcc->conn
$2 = (struct connection *) 0x155550508
(gdb)
This patch simply fixes the TRACE() call concerned to avoid <qcc> object
dereferencing when it is NULL.
Must be backported as far as 3.0.
This patch follows this previous bug fix:
BUG/MINOR: quic: reorder fragmented RX CRYPTO frames by their offsets
where a ebtree node has been added to qf_crypto struct. It has the same
meaning and type as ->offset_node.key field with ->offset_node an eb64tree node.
This patch simply removes ->offset which is no more useful.
This patch should be easily backported as far as 2.6 as the one mentioned above
to ease any further backport to come.
Previous patch emits a diag warning when both 'strict-sni' +
'default-crt' are used on the same bind line.
This patch converts this diagnostic warning to a real warning, so the
previous patch could be backported without breaking configurations.
This was discussed in #3082.
It possible to use both 'strict-sni' and 'default-crt' on the same bind
line, which does not make much sense.
This patch implements a check which will look for default certificates
in the sni_w tree when strict-sni is used. (Referenced by their empty
sni ""). default-crt sets the CKCH_INST_EXPL_DEFAULT flag in
ckch_inst->is_default, so its possible to differenciate explicits
default from implicit default.
Could be backported as far as 3.0.
This was discussed in ticket #3082.
This issue impacts the QUIC listeners. It is the same as the one fixed by this
commit:
BUG/MINOR: quic: repeat packet parsing to deal with fragmented CRYPTO
As chrome, ngtcp2 client decided to fragment its CRYPTO frames but in a much
more agressive way. This could be fixed with a list local to qc_parse_pkt_frms()
to please chrome thanks to the commit above. But this is not sufficient for
ngtcp2 which often splits its ClientHello message into more than 10 fragments
with very small ones. This leads the packet parser to interrupt the CRYPTO frames
parsing due to the ncbuf gap size limit.
To fix this, this patch approximatively proceeds the same way but with an
ebtree to reorder the CRYPTO by their offsets. These frames are directly
inserted into a local ebtree. Then this ebtree is reused to provide the
reordered CRYPTO data to the underlying ncbuf (non contiguous buffer). This way
there are very few less chances for the ncbufs used to store CRYPTO data
to reach a too much fragmented state.
Must be backported as far as 2.6.
This bug arrived with this fix:
BUG/MINOR: quic-be: missing Initial packet number space discarding
leading to crashes when dereferencing ->ipktns.
Such crashes could be reproduced with -dMfail option. To reach them, the
memory allocations must fail. So, this is relatively rare, except on systems
with limited memory.
To fix this, do not call quic_pktns_discard() if ->ipktns is NULL.
No need to backport.
We used to allocate and prepare listener counters from
check_config_validity() all at once. But it isn't correct, since at that
time listeners's guid are not inserted yet, thus
counters_fe_shared_prepare() cannot work correctly, and so does
shm_stats_file_preload() which is meant to be called even earlier.
Thus in this commit (and to prepare for upcoming shm shared counters
preloading patches), we handle the shared listener counters prep in
proxy_postcheck(), which means that between the allocation and the
prep there is the proper window for listener's guid insertion and shm
counters preloading.
No change of behavior expected when shm shared counters are not
actually used.
We actually need more granularity to split srv postparsing init tasks:
Some of them are required to be run BEFORE the config is checked, and
some of them AFTER the config is checked.
Thus we push the logic from 368d0136 ("MEDIUM: server: add and use
srv_init() function") a little bit further and split the function
in two distinct ones, one of them executed under check_config_validity()
and the other one using REGISTER_POST_SERVER_CHECK() hook.
SRV_F_CHECKED flag was removed because it is no longer needed,
srv_preinit() is only called once, and so is srv_postinit().
When pre-check and post-check postparsing hooks= are evaluated in
step_init_2() potential fatal errors are ignored during the iteration
and are only taken into account at the end of the loop. This is not ideal
because some errors (ie: memory errors) could cause multiple alert
messages in a row, which could make troubleshooting harder for the user.
Let's stop as soon as a fatal error is encountered for post parsing
hooks, as we use to do everywhere else.
It is possible to interrupt a SPOE applet without reporting an error. For
instance, when the client of the parent stream aborts. Thanks to this patch,
we take care to report an error on the SPOE applet to be sure to interrupt
the processing. It is especially important if the connection to the agent is
queued. Thanks to 886a248be ("BUG/MEDIUM: mux-spop: Reject connection
attempts from a non-spop frontend"), it is no longer an issue. But there is
no reason to continue to process if the parent stream is gone.
In addition, in the SPOE filter, if the processing is interrupted when the
filter is destroyed, no specific status code was set. It is not a big deal
because it cannot be logged at this stage. But it can be used to notify the SPOE
applet. So better to set it.
This patch should be backported as far as 3.1.
Stop declaring "cert.ecdsa.pem" in a crt-store, and add it dynamically
over the stats socket insted.
This way we fully verify a JWS signature with a certificate which never
existed at HAProxy startup.
It is possible to crash the process by initializing a connection to a SPOP
server from a non-spop frontend. It is of course unexpected and invalid. And
there are some checks to prevent that when the configuration is
loaded. However, it is not possible to handle all cases, especially the
"use_backend" rules relying on log-format strings.
It could be good to improve the backend selection by checking the mode
compatibility (for now, it is only performed for the HTTP).
But at the end, this can also be handled by the SPOP multiplexer when it is
initialized. If the opposite SD is not attached to an SPOE agent, we should
fail the mux initialization and return an internal error.
This patch must be backported as far as 3.1.
Based on the applet flags, it is possible to set .rcv_buf and .snd_buf
callback functions if necessary. If these functions are not defined for an
applet using the new API, it means the default functions must be used.
We also take care to choose the raw version or the htx version, depending on
the applet flags.
applet_output_room() and applet_input_data() are now HTX aware. These
functions automatically rely on htx versions if APPLET_FL_HTX flag is set
for the applet.
Multiplexers already explicitly announce their HTX support. Now it is
possible to set flags on applet, it could be handy to do the same. So, now,
HTX aware applets must set the APPLET_FL_HTX flag.
appctx_app_test() function can now be used to test the applet flags using an
appctx. This simplify a bit tests on applet flags. For now, this function is
used to test APPLET_FL_NEW_API flag.
Instead of setting a flag on the applet context by checking the defined
callback functions of the applet to know if an applet is using the new API
or not, we can now rely on the applet flags itself. By checking
APPLET_FL_NEW_API flag, it does the job. APPCTX_FL_INOUT_BUFS flag is thus
removed.
stats http-request rules evaluation is handled separately in
http_process_req_common(). Because of that, if a rule requires yielding,
the evaluation is interrupted as (F)YIELD verdict return values are not
handled there.
Since 3.2 with the introduction of costly ruleset interruption in
0846638 ("MEDIUM: stream: interrupt costly rulesets after too many
evaluations"), the issue started being more visible because stats
http-request rules would be interrupted when the evaluation counters
reached tune.max-rules-at-once, but the evaluation would never be
resumed, and the request would continue to be handled as if the
evaluation was complete. Note however that the issue already existed
in the past for actions that could return ACT_RET_YIELD such as
"pause" for instance.
This issue was reported by GH user @Wahnes in #3087, thanks to him for
providing useful repro and details.
To fix the issue, we merge rule vedict handling in
http_process_req_common() so that "stats http-request" evaluation benefits
from all return values already supported for the current ruleset.
It should be backported in 3.2 with 0846638 ("MEDIUM: stream: interrupt
costly rulesets after too many evaluations"), and probably even further
(all stable versions) if the patch adaptation is not to complex (before
HTTP_RULE_RES_FYIELD was introduced) because it is still relevant.
HTTP_RULE_RES_YIELD was used where HTTP_RULE_RES_FYIELD should be used.
Hopefully, aside from debug traces, both return values were treated
equally. Let's fix that to prevent confusion and from causing bugs
in the future.
It may be backported in 3.2 with 0846638 ("MEDIUM: stream: interrupt
costly rulesets after too many evaluations") if it easily applies
The below patch has simplified INITIAL padding on emission. Now,
qc_prep_pkts() is responsible to activate padding for this case, and
there is no more special case in qc_do_build_pkt() needed.
commit 8bc339a6ad4702f2c39b2a78aaaff665d85c762b
BUG/MAJOR: quic: fix INITIAL padding with probing packet only
However, qc_do_build_pkt() may still activate padding on its own, to
ensure that a packet is big enough so that header protection decryption
can be performed by the peer. HP decryption is performed by extracting a
sample from the ciphered packet, starting 4 bytes after PN offset.
Sample length is 16 bytes as defined by TLS algos used by QUIC. Thus, a
QUIC sender must ensures that length of packet number plus payload
fields to be at least 4 bytes long. This is enough given that each
packet is completed by a 16 bytes AEAD tag which can be part of the HP
sample.
This patch simplifies qc_do_build_pkt() by centralizing padding for this
case in a single location. This is performed at the end of the function
after payload is completed. The code is thus simpler.
This is not a bug. However, it may be interesting to backport this patch
up to 2.6, as qc_do_build_pkt() is a tedious function, in particular
when dealing with padding generation, thus it may benefit greatly from
simplification.
Haproxy QUIC stack suffers from a limitation : it's not possible to emit
a packet which contains probing data and a ACK frame in it. Thus, in
case qc_do_build_pkt() is invoked which both values as true, probing has
the priority and ACK is ignored.
However, this has the undesired side-effect of possibly generating two
coalesced packets of the same type in the same datagram : the first one
with the probing data and the second with an ACK frame. This is caused
by qc_prep_pkts() loop which may call qc_do_build_pkt() multiple times
with the same QEL instance. This case is normally use when a full
datagram has been built but there is still content to emit on the
current encryption level.
To fix this, alter qc_prep_pkts() loop : if both probing and ACK is
requested, force the datagram to be written after packet encoding. This
will result in a datagram containing the packet with probing data as
final entry. A new datagram is started for the next packet which will
can contain the ACK frame.
This also has some impact on INITIAL padding. Indeed, if packet must be
the last due to probing emission, qc_prep_pkts() will also activate
padding to ensure final datagram is at least 1.200 bytes long.
Note that coalescing two packets of the same type is not invalid
according to QUIC RFC. However it could cause issue with some shaky
implementations, so it is considered as a bug.
This must be backported up to 2.6.
A QUIC datagram that contains an INITIAL packet must be padded to 1.200
bytes to prevent any deadlock due to anti-amplification protection. This
is implemented by encoding a PADDING frame on the last packet of the
datagram if necessary.
Previously, qc_prep_pkts() was responsible to activate padding when
calling qc_do_build_pkt(), as it knows which packet is the last to
encode. However, this has the side-effect of preventing PING emission
for probing with no data as this case was handled in an else-if branch
after padding. This was fixed by the below commit
217e467e89d15f3c22e11fe144458afbf718c8a8
BUG/MINOR: quic: fix malformed probing packet building
Above logic was altered to fix the PING case : padding was set to false
explicitely in qc_prep_pkts(). Padding was then added in a specific
block dedicated to the PING case in qc_do_build_pkt() itself for INITIAL
packets.
However, the fix is incorrect if the last QEL used to built a packet is
not the initial one and probing is used with PING frame only. In this
case, specific block in qc_do_build_pkt() does not add padding. This
causes a BUG_ON() crash in qc_txb_store() which catches these packets as
irregularly formed.
To fix this while also properly handling PING emission, revert to the
original padding logic : qc_prep_pkts() is responsible to activate
INITIAL padding. To not interfere with PING emission, qc_do_build_pkt()
body is adjusted so that PING block is moved up in the function and
detached from the padding condition.
The main benefit from this patch is that INITIAL padding decision in
qc_prep_pkts() is clearer now.
Note that padding can also be activated by qc_do_build_pkt(), as packets
should be big enough for header protection decipher. However, this case
is different from INITIAL padding, so it is not covered by this patch.
This should be backported up to 2.6.
If connection closing is activated, qc_prep_pkts() can only built a
datagram with a single packet. This is because we consider that only a
single CONNECTION_CLOSE frame is relevant at this stage.
This is handled both by qc_prep_pkts() which ensure that only a single
packet datagram is built and also qc_do_build_pkt() which prevents the
invokation of qc_build_frms() if <cc> is set.
However, there is an incoherency for probing. First, qc_prep_pkts()
deactivates it if connection closing is requested. But qc_do_build_pkt()
may still emit probing frame as it does not check its <probe> argument
but rather <pto_probe> QEL field directly. This can results in a packet
mixing a PING and a CONNECTION close frames, which is useless.
Fix this by adjusting qc_do_build_pkt() : closing argument is also
checked on PING probing emission. Note that there is still shaky code
here as qc_do_build_pkt() should rely only on <probe> argument to ensure
this.
This should be backported up to 2.6.
qc_prep_pkts() encodes input data into QUIC packets in a loop into one
or several datagrams. It supports GSO which requires to built a serie of
multiple datagrams of the same length.
Each packet encoding is performed via a call to qc_do_build_pkt(). This
function has an argument to specify if output packet must be completed
with a PADDING frame. This option is activated when qc_prep_pkts()
encodes the last packet of a datagram with at least one INITIAL packet
in it.
Padding is resetted each time a new datagram is started. However, this
was not performed if GSO is used to built the next datagram. This patch
fixes it by properly resetting padding in this case also.
The impact of this bug is unknown. It may have several effectfs, one of
the most obvious being the insertion of unnecessary padding in packets.
It could also potentially trigger an infinite loop in qc_prep_pkts(),
although this has never been encountered so far.
This must be backported up to 3.1.
This fixes the commit 2c7e05f80e3b
("MEDIUM: dns: don't call connect to dest socket for AF_INET*"). If we fail to
bind AF_INET sockets or the address family of the nameserver protocol isn't
something, what we expect, we need to close the fd, obtained by
connect.
This fixes the issue GitHub #3085
This must be backported along with the commit 2c7e05f80e3b.
It is possible to miss a synchronous write event in process_stream() if the
stream was woken up on a write event. In that case, it is possible to freeze
the stream until the next I/O event or timeout.
Concretely, the stream is woken up with CF_WRITE_EVENT on a channel. this
flag is removed from the channel when we leave proces_stream(). But before
leaving process_stream(), when a synchronous send is tried on this channel,
the flag is removed and eventually set again on success. But this event is
masked by the previous one, and the channel is not resync as it should be.
To fix the bug, CF_READ_EVENT and CF_WRITE_EVENT flags are removed from a
channel after the corresponding analysers evaluation. This way, we will be
able to detect a successful synchronous send to restart analysers evaluation
based on the new channel state. It is safe (or it should be) to do so
becaues these flags are only used by analysers and tested to resync the
stream inside process_stream().
It is a very old bug and I guess all versions are affected. It was observed
on 2.9 and higher, and with the master/worker only. But it could affect any
stream. It is tagged a MAJOR because this area is really sensitive to any
change.
This patch should fix the issue #3070. It should probably be backported to
all stable versions, but only after a period of observation and with a
special care because this area is really sensitive to changes. It is
probably reasonnable to backport it as far as 3.0 and wait for older
versions.
Thanks to Valentine for its help on this issue !
This patch introduces a change of behavior in the configuration parsing.
Previously the "ssl-f-use" lines were only applied on "ssl" bind lines
that does not have any "crt" configured.
Since there is no warning and you could mix bind lines with and without
crt, this is really confusing.
This patch applies the "ssl-f-use" lines on every "ssl" bind lines.
This was discussed in ticket #3082.
Must be backported in 3.2.
This bug impacts only the QUIC backends. It arrived with this commit:
MINOR: quic-be: QUIC connection allocation adaptation (qc_new_conn())
which was supposed to be fixed by:
BUG/MEDIUM: quic: crash after quic_conn allocation failures
but this commit was not sufficient.
Such a crashe could be reproduced with -dMfail option. To reach it, the
<conn_id> object allocation must fail (from qc_new_conn()). So, this is
relatively rare, except on systems with limited memory.
No need to backport.
A QUIC client must discard the Initial packet number space as soon as it first
sends a Handshake packet.
This patch implements this packet number space which was missing.
An ABORT_NOW() was used during debugging idle-ping but was not removed
from the final code. This may cause crash, in particular when mixing
idle-ping with shorter http-request/http-keep-alive values.
Fix this situation by removing ABORT_NOW() statement.
This should fix github issue #3079.
This must be backported up to 3.2.
Released version 3.3-dev7 with the following main changes :
- MINOR: quic: duplicate GSO unsupp status from listener to conn
- MINOR: quic: define QUIC_FL_CONN_IS_BACK flag
- MINOR: quic: prefer qc_is_back() usage over qc->target
- BUG/MINOR: cfgparse: immediately stop after hard error in srv_init()
- BUG/MINOR: cfgparse-listen: update err_code for fatal error on proxy directive
- BUG/MINOR: proxy: avoid NULL-deref in post_section_px_cleanup()
- MINOR: guid: add guid_get() helper
- MINOR: guid: add guid_count() function
- MINOR: clock: add clock_set_now_offset() helper
- MINOR: clock: add clock_get_now_offset() helper
- MINOR: init: add REGISTER_POST_DEINIT_MASTER() hook
- BUILD: restore USE_SHM_OPEN build option
- BUG/MINOR: stick-table: cap sticky counter idx with tune.nb_stk_ctr instead of MAX_SESS_STKCTR
- MINOR: sock: update broken accept4 detection for older hardwares.
- CI: vtest: add os name to OT cache key
- CI: vtest: add Ubuntu arm64 builds
- BUG/MEDIUM: ssl: Fix 0rtt to the server
- BUG/MEDIUM: ssl: fix build with AWS-LC
- MEDIUM: acme: use lowercase for challenge names in configuration
- BUG/MINOR: init: Initialize random seed earlier in the init process
- DOC: management: clarify usage of -V with -c
- MEDIUM: ssl/cli: relax crt insertion in crt-list of type directory
- MINOR: tools: implement ha_aligned_zalloc()
- CLEANUP: fd: make use of ha_aligned_alloc() for the fdtab
- MINOR: pools: distinguish the requested alignment from the type-specific one
- MINOR: pools: permit to optionally specify extra size and alignment
- MINOR: pools: always check that requested alignment matches the type's
- DOC: api: update the pools API with the alignment and typed declarations
- MEDIUM: tree-wide: replace most DECLARE_POOL with DECLARE_TYPED_POOL
- OPTIM: tasks: align task and tasklet pools to 64
- OPTIM: buffers: align the buffer pool to 64
- OPTIM: queue: align the pendconn pools to 64
- OPTIM: connection: align connection pools to 64
- OPTIM: server: start to use aligned allocs in server
- DOC: management: fix typo in commit f4f93c56
- DOC: config: recommend single quoting passwords
- MINOR: tools: also implement ha_aligned_alloc_typed()
- MEDIUM: server: introduce srv_alloc()/srv_free() to alloc/free a server
- MINOR: server: align server struct to 64 bytes
- MEDIUM: ring: always allocate properly aligned ring structures
- CI: Update to actions/checkout@v5
- MINOR: quic: implement qc_ssl_do_hanshake()
- BUG/MEDIUM: quic: listener connection stuck during handshakes (OpenSSL 3.5)
- BUG/MINOR: mux-h1: fix wrong lock label
- MEDIUM: dns: don't call connect to dest socket for AF_INET*
- BUG/MINOR: spoe: Properly detect and skip empty NOTIFY frames
- BUG/MEDIUM: cli: Report inbuf is no longer full when a line is consumed
- BUG/MEDIUM: quic: crash after quic_conn allocation failures
- BUG/MEDIUM: quic-be: do not initialize ->conn too early
- BUG/MEDIUM: mworker: more verbose error upon loading failure
- MINOR: xprt: Add recvmsg() and sendmsg() parameters to rcv_buf() and snd_buf().
- MINOR: ssl: Add a "flags" field to ssl_sock_ctx.
- MEDIUM: xprt: Add a "get_capability" method.
- MEDIUM: mux_h1/mux_pt: Use XPRT_CAN_SPLICE to decide if we should splice
- MINOR: cfgparse: Add a new "ktls" option to bind and server.
- MINOR: ssl: Define HAVE_VANILLA_OPENSSL if openssl is used.
- MINOR: build: Add a new option, USE_KTLS.
- MEDIUM: ssl: Add kTLS support for OpenSSL.
- MEDIUM: splice: Don't consider EINVAL to be a fatal error
- MEDIUM: ssl: Add splicing with SSL.
- MEDIUM: ssl: Add ktls support for AWS-LC.
- MEDIUM: ssl: Add support for ktls on TLS 1.3 with AWS-LC
- MEDIUM: ssl: Handle non-Application data record with AWS-LC
- MINOR: ssl: Add a way to globally disable ktls.
Add a new global option, "noktls", as well as a command line option,
"-dT", to totally disable ktls usage, even if it is activated on servers
or binds in the configuration.
That makes it easier to quickly figure out if a problem is related to
ktls or not.
Handle receiving and sending TLS records that are not application data
records.
When receiving, we ignore new session tickets records, we handle close
notify as a read0, and we consider any other records as a connection
error.
For sending, we're just sending close notify, so that the TLS connection
is properly closed.
AWS-LC added a new API in AWS-LC 1.54 that allows the user to retrieve
the keys for TLS 1.3 connections with SSL_get_read_traffic_secret(), so
use it to be able to use ktls with TLS 1.3 too.
Add ktls support for AWS-LC. As it does not know anything
about ktls, it means extracting keys from the ssl lib, and provide them
to the kernel. At which point we can use regular recvmsg()/sendmsg()
calls.
This patch only provides support for TLS 1.2, AWS-LC provides a
different way to extract keys for TLS 1.3.
Note that this may work with BoringSSL too, but it has not been tested.
Implement the splicing methods to the SSL xprt (which will just call the
raw_sock methods if kTLS is enabled on the socket), and properly report
that a connection supports splicing if kTLS is configured on that
connection.
For OpenSSL, if the upper layer indicated that it wanted to start using
splicing by adding the CO_FL_WANT_SPLICING flag, make sure we don't read
any more data from the socket, and just drain what may be in the
internal OpenSSL buffers, before allowing splicing
Don't consider that EINVAL is a fatal error, when calling splice().
When doing splicing from a kTLS socket, splice() will set errno to
EINVAL if the next record to be read is not an application data record.
This is not a fatal error, it just means we have to use recvmsg() to
read it, and potentially we can then resume using splicing.
It is unfortunate that EINVAL was used for that case, but we should
never get any other case of receiving EINVAL from splice(), so it should
be safe to treat it as non-fatal.
Modify the SSL code to enable kTLS with OpenSSL.
It mostly requires our internal BIO to be able to handle the various
kTLS-specific controls in ha_ssl_ctrl(), as well as being able to use
recvmsg() and sendmsg() from ha_ssl_read() and ha_ssl_write().
If we're using OpenSSL as our crypto library, so add a define,
HAVE_VANILLA_OPENSSL, to make it easier to differentiate between the
various crypto libs.
Add a new "ktls" option to bind and server. Valid values are "on" and
"off".
It currently does nothing, but when kTLS will be implemented, it will
enable or disable kTLS for the corresponding sockets.
It is marked as experimental for now.
In both mux_h1 and mux_pt, use the new XPRT_CAN_SPLICE capability to
decide if we should attempt to use splicing or not.
If we receive XPRT_CONN_CAN_MAYBE_SPLICE, add a new flag on the
connection, CO_FL_WANT_SPLICING, to let the xprt know that we'd love to
be able to do splicing, so that it may get ready for that.
This should have no effect right now, and is required work for adding
kTLS support.
Add a new method to xprts, get_capability, that can be used to query if
an xprt supports something or not.
The first capability implemented is XPRT_CAN_SPLICE, to know if the xprt
will be able to use splicing for the provided connection.
The possible answers are XPRT_CONN_CAN_NOT_SPLICE, which indicates
splicing will never be possible for that connection,
XPRT_CONN_COULD_SPLICE, which indicates that splicing is not usable
right now, but may be in the future, and XPRT_CONN_CAN_SPLICE, that
means we can splice right away.
Instead of adding more separate fields in ssl_sock_ctx, add a "flags"
one.
Convert the "can_send_early_data" to the flag SSL_SOCK_F_EARLY_ENABLED.
More flags will be added for kTLS support.
In rcv_buf() and snd_buf(), use sendmsg/recvmsg instead of send and
recv, and add two new optional parameters to provide msg_control and
msg_controllen.
Those are unused for now, but will be used later for kTLS.
When a worker crashes during its configuration parsing and without
emitting any messages, the master will emit the message "Failed to load
worker!". However that doesn't give us neither the PID of the worker,
nor the status code.
This patch fixes the problem by emitting a more verbose error.
Must be backported as far as 3.1.
This bug arrived with this commit:
BUG/MEDIUM: quic: do not release BE quic-conn prior to upper conn
which added a BUG_ON(qc->conn) statement at the beginning of quic_conn_release().
It is triggered if the connection is not released before releasing the quic_conn.
But this is always the case for a backend quic_conn when its allocation from
qc_new_conn() fails.
Such crashes could be reproduced with -dMfail option. To reach them, the
memory allocations must fail. So, this is relatively rare, except on systems
with limited memory.
To fix this, simply set ->conn quic_conn struct member to a not null value
(the one passed as parameter) after the quic_conn allocation has succeeded.
No backport needed.
This regression arrived with this commit:
MINOR: quic-be: QUIC connection allocation adaptation (qc_new_conn())
where qc_new_conn() was modified. The ->cids allocation was moved without
checking if a quic_conn_release() call could lead to crashes due to uninitialized
quic_conn members. Indeed, if qc_new_conn() fails, then quic_conn_release() is
called. This bug could impact both QUIC servers and clients.
Such crashes could be reproduced with -dMfail option. To reach them, the
memory allocations must fail. So, this is relatively rare, except on systems
with limited memory.
This patch ensures all the quic_conn members which could lead to crash
from quic_conn_release() are initialized before any remaining memory allocations
required for the quic_conn.
The <conn_id> variable allocated by the client is no more attached to
the connection during its allocation, but after the ->cids trees is allocated.
No backport needed.
When the command line parsing was refactored (20ec1de21 "MAJOR: cli: Refacor
parsing and execution of pipelined commands"), a regression was introduced.
When input data are consumed, information about the applet's input buffer
are no longer updated accordingly to state it is no longer full. So it is
possible to freeze the CLI applet. And a spinning loop may be encountered if
a client shutdown is detected in this state.
The fix is obivous. When data are consumed from the applet's input buffer,
APPCTX_FL_INBLK_FULL flag is removed to notify the input buffer is no longer
full and more data can be sent to the CLI applet.
This patch should fix the issue #3064. It must be backported to 3.2.
Since the SPOE was refactored, the detection of empty NOTIFY frames is
broken. So it is possible to send a NOTIFY frames to an agent with no
message at all. The bug happens because the frame type is now added to the
buffer before the messages encoding. So the buffer is never really empty.
To fix the issue, the condition to detect empty frame was adapted.
This patch must be backported as far as 3.1.
When we perform connect call for a datagram socket, used to send DNS requests,
we set for it the default destination address to some given nameserver. Then we
simply use send(), as the destination address is already set. In some usecases
described in GitHub issues #3001 and #2654, this approach becames inefficient,
nameservers change its IP addresses dynamically, this triggers DNS resolution
errors.
To fix this, let's perform the bind() on the wildcard address for the datagram
AF_INET* client socket. Like this we will allocate a port for it. Then let's
use sendto() instead of send().
If the nameserver is local and is listening on the UNIX domain socket, we
continue to use the existed approach (connect() and then send()).
This fixes issues #3001 and #2654.
This may be backported in all stable versions.
Wrong lock label is used when manipulating idle lock on h1_timeout_task.
Fix this by replacing OTHER_LOCK by IDLE_CONNS_LOCK.
This only concerns thread debugging statistics.
This must be backported up to 2.4.
This issue was reported in GH #3071 by @famfo where a wireshark capture
reveals that some handshake could not complete after having received
two Initial packets. This could happen when the packets were parsed
in two times, calling qc_ssl_provide_all_quic_data() two times.
This is due to crypto data stream counter which was incremented two times
from qc_ssl_provide_all_quic_data() (see cstream->rx.offset += data
statement around line 1223 in quic_ssl.c). One time by the callback
which "receives" the crypto data, and on time by qc_ssl_provide_all_quic_data().
Then when parsing the second crypto data frame, the parser detected
that the crypto were already provided.
To fix this, one could comment the code which increment the crypto data
stream counter by <data>. That said, when using the OpenSSL 3.5 QUIC API
one should not modified the crypto data stream outside of the OpenSSL 3.5
QUIC API.
So, this patch stop calling qc_ssl_provide_all_quic_data() and
qc_ssl_provide_quic_data() and only calls qc_ssl_do_hanshake() after
having received some crypto data. In addition to this, as these functions
are no more called when building haproxy against OpenSSL 3.5, this patch
disable their compilations (with #ifndef HAVE_OPENSSL_QUIC).
This patch depends on this previous one:
MINOR: quic: implement qc_ssl_do_hanshake()
Thank you to @famto for this report.
Must be backported to 3.2.
The rings were manually padded to place the various areas that compose
them into different cache lines, provided that the allocator returned
a cache-aligned address, which until now was not granted. By now
switching to the aligned API we can finally have this guarantee and
hope for more consistent ring performance between tests. Like previously
the few carefully crafted THREAD_PAD() could simply be replaced by
generic THREAD_ALIGN() that dictate the type's alignment.
This was the last user of THREAD_PAD() by the way.
Several times recently, it was noticed that some benchmarks would
highly vary depending on the position of certain fields in the server
struct, and this could even vary between runs.
The server struct does have separate areas depending on the user cases
and hot/cold aspect of the members stored there, but the areas are
artificially kept apart using fixed padding instead of real alignment,
which has the first sad effect of artificially inflating the struct,
and the second one of misaligning it.
Now that we have all the necessary tools to keep them aligned, let's
just do it. The struct has shrunk from 4160 to 4032 bytes on 64-bit
systems, 152 of which are still holes or padding.
It happens that we free servers at various places in the code, both
on error paths and at runtime thanks to the "server delete" feature. In
order to switch to an aligned struct, we'll need to change the calloc()
and free() calls. Let's first spot them and switch them to srv_alloc()
and srv_free() instead of using calloc() and either free() or ha_free().
An easy trap to fall into is that some of them are default-server
entries. The new srv_free() function also resets the pointer like
ha_free() does.
This was done by running the following coccinelle script all over the
code:
@@
struct server *srv;
@@
(
- free(srv)
+ srv_free(&srv)
|
- ha_free(&srv)
+ srv_free(&srv)
)
@@
struct server *srv;
expression e1;
expression e2;
@@
(
- srv = malloc(e1)
+ srv = srv_alloc()
|
- srv = calloc(e1, e2)
+ srv = srv_alloc()
)
This is marked medium because despite spotting all call places, we can
never rule out the possibility that some out-of-tree patches would
allocate their own servers and continue to use the old API... at their
own risk.
This one is a macro and will allocate a properly aligned and sized
object. This will help make sure that the alignment promised to the
compiler is respected.
When memstats is used, the type name is passed as a string into the
.extra field so that it can be displayed in "debug dev memstats". Two
tiny mistakes related to memstats macros were also fixed (calloc
instead of malloc for zalloc), and the doc was also added to document
how to use these calls.
This is currently for per-thread arrays like idle conns etc. We're
now cache-aligning the per-thread arrays so as to put an end to false
sharing. A comparative test between no alignment and alignment on a
simple config with round robin between 4 servers showed an average
rate of 1.75M/s vs 1.72M/s before for 100M requests. The gain seems
to be more commonly less than 1% however. This should mostly help
make measurements more reproducible across multiple runs.
The struct connection is used a lot by the muxes during many operations,
particularly at the beginning of the struct (flags, ctrl, xprt and mux).
We definitely want this one not to be falsely shared with another thread,
so let's align the pools to a cache line.
This struct is used by memcpy() and friends, particularly during the
early recv() and send(). By keeping it 64-byte aligned, we let the
underlying libs/kernel use optimal operations (e.g. AVX512) for memory
copies while right now it's just random (buffers are found to be equally
aligned to 32 and 64 in practice).
These structs are intensively used and really must not experience false
sharing, so let's declare them aligned to 64. We don't try to align the
struct themselves, as we don't want the compiler to expand them either.
This will make the pools size and alignment automatically inherit
the type declaration. It was done like this:
sed -i -e 's:DECLARE_POOL(\([^,]*,[^,]*,\s*\)sizeof(\([^)]*\))):DECLARE_TYPED_POOL(\1\2):g' $(git grep -lw DECLARE_POOL src addons)
sed -i -e 's:DECLARE_STATIC_POOL(\([^,]*,[^,]*,\s*\)sizeof(\([^)]*\))):DECLARE_STATIC_TYPED_POOL(\1\2):g' $(git grep -lw DECLARE_STATIC_POOL src addons)
81 replacements were made. The only remaining ones are those which set
their own size without depending on a structure. The few ones with an
extra size were manually handled.
It also means that the requested alignments are now checked against the
type's. Given that none is specified for now, no issue is reported.
It was verified with "show pools detailed" that the definitions are
exactly the same, and that the binaries are similar.
For pool registrations that are created from the type declaration, we
now have the ability to verify that the requested alignment matches
the type's one. Let's not miss this opportunity, as we've met bugs in
the past that were caused by such mismatches. The principle is simple:
if the type alignment is known, we check that the configured alignment
is at least as large as that one otherwise we refuse to start (since
the code may crash at any moment). Obviously it doesn't crash for now!
The common macros REGISTER_TYPED_POOL(), DECLARE_TYPED_POOL() and
DECLARE_STATIC_TYPED_POOL() will now take two optional arguments,
one being the extra size to be added to the structure, and a second
one being the desired alignment to enforce. This will permit to
specify alignments larger than the default ones promised to the
compiler.
We're letting users request an alignment but that can violate one imposed
by a type, especially if we start seeing REGISTER_TYPED_POOL() grow in
adoption, encouraging users to specify alignment on their types. On the
other hand, if we ask the user to always specify the alignment, no control
is possible and the error is easy. Let's have a second field in the pool
registration, for the type-specific one. We'll set it to zero when unknown,
and to the types's alignment when known. This way it will become possible
to compare them at startup time to detect conflicts. For now no macro
permits to set both separately so this is not visible.
We've forcefully aligned the fdtab in commit 97ea9c49f1 ("BUG/MEDIUM:
fd: always align fdtab[] to 64 bytes"), but now we don't need such hacks
anymore thanks to ha_aligned_alloc(). Let's use it and get rid of
fdtab_addr.
This one is exactly ha_aligned_alloc() followed by a memset(0), as
it will be convenient for a number of call places as a replacement
for calloc().
Note that ideally we should also have a calloc version that performs
basic multiply overflow checks, but these are essentially used with
numbers of threads times small structs so that's fine, and we already
do the same everywhere in malloc() calls.
In previous versions of haproxy, insertions of certificates in a
crt-list from the CLI would require to have the path of the directory,
in the path of the certificate. This would help avoiding that the
certificate wasn't loaded upon a reload because it is not at the right
place.
However, since version 3.0 and crt-store, the name stored in the tree
could be an alias and not a path, so that does not make sense anymore.
Even though path would be right, the check is not right anymore in this
case.
The tool or user inserting the certificate must now check itself that
the certificate was placed at the right spot on the filesystem.
Reported in issue #3053.
Could be backported as far as haproxy 3.0.
In ticket #3065 an user complained that no success message is printed
anymore when using -c. The message does not appear by default since
version 2.9. This patch clarify the documentation.
Must be backported as far as 2.8.
The random seed used in ha_random functions needs to be first
initialized by calling ha_random_boot. This function was called rather
late in the init process, after the init functions (INITCALLS) are
called and after the configuration parsing for instance which means that
any ha_random call in an init function would return 0. This was the case
in 'vars_init' and 'cache_init' which tried to build seeds for specific
hash calculations but ended up not being seeded.
This patch can be backported on all stable branches.
Both the RFC and the IANA registry refers to challenge names in
lowercase. If we need to implement more challenges, it's better to
use the correct naming.
In order to keep the compatibility with the previous configurations, the
parsing does a strcasecmp() instead of a strcmp().
Also rename every occurence in the code and doc in lowercase.
This was discussed in issue #1864
AWS-LC doesn't provide SSL_in_before(), and doesn't provide an easy way
to know if we already started the handshake or not. So instead, just add
a new field in ssl_sock_ctx, "can_write_early_data", that will be
initialized to 1, and will be set to 0 as soon as we start the
handshake.
This should be backported up to 2.8 with
13aa5616c9f99dbca0711fd18f716bd6f48eb2ae.
In order to send early data, we have to make sure no handshake has been
initiated at all. To do that, we remove the CO_FL_SSL_WAIT_HS flag, so
that we won't attempt to start a handshake. However, by removing those
flags, we allow ssl_sock_to_buf() to call SSL_read(), as it's no longer
aware that no handshake has been done, and SSL_read() will begin the
handshake, thus preventing us from sending early data.
The fix is to just call SSL_in_before() to check if no handshake has
been done yet, in addition to checking CO_FL_SSL_WAIT_HS (both are
needed, as CO_FL_SSL_WAIT_HS may come back in case of renegociation).
In ssl_sock_from_buf(), fix the check to see if we may attempt to send
early data. Use SSL_in_before() instead of SSL_is_init_finished(), as
SSL_is_init_finished() will return 1 if the handshake has been started,
but not terminated, and if the handshake has been started, we can no
longer send early data.
This fixes errors when attempting to send early data (as well as
actually sending early data).
This should be backported up to 2.8.
Cap sticky counter index with tune.nb_stk_ctr instead of MAX_SESS_STKCTR for
sc-add-gpc. Same logic is already implemented for sc-inc-gpc and sc-set-gpt
keywords. So, it seems missed for sc-add-gpc.
This fixes the issue #3061 reported at GitHub. Thanks to @ma311 for
reporting their analysis of the issue.
This should be backported in all versions until 2.8, included 2.8.
Some optional features may still require the use of shm_open() in the
future. In this patch we restore the USE_SHM_OPEN build option that
was removed in 143be1b59 ("MEDIUM: errors: get rid of shm_open()") and
should guard the use of shm_open() in the code.
Similar to REGISTER_POST_DEINIT() hook (which is invoked during deinit)
but for master process only, when haproxy was started in master-worker
mode. The goal is to be able to register cleanup functions that will
only run for the master process right before exiting.
Since now_offset is a static variable and is not exposed outside from
clock.c, let's add an helper so that it becomes possible to set its
value from another source file.
post_section_px_cleanup(), which was implemented in abcc73830
("MEDIUM: proxy: register a post-section cleanup function"), is called
for the current section no matter if the parsing was aborted due to
a fatal error. In this case, the curproxy pointer may point to NULL,
yet post_section_px_cleanup() assumes curproxy pointer is always valid,
which could lead to NULL-deref.
For instance, the config below will cause SEGFAULT:
listen toto titi
To fix the issue, let's simply consider that the curproxy pointer may
be NULL in post_section_px_cleanup(), in which case we skip the cleanup
for the curproxy since there is nothing we can do.
No backport needed
When improper arguments are provided on proxy directive (listen,
frontend or backend), such alert may be emitted:
"please use the 'bind' keyword for listening addresses"
This was introduced in 6e62fb6405 ("MEDIUM: cfgparse: check section
maximum number of arguments"). However, despite the error being reported
as alert, the err_code isn't updated accordingly, which could make the
upper parser think there was no error, while it isn't the case.
In practise since the proxy directive is ignored following proxy related
directives should raise errors, so this didn't cause much harm, yet
better fix that.
It could be backported to all stable versions.
Since 368d01361 (" MEDIUM: server: add and use srv_init() function"), in
case of srv_init() error, we simply increment cfgerr variable and keep
going.
It isn't enough, some treatment occuring later in check_config_validity()
assume that srv_init() succeeded for servers, and may cause undefined
behavior. To fix the issue, let's consider that if (srv_init() & ERR_CODE)
returns true, then we must stop checking the config immediately.
No backport needed unless 368d01361 is.
Previously quic_conn <target> member was used to determine if quic_conn
was used on the frontend (as server) or backend side (as client). A new
helper function can now be used to directly check flag
QUIC_FL_CONN_IS_BACK.
This reduces the dependency between quic_conn and their relative
listener/server instances.
Define a new quic_conn flag assign if the connection is used on the
backend side. This is similar to other haproxy components such as struct
connection and muxes element.
This flag is positionned via qc_new_conn(). Also update quic traces to
mark proxy side as 'F' or 'B' suffix.
QUIC emission can use GSO to emit multiple datagrams with a single
syscall invokation. However, this feature relies on several kernel
parameters which are checked on haproxy process startup.
Even if these checks report no issue, GSO may still be unable due to the
underlying network adapter underneath. Thus, if a EIO occured on
sendmsg() with GSO, listener is flagged to mark GSO as unsupported. This
allows every other QUIC connections to share the status and avoid using
GSO when using this listener.
Previously, listener flag was checked for every QUIC emission. This was
done using an atomic operation to prevent races. Improve this by
duplicating GSO unsupported status as the connection level. This is done
on qc_new_conn() and also on thread rebinding if a new listener instance
is used.
The main benefit from this patch is to reduce the dependency between
quic_conn and listener instances.
Released version 3.3-dev6 with the following main changes :
- MINOR: acme: implement traces
- BUG/MINOR: hlua: take default-path into account with lua-load-per-thread
- CLEANUP: counters: rename counters_be_shared_init to counters_be_shared_prepare
- MINOR: clock: make global_now_ms a pointer
- MINOR: clock: make global_now_ns a pointer as well
- MINOR: mux-quic: release conn after shutdown on BE reuse failure
- MINOR: session: strengthen connection attach to session
- MINOR: session: remove redundant target argument from session_add_conn()
- MINOR: session: strengthen idle conn limit check
- MINOR: session: do not release conn in session_check_idle_conn()
- MINOR: session: streamline session_check_idle_conn() usage
- MINOR: muxes: refactor private connection detach
- BUG/MEDIUM: mux-quic: ensure Early-data header is set
- BUILD: acme: avoid declaring TRACE_SOURCE in acme-t.h
- MINOR: acme: emit a log for DNS-01 challenge response
- MINOR: acme: emit the DNS-01 challenge details on the dpapi sink
- MEDIUM: acme: allow to wait and restart the task for DNS-01
- MINOR: acme: update the log for DNS-01
- BUG/MINOR: acme: possible integer underflow in acme_txt_record()
- BUG/MEDIUM: hlua_fcn: ensure systematic watcher cleanup for server list iterator
- MINOR: sample: Add le2dec (little endian to decimal) sample fetch
- BUILD: fcgi: fix the struct name of fcgi_flt_ctx
- BUILD: compat: provide relaxed versions of the MIN/MAX macros
- BUILD: quic: use _MAX() to avoid build issues in pools declarations
- BUILD: compat: always set _POSIX_VERSION to ease comparisons
- MINOR: implement ha_aligned_alloc() to return aligned memory areas
- MINOR: pools: support creating a pool from a pool registration
- MINOR: pools: add a new flag to declare static registrations
- MINOR: pools: force the name at creation time to be a const.
- MEDIUM: pools: change the static pool creation to pass a registration
- DEBUG: pools: store the pool registration file name and line number
- DEBUG: pools: also retrieve file and line for direct callers of create_pool()
- MEDIUM: pools: add an alignment property
- MINOR: pools: add macros to register aligned pools
- MINOR: pools: add macros to declare pools based on a struct type
- MEDIUM: pools: respect pool alignment in allocations
Now pool_alloc_area() takes the alignment in argument and makes use
of ha_aligned_malloc() instead of malloc(). pool_alloc_area_uaf()
simply applies the alignment before returning the mapped area. The
pool_free() functionn calls ha_aligned_free() so as to permit to use
a specific API for aligned alloc/free like mingw requires.
Note that it's possible to see warnings about mismatching sized
during pool_free() since we know both the pool and the type. In
pool_free, adding just this is sufficient to detect potential
offenders:
WARN_ON(__alignof__(*__ptr) > pool->align);
DECLARE_TYPED_POOL() and friends take a name, a type and an extra
size (to be added to the size of the element), and will use this
to create the pool. This has the benefit of letting the compiler
automatically adapt sizeof() and alignof() based on the type
declaration.
This adds an alignment argument to create_pool_from_loc() and
completes the existing low-level macros with new ones that expose
the alignment and the new macros permit to specify it. For now
they're not used.
This will be used to declare aligned pools. For now it's not used,
but it's properly set from the various registrations that compose
a pool, and rounded up to the next power of 2, with a minimum of
sizeof(void*).
The alignment is returned in the "show pools" part that indicates
the entry size. E.g. "(56 bytes/8)" means 56 bytes, aligned by 8.
Just like previous patch, we want to retrieve the location of the caller.
For this we turn create_pool() into a macro that collects __FILE__ and
__LINE__ and passes them to the now renamed function create_pool_with_loc().
Now the remaining ~30 pools also have their location stored.
When pools are declared using DECLARE_POOL(), REGISTER_POOL etc, we
know where they are and it's trivial to retrieve the file name and line
number, so let's store them in the pool_registration, and display them
when known in "show pools detailed".
Now we're creating statically allocated registrations instead of
passing all the parameters and allocating them on the fly. Not only
this is simpler to extend (we're limited in number of INITCALL args),
but it also leaves all of these in the data segment where they are
easier to find when debugging.
This is already the case as all names are constant so that's fine. If
it would ever change, it's not very hard to just replace it in-situ
via an strdup() and set a flag to mention that it's dynamically
allocated. We just don't need this right now.
One immediately visible effect is in "show pools detailed" where the
names are no longer truncated.
We've recently introduced pool registrations to be able to enumerate
all pool creation requests with their respective parameters, but till
now they were only used for debugging ("show pools detailed"). Let's
go a step further and split create_pool() in two:
- the first half only allocates and sets the pool registration
- the second half creates the pool from the registration
This is what this patch does. This now opens the ability to pre-create
registrations and create pools directly from there.
We have two versions, _safe() which verifies and adjusts alignment,
and the regular one which trusts the caller. There's also a dedicated
ha_aligned_free() due to mingw.
The currently detected OSes are mingw, unixes older than POSIX 200112
which require memalign(), and those post 200112 which will use
posix_memalign(). Solaris 10 reports 200112 (probably through
_GNU_SOURCE since it does not do it by default), and Solaris 11 still
supports memalign() so for all Solaris we use memalign(). The memstats
wrappers are also implemented, and have the exported names. This was
the opportunity for providing a separate free call that lets the caller
specify the size (e.g. for use with pools).
For now this code is not used.
Sometimes we need to compare it to known versions, let's make sure it's
always defined. We set it to zero if undefined so that it cannot match
any comparison.
With the upcoming pool declaration, we're filling a struct's fields,
while older versions were relying on initcalls which could be turned
to function declarations. Thus the compound expressions that were
usable there are not necessarily anymore, as witnessed here with
gcc-5.5 on solaris 10:
In file included from include/haproxy/quic_tx.h:26:0,
from src/quic_tx.c:15:
include/haproxy/compat.h:106:19: error: braced-group within expression allowed only inside a function
#define MAX(a, b) ({ \
^
include/haproxy/pool.h:41:11: note: in definition of macro '__REGISTER_POOL'
.size = _size, \
^
...
include/haproxy/quic_tx-t.h:6:29: note: in expansion of macro 'MAX'
#define QUIC_MAX_CC_BUFSIZE MAX(QUIC_INITIAL_IPV6_MTU, QUIC_INITIAL_IPV4_MTU)
Let's make the macro use _MAX() instead of MAX() since it relies on pure
constants.
In 3.0 the MIN/MAX macros were converted to compound expressions with
commit 0999e3d959 ("CLEANUP: compat: make the MIN/MAX macros more
reliable"). However with older compilers these are not supported out
of code blocks (e.g. to initialize variables or struct members). This
is the case on Solaris 10 with gcc-5.5 when QUIC doesn't compile
anymore with the future pool registration:
In file included from include/haproxy/quic_tx.h:26:0,
from src/quic_tx.c:15:
include/haproxy/compat.h:106:19: error: braced-group within expression allowed only inside a function
#define MAX(a, b) ({ \
^
include/haproxy/pool.h:41:11: note: in definition of macro '__REGISTER_POOL'
.size = _size, \
^
...
include/haproxy/quic_tx-t.h:6:29: note: in expansion of macro 'MAX'
#define QUIC_MAX_CC_BUFSIZE MAX(QUIC_INITIAL_IPV6_MTU, QUIC_INITIAL_IPV4_MTU)
Let's provide the old relaxed versions as _MIN/_MAX for use with constants
like such cases where it's certain that there is no risk. A previous attempt
using __builtin_constant_p() to switch between the variants did not work,
and it's really not worth the hassle of going this far.
The struct was mistakenly spelled flt_fcgi_ctx() in fcgi_flt_stop()
when it was introduced in 2.1 with commit 78fbb9f991 ("MEDIUM:
fcgi-app: Add FCGI application and filter"), causing build issues
when trying to get the alignment of the object in pool_free() for
debugging purposes. No backport is needed as it's just used to convey
a pointer.
This commit introduces a sample fetch, `le2dec`, to convert
little-endian binary input samples into their decimal representations.
The function converts the input into a string containing unsigned
integer numbers, with each number derived from a specified number of
input bytes. The numbers are separated using a user-defined separator.
This new sample is achieved by adding a parametrized sample_conv_2dec
function, unifying the logic for be2dec and le2dec converters.
Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
[wt: tracked as GH issue #2915]
Signed-off-by: Willy Tarreau <w@1wt.eu>
In 358166a ("BUG/MINOR: hlua_fcn: restore server pairs iterator pointer
consistency"), I wrongly assumed that because the iterator was a temporary
object, no specific cleanup was needed for the watcher.
In fact watcher_detach() is not only relevant for the watcher itself, but
especially for its parent list to remove the current watcher from it.
As iterators are temporary objects, failing to remove their watchers from
the server watcher list causes the server watcher list to be corrupted.
On a normal iteration sequence, the last watcher_next() receives NULL
as target so it successfully detaches the last watcher from the list.
However the corner case here is with interrupted iterators: users are
free to break away from the iteration loop when a specific condition is
met for instance from the lua script, when this happens
hlua_listable_servers_pairs_iterator() doesn't get a chance to detach the
last iterator.
Also, Lua doesn't tell us that the loop was interrupted,
so to fix the issue we rely on the garbage collector to force a last
detach right before the object is freed. To achieve that, watcher_detach()
was slightly modified so that it becomes possible to call it without
knowing if the watcher is already detached or not, if watcher_detach() is
called on a detached watcher, the function does nothing. This way it saves
the caller from having to track the watcher state and makes the API a
little more convenient to use. This way we now systematically call
watcher_detach() for server iterators right before they are garbage
collected.
This was first reported in GH #3055. It can be observed when the server
list is browsed one than more time when it was already browsed from Lua
for a given proxy and the iteration was interrupted before the end. As the
watcher list is corrupted, the common symptom is watcher_attach() or
watcher_next() not ending due to the internal mt_list call looping
forever.
Thanks to GH users @sabretus and @sabretus for their precious help.
It should be backported everywhere 358166a was.
a2base64url() can return a negative value is olen is too short to
accept ilen. This is not supposed to happen since the sha256 should
always fit in a buffer. But this is confusing since a2base64()
returns a signed integer which is pt in output->data which is unsigned.
Fix the issue by setting ret to 0 instead of -1 upon error. And returns
a unsigned integer instead of a signed one.
This patch also checks the return value from the caller in order
to emit an error instead of setting trash.data which is already done
from the function.
DNS-01 needs a external process which would register a TXT record on a
DNS provider, using a REST API or something else.
To achieve this, the process should read the dpapi sink and wait for
events. With the DNS-01 challenge, HAProxy will put the task to sleep
before asking the ACME server to achieve the challenge. The task then
need to be woke up, using the command implemented by this patch.
This patch implements the "acme challenge_ready" command which should be
used by the agent once the challenge was configured in order to wake the
task up.
Example:
echo "@1 acme challenge_ready foobar.pem.rsa domain kikyo" | socat /tmp/master.sock -
This commit adds a new message to the dpapi sink which is emitted during
the new authorization request.
One message is emitted by challenge to resolve. The certificate name as
well as the thumprint of the account key are on the first line of the
message. A dump of the JSON response for 1 challenge is dumped, en the
message ends with a \0.
The agent consuming these messages MUST NOT access the URLs, and SHOULD
only uses the thumbprint, dns and token to configure a challenge.
Example:
$ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock - | cat -e
<0>2025-08-01T16:23:14.797733+02:00 acme deploy foobar.pem.rsa thumbprint Gv7pmGKiv_cjo3aZDWkUPz5ZMxctmd-U30P2GeqpnCo$
{$
"status": "pending",$
"identifier": {$
"type": "dns",$
"value": "foobar.com"$
},$
"challenges": [$
{$
"type": "dns-01",$
"url": "https://0.0.0.0:14000/chalZ/1o7sxLnwcVCcmeriH1fbHJhRgn4UBIZ8YCbcrzfREZc",$
"token": "tvAcRXpNjbgX964ScRVpVL2NXPid1_V8cFwDbRWH_4Q",$
"status": "pending"$
},$
{$
"type": "dns-account-01",$
"url": "https://0.0.0.0:14000/chalZ/z2_WzibwTPvE2zzIiP3BF0zNy3fgpU_8Nj-V085equ0",$
"token": "UedIMFsI-6Y9Nq3oXgHcG72vtBFWBTqZx-1snG_0iLs",$
"status": "pending"$
},$
{$
"type": "tls-alpn-01",$
"url": "https://0.0.0.0:14000/chalZ/AHnQcRvZlFw6e7F6rrc7GofUMq7S8aIoeDileByYfEI",$
"token": "QhT4ejBEu6ZLl6pI1HsOQ3jD9piu__N0Hr8PaWaIPyo",$
"status": "pending"$
},$
{$
"type": "http-01",$
"url": "https://0.0.0.0:14000/chalZ/Q_qTTPDW43-hsPW3C60NHpGDm_-5ZtZaRfOYDsK3kY8",$
"token": "g5Y1WID1v-hZeuqhIa6pvdDyae7Q7mVdxG9CfRV2-t4",$
"status": "pending"$
}$
],$
"expires": "2025-08-01T15:23:14Z"$
}$
^@
This commit emits a log which output the TXT entry to create in case of
DNS-01. This is useful in cases you want to update your TXT entry
manually.
Example:
acme: foobar.pem.rsa: DNS-01 requires to set the "acme-challenge.example.com" TXT record to "7L050ytWm6ityJqolX-PzBPR0LndHV8bkZx3Zsb-FMg"
Files ending with '-t.h' are supposed to be used for structure
definitions and could be included in the same file to check API
definitions.
This patch removes TRACE_SOURCE from acme-t.h to avoid conflicts with
other TRACE_SOURCE definitions.
QUIC MUX may be initialized prior to handshake completion, when 0-RTT is
used. In this case, connection is flagged with CO_FL_EARLY_SSL_HS, which
is notably used by wait-for-hs http rule.
Early data may be subject to replay attacks. For this reason, haproxy
adds the header 'Early-data: 1' to all requests handled as TLS early
data. Thus the server can reject it if it is deemed unsafe. This header
injection is implemented by http-ana. However, it was not functional
with QUIC due to missing CO_FL_EARLY_DATA connection flag.
Fix this by ensuring that QUIC MUX sets CO_FL_EARLY_DATA when needed.
This is performed during qcc_recv() for STREAM frame reception. It is
only set if QC_CF_WAIT_HS is set, meaning that the handshake is not yet
completed. After this, the request is considered safe and Early-data
header is not necessary anymore.
This should fix github issue #3054.
This must be backported up to 3.2 at least. If possible, it should be
backported to all stable releases as well. On these versions, the
current patch relies on the following refactoring commit :
commit 0a53a008d032b69377869c8caaec38f81bdd5bd6
MINOR: mux-quic: refactor wait-for-handshake support
Following the latest adjustment on session_add_conn() /
session_check_idle_conn(), detach muxes callbacks were rewritten for
private connection handling.
Nothing really fancy here : some more explicit comments and the removal
of a duplicate checks on idle conn status for muxes with true
multipexing support.
session_check_idle_conn() is called by muxes when a connection becomes
idle. It ensures that the session idle limit is not yet reached. Else,
the connection is removed from the session and it can be freed.
Prior to this patch, session_check_idle_conn() was compatible with a
NULL session argument. In this case, it would return true, considering
that no limit was reached and connection not removed.
However, this renders the function error-prone and subject to future
bugs. This patch streamlines it by ensuring it is never called with a
NULL argument. Thus it can now only returns true if connection is kept
in the session or false if it was removed, as first intended.
session_check_idle_conn() is called to flag a connection already
inserted in a session list as idle. If the session limit on the number
of idle connections (max-session-srv-conns) is exceeded, the connection
is removed from the session list.
In addition to the connection removal, session_check_idle_conn()
directly calls MUX destroy callback on the connection. This means the
connection is freed by the function itself and should not be used by the
caller anymore.
This is not practical when an alternative connection closure method
should be used, such as a graceful shutdown with QUIC. As such, remove
MUX destroy invokation : this is now the responsability of the caller to
either close or release immediately the connection.
Add a BUG_ON() on session_check_idle_conn() to ensure the connection is
not already flagged as CO_FL_SESS_IDLE.
This checks that this function is only called one time per connection
transition from active to idle. This is necessary to ensure that session
idle counter is only incremented one time per connection.
session_add_conn() uses three argument : connection and session
instances, plus a void pointer labelled as target. Typically, it
represents the server, but can also be a backend instance (for example
on dispatch).
In fact, this argument is redundant as <target> is already a member of
the connection. This commit simplifies session_add_conn() by removing
it. A BUG_ON() on target is extended to ensure it is never NULL.
This commit is the first one of a serie to refactor insertion of backend
private connection into the session list.
session_add_conn() is used to attach a connection into a session list.
Previously, this function would report an error if the connection
specified was already attached to another session. However, this case
currently never happens and thus can be considered as buggy.
Remove this check and replace it with a BUG_ON(). This allows to ensure
that session insertion remains consistent. The same check is also
transformed in session_check_idle_conn().
On stream detach on backend side, connection is inserted in the proper
server/session list to be able to reuse it later. If insertion fails and
the connection is idle, the connection can be removed immediately.
If this occurs on a QUIC connection, QUIC MUX implements graceful
shutdown to ensure the server is notified of the closure. However, the
connection instance is not freed. Change this to ensure that both
shutdown and release is performed.
This is preparation work for shared counters between co-processes. As
co-processes will need to share a common date. global_now_ms will be used
for that as it will point to the shm when sharing is enabled.
Thus in this patch we turn global_now_ms into a pointer (and adjust the
places where it is written to and read from, hopefully atomic operations
through pointer are already used so the change is trivial)
For now global_now_ms points to process-local _global_now_ms which is a
fallback for when sharing through the shm is not enabled.
75e480d10 ("MEDIUM: stats: avoid 1 indirection by storing the shared
stats directly in counters struct") took care of renaming
counters_fe_shared_init() but we forgot counters_be_shared_init().
Let's fix that for consistency
As discussed in GH #3051, default-path is not taken into account when
loading files using lua-load-per-thread. In fact, the initial
hlua_load_state() (performed on first thread which parses the config)
is successful, but other threads run hlua_load_state() later based
on config hints which were saved by the first thread, and those config
hints only contain the file path provided on the lua-load-per-thread
config line, not the absolute one. Indeed, `default-path` directive
changes the current working directory only for the thread parsing the
configuration.
To fix the issue, when storing config hints under hlua_load_per_thread()
we now make sure to save the absolute file path for `lua-load-per-thread'
argument.
Thanks to GH user @zhanhb for having reported the issue
It may be backported to all stable versions.
Implement traces for the ACME protocol.
-dt acme:data:complete will dump every input and output buffers,
including decoded buffers before being converted to JWS.
It will also dump certificates in the traces.
-dt acme:user:complete will only dump the state of the task handler.
Released version 3.3-dev5 with the following main changes :
- BUG/MEDIUM: queue/stats: also use stream_set_srv_target() for pendconns
- DOC: list missing global QUIC settings
Complete list of global keywords with missing QUIC entries.
This could be backported to stable versions. This requires to take into
account the version of introduction for each keyword.
* limited-quic, introduced in 2.8
* no-quic, introduced in 2.8
* tune.quic.cc.cubic.min-losses, introduced in 3.1
Following c24de07 ("OPTIM: stats: store fast sharded counters pointers
at session and stream level") some crashes were observed in
connect_server():
#0 0x00000000007ba39c in connect_server (s=0x65117b0) at src/backend.c:2101
2101 _HA_ATOMIC_INC(&s->sv_tgcounters->connect);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-325.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 nss-softokn-freebl-3.67.0-3.el7_9.x86_64 pcre-8.32-17.el7.x86_64
(gdb) bt
#0 0x00000000007ba39c in connect_server (s=0x65117b0) at src/backend.c:2101
#1 0x00000000007baff8 in back_try_conn_req (s=0x65117b0) at src/backend.c:2378
#2 0x00000000006c0e9f in process_stream (t=0x650f180, context=0x65117b0, state=8196) at src/stream.c:2366
#3 0x0000000000bd3e51 in run_tasks_from_lists (budgets=0x7ffd592752e0) at src/task.c:655
#4 0x0000000000bd49ef in process_runnable_tasks () at src/task.c:889
#5 0x0000000000851169 in run_poll_loop () at src/haproxy.c:2834
#6 0x0000000000851865 in run_thread_poll_loop (data=0x1a03580 <ha_thread_info>) at src/haproxy.c:3050
#7 0x0000000000852a53 in main (argc=7, argv=0x7ffd592755f8) at src/haproxy.c:3637
Here the crash occurs during the atomic inc of a sv_tgcounters metric from
the stream pointer, which tells us the pointer is likely garbage.
In fact, we assign s->sv_tgcounters each time the stream target is set to
a valid server. For that we use stream_set_srv_target() helper which does
assigment for us. By reviewing the code, in turns out we forgot to call
stream_set_srv_target() in pendconn_dequeue(), where the stream target
is set to the server who picked the pendconn.
Let's fix the bug by using stream_set_srv_target() there.
No backport needed unless c24de07 is.
Released version 3.3-dev4 with the following main changes :
- CLEANUP: server: do not check for duplicates anymore in findserver()
- REORG: server: move findserver() from proxy.c to server.c
- MINOR: server: use the tree to look up the server name in findserver()
- CLEANUP: server: rename server_find_by_name() to server_find()
- CLEANUP: server: rename findserver() to server_find_by_name()
- CLEANUP: server: use server_find_by_name() where relevant
- CLEANUP: cfgparse: lookup proxy ID using existing functions
- CLEANUP: stream: lookup server ID using standard functions
- CLEANUP: server: simplify server_find_by_id()
- CLEANUP: server: add server_find_by_addr()
- CLEANUP: stream: use server_find_by_addr() in sticking_rule_find_target()
- CLEANUP: server: be sure never to compare src against a non-existing defsrv
- MEDIUM: proxy: take the defsrv out of the struct proxy
- MINOR: proxy: add checks for defsrv's validity
- MEDIUM: proxy: no longer allocate the default-server entry by default
- MEDIUM: proxy: register a post-section cleanup function
- MINOR: debug: report haproxy and operating system info in panic dumps
- BUG/MEDIUM: h3: do not overwrite interim with final response
- BUG/MINOR: h3: properly realloc buffer after interim response encoding
- BUG/MINOR: h3: ensure that invalid status code are not encoded (FE side)
- MINOR: qmux: change API for snd_buf FIN transmission
- BUG/MEDIUM: h3: handle interim response properly on FE side
- BUG/MINOR: h3: properly handle interim response on BE side
- BUG/MINOR: quic: Wrong source address use on FreeBSD
- MINOR: h3: remove unused outbuf in h3_resp_headers_send()
- BUG/MINOR: applet: Don't trigger BUG_ON if the tid is not on appctx init
- DEV: gdb: add a memprofile decoder to the debug tools
- MINOR: quic: Get rid of qc_is_listener()
- DOC: connection: explain the rules for idle/safe/avail connections
- BUG/MEDIUM: quic-be: CC buffer released from wrong pool
- BUG/MINOR: halog: exit with error when some output filters are set simultaneosly
- MINOR: cpu-topo: split cpu_dump_topology() to show its summary in show dev
- MINOR: cpu-topo: write thread-cpu bindings into trash buffer
- MINOR: debug: align output style of debug_parse_cli_show_dev with cpu_dump_topology
- MINOR: debug: add thread-cpu bindings info in 'show dev' output
- MINOR: quic: Remove pool_head_quic_be_cc_buf pool
- BUILD: debug: add missed guard USE_CPU_AFFINITY to show cpu bindings
- BUG/MEDIUM: threads: Disable the workaround to load libgcc_s on macOS
- BUG/MINOR: logs: fix log-steps extra log origins selection
- BUG/MINOR: hq-interop: fix FIN transmission
- MINOR: ssl: Add ciphers in ssl traces
- MINOR: ssl: Add curve id to curve name table and mapping functions
- MINOR: ssl: Add curves in ssl traces
- MINOR: ssl: Dump ciphers and sigalgs details in trace with 'advanced' verbosity
- MINOR: ssl: Remove ClientHello specific traces if !HAVE_SSL_CLIENT_HELLO_CB
- MINOR: h3: use smallbuf for request header emission
- MINOR: h3: add traces to h3_req_headers_send()
- BUG/MINOR: h3: fix uninitialized value in h3_req_headers_send()
- MINOR: log: explicitly ignore "log-steps" on backends
- BUG/MEDIUM: acme: use POST-as-GET instead of GET for resources
- BUG/MINOR mux-quic: apply correctly timeout on output pending data
- BUG/MINOR: mux-quic: ensure close-spread-time is properly applied
- MINOR: mux-quic: refactor timeout code
- MINOR: mux-quic: correctly implement backend timeout
- MINOR: mux-quic: disable glitch on backend side
- MINOR: mux-quic: store session in QCS instance
- MEDIUM: mux-quic: implement be connection reuse
- MINOR: mux-quic: do not reuse connection if app already shut
- MEDIUM: mux-quic: support backend private connection
- MINOR: acme: remove acme_req_auth() and use acme_post_as_get() instead
- BUG/MINOR: acme: allow "processing" in challenge requests
- CLEANUP: acme: fix wrong spelling of "resources"
- CLEANUP: ssl: Use only NIDs in curve name to id table
- MINOR: acme: add ACME to the haproxy -vv feature list
- BUG/MINOR: hlua: Skip headers when a receive is performed on an HTTP applet
- BUG/MEDIUM: applet: State inbuf is no longer full if input data are skipped
- BUG/MEDIUM: stconn: Fix conditions to know an applet can get data from stream
- BUG/MINOR: applet: Fix applet_getword() to not return one extra byte
- BUG/MEDIUM: Remove sync sends from streams to applets
- MINOR: applet: Add HTX versions for applet_input_data() and applet_output_room()
- MINOR: applet: Improve applet API to take care of inbuf/outbuf alloc failures
- MEDIUM: hlua: Update the tcp applet to use its own buffers
- MINOR: hlua: Fill the request array on the first HTTP applet run
- MINOR: hlua: Use the buffer instead of the HTTP message to get HTTP headers
- MEDIUM: hlua: Update the http applet to use its own buffers
- BUG/MEDIUM: hlua: Report to SC when data were consumed on a lua socket
- BUG/MEDIUM: hlua: Report to SC when output data are blocked on a lua socket
- MEDIUM: hlua: Update the socket applet to use its own buffers
- BUG/MEDIUM: dns: Reset reconnect tempo when connection is finally established
- MEDIUM: dns: Update the dns_session applet to use its own buffers
- CLEANUP: http-client: Remove useless indentation when sending request body
- MINOR: http-client: Try to send request body with headers if possible
- MINOR: http-client: Trigger an error if first response block isn't a start-line
- BUG/MINOR: httpclient-cli: Don't try to dump raw headers in HTX mode
- MINOR: httpclient-cli: Reset httpclient HTX buffer instead of removing blocks
- MEDIUM: http-client: Update the http-client applet to use its own buffers
- MEDIUM: log: Update the log applet to use its own buffers
- MEDIUM: sink: Update the sink applets to use their own buffers
- MEDIUM: peers: Update the peer applet to use its own buffers
- MEDIUM: promex: Update the promex applet to use their own buffers
- MINOR: applet: Add support for flags on applets with a flag about the new API
- MEDIUM: applet: Emit a warning when a legacy applet is spawned
- BUG/MEDIUM: logs: fix sess_build_logline_orig() recursion with options
- MEDIUM: stats: avoid 1 indirection by storing the shared stats directly in counters struct
- CLEANUP: compiler: prefer char * over void * for pointer arithmetic
- CLEANUP: include: replace hand-rolled offsetof to avoid UB
- CLEANUP: peers: remove unused peer_session_target()
- OPTIM: stats: store fast sharded counters pointers at session and stream level
Following commit 75e480d10 ("MEDIUM: stats: avoid 1 indirection by storing
the shared stats directly in counters struct"), in order to minimize the
impact of the recent sharded counters work, we try to push things a bit
further in this patch by storing and using "fast" pointers at the session
and stream levels when available to avoid costly indirections and
systematic "tgid" resolution (which can not be cached by the CPU due to
its THREAD-local nature).
Indeed, we know that a session/stream is tied to a given CPU, thanks to
this we know that the tgid for a given session/stream will never change.
Given that, we are able to store sharded frontend and listener counters
pointer at the session level (namely sess->fe_tgcounters and
sess->li_tgcounters), and once the backend and the server are selected,
we are also able to store backend and server sharded counters
pointer at the stream level (namely s->be_tgcounters and s->sv_tgcounters)
Everywhere we rely on these counters and the stream or session context is
available, we use the fast pointers it instead of the indirect pointers
path to make the pointer resolution a bit faster.
This optimization proved to bring a few percents back, and together with
the previous 75e480d10 commit we now fixed the performance regression (we
are back to back with 3.2 stats performance)
Since commit 7293eb68 ("MEDIUM: peers: use server as stream target") peer
session target always point to server in order to benefit from existing
server transport options.
Thanks to that, it is no longer necessary to have peer_session_target()
helper function, because all it does is return the pointer to the
server object. Let's get rid of that
The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.
Note that there's no clear statement about this point in the spec,
only several points which together converge to this:
- From N3220, 6.5.3.4:
A postfix expression followed by the -> operator and an identifier
designates a member of a structure or union object. The value is
that of the named member of the object to which the first expression
points, and is an lvalue.
- From N3220, 6.3.2.1:
An lvalue is an expression (with an object type other than void) that
potentially designates an object; if an lvalue does not designate an
object when it is evaluated, the behavior is undefined.
- From N3220, 6.5.4.4 p3:
The unary & operator yields the address of its operand. If the
operand has type "type", the result has type "pointer to type". If
the operand is the result of a unary * operator, neither that operator
nor the & operator is evaluated and the result is as if both were
omitted, except that the constraints on the operators still apply and
the result is not an lvalue. Similarly, if the operand is the result
of a [] operator, neither the & operator nor the unary * that is
implied by the [] is evaluated and the result is as if the & operator
were removed and the [] operator were changed to a + operator.
=> In short, this is saying that C guarantees these identities:
1. &(*p) is equivalent to p
2. &(p[n]) is equivalent to p + n
As a consequence, &(*p) doesn't result in the evaluation of *p, only
the evaluation of p (and similar for []). There is no corresponding
special carve-out for ->.
See also: https://pvs-studio.com/en/blog/posts/cpp/0306/
After this patch, HAProxy can run without crashing after building w/
clang-19 -fsanitize=undefined -fno-sanitize=function,alignment
This patch changes two instances of pointer arithmetic on void *
to use char * instead, to avoid UB. This is essentially to please
UB analyzers, though.
Between 3.2 and 3.3-dev we noticed a noticeable performance regression
due to stats handling. After bisecting, Willy found out that recent
work to split stats computing accross multiple thread groups (stats
sharding) was responsible for that performance regression. We're looking
at roughly 20% performance loss.
More precisely, it is the added indirections, multiplied by the number
of statistics that are updated for each request, which in the end causes
a significant amount of time being spent resolving pointers.
We noticed that the fe_counters_shared and be_counters_shared structures
which are currently allocated in dedicated memory since a0dcab5c
("MAJOR: counters: add shared counters base infrastructure")
are no longer huge since 16eb0fab31 ("MAJOR: counters: dispatch counters
over thread groups") because they now essentially hold flags plus the
per-thread group id pointer mapping, not the counters themselves.
As such we decided to try merging fe_counters_shared and
be_counters_shared in their parent structures. The cost is slight memory
overhead for the parent structure, but it allows to get rid of one
pointer indirection. This patch alone yields visible performance gains
and almost restores 3.2 stats performance.
counters_fe_shared_get() was renamed to counters_fe_shared_prepare() and
now returns either failure or success instead of a pointer because we
don't need to retrieve a shared pointer anymore, the function takes care
of initializing existing pointer.
Since ccc43412 ("OPTIM: log: use thread local lf_buildctx to stop pushing
it on the stack"), recursively calling sess_build_logline_orig(), which
may for instance happen when leveraging %ID (or unique-id fetch) for the
first time, would lead to undefined behavior because the parent
sess_build_logline_orig() build context was shared between recursive calls
(only one build ctx per thread to avoid pushing it on the stack for each
call)
In short, the parent build ctx would be altered by the recursive calls,
which is obviously not expected and could result in log formatting errors.
To fix the issue but still avoid polluting the stack with large lf_buildctx
struct, let's move the static 256 bytes build buffer out of the buildctx
so that the buildctx is now stored in the stack again (each function
invokation has its own dedicated build ctx). On the other hand, it's
acceptable to have only 1 256 bytes build buffer per thread because the
build buffer is not involved in recursives calls (unlike the build ctx)
Thanks to Willy and Vincent Gramer for spotting the bug and providing
useful repro.
It should be backported in 3.0 with ccc43412.
To motivate developers to support the new applets API, a warning is now
emitted when a legacy applet is spawned. To not flood users, this warning is
only emitted once per legacy applet. To do so, the applet flag
APPLET_FL_WARNED was added. It is set when the warning is emitted.
Note that test and set on this flag are not performed via atomic operations.
So it is possible to have more than one warning for a given applet if it is
spawned in same time on several threads. At worrst, there is one warning per
thread.
A new field was added in the applet structure to be able to set flags on the
applets The first one is related to the new API. APPLET_FL_NEW_API is set
for applets based on the new API. It was set on all HAProxy's applets.
Thanks to this patch, the promex applet is now using its own buffers.
.rcv_buf and .snd_buf callback functions are now defined to use the default
HTX functions. Parts to receive and send data have also been updated to use
the applet API and to remove any dependencies on the stream-connectors and
the channels.
Thanks to this patch, the peer applet is now using its own buffers. .rcv_buf
and .snd_buf callback functions are now defined to use the default raw
functions. The applet API is now used and any dependencies on the
stream-connectors and the channels were removed.
Thanks to this patch, the sink applets is now using their own buffers.
.rcv_buf and .snd_buf callback functions are now defined to use the default
raw functions. The applet API is now used and any dependencies on the
stream-connectors and the channels were removed.
Thanks to this patch, the log applet is now using its own buffers. .rcv_buf
and .snd_buf callback functions are now defined to use the default raw
functions. The applet API is now used and any dependencies on the
stream-connectors and the channels were removed.
Thanks to this patch, the http-client applet is now using its own buffers.
.rcv_buf and .snd_buf callback functions are now defined to use the default
HTX functions. Parts to receive and send data have also been updated to use
the applet API and to remove any dependencies on the stream-connectors and
the channels.
In the CLI I/O handler interacting with the HTTP client, in HTX mode, after
a dump of the HTX message, data must be removed. Instead of removng all
blocks one by one, we can call htx_reset() because all the message must be
flushed.
In the CLI I/O handler interacting with the HTTP client, we must not try to
push raw headers in HTX mode, because there is no raw data in this
mode. This prevent the HTX dump at the end of the I/O handle.
It is a 3.3-specific issue. No backport needed.
The first HTX block of a response must be a start-line. There is no reason
to wait for something else. And if there are output data in the response
channel buffer, it means we must found the start-line.
There is no reason to yield after sending the request headers, except if the
request was fully sent. If there is a payload, it is better to send it as
well. However, when the whole request was sent, we can leave the I/O handler.
Thanks to this patch, the dns_session applet is now using its own
buffers. .rcv_buf and .snd_buf callback functions are now defined to use the
default raw functions. Functions to receive and send data have also been
updated to use the applet API and to remove any dependencies on the
stream-connectors and the channels.
The issue was introduced by commit 27236f221 ("BUG/MINOR: dns: add tempo
between 2 connection attempts for dns servers"). In this patch, to delay the
reconnection, a timer is used on the appctx when it is created. This
postpones the appctx initialization. However, once initialized, the
expiration time of the underlying task is not reset. So, it is always
considered as expired and the appctx is woken up in loop.
The fix is quite simple. In dns_session_init(), the expiration time of the
appctx's task is alwaus set to TICK_ETERNITY.
This patch must be backported everywhere the commit above was backported. So
as far as 2.8 for now but possibly to all stable versions.
Thanks to this patch, the lua cosocket applet is now using its own
buffers. .rcv_buf and .snd_buf callback functions are now defined to use the
default raw functions. Functions to receive and send data have also been
updated to use the applet API and to remove any dependencies on the
stream-connectors and the channels.
It is a fix similar to the previous one ("BUG/MEDIUM: hlua: Report to SC
when data were consumed on a lua socket"), but for the write side. The
writer must notify the cosocket it needs more space in the request buffer to
produce more data by calling sc_need_room(). Otherwise, there is nothing to
prevent to wake the cosocket applet up again and again.
This patch must be backported as far as 2.8, and maybe to 2.6 too.
The lua cosocket are quite strange. There is an applet used to handle the
connection and writer and readers subscribed on it to write or read
data. Writers and readers are tasks woken up by the cosocket applet when
data can be consumed or produced, depending on the channels buffers
state. Then the cosocket applet is woken up by writers and readers when read
or write events were performed.
It means the cosocket applet has only few information on what was produced
or consumed. It is the writers and readers responsibility to notify any
blocking. Among other things, the readers must take care to notify the
stream on top of the cosocket applet that some data was consumed. Otherwise,
it may remain blocked, waiting for a write event (a write event from the
stream point of view is a read event from the cosocket point of view).
Thie patch must be backported as far as 2.8, and maybe to 2.6 too.
Thanks to this patch, the lua HTTP applet is now using its own buffers.
.rcv_buf and .snd_buf callback functions are now defined to use the default
HTX functions. Functions to receive and send data have also been updated to
use the applet API and to remove any dependencies on the stream-connectors
and the channels.
hlua_http_get_headers() function was using the HTTP message from the stream
TXN to retrieve headers from a message. However, this will be an issue to
update the lua HTTP applet to use its own buffers. Indeed, in that case,
information from the channels will be unavailable. So now,
hlua_http_get_headers() is now using a buffer containing an HTX message. It
is just an API change bacause, internally, the function was already
manipulation an HTX message.
When a lua HTTP applet is created, a "request" object is created, filled
with the request information (method, path, headers...), to be able to
easily retrieve these information from the script. However, this was done
when thee appctx was created, retrieving the info from the request channel.
To be ale to update the applet to use its own buffer, it is now performed on
the first applet run. Indead, when the applet is created, the info are not
forwarded yet and should not be accessed. Note that for now, information are
still retrieved from the channel.
Thanks to this patch, the lua TCP applet is now using its own buffers.
.rcv_buf and .snd_buf callback functions are now defined to use the default
raw functions. Other changes are quite light. Mainly, end of stream and
errors are reported on the appctx instead of the stream-endpoint descriptor.
applet_get_inbuf() and applet_get_outbuf() functions were not testing if the
buffers were available. So, the caller had to check them before calling one
of these functions. It is not really handy. So now, these functions take
care to have a fully usable buffer before returning. Otherwise NULL is
returned.
It will be useful for HTX applets because availale data in the input buffer and
available space in the output buffer are computed from the HTX message and not
the buffer itself. So now, applet_htx_input_data() and applet_htx_output_room()
functions can be used.
When the applet API was reviewed to use dedicated buffers, the support for
sends from the streams to applets was added. Unfortunately, it was not a
good idea because this way it is possible to deliver data to an applet and
release it just after, truncated data. Indeed, the release stage for applets
is related to the stream release itself. However, unlike the multiplexers,
the applets cannot survive to a stream for now.
So, for now, the sync sends from the streams is removed for applets, waiting
for a better way to handle the applets release stage.
Note that this only concerns applets using their own buffers. And of now,
the bug is harmless because all refactored applets are on server side and
consume data first. But this will be an issue with the HTTP client.
This patch should be backported as far as 3.0 after a period of observation.
applet_getword() function is returning one extra byte when a string is
returned because the "ret" variable is not reset before the loop on the
data. The patch also fixes applet_getline().
It is a 3.3-specific issue. No need to backport.
sc_is_send_allowed() function is used to know if an applet is able to
receive data from the stream. But this function was designed for applets
using the channels buffer. It is not adapted to applets using their own
buffers.
when the SE_FL_WAIT_DATA flag is set, it means the applet is waiting for
more data and should not be woken up without new data. For applets using
channels buffer, just testing the flag is enough because process_stream()
will remove if when more data will be available. For applets using their own
buffers, it is more complicated. Some data may be blocked in the output
channel buffer. In that case, and when the applet input buffer can receive
daa, the applet can be woken up.
This patch must be backported as far as 3.0 after a period of observation.
When data are skipped from the input buffer of an applet, we must take care
to notify the input buffer is no longer full. Otherwise, this could prevent
the stream to push data to the applet.
It is 3.3-specific. No backport needed.
When an HTTP applet tries to retrieve data, the request headers are still in
the buffer. But, instead of being silently removed, their size is removed
from the amount of data retrieved. When the request payload is fully
retrieved, it is not an issue. But it is a problem when a length is
specified. The data are shorten from the headers size.
So now, we take care to silently remove headers.
This patch must be backported to all stable versions.
The curve name to curve id mapping table was built out of multiple
internal tables found in openssl sources, namely the 'nid_to_group'
table found in 'ssl/t1_lib.c' which maps openssl specific NIDs to public
IANA curve identifiers. In this table, there were two instances of
EVP_PKEY_XXX ids being used while all the other ones are NID_XXX
identifiers.
Since the two EVP_PKEY are actually equal to their NID equivalent in
'include/openssl/evp.h' we can use NIDs all along for better coherence.
Allow the "processing" status in the challenge object when requesting
to do the challenge, in addition to "pending".
According to RFC 8555 https://datatracker.ietf.org/doc/html/rfc8555/#section-7.1.6
Challenge objects are created in the "pending" state. They
transition to the "processing" state when the client responds to the
challenge (see Section 7.5.1)
However some CA could respond with a "processing" state without ever
transitioning to "pending".
Must be backported to 3.2.
If a backend connection is private, it should not be reused outside of
its original attached session. As such, on stream detach operation, such
connection is never inserted into server idle/avail list. Instead, it is
stored directly on the session.
The purpose of this commit is to implement proper handling of private
backend connections via QUIC multiplexer.
QUIC connection graceful closure is performed in two steps. First, the
application layer is closed. In the context of HTTP/3, this is done with
a GOAWAY frame emission, which forbids opening of new streams. Then the
whole connection is terminated via CONNECTION_CLOSE which is the final
emitted frame.
This commit ensures that when app layer is shut for a backend
connection, this connection is removed from either idle or avail server
tree. The objective is to prevent stream layer to try to reuse a
connection if no new stream can be attached on it.
New BUG_ON checks are inserted in qmux_strm_attach() and h3_attach() to
ensure that this assertion is always true.
Implement support for QUIC connection reuse on the backend side. The
main change is done during detach stream operation. If a connection is
idle, it is inserted in the server list. Else, it is stored in the
server avail tree if there is room for more streams.
For non idle connection, qmux_avail_streams() is reused to detect that
stream flow-control limit is not yet reached. If this is the case, the
connection is not inserted in the avail tree, so it cannot be reuse,
even if flow-control is unblocked later by the peer. This latter point
could be improved in the future.
Note that support for QUIC private connections is still missing. Reuse
code will evolved to fully support this case.
Add a new <sess> member into QCS structure. It is used to store the
parent session of the stream on attach operation. This is only done for
backend side.
This new member will become necessary when connection reuse will be
implemented. <owner> member of connection is not suitable as it could be
set to NULL, notably after a session_add_conn() failure.
Also, a single BE conn can be shared along different session instance,
in particular when using aggressive/always reuse mode. Thus it is
necessary to linked each QCS instance with its session.
For now, QUIC glitch limit counter is only available on the frontend
side. Thus, disable incrementation on the backend side for now. Also,
session is only available as conn <owner> reliably on the frontend side,
so session_add_glitch_ctr() operation is also securised.
qcc_refresh_timeout() is the function called on QUIC MUX activity. Its
purpose is to update the timeout by selecting the correct value
depending on the connection state.
Prior to this patch, backend connections were mostly ignored by the
function. However, the default server timeout was selecting as a
fallback. This is incompatible with backend connections reuse.
This patch fixes timeout applied on backend connections. Only values
specific to frontend which are http-request and http-keep-alive timeouts
are now ignored for a backend connection. Also, fallback timeout is only
used for frontend connections.
This patch ensures that an idle backend connection won't be deleted due
to server timeout. This is necessary for proper connection reuse which
will be implemented in a future patch.
This commit is a small reorganization of condition used into
qcc_refresh_timeout(). Its objective is to render the code more logical
before the next patch which will ensure that timeout is properly set for
backend connections.
If a connection remains on a proxy currently disabled or stopped, a
special spread timeout is set if active close is configured. For QUIC
MUX, this is set via qcc_refresh_timeout() as with all other timeout
values.
Fix this closing timeout setting : it is now used as an override to any
other timeout that may have been chosen if calculated spread time is
lower than the previously selected value. This is done for backend
connections as well.
This should be backported up to 2.6 after a period of observation.
When no stream is attached, mux layer is responsible to maintain a
timeout. The first criteria is to apply client/server timeout if there
is still data waiting for emission.
Previously, <hreq> qcc member was used to determine this state. However,
this only covers bidirectional streams. Fix this by testing if
<send_list> is empty or not. This is enough to take into account both
bidi and uni streams.
Theorically, this should be backported to every stable versions.
However, send-list is not available on 2.6 and there is no alternative
to quickly determine if there is waiting output data. Thus, it's better
to backport it up to 2.8 only.
The requests that checked the status of the challenge and the retrieval
of the certificate were done using a GET.
This is working with letsencrypt and other CA providers, but it might
not work everywhere. RFC 8555 specifies that only the directory and
newNonce resources MUST work with a GET requests, but everything else
must use POST-as-GET.
Must be backported to 3.2.
"log-steps" was already ignored if directly defined in a backend section,
however, when defined in a defaults section it was inherited to all
proxies no matter their capability (ie: including backends).
As configurations often contain more backends than frontends, this would
result in wasted memory given that the log-steps setting is only
considered on frontends.
Let's fix that by preventing the inheritance from defaults section to
anything else than frontends. Also adjust the documentation to mention
that the setting in not relevant for backends.
Due to the introduction of smallbuf usage for HTTP/3 headers emission,
ret variable may be used uninitialized if buffer allocation fails due to
not enough room in QUIC connection window.
Fix this by setting ret value to 0.
Function variable declaration are also adjusted so that the pattern is
similar to h3_resp_headers_send(). Finally, outbuf buffer is also
removed as it is now unused.
No need to backport.
Similarly to HTTP/3 response encoding, a small buffer is first allocated
for the request encoding on the backend side. If this is not sufficient,
the smallbuf is replaced by a standard buffer and encoding is restarted.
This is useful to reduce the window usage over a connection of smaller
requests.
SSL libraries like wolfSSL that don't have the clienthello callback
mechanism enabled do not need to have the traces that are only called
from the said callback.
The code added to parse the ciphers relied on a function that wes not
defined in wolfSSL (SSL_CIPHER_find).
The contents of the extensions were only dumped with verbosity
'complete' which meant that the 'advanced' verbosity was pretty much
useless despite what its name implies (it was the same as the 'simple'
one).
The 'advanced' verbosity is now the "maximum" one, using 'complete'
would not add any extra information yet, but it leaves more room for
some actually large traces to be dumped later on (some complete
ClientHello dumps for instance).
The SSL libraries like OpenSSL for instance do not seem to actually
provide a public mapping between IANA defined curve IDs and curve names,
or even a mapping between curve IDs and internal NIDs.
This new table regroups all those information in a single table so that
we can convert curve names (be it SECG or NIST format) to curve IDs or
NIDs.
The previously existing 'curves2nid' function now uses the new table,
and a new 'curveid2str' one is added.
Since the following patch, app_ops layer is now responsible to report
that HTX block was the last transmitted so that FIN STREAM can be set.
This is mandatory to properly support HTTP 1xx interim responses.
f349df44b4e21d8bf9b575a0aa869056a2ebaa58
MINOR: qmux: change API for snd_buf FIN transmission
This change was correctly implemented in HTTP/3 code, however an issue
appeared on hq-interop transcoder in case zero-copy DATA transfer is
performed when HTX buffer is swapped. If this occured during the
transfer of the last HTX block, EOM is not detected and thus STREAM FIN
is never set.
Most of the times, QMUX shut callback is called immediately after. This
results in an emission of a RESET_STREAM to the client, which prevents
the data transfer.
To fix this, use the same method as HTTP/3 : HTX EOM flag status is
checked before any transfer, thus preserving it even after a zero-copy.
Criticity of this bug is low as hq-interop is experimental and is mostly
used for interop testing.
This should fix github issue #3038.
This patch must be backported wherever the above one is.
Willy noticed that it was not possible to select extra log origins using
log-steps directive. Extra origins are the one registered using
log_orig_register() such as http-req.
Reason was the error path was always executed during extra log origin
matching for log-steps parser, while it should only be executed if no
match was found.
It should be backported to 3.1.
Don't use the workaround to load libgcc_s on macOS. It is not needed
there, and it causes issues, as recent macOS dislike processes that fork
after threads where created (and the workaround creates a temporary
thread). This fixes crashes on macOS at least when using master-worker,
and using the system resolver.
This should fix Github issue #3035
This should be backported up to 2.8.
Not all platforms support thread-cpu bindings, so let's put
cpu_topo_dump_summary() under USE_CPU_AFFINITY guards.
Only needs to be backported if 1cc0e023ce ("MINOR: debug: add thread-cpu
bindings info in 'show dev' output") is backported.
This patch impacts the QUIC frontends. It reverts this patch
MINOR: quic-be: add a "CC connection" backend TX buffer pool
which adds <pool_head_quic_be_cc_buf> new pool to allocate CC (connection closed state)
TX buffers with bigger object size than the one for <pool_head_quic_cc_buf>.
Indeed the QUIC backends must be able to send at least 1200 bytes Initial packets.
For now on, both the QUIC frontends and backend use the same pool with
MAX(QUIC_INITIAL_IPV6_MTU, QUIC_INITIAL_IPV4_MTU)(1252 bytes) as object size.
Align titles style of debug_parse_cli_show_dev() with
cpu_dump_topology(). We will call the latter inside of
debug_parse_cli_show_dev() to show thread-cpu bindings info.
cpu_dump_topology() prints details about each enabled CPU and a summary with
clusters info and thread-cpu bindings. The latter is often usefull for
debugging and we want to add it in the 'show dev' output.
So, let's split cpu_dump_topology() in two parts: cpu_topo_debug() to print the
details about each enabled CPU; and cpu_topo_dump_summary() to print only the
summary.
In the next commit we will modify cpu_topo_dump_summary() to write into local
trash buffer and it could be easily called from debug_parse_cli_show_dev().
Exit with an error if multiple output filters (-ic, -srv, -st, -tc, -u*, etc.)
are used at the same time.
halog is designed to process and display output for only one filter at a time.
Using multiple filters simultaneously can cause a crash because the program is
not designed to manage multiple, separate result sets (e.g., one for
IP counts, another for URLs).
Supporting simultaneous filters would require a redesign to collect entries for
each filter in separate ebtree. This would negatively impact performance and is
not requested for the moment. This patch prevents the crash by checking filter
combinations just after the command line parsing.
This issue was reported in GitHUB #3031.
This should be backported in all stable versions.
The "connection close state" TX buffer is used to build the datagram with
basically a CONNECTION_CLOSE frame to notify the peer about the connection
closure. It allows the quic_conn memory release and its replacement by a lighter
quic_cc_conn struct.
For the QUIC backend, there is a dedicated pool to build such datagrams from
bigger TX buffers. But from quic_conn_release(), this is the pool dedicated
to the QUIC frontends which was used to release the QUIC backend TX buffers.
This patch simply adds a test about the target of the connection to release
the "connection close state" TX buffers from the correct pool.
No backport needed.
It's super difficult to find the rules that operate idle conns depending
on their idle/safe/avail/private status. Some are in lists, others not.
Some are in trees, others not. Some have a flag set, others not. This
documents the rules before the definitions in connection-t.h. It could
even be backported to help during backport sessions.
Replace all calls to qc_is_listener() (resp. !qc_is_listener()) by calls to
objt_listener() (resp. objt_server()).
Remove qc_is_listener() implement and QUIC_FL_CONN_LISTENER the flag it
relied on.
When an appctx is initialized, there is a BUG_ON() to be sure the appctx is
really initialized on the right thread to avoid bugs on the thread
affinity. However, it is possible to not choose the thread when the appctx
is created and let it starts on any thread. In that case, the thread
affinity is set when the appctx is initialized. So, we must take cate to not
trigger the BUG_ON() in that case.
For now, we never hit the bug because the thread affinity is always set
during the appctx creation.
This patch must be backport as far as 2.8.
The bug is a listener only one, and only occured on FreeBSD.
The FreeBSD issue has been reported here:
https://forums.freebsd.org/threads/quic-http-3-with-haproxy.98443/
where QUIC traces could reveal that sendmsg() calls lead to EINVAL
syscall errnos.
Such a similar issue could be reproduced from a FreeBSD 14-2 VM
with reg-tests/quic/retry.vtc as reg test.
As noted by Olivier, this issue could be fixed within the VM binding
the listener socket to INADDR_ANY.
That said, the symptoms are not exactly the same as the one reporte by the user.
What could be observed from such a VM is that if the first recvmsg() call
returns the datagram destination address, and if the listener
listening address is bound to a specific address, the calls to
sendmsg() fail because of the IP_SENDSRCADDR ip option value
set by cmsg_set_saddr(). According to the ip(4) freebsd manual
such an IP options must be used if the listening socket is
bound to a specific address. It is to be noted that into a VM
the first call to recvmsg() of the first connection does not return the datagram
destination address. This leads the first quic_conn to be initialized without
->local_addr value. This is this value which is used by IP_SENDSRCADDR
ip option. In this case, the sendmsg() calls (without IP_SENDSRCADDR)
never fail. The issue appears at the second condition.
This patch replaces the conditions to use IP_SENDSRCADDR to a call to
qc_may_use_saddr(). This latter also checks that the listener listening
address is not INADDR_ANY to allow the use of the source address.
It is generalized to all the OSes. Indeed, there is no reason to set the source
address when the listener is bound to a specific address.
Must be backported as far as 2.8.
On backend side, H3 layer is responsible to decode a HTTP/3 response
into an HTX message. Multiple responses may be received on a single
stream with interim status codes prior to the final one.
h3_resp_headers_to_htx() is the function used solely on backend side
responsible for H3 response to HTX transcoding. This patch extends it to
be able to properly support interim responses. When such a response is
received, the new flag H3_SF_RECV_INTERIM is set. This is converted to
QMUX qcs flag QC_SF_EOI_SUSPENDED.
The objective of this latter flag is to prevent stream EOI to be
reported during stream rcv_buf callback, even if HTX message contains
EOM and is empty. QC_SF_EOI_SUSPENDED will be cleared when the final
response is finally converted, which unblock stream EOI notification for
next rcv_buf invocations. Note however that HTX EOM is untouched : it is
always set for both interim and final response reception.
As a minor adjustment, HTX_SL_F_BODYLESS is always set for interim
responses.
Contrary to frontend interim response handling, a flag is necessary on
QMUX layer. This is because H3 to HTX transcoding and rcv_buf callback
are two distinct operations, called under different context (MUX vs
stream tasklet).
Also note that H3 layer has two distinct flags for interim response
handling, one only used as a server (FE side) and the other as a client
(BE side). It was preferred to used two distinct flags which is
considered less error-prone, contrary to a single unified flag which
would require to always set the proxy side to ensure it is relevant or
not.
No need to backport.
On frontend side, HTTP/3 layer is responsible to transcode an HTX
response message into HTTP/3 HEADERS frame. This operations is handled
via h3_resp_headers_send().
Prior to this patch, if HTX EOM was encountered in the HTX message after
response transcoding, <fin> was reported to the QMUX layer. This will in
turn cause FIN stream bit to be set when the response is emitted.
However, this is not correct as a single HTX response can be constitued
of several interim message, each delimited by EOM block.
Most of the time, this bug will cause the client to close the connection
as it is invalid to receive an interim response with FIN bit set.
Fixes this by now properly differentiate interim and final response.
During interim response transcoding, the new flag H3_SF_SENT_INTERIM
will be set, which will prevent <fin> to be reported. Thus, <fin> will
only be notified for the final response.
This must be backported up to 2.6. Note that it relies on the previous
patch which also must be taken.
Previous patches have fixes interim response encoding via
h3_resp_headers_send(). However, it is still necessary to adjust h3
layer state-machine so that several successive HTTP responses are
accepted for a single stream.
Prior to this, QMUX was responsible to decree that the final HTX message
was encoded so that FIN stream can be emitted. However, with interim
response, MUX is in fact unable to properly determine this. As such,
this is the responsibility of the application protocol layer. To reflect
this, app_ops snd_buf callback is modified so that a new output argument
<fin> is added to it.
Note that for now this commit does not bring any functional change.
However, it will be necessary for the following patch. As such, it
should be backported prior to it to every versions as necessary.
On frontend side, H3 layer transcodes HTX status code into HTTP/3
HEADERS frame. This is done by calling qpack_encode_int_status().
Prior to this patch, the latter function was also responsible to reject
an invalid value, which guarantee that only valid codes are encoded
(between 100 and 999 values). However, this is not practical as it is
impossible to differentiate between an invalid code error and a buffer
room exhaustation.
Changes this so that now HTTP/3 layer first ensures that HTX code is
valid. The stream is closed with H3_INTERNAL_ERROR if invalid value is
present. Thus, qpack_encode_int_status() will only report an error due
to buffer room exhaustion. If a small buffer is used, a standard buffer
will be reallocated which should be sufficient to encode the response.
The impact of this bug is minimal. Its main benefit is code clarity,
while also removing an unnecessary realloc when confronting with an
invalid HTTP code.
This should be backported at least up to 3.1. Prior to it, smallbuf
mechanism isn't present, hence the impact of this patch is less
important. However, it may still be backported to older versions, which
should facilitate picking patches for HTTP 1xx interim response support.
Previous commit fixes encoding of several following HTTP response
message when interim status codes are first reported. However,
h3_resp_headers_send() still was unable to interrupt encoding if output
buffer room was not sufficient. This case may be likely because small
buffers are used for headers encoding.
This commit fixes this situation. If output buffer is not empty prior to
response encoding, this means that a previous interim response message
was already encoded before. In this case, and if remaining space is not
sufficient, use buffer release mechanism : this allows to restart
response encoding by using a newer buffer. This process has already been
used for DATA and trailers encoding.
This must be backported up to 2.6. However, note that buffer release
mechanism is not present for version 2.8 and lower. In this case, qcs
flag QC_SF_BLK_MROOM should be enough as a replacement.
An HTTP response may contain several interim response message prior (1xx
status) to a final response message (all other status codes). This may
cause issues with h3_resp_headers_send() called for response encoding
which assumes that it is only call one time per stream, most notably
during output buffer handling.
This commit fixes output buffer handling when h3_resp_headers_send() is
called multiple times due to an interim response. Prior to it, interim
response was overwritten with newer response message. Most of the time,
this resulted in error for the client due to QPACK decoding failure.
This is now fixed so that each response is encoded one after the other.
Note that if encoding of several responses is bigger than output buffer,
an error is reported. This can definitely occurs as small buffer are
used during header encoding. This situation will be improved by the next
patch.
This must be backported up to 2.6.
The goal is to help figure the OS version (kernel and userland), any
virtualization/containers, and the haproxy version and build features.
Sometimes even reporters themselve can be mistaken about the running
version or environment. Also printing this at the top hepls draw a
visual delimitation between warnings and panic. Now we get something
like this:
PANIC! Thread 1 is about to kill the process.
HAProxy info:
version: 3.3-dev3-c863c0-18
features: +51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY (...)
Operating system info:
virtual machine: no
container: no
kernel: Linux 6.1.131 #1 SMP PREEMPT_DYNAMIC Fri Mar 14 01:04:55 CET 2025 x86_64
userland: Slackware 15.0 x86_64
* Thread 1 : id=0x7f615a8775c0 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/1 stuck=0 prof=0 harmless=0 isolated=0
cpu_ns: poll=1835010197 now=1835066102 diff=55905
(...)
For listen/frontend/backend, we now want to be able to clean up the
default-server directive that's no longer used past the end of the
section. For this we register a post-section function and perform the
cleanup there.
The default-server entry used to always be allocated. Now we'll postpone
its allocation for the first time we need it, i.e. during a "default-server"
directive, or when inheriting a defaults section which has one. The memory
savings are significant, on a large configuration with 100k backends and
no default-server directive, the memory usage dropped from 800MB RSS to
420MB (380 MB saved). It should be possible to also address configs using
default-server by releasing this entry when leaving the proxy section,
which is not done yet.
Now we only copy the default server's settings if such a default server
exists, otherwise we only initialize it. At the moment it always exists.
The change is mostly performed in srv_settings_cpy() since that's where
each caller passes through, and there's no point duplicating that test
everywhere.
The server struct has gone huge over time (~3.8kB), and having a copy
of it in the defsrv section of the struct proxy costs a lot of RAM,
that is not needed anymore at run time.
This patch replaces this struct with a dynamically allocated one. The
field is allocated and initialized during alloc_new_proxy() and is
freed when the proxy is destroyed for now. But the goal will be to
support freeing it after parsing the section.
The test in srv_ssl_settings_cpy() comparing src to the server's proxy's
default server does work but it's a subtle trap. Indeed, no check is made
on srv->proxy to be valid, and this only works because the compiler is
comparing pointer offsets. During the boot, it's common to have NULL here
in srv->proxy and of course in this case srv does not match that value
which is NULL plus epsilon. But when trying to turn defsrv to a dynamic
pointer instead, then the compiler is forced to dereference this NULL
srv->proxy and dies during init.
Let's always add the null check for srv->proxy before the test to avoid
this situation.
No backport is needed since the problem cannot happen yet.
At a few places we're seeing some open-coding of the same function, likely
because it looks overkill for what it's supposed to do, due to extraneous
tests that are not needed (e.g. check of the backend's PR_CAP_BE etc).
Let's just remove all these superfluous tests and inline it so that it
feels more suitable for use everywhere it's needed.
The server lookup in sticking_rule_find_target() uses an open-coded tree
search while we have a function for this server_find_by_id(). In addition,
due to the way it's coded, the stick-table lock also covers the server
lookup by accident instead of being released earlier. This is not a real
problem though since such feature is rarely used nowadays.
Let's clean all this stuff by first retrieving the ID under the lock and
then looking up the corresponding server.
The code used to detect proxy id conflicts uses an open-coded lookup
in the ID tree which is not necessary since we already have functions
for this. Let's switch to that instead.
This function doesn't just look at the name but also the ID when the
argument starts with a '#'. So the name is not correct and explains
why this function is not always used when the name only is needed,
and why the list-based findserver() is used instead. So let's just
call the function "server_find()", and rename its generation-id based
cousin "server_find_unique()".
Let's just use the tree-based lookup instead of walking through the list.
This function is used to find duplicates in "track" statements and a few
such places, so it's important not to waste too much time on large setups.
findserver() used to check for duplicate server names. These are no
longer accepted in 3.3 so let's get rid of that test and simplify the
code. Note that the function still only uses the list instead of the
tree.
Released version 3.3-dev3 with the following main changes :
- BUG/MINOR: quic-be: Wrong retry_source_connection_id check
- MEDIUM: sink: change the sink mode type to PR_MODE_SYSLOG
- MEDIUM: server: move _srv_check_proxy_mode() checks from server init to finalize
- MINOR: server: move send-proxy* incompatibility check in _srv_check_proxy_mode()
- MINOR: mailers: warn if mailers are configured but not actually used
- BUG/MEDIUM: counters/server: fix server and proxy last_change mixup
- MEDIUM: server: add and use a separate last_change variable for internal use
- MEDIUM: proxy: add and use a separate last_change variable for internal use
- MINOR: counters: rename last_change counter to last_state_change
- MINOR: ssl: check TLS1.3 ciphersuites again in clienthello with recent AWS-LC
- BUG/MEDIUM: hlua: Forbid any L6/L7 sample fetche functions from lua services
- BUG/MEDIUM: mux-h2: Properly handle connection error during preface sending
- BUG/MINOR: jwt: Copy input and parameters in dedicated buffers in jwt_verify converter
- DOC: Fix 'jwt_verify' converter doc
- MINOR: jwt: Rename pkey to pubkey in jwt_cert_tree_entry struct
- MINOR: jwt: Remove unused parameter in convert_ecdsa_sig
- MAJOR: jwt: Allow certificate instead of public key in jwt_verify converter
- MINOR: ssl: Allow 'commit ssl cert' with no privkey
- MINOR: ssl: Prevent delete on certificate used by jwt_verify
- REGTESTS: jwt: Add test with actual certificate passed to jwt_verify
- REGTESTS: jwt: Test update of certificate used in jwt_verify
- DOC: 'jwt_verify' converter now supports certificates
- REGTESTS: restrict execution to a single thread group
- MINOR: ssl: Introduce new smp_client_hello_parse() function
- MEDIUM: stats: add persistent state to typed output format
- BUG/MINOR: httpclient: wrongly named httpproxy flag
- MINOR: ssl/ocsp: stop using the flags from the httpclient CLI
- MEDIUM: httpclient: split the CLI from the actual httpclient API
- MEDIUM: httpclient: implement a way to use directly htx data
- MINOR: httpclient/cli: add --htx option
- BUILD: dev/phash: remove the accidentally committed a.out file
- BUG/MINOR: ssl: crash in ssl_sock_io_cb() with SSL traces and idle connections
- BUILD/MEDIUM: deviceatlas: fix when installed in custom locations.
- DOC: deviceatlas build clarifications
- BUG/MINOR: ssl/ocsp: fix definition discrepancies with ocsp_update_init()
- MINOR: proto-tcp: Add support for TCP MD5 signature for listeners and servers
- BUILD: cfgparse-tcp: Add _GNU_SOURCE for TCP_MD5SIG_MAXKEYLEN
- BUG/MINOR: proto-tcp: Take care to initialized tcp_md5sig structure
- BUG/MINOR: http-act: Fix parsing of the expression argument for pause action
- MEDIUM: httpclient: add a Content-Length when the payload is known
- CLEANUP: ssl: Rename ssl_trace-t.h to ssl_trace.h
- MINOR: pattern: add a counter of added/freed patterns
- CI: set DEBUG_STRICT=2 for coverity scan
- CI: enable USE_QUIC=1 for OpenSSL versions >= 3.5.0
- CI: github: add an OpenSSL 3.5.0 job
- CI: github: update the stable CI to ubuntu-24.04
- BUG/MEDIUM: quic: SSL/TCP handshake failures with OpenSSL 3.5
- CI: github: update to OpenSSL 3.5.1
- BUG/MINOR: quic: Missing TLS 1.3 QUIC cipher suites and groups inits (OpenSSL 3.5 QUIC API)
- BUG/MINOR: quic-be: Malformed coalesced Initial packets
- MINOR: quic: Prevent QUIC backend use with the OpenSSL QUIC compatibility module (USE_OPENSS_COMPAT)
- MINOR: reg-tests: first QUIC+H3 reg tests (QUIC address validation)
- MINOR: quic-be: Set the backend alpn if not set by conf
- MINOR: quic-be: TLS version restriction to 1.3
- MINOR: cfgparse: enforce QUIC MUX compat on server line
- MINOR: server: support QUIC for dynamic servers
- CI: github: skip a ssl library version when latest is already in the list
- MEDIUM: resolvers: switch dns-accept-family to "auto" by default
- BUG/MINOR: resolvers: don't lower the case of binary DNS format
- MINOR: resolvers: do not duplicate the hostname_dn field
- MINOR: proto-tcp: Register a feature to report TCP MD5 signature support
- BUG/MINOR: listener: really assign distinct IDs to shards
- MINOR: quic: Prevent QUIC build with OpenSSL 3.5 new QUIC API version < 3.5.1
- BUG/MEDIUM: quic: Crash after QUIC server callbacks restoration (OpenSSL 3.5)
- REGTESTS: use two haproxy instances to distinguish the QUIC traces
- BUG/MEDIUM: http-client: Don't wake http-client applet if nothing was xferred
- BUG/MEDIUM: http-client: Properly inc input data when HTX blocks are xferred
- BUG/MEDIUM: http-client: Ask for more room when request data cannot be xferred
- BUG/MEDIUM: http-client: Test HTX_FL_EOM flag before commiting the HTX buffer
- BUG/MINOR: http-client: Ignore 1XX interim responses in non-HTX mode
- BUG/MINOR: http-client: Reject any 101-switching-protocols response
- BUG/MEDIUM: http-client: Drain the request if an early response is received
- BUG/MEDIUM: http-client: Notify applet has more data to deliver until the EOM
- BUG/MINOR: h3: fix https scheme request encoding for BE side
- MINOR: h1-htx: Add function to format an HTX message in its H1 representation
- BUG/MINOR: mux-h1: Use configured error files if possible for early H1 errors
- BUG/MINOR: h1-htx: Don't forget to init flags in h1_format_htx_msg function
- CLEANUP: assorted typo fixes in the code, commits and doc
- BUILD: adjust scripts/build-ssl.sh to modern CMake system of QuicTLS
- MINOR: debug: add distro name and version in postmortem
Since 2012, systemd compliant distributions contain
/etc/os-release file. This file has some standardized format, see details at
https://www.freedesktop.org/software/systemd/man/latest/os-release.html.
Let's read it in feed_post_mortem_linux() to gather more info about the
distribution.
(cherry picked from commit f1594c41368baf8f60737b229e4359fa7e1289a9)
Signed-off-by: Willy Tarreau <w@1wt.eu>
The regression was introduced by commit 187ae28 ("MINOR: h1-htx: Add
function to format an HTX message in its H1 representation"). We must be
sure the flags variable must be initialized in h1_format_htx_msg() function.
This patch must be backported with the commit above.
The H1 multiplexer is able to produce some errors on its own to report early
errors, before the stream is created. In that case, the error files of the
proxy were tested to detect empty files (or /dev/null) but they were not
used to produce the error itself.
But the documentation states that configured error files are used in all
cases. And in fact, it is not really a problem to use these files. We must
just format a full HTX message. Thanks to the previous patch, it is now
possible.
This patch should fix the issue #3032. It should be backported to 3.2. For
older versions, it must be discussed but it should be quite easy to do.
The function h1_format_htx_msg() can now be used to convert a valid HTX
message in its H1 representation. No validity test is performed, the HTX
message must be valid. Only trailers are silently ignored if the message is
not chunked. In addition, the destination buffer must be empty. 1XX interim
responses should be supported. But again, there is no validity tests.
An HTTP/3 request must contains :scheme pseudo-header. Currently, only
"https" value is expected due to QUIC transport layer in use.
However, https value is incorrectly encoded due to a QPACK index value
mismatch in qpack_encode_scheme(). Fix it to ensure that scheme is now
properly set for HTTP/3 requests on the backend side.
No need to backport this.
When we leave the I/O handler with an unfinished request, we must report the
applet has more data to deliver. Otherwise, when the channel request buffer
is emptied, the http-client applet is not always woken up to forward the
remaining request data.
This issue was probably revealed by commit "BUG/MEDIUM: http-client: Don't
wake http-client applet if nothing was xferred". It is only an issue with
large POSTs, when the payload is streamed.
This patch must be backported as far as 2.6 with the commit above. But on
older versions, the applet API may differ. So be careful.
When a large request is sent, it is possible to have a response before the
end of the request. It is valid from HTTP perspective but it is an issue
with the current design of the http-client. Indded, the request and the
response are handled sequentially. So the response will be blocked, waiting
for the end of the request. Most of time, it is not an issue, except when
the request transfer is blocked. In that case, the applet is blocked.
With the current API, it is not possible to handle early response and
continue the request transfer. So, this case cannot be handle. In that case,
it seems reasonnable to drain the request if a response is received. This
way, the request transfer, from the caller point of view, is never blocked
and the response can be properly processed.
To do so, the action flag HTTPCLIENT_FA_DRAIN_REQ is added to the
http-client. When it is set, the request payload is just dropped. In that
case, we take care to not report the end of input to properly report the
request was truncated, especially in logs.
It is only an issue with large POSTs, when the payload is streamed.
This patch must be backported as far as 2.6.
Protocol updages are not supported by the http-client. So report an error is
a 101-switching-protocols response is received. Of course, it is unexpected
because the API is not designed to support upgrades. But it is better to
properly handle this case.
This patch could be backported as far as 2.6. It depends on the commit
"BUG/MINOR: http-client: Ignore 1XX interim responses in non-HTX mode".
When the response is re-formatted in raw message, the 1XX interim responses
must be skipped. Otherwise, information of the first interim response will
be saved (status line and headers) and those from the final response will be
dropped.
Note that for now, in HTX-mode, the interim messages are removed.
This patch must be backported as far as 2.6.
when htx_to_buf() function is called, if the HTX message is empty, the
buffer is reset. So HTX flags must not be tested after because the info may
be lost.
So now, we take care to test HTX_FL_EOM flag before calling htx_to_buf().
This patch must be backported as far as 2.8.
When the request payload cannot be xferred to the channel because its buffer
is full, we must request for more room by calling sc_need_room(). It is
important to be sure the httpclient applet will not be woken up in loop to
push more data while it is not possible.
It is only an issue with large POSTs, when the payload is streamed.
This patch must be backported as far as 2.6. Note that on 2.6,
sc_need_room() only takes one argument.
When HTX blocks from the requests are transferred into the channel buffer,
the return value of htx_xfer_blks() function must not be used to increment
the channel input value because meta data are counted here while they are
not part of input data. Because of this bug, it is possible to forward more
data than these present in the channel buffer.
Instead, we look at the input data before and after the transfer and the
difference is added.
It is only an issue with large POSTs, when the payload is streamed.
This patch must be backported as far as 2.6.
When data are transferred to or from the htt-pclient, the applet is
systematically woken up, even when no data are transferred. This could lead
to needlessly wakeups. When called from a lua script, if data are blocked
for a while, this leads to a wakeup ping-pong loop where the http-client
applet is woken up by the lua script which wakes back the script.
To fix the issue, in httpclient_req_xfer() and httpclient_res_xfer()
functions, we now take care to not wake the http-client applet up when no
data are transferred.
This patch must be backported as far as 2.6.
The aim of this patch is to identify the QUIC traces between the QUIC frontend
and backend parts. Two haproxy instances are created. The c(1|2) http clients
connect to ha1 with TCP frontends and QUIC backends. ha2 embeds two QUIC listeners
with s1 as TCP backend. When the traces are activated, they are dumped to stderr.
Hopefully, they are prefixed by the haproxy instance name (h1 or h2). This is very
useful to identify the QUIC instances.
Revert this patch which is no more useful since OpenSSL 3.5.1 to remove the
QUIC server callback restoration after SSL context switch:
MINOR: quic: OpenSSL 3.5 internal QUIC custom extension for transport parameters reset
It was required for 3.5.0. That said, there was no CI for OpenSSL 3.5 at the date
of this commit. The CI recently revealed that the QUIC server side could crash
during QUIC reg tests just after having restored the callbacks as implemented by
the commit above.
Also revert this commit which is no more useful because it arrived with the commit
above:
BUG/MEDIUM: quic: SSL/TCP handshake failures with OpenSSL 3.
Must be backported to 3.2.
The QUIC listener part was impacted by the 3.5.0 OpenSSL new QUIC API with several
issues which have been fixed by 3.5.1.
Add a #error to prevent such OpenSSL 3.5 new QUIC API use with version below 3.5.1.
Must be backported to 3.2.
A fix was made in 3.0 for the case where sharded listeners were using
a same ID with commit 0db8b6034d ("BUG/MINOR: listener: always assign
distinct IDs to shards"). However, the fix is incorrect. By checking the
ID of temporary node instead of the kept one in bind_complete_thread_setup()
it ends up never inserting the used nodes at this point, thus not reserving
them. The side effect is that assigning too close IDs to subsequent
listeners results in the same ID still being assigned twice since not
reserved. Example:
global
nbthread 20
frontend foo
bind :8000 shards by-thread id 10
bind :8010 shards by-thread id 20
The first one will start a series from 10 to 29 and the second one a
series from 20 to 39. But 20 not being inserted when creating the shards,
it will remain available for the post-parsing phase that assigns all
unassigned IDs by filling holes, and two listeners will have ID 20.
By checking the correct node, the problem disappears. The patch above
was marked for backporting to 2.6, so this fix should be backported that
far as well.
"HAVE_TCP_MD5SIG" feature is now registered if TCP MD5 signature is
supported. This will help the feature detection in the reg-test script
dedicated to this feature.
The hostdn.key field in the server contains a pure copy of the hostname_dn
since commit 3406766d57 ("MEDIUM: resolvers: add a ref between servers and
srv request or used SRV record") which wanted to lowercase it. Since it's
not necessary, let's drop this useless copy. In addition, the return from
strdup() was not tested, so it could theoretically crash the process under
heavy memory contention.
The server's "hostname_dn" is in Domain Name format, not a pure string, as
converted by resolv_str_to_dn_label(). It is made of lower-case string
components delimited by binary lengths, e.g. <0x03>www<0x07>haproxy<0x03)org.
As such it must not be lowercased again in srv_state_srv_update(), because
1) it's useless on the name components since already done, and 2) because
it would replace component lengths 97 and above by 32-char shorter ones.
Granted, not many domain names have that large components so the risk is
very low but the operation is always wrong anyway. This was brought in
2.5 by commit 3406766d57 ("MEDIUM: resolvers: add a ref between servers
and srv request or used SRV record").
In the same vein, let's fix the confusing strcasecmp() that are applied
to this binary format, and use memcmp() instead. Here there's basically
no risk to incorrectly match the wrong record, but that test alone is
confusing enough to provoke the existence of the bug above.
Finally let's update the component for that field to mention that it's
in this format and already lower cased.
Better not backport this, the risk of facing this bug is almost zero, and
every time we touch such files something breaks for bad reasons.
Skip the job for "latest" libssl version, when this version is the same
as a one already in the list.
This avoid having 2 jobs for OpenSSL 3.5.1 since no new dev version are
available for now and 3.5.1 is already in the list.
To properly support QUIC for dynamic servers, it is required to extend
add server CLI handler :
* ensure conformity between server address and proto
* automatically set proto to QUIC if not specified
* prepare_srv callback must be called to initialize required SSL context
Prior to this patch, crashes may occur when trying to use QUIC with
dynamic servers.
Also, destroy_srv callback must be called when a dynamic server is
deallocated. This ensures that there is no memory leak due to SSL
context.
No need to backport.
Add postparsing checks to control server line conformity regarding QUIC
both on the server address and the MUX protocol. An error is reported in
the following case :
* proto quic is explicitely specified but server address does not
specify quic4/quic6 prefix
* another proto is explicitely specified but server address uses
quic4/quic6 prefix
This patch skips the TLS version settings. They have as a side effect to add
all the TLS version extensions to the ClientHello message (TLS 1.0 to TLS 1.3).
QUIC supports only TLS 1.3.
First simple VTC file for QUIC reg tests. Two listeners are configured, one without
Retry enabled and the other without. Two clients simply tries to connect to these
listeners to make an basic H3 request.
Make the server line parsing fail when a QUIC backend is configured if haproxy
is built to use the OpenSSL stack compatibility module. This latter does not
support the QUIC client part.
This bug fix completes this patch which was not sufficient:
MINOR: quic-be: Allow sending 1200 bytes Initial datagrams
This patch could not allow the build of well formed Initial packets coalesced to
others (Handshake) packets. Indeed, the <padding> parameter passed to qc_build_pkt()
is deduced from a first value: <padding> value and must be set to 1 for
the last encryption level. As a client, the last encryption level is always
the Handshake encryption level. But <padding> was always set to 1 for a QUIC
client, leading the first Initial packet to be malformed because considered
as the second one into the same datagram.
So, this patch sets <padding> value passed to qc_build_pkt() to 1 only when there
is no last encryption level at all, to allow the build of Initial only packets
(not coalesced) or when it frames to send (coalesced packets).
No need to backport.
This bug impacts both QUIC backends and frontends with OpenSSL 3.5 as QUIC API.
The connections to a haproxy QUIC listener from a haproxy QUIC backend could not
work at all without HelloRetryRequest TLS messages emitted by the backend
asking the QUIC client to restart the handshake followed by TLS alerts:
conn. @(nil) OpenSSL error[0xa000098] read_state_machine: excessive message size
Furthermore, the Initial CRYPTO data sent by the client were big (about two 1252 bytes
packets) (ClientHello TLS message). After analyzing the packets a key_share extension
with <unknown> as value was long (more that 1Ko). This extension is in relation with
the groups but does not belong to the groups supported by QUIC.
That said such connections could work with ngtcp2 as backend built against the same
OSSL TLS stack API but with a HelloRetryRequest.
ngtcp2 always set the QUIC default cipher suites and group, for all the stacks it
supports as implemented by this patch.
So this patch configures both QUIC backend and frontend cipher suites and groups
calling SSL_CTX_set_ciphersuites() and SSL_CTX_set1_groups_list() with the correct
argument, except for SSL_CTX_set1_groups_list() which fails with QUIC TLS for
a unknown reason at this time.
The call to SSL_CTX_set_options() is useless from ssl_quic_initial_ctx() for the QUIC
clients. One relies on ssl_sock_prepare_srv_ssl_ctx() to set them for now on.
This patch is effective for all the supported stacks without impact for AWS-LC,
and QUIC TLS and fixes the connections for haproxy QUIC frontend and backends
when builts against OpenSSL 3.5 QUIC API).
A new define HAVE_OPENSSL_QUICTLS has been added to openssl-compat.h to distinguish
the QUIC TLS stack.
Must be backported to 3.2.
This bug arrived with this commit:
MINOR: quic: OpenSSL 3.5 internal QUIC custom extension for transport parameters reset
To make QUIC connection succeed with OpenSSL 3.5 API, a call to quic_ssl_set_tls_cbs()
was needed from several callback which call SSL_set_SSL_CTX(). This has as side effect
to set the QUIC callbacks used by the OpenSSL 3.5 API.
But quic_ssl_set_tls_cbs() was also called for TCP sessions leading the SSL stack
to run QUIC code, if the QUIC support is enabled.
To fix this, simply ignore the TCP connections inspecting the <ssl_qc_app_data_index>
index value which is NULL for such connections.
Must be backported to 3.2.
OpenSSL 3.5.0 introduced experimental support for QUIC. This change enables the use_quic option when a compatible version of OpenSSL is detected, allowing QUIC-based functionality to be leveraged where applicable. Feature remains disabled for earlier versions to ensure compatibility.
Patterns are allocated when loading maps/acls from a file or dynamically
via the CLI, and are released only from the CLI (e.g. "clear map xxx").
These ones do not use pools and are much harder to monitor, e.g. in case
a script adds many and forgets to clear them, etc.
Let's add a new pair of metrics "PatternsAdded" and "PatternsFreed" that
will report the number of added and freed patterns respectively. This
can allow to simply graph both. The difference between the two normally
represents the number of allocated patterns. If Added grows without
Freed following, it can indicate a faulty script that doesn't perform
the needed cleanup. The metrics are also made available to Prometheus
as patterns_added_total and patterns_freed_total respectively.
This introduce a change of behavior in the httpclient API. When
generating a request with a payload buffer, the size of the buffer
payload is known and does not need to be streamed in chunks.
This patch force to sends payload buffer using a Content-Length header
in the request, however the behavior does not change if a callback is
still used instead of a buffer.
When the "pause" action is parsed, if an expression is used instead of a
static value, the position of the current argument after the expression
evaluation is incremented while it should not. The sample_parse_expr()
function already take care of it. However, it should still be incremented
when an time value was parsed.
This patch must be backported to 3.2.
When the TCP MD5 signature is enabled, on a listening socket or an outgoing
one, the tcp_md5sig structure must be initialized first.
It is a 3.3-specific issue. No backport needed.
This patch adds the support for the RFC2385 (Protection of BGP Sessions via
the + TCP MD5 Signature Option) for the listeners and the servers. The
feature is only available on Linux. Keywords are not exposed otherwise.
By setting "tcp-md5sig <password>" option on a bind line, TCP segments of
all connections instantiated from the listening socket will be signed with a
16-byte MD5 digest. The same option can be set on a server line to protect
outgoing connections to the corresponding server.
The primary use case for this option is to allow BGP to protect itself
against the introduction of spoofed TCP segments into the connection
stream. But it can be useful for any very long-lived TCP connections.
A reg-test was added and it will be executed only on linux. All other
targets are excluded.
Since patch 20718f40b6 ("MEDIUM: ssl/ckch: add filename and linenum
argument to crt-store parsing"), the definition of ocsp_update_init()
and its declaration does not share the same arguments.
Must be backported to 3.2.
We are reusing DEVICEATLAS_INC/DEVICEATLAS_LIB when the DeviceAtlas
library had been compiled and installed with cmake and make install targets.
Works fine except when ldconfig is unaware of the path, thus adding
cflags/ldflags into the mix.
Ideally, to be backported down to the lowest stable branch.
TRACE_ENTER is crashing in ssl_sock_io_cb() in case a connection idle is
being stolen. Indeed the function could be called with a NULL context
and dereferencing it will crash.
This patch fixes the issue by initializing ctx only once it is usable,
and moving TRACE_ENTER after the initialization.
This must be backported to 3.2.
Commit 41f28b3c53 ("DEV: phash: Update 414 and 431 status codes to phash")
accidentally committed a.out, resulting in build/checkout issues when
locally rebuilt. Let's drop it.
This should be backported to 3.1.
Use the new HTTPCLIENT_O_RES_HTX flag when using the CLI httpclient with
--htx.
It allows to process directly the response in HTX, then the htx_dump()
function is used to display a debug output.
Example:
echo "httpclient --htx GET https://haproxy.org" | socat /tmp/haproxy.sock
htx=0x79fd72a2e200(size=16336,data=139,used=6,wrap=NO,flags=0x00000010,extra=0,first=0,head=0,tail=5,tail_addr=139,head_addr=0,end_addr=0)
[0] type=HTX_BLK_RES_SL - size=31 - addr=0 HTTP/2.0 301
[1] type=HTX_BLK_HDR - size=15 - addr=31 content-length: 0
[2] type=HTX_BLK_HDR - size=32 - addr=46 location: https://www.haproxy.org/
[3] type=HTX_BLK_HDR - size=25 - addr=78 alt-svc: h3=":443"; ma=3600
[4] type=HTX_BLK_HDR - size=35 - addr=103 set-cookie: served=2:TLSv1.3+TCP:IPv4
[5] type=HTX_BLK_EOH - size=1 - addr=138 <empty>
Add a HTTPCLIENT_O_RES_HTX flag which allow to store directly the HTX
data in the response buffer instead of extracting the data in raw
format.
This is useful when the data need to be reused in another request.
This patch split the httpclient code to prevent confusion between the
httpclient CLI command and the actual httpclient API.
Indeed there was a confusion between the flag used internally by the
CLI command, and the actual httpclient API.
hc_cli_* functions as well as HC_C_F_* defines were moved to
httpclient_cli.c.
The ocsp-update uses the flags from the httpclient CLI, which are not
supposed to be used elsewhere since this is a state for the CLI.
This patch implements HC_OCSP flags for the ocsp-update.
The HC_F_HTTPPROXY flag was wrongly named and does not use the correct
value, indeed this flag was meant to be used for the httpclient API, not
the httpclient CLI.
This patch fixes the problem by introducing HTTPCLIENT_FO_HTTPPROXY
which has must be set in hc->flags.
Also add a member 'options' in the httpclient structure, because the
member flags is reinitialized when starting.
Must be backported as far as 3.0.
Add a fourth character to the second column of the "typed output format"
to indicate whether the value results from a volatile or persistent metric
('V' or 'P' characters respectively). A persistent metric means the value
could possibily be preserved across reloads by leveraging a shared memory
between multiple co-processes. Such metrics are identified as "shared" in
the code (since they are possibly shared between multiple co-processes)
Some reg-tests were updated to take that change into account, also, some
outputs in the configuration manual were updated to reflect current
behavior.
In this patch we introduce a new helped function called `smp_client_hello_parse()` to extract
information presented in a TLS client hello handshake message. 7 sample fetches have also been
modified to use this helped function to do the common client hello parsing and use the result
to do further processing of extensions/cipher.
Fixes: #2532
When threads are enabled and running on a machine with multiple CCX
or multiple nodes, thread groups are now enabled since 3.3-dev2, causing
load-balancing algorithms to randomly fail due to incoming connections
spreading over multiple groups and using different load balancing indexes.
Let's just force "thread-groups 1" into all configs when threads are
enabled to avoid this.
Using certificates in the jwt_verify converter allows to make use of the
CLI certificate updates, which is still impossible with public keys (the
legacy behavior).
The jwt_verify can now take public certificates as second parameter,
either with actual certificate path (no previously mentioned) or from a
predefined crt-store or from a variable.
A ckch_store used in JWT verification might not have any ckch instances
or crt-list entries linked but we don't want to be able to remove it via
the CLI anyway since it would make all future jwt_verify calls using
this certificate fail.
The ckch_stores might be used to store public certificates only so in
this case we won't provide private keys when updating the certificate
via the CLI.
If the ckch_store is actually used in a bind or server line an error
will still be raised if the private key is missing.
The 'jwt_verify' converter could only be passed public keys as second
parameter instead of full-on public certificates. This patch allows
proper certificates to be used.
Those certificates can be loaded in ckch_stores like any other
certificate which means that all the certificate-related operations that
can be made via the CLI can now benefit JWT validation as well.
We now have two ways JWT validation can work, the legacy one which only
relies on public keys which could not be stored in ckch_stores without
some in depth changes in the way the ckch_stores are built. In this
legacy way, the public keys are fully stored in a cache dedicated to JWT
only which does not have any CLI commands and any way to update them
during runtime. It also requires that all the public keys used are
passed at least once explicitely to the 'jwt_verify' converter so that
they can be loaded during init.
The new way uses actual certificates, either already stored in the
ckch_store tree (if predefined in a crt-store or already used previously
in the configuration) or loaded in the ckch_store tree during init if
they are explicitely used in the configuration like so:
var(txn.bearer),jwt_verify(txn.jwt_alg,"cert.pem")
When using a variable (or any other way that can only be resolved during
runtime) in place of the converter's <key> parameter, the first time we
encounter a new value (for which we don't have any entry in the jwt
tree) we will lock the ckch_store tree and try to perform a lookup in
it. If the lookup fails, an entry will still be inserted into the jwt
tree so that any following call with this value avoids performing the
ckch_store tree lookup.
Contrary to what the doc says, the jwt_verify converter only works with
a public key and not a full certificate for certificate based protocols
(everything but HMAC).
This patch should be backported up to 2.8.
When resolving variable values the temporary trash chunks are used so
when calling the 'jwt_verify' converter with two variable parameters
like in the following line, the input would be overwritten by the value
of the second parameter :
var(txn.bearer),jwt_verify(txn.jwt_alg,txn.cert)
Copying the values into dedicated alloc'ed buffers prevents any new call
to get_trash_chunk from erasing the data we need in the converter.
This patch can be backported up to 2.8.
On backend side, an error at connection level during the preface sending was
not properly handled and could lead to a spinning loop on process_stream()
when the h2 stream on client side was blocked, for instance because of h2
flow control.
It appeared that no transition was perfromed from the PREFACE state to an
ERROR state on the H2 connection when an error occurred on the underlying
connection. In that case, the H2 connection was woken up in loop to try to
receive data, waking up the upper stream at the same time.
To fix the issue, an H2C error must be reported. Most state transitions are
handled by the demux function. So it is the right place to do so. First, in
PREFACE state and on server side, if an error occurred on the TCP
connection, an error is now reported on the H2 connection. REFUSED_STREAM
error code is used in that case. In addition, in that case, we also take
care to properly handle the connection shutdown.
This patch should fix the issue #3020. It must be backported to all stable
versions.
It was already forbidden to use HTTP sample fetch functions from lua
services. An error is triggered if it happens. However, the error must be
extended to any L6/L7 sample fetch functions.
Indeed, a lua service is an applet. It totally unexepected for an applet to
access to input data in a channel's buffer. These data have not been
analyzed yet and are still subject to any change. An applet, lua or not,
must never access to "not forwarded" data. Only output data are
available. For now, if a lua applet relies on any L6/L7 sampel fetch
functions, the behavior is undefined and not consistent.
So to fix the issue, hlua flag HLUA_F_MAY_USE_HTTP is renamed to
HLUA_F_MAY_USE_CHANNELS_DATA. This flag is used to prevent any lua applet to
use L6/L7 sample fetch functions.
This patch could be backported to all stable versions.
Patch ed9b8fec49 ("BUG/MEDIUM: ssl: AWS-LC + TLSv1.3 won't do ECDSA in
RSA+ECDSA configuration") partly fixed a cipher selection problem with
AWS-LC. However this was not checking anymore if the ciphersuites was
available in haproxy which is still a problem.
The problem was fixed in AWS-LC 1.46.0 with this PR
https://github.com/aws/aws-lc/pull/2092.
This patch allows to filter again the TLS13 ciphersuites with recent
versions of AWS-LC. However, since there are no macros to check the
AWS-LC version, it is enabled at the next AWS-LC API version change
following the fix in AWS-LC v1.50.0.
This could be backported where ed9b8fec49 was backported.
Since proxy and server struct already have an internal last_change
variable and we cannot merge it with the shared counter one, let's
rename the last_change counter to be more specific and prevent the
mixup between the two.
last_change counter is renamed to last_state_change, and unlike the
internal last_change, this one is a shared counter so it is expected
to be updated by other processes in our back.
However, when updating last_state_change counter, we use the value
of the server/proxy last_change as reference value.
Same motivation as previous commit, proxy last_change is "abused" because
it is used for 2 different purposes, one for stats, and the other one
for process-local internal use.
Let's add a separate proxy-only last_change variable for internal use,
and leave the last_change shared (and thread-grouped) counter for
statistics.
last_change server metric is used for 2 separate purposes. First it is
used to report last server state change date for stats and other related
metrics. But it is also used internally, including in sensitive paths,
such as lb related stuff to take decision or perform computations
(ie: in srv_dynamic_maxconn()).
Due to last_change counter now being split over thread groups since 16eb0fa
("MAJOR: counters: dispatch counters over thread groups"), reading the
aggregated value has a cost, and we cannot afford to consult last_change
value from srv_dynamic_maxconn() anymore. Moreover, since the value is
used to take decision for the current process we don't wan't the variable
to be updated by another process in our back.
To prevent performance regression and sharing issues, let's instead add a
separate srv->last_change value, which is not updated atomically (given how
rare the updates are), and only serves for places where the use of the
aggregated last_change counter/stats (split over thread groups) is too
costly.
16eb0fa ("MAJOR: counters: dispatch counters over thread groups")
introduced some bugs: as a result of improper copy paste during
COUNTERS_SHARED_LAST() macro introduction, some functions such as
srv_downtime() which used to make use of the server last_change variable
now use the proxy one, which doesn't make sense and will likely cause
unexpected logical errors/bugs.
Let's fix them all at once by properly pointing to the server last_change
variable when relevant.
No backport needed.
Now that native mailers configuration is only usable with Lua mailers,
Willy noticed that we lack a way to warn the user if mailers were
previously configured on an older version but Lua mailers were not loaded,
which could trick the user into thinking mailers keep working when
transitionning to 3.2 while it is not.
In this patch we add the 'core.use_native_mailers_config()' Lua function
which should be called in Lua script body before making use of
'Proxy:get_mailers()' function to retrieve legacy mailers configuration
from haproxy main config. This way haproxy effectively knows that the
native mailers config is actually being used from Lua (which indicates
user correctly migrated from native mailers to Lua mailers), else if
mailers are configured but not used from Lua then haproxy warns the user
about the fact that they will be ignored unless they are used from Lua.
(e.g.: using the provided 'examples/lua/mailers.lua' to ease transition)
_srv_check_proxy_mode() is currently executed during server init (from
_srv_parse_init()), while it used to be fine for current checks, it
seems it occurs a bit too early to be usable for some checks that depend
on server keywords to be evaluated for instance.
As such, to make _srv_check_proxy_mode() more relevant and be extended
with additional checks in the future, let's call it later during server
finalization, once all server keywords were evaluated.
No change of behavior is expected
This commit broke the QUIC backend connection to servers without address validation
or retry activated:
MINOR: quic-be: address validation support implementation (RETRY)
Indeed the retry_source_connection_id transport parameter was already checked as
as if it was required, as if the peer (server) was always using the address validation.
Furthermore, relying on ->odcid.len to ensure a retry token was received is not
correct.
This patch ensures the retry_source_connection_id transport parameter is checked
only when a retry token was received (->retry_token != NULL). In this case
it also checks that this transport parameter is present when a retry token
has been received (tx_params->retry_source_connection_id.len != 0).
No need to backport.
Released version 3.3-dev2 with the following main changes :
- BUG/MINOR: config/server: reject QUIC addresses
- MINOR: server: implement helper to identify QUIC servers
- MINOR: server: mark QUIC support as experimental
- MINOR: mux-quic-be: allow QUIC proto on backend side
- MINOR: quic-be: Correct Version Information transp. param encoding
- MINOR: quic-be: Version Information transport parameter check
- MINOR: quic-be: Call ->prepare_srv() callback at parsing time
- MINOR: quic-be: QUIC backend XPRT and transport parameters init during parsing
- MINOR: quic-be: QUIC server xprt already set when preparing their CTXs
- MINOR: quic-be: Add a function for the TLS context allocations
- MINOR: quic-be: Correct the QUIC protocol lookup
- MINOR: quic-be: ssl_sock contexts allocation and misc adaptations
- MINOR: quic-be: SSL sessions initializations
- MINOR: quic-be: Add a function to initialize the QUIC client transport parameters
- MINOR: sock: Add protocol and socket types parameters to sock_create_server_socket()
- MINOR: quic-be: ->connect() protocol callback adaptations
- MINOR: quic-be: QUIC connection allocation adaptation (qc_new_conn())
- MINOR: quic-be: xprt ->init() adapatations
- MINOR: quic-be: add field for max_udp_payload_size into quic_conn
- MINOR: quic-be: Do not redispatch the datagrams
- MINOR: quic-be: Datagrams and packet parsing support
- MINOR: quic-be: Handshake packet number space discarding
- MINOR: h3-be: Correctly retrieve h3 counters
- MINOR: quic-be: Store asap the DCID
- MINOR: quic-be: Build post handshake frames
- MINOR: quic-be: Add the conn object to the server SSL context
- MINOR: quic-be: Initial packet number space discarding.
- MINOR: quic-be: I/O handler switch adaptation
- MINOR: quic-be: Store the remote transport parameters asap
- MINOR: quic-be: Missing callbacks initializations (USE_QUIC_OPENSSL_COMPAT)
- MINOR: quic-be: Make the secret derivation works for QUIC backends (USE_QUIC_OPENSSL_COMPAT)
- MINOR: quic-be: SSL_get_peer_quic_transport_params() not defined by OpenSSL 3.5 QUIC API
- MINOR: quic-be: get rid of ->li quic_conn member
- MINOR: quic-be: Prevent the MUX to send/receive data
- MINOR: quic: define proper proto on QUIC servers
- MEDIUM: quic-be: initialize MUX on handshake completion
- BUG/MINOR: hlua: Don't forget the return statement after a hlua_yieldk()
- BUILD: hlua: Fix warnings about uninitialized variables
- BUILD: listener: fix 'for' loop inline variable declaration
- BUILD: hlua: Fix warnings about uninitialized variables (2)
- BUG/MEDIUM: mux-quic: adjust wakeup behavior
- MEDIUM: backend: delay MUX init with ALPN even if proto is forced
- MINOR: quic: mark ctrl layer as ready on quic_connect_server()
- MINOR: mux-quic: improve documentation for snd/rcv app-ops
- MINOR: mux-quic: define flag for backend side
- MINOR: mux-quic: set expect data only on frontend side
- MINOR: mux-quic: instantiate first stream on backend side
- MINOR: quic: wakeup backend MUX on handshake completed
- MINOR: hq-interop: decode response into HTX for backend side support
- MINOR: hq-interop: encode request from HTX for backend side support
- CLEANUP: quic-be: Add comments about qc_new_conn() usage
- BUG/MINOR: quic-be: CID double free upon qc_new_conn() failures
- MINOR: quic-be: Avoid SSL context unreachable code without USE_QUIC_OPENSSL_COMPAT
- BUG/MINOR: quic: prevent crash on startup with -dt
- MINOR: server: reject QUIC servers without explicit SSL
- BUG/MINOR: quic: work around NEW_TOKEN parsing error on backend side
- BUG/MINOR: http-ana: Properly handle keep-query redirect option if no QS
- BUG/MINOR: quic: don't restrict reception on backend privileged ports
- MINOR: hq-interop: handle HTX response forward if not enough space
- BUG/MINOR: quic: Fix OSSL_FUNC_SSL_QUIC_TLS_got_transport_params_fn callback (OpenSSL3.5)
- BUG/MINOR: quic: fix ODCID initialization on frontend side
- BUG/MEDIUM: cli: Don't consume data if outbuf is full or not available
- MINOR: cli: handle EOS/ERROR first
- BUG/MEDIUM: check: Set SOCKERR by default when a connection error is reported
- BUG/MINOR: mux-quic: check sc_attach_mux return value
- MINOR: h3: support basic HTX start-line conversion into HTTP/3 request
- MINOR: h3: encode request headers
- MINOR: h3: complete HTTP/3 request method encoding
- MINOR: h3: complete HTTP/3 request scheme encoding
- MINOR: h3: adjust path request encoding
- MINOR: h3: adjust auth request encoding or fallback to host
- MINOR: h3: prepare support for response parsing
- MINOR: h3: convert HTTP/3 response into HTX for backend side support
- MINOR: h3: complete response status transcoding
- MINOR: h3: transcode H3 response headers into HTX blocks
- MINOR: h3: use BUG_ON() on missing request start-line
- MINOR: h3: reject invalid :status in response
- DOC: config: prefer-last-server: add notes for non-deterministic algorithms
- CLEANUP: connection: remove unused mux-ops dedicated to QUIC
- BUG/MINOR: mux-quic/h3: properly handle too low peer fctl initial stream
- MINOR: mux-quic: support max bidi streams value set by the peer
- MINOR: mux-quic: abort conn if cannot create stream due to fctl
- MEDIUM: mux-quic: implement attach for new streams on backend side
- BUG/MAJOR: fwlc: Count an avoided server as unusable.
- MINOR: fwlc: Factorize code.
- BUG/MEDIUM: quic: do not release BE quic-conn prior to upper conn
- MAJOR: cfgparse: turn the same proxy name warning to an error
- MAJOR: cfgparse: make sure server names are unique within a backend
- BUG/MINOR: tools: only reset argument start upon new argument
- BUG/MINOR: stream: Avoid recursive evaluation for unique-id based on itself
- BUG/MINOR: log: Be able to use %ID alias at anytime of the stream's evaluation
- MINOR: hlua: emit a log instead of an alert for aborted actions due to unavailable yield
- MAJOR: mailers: remove native mailers support
- BUG/MEDIUM: ssl/clienthello: ECDSA with ssl-max-ver TLSv1.2 and no ECDSA ciphers
- DOC: configuration: add details on prefer-client-ciphers
- MINOR: ssl: Add "renegotiate" server option
- DOC: remove the program section from the documentation
- MAJOR: mworker: remove program section support
- BUG/MINOR: quic: wrong QUIC_FT_CONNECTION_CLOSE(0x1c) frame encoding
- MINOR: quic-be: add a "CC connection" backend TX buffer pool
- MINOR: quic: Useless TX buffer size reduction in closing state
- MINOR: quic-be: Allow sending 1200 bytes Initial datagrams
- MINOR: quic-be: address validation support implementation (RETRY)
- MEDIUM: proxy: deprecate the "transparent" and "option transparent" directives
- REGTESTS: update http_reuse_be_transparent with "transparent" deprecated
- REGTESTS: script: also add a line pointing to the log file
- DOC: config: explain how to deal with "transparent" deprecation
- MEDIUM: proxy: mark the "dispatch" directive as deprecated
- DOC: config: crt-list clarify default cert + cert-bundle
- MEDIUM: cpu-topo: switch to the "performance" cpu-policy by default
- SCRIPTS: drop the HTML generation from announce-release
- BUG/MINOR: tools: use my_unsetenv instead of unsetenv
- CLEANUP: startup: move comment about nbthread where it's more appropriate
- BUILD: qpack: fix a build issue on older compilers
Got this on gcc-4.8:
src/qpack-enc.c: In function 'qpack_encode_method':
src/qpack-enc.c:168:3: error: 'for' loop initial declarations are only allowed in C99 mode
for (size_t i = 0; i < istlen(other); ++i)
^
This came from commit a0912cf914 ("MINOR: h3: complete HTTP/3 request
method encoding"), no backport is needed.
Let's use our own implementation of unsetenv() instead of the one, which is
provided in libc. Implementation from libc may vary in dependency of UNIX
distro. Implemenation from libc.so.1 ported on Illumos (see the link below) has
caused an eternal loop in the clean_env(), where we invoke unsetenv().
(https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/gen/getenv.c#L411C1-L456C1)
This is reported at GitHUB #3018 and the reporter has proposed the patch, which
we really appreciate! But looking at his fix and to the implementations of
unsetenv() in FreeBSD libc and in Linux glibc 2.31, it seems, that the algorithm
of clean_env() will perform better with our my_unsetenv() implementation.
This should be backported in versions 3.1 and 3.2.
It has not been used over the last 5 years or so and systematically
requires manual removal. Let's just stop producing it. Also take
this opportunity to add the missing link to /discussions.
As mentioned during the NUMA series development, the goal is to use
all available cores in the most efficient way by default, which
normally corresponds to "cpu-policy performance". The previous default
choice of "cpu-policy first-usable-node" was only meant to stay 100%
identical to before cpu-policy.
So let's switch the default cpu-policy to "performance" right now.
The doc was updated to reflect this.
Clarify that HAProxy duplicates crt-list entries for multi-cert bundles
which can create unexpected side-effects as only the very first
certificate after duplication is considered as default implicitly.
As mentioned in [1], the "dispatch" directive from haproxy 1.0 has long
outlived its original purpose and still suffers from a number of technical
limitations (no checks, no SSL, no idle connes etc) and still hinders some
internal evolutions. It's now time to mark it as deprecated, and to remove
it in 3.5 [2]. It was already recommended against in the documentation but
remained popular in raw TCP environments for being shorter to write.
The directive will now cause a warning to be emitted, suggesting an
alternate method involving "server". The warning can be shut using
"expose-deprecated-directives". The rare configs from 1.0 where
"dispatch" is combined with sticky servers using cookies will just
need to set these servers's weights to zero to prevent them from
being selected by the load balancing algorithm. All of this is
explained in the doc with examples.
Two reg tests were using this method, one purposely for this directive,
which now has expose-deprecated-directives, and another one to test the
behavior of idle connections, which was updated to use "server" and
extended to test both "http-reuse never" and "http-reuse always".
[1] https://github.com/orgs/haproxy/discussions/2921
[2] https://github.com/haproxy/wiki/wiki/Breaking-changes
The explanations for the "option transparent" keyword were a bit scarce
regarding deprecation, so let's explain how to replace it with a server
line that does the same.
I never counted the number of hours I've been spending selecting then
copy-pasting the directory output and manually appending "/LOG" to read
a log file but it amounts in tens to hundreds. Let's just add a direct
pointer to the log file at the end of the log for a failed run.
With commit e93f3ea3f8 ("MEDIUM: proxy: deprecate the "transparent" and
"option transparent" directives") this one no longer works as the config
either has to be adjusted to use server 0.0.0.0 or to enable the deprecated
feature. The test used to validate a technical limitation ("transparent"
not supporting shared connections), indicated as being comparable to
"http-reuse never". Let's now duplicate the test for "http-reuse never"
and "http-reuse always" and validate both behaviors.
Take this opportunity to fix a few problems in this config:
- use "nbthread 1": depending on the thread where the connection
arrives, the connection may or may not be reused
- add explicit URLs to the clients so that they can be recognized
in the logs
- add comments to make it clearer what to expect for each test
As discussed here [1], "transparent" (already deprecated) and
"option transparent" are horrible hacks which should really disappear
in favor of "server xxx 0.0.0.0" which doesn't rely on hackish code
path. This old feature is now deprecated in 3.3 and will disappear in
3.5, as indicated here [2]. A warning is emitted when used, explaining
how to proceed, and how to silence the warning using the global
"expose-deprecated-directives" if needed. The doc was updated to
reflect this new state.
[1] https://github.com/orgs/haproxy/discussions/2921
[2] https://github.com/haproxy/wiki/wiki/Breaking-changes
- Add ->retry_token and ->retry_token_len new quic_conn struct members to store
the retry tokens. These objects are allocated by quic_rx_packet_parse() and
released by quic_conn_release().
- Add <pool_head_quic_retry_token> new pool for these tokens.
- Implement quic_retry_packet_check() to check the integrity tag of these tokens
upon RETRY packets receipt. quic_tls_generate_retry_integrity_tag() is called
by this new function. It has been modified to pass the address where the tag
must be generated
- Add <resend> new parameter to quic_pktns_discard(). This function is called
to discard the packet number spaces where the already TX packets and frames are
attached to. <resend> allows the caller to prevent this function to release
the in flight TX packets/frames. The frames are requeued to be resent.
- Modify quic_rx_pkt_parse() to handle the RETRY packets. What must be done upon
such packets receipt is:
- store the retry token,
- store the new peer SCID as the DCID of the connection. Note that the peer will
modify again its SCID. This is why this SCID is also stored as the ODCID
which must be matched with the peer retry_source_connection_id transport parameter,
- discard the Initial packet number space without flagging it as discarded and
prevent retransmissions calling qc_set_timer(),
- modify the TLS cryptographic cipher contexts (RX/TX),
- wakeup the I/O handler to send new Initial packets asap.
- Modify quic_transport_param_decode() to handle the retry_source_connection_id
transport parameter as a QUIC client. Then its caller is modified to
check this transport parameter matches with the SCID sent by the peer with
the RETRY packet.
This easy to understand patch is not intrusive at all and cannot break the QUIC
listeners.
The QUIC client MUST always pad its datagrams with Initial packets. A "!l" (not
a listener) test OR'ed with the existing ones is added to satisfy the condition
to allow the build of such datagrams.
There is no need to limit the size of the TX buffer to QUIC_MIN_CC_PKTSIZE bytes
when the connection is in closing state. There is already a test which limits the
number of bytes to be used from this TX buffer after this useless test removed.
It limits this number of bytes to the size of the TX buffer itself:
if (end > (unsigned char *)b_wrap(buf))
end = (unsigned char *)b_wrap(buf);
This is exactly what is needed when the connection is in closing state. Indeed,
the size of the TX buffers are limited to reduce the memory usage. The connection
only needs to send short datagrams with at most 2 packets with a CONNECTION_CLOSE*
frames. They are built only one time and backed up into small TX buffer allocated
from a dedicated pool.
The size of this TX buffer is QUIC_MAX_CC_BUFSIZE which depends on QUIC_MIN_CC_PKTSIZE:
#define QUIC_MIN_CC_PKTSIZE 128
#define QUIC_MAX_CC_BUFSIZE (2 * (QUIC_MIN_CC_PKTSIZE + QUIC_DGRAM_HEADLEN))
This size is smaller than an MTU.
This patch should be backported as far as 2.9 to ease further backports to come.
A QUIC client must be able to close a connection sending Initial packets. But
QUIC client Initial packets must always be at least 1200 bytes long. To reduce
the memory use of TX buffers of a connection when in "closing" state, a pool
was dedicated for this purpose but with a too much reduced TX buffer size
(QUIC_MAX_CC_BUFSIZE).
This patch adds a "closing state connection" TX buffer pool with the same role
for QUIC backends.
This is an old bug which was there since this commit:
MINOR: quic: Avoid zeroing frame structures
It seems QUIC_FT_CONNECTION_CLOSE was confused with QUIC_FT_CONNECTION_CLOSE_APP
which does not include a "frame type" field. This field was not initialized
(so with a random value) which prevent the packet to be built because the
packet builder supposes the packet with such frames are very short.
Must be backported as far as 2.6.
This patch removes completely the support for the program section, the
parsing of the section as well as the internals in the mworker does not
support it anymore.
The program section was considered dysfonctional and not fully
compatible with the "mworker V3" model. Users that want to run an
external program must use their init system.
The documentation is cleaned up in another patch.
This "renegotiate" option can be set on SSL backends to allow secure
renegotiation. It is mostly useful with SSL libraries that disable
secure regotiation by default (such as AWS-LC).
The "no-renegotiate" one can be used the other way around, to disable
secure renegotation that could be allowed by default.
Those two options can be set via "ssl-default-server-options" as well.
prefer-client-ciphers does not work exactly the same way when used with
a dual algorithm stack (ECDSA + RSA). This patch details its behavior.
This patch must be backported in every maintained version.
Problem was discovered in #2988.
Patch 23093c72 ("BUG/MINOR: ssl: suboptimal certificate selection with TLSv1.3
and dual ECDSA/RSA") introduced a problem when prioritizing the ECDSA
with TLSv1.3.
Indeed, when a client with TLSv1.3 capabilities announce a list of
ECDSA sigalgs, a list of TLSv1.3 ciphersuites compatible with ECDSA,
but only RSA ciphers for TLSv1.2, and haproxy is configured to a
ssl-max-ver TLSv1.2, then haproxy would use the ECDSA keypair, but the
client wouldn't be able to process it because TLSv1.2 was negociated.
HAProxy would be configured like that:
ssl-default-bind-options ssl-max-ver TLSv1.2
And a client could be used this way:
openssl s_client -connect localhost:8443 -cipher ECDHE-ECDSA-AES128-GCM-SHA256 \
-ciphersuites TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256
This patch fixes the issue by checking if TLSv1.3 was configured before
allowing ECDSA is an TLSv1.3 ciphersuite is in the list.
This could be backported where 23093c72 ("BUG/MINOR: ssl: suboptimal
certificate selection with TLSv1.3 and dual ECDSA/RSA") was backported.
However this is quite sensible and we should wait a bit before the
backport.
This should fix issue #2988
As mentioned in 2.8 announce on the mailing list [1] and on the wiki [2]
native mailers were deprecated and planned for removal in 3.3. Now is
the time to drop the legacy code for native mailers which is based on a
tcpcheck "hack" and cannot be maintained. Lua mailers should be used as
a drop in replacement. Indeed, "mailers" and associated config directives
are preserved because mailers config is exposed to Lua, which helps smoothing
the transition from native mailers to Lua based ones.
As a reminder, to keep mailers configuration working as before without
making changes to the config file, simply add the line below to the global
section:
lua-load examples/lua/mailers.lua
mailers.lua script (provided in the git repository, adjust path as needed)
may be customized by users familiar with Lua, by default it emulates the
behavior of the native (now removed) mailers.
[1]: https://www.mail-archive.com/haproxy@formilux.org/msg43600.html
[2]: https://github.com/haproxy/wiki/wiki/Breaking-changes
As reported by Chris Staite in GH #3002, trying to yield from a Lua
action during a client disconnect causes the script to be interrupted
(which is expected) and an alert to be emitted with the error:
"Lua function '%s': yield not allowed".
While this error is well suited for cases where the yield is not expected
at all (ie: when context doesn't allow it) and results from a yield misuse
in the Lua script, it isn't the case when the yield is exceptionnally not
available due to an abort or error in the request/response processing.
Because of that we raise an alert but the user cannot do anything about it
(the script is correct), so it is confusing and polluting the logs.
In this patch we introduce the ACT_OPT_FINAL_EARLY flag which is a
complementary flag to ACT_OPT_FIRST. This flag is set when the
ACT_OPT_FIRST is set earlier than normal (due to error/abort).
hlua_action() then checks for this flag to decide whether an error (alert)
or a simple log message should be emitted when the yield is not available.
It should solve GH #3002. Thanks to Chris Staite (@chrisstaite-menlo) for
having reported the issue and suggested a solution.
In a log-format string, using "%[unique-id]" or "%ID" should be equivalent.
However, for the first one, the unique ID is generated when the sample fetch
function is called. For the alias, it is not true. It that case, the
stream's unique ID is generated when the log message is emitted. Otherwise,
by default, the unique id is automatically generated at the end of the HTTP
request analysis.
So, if the alias "%ID" is use in a log-format string anywhere before the end
of the request analysis, the evaluation failed and the ID is considered as
empty. It is not consistent and in contradiction with the "%ID"
documentation.
To fix the issue, instead of evaluating the unique ID when the log message
is emitted, it is now performed on demand when "%ID" format is evaluated.
This patch should fix the issue #3016. It should be backported to all stable
versions. It relies on the following commit:
* BUG/MINOR: stream: Avoid recursive evaluation for unique-id based on itself
There is nothing that prevent a "unique-id-format" to reference itself,
using '%ID' or '%[unique-id]'. If the sample fetch function is used, it
leads to an infinite loop, calling recursively the function responsible to
generate the unique ID.
One solution is to detect it during the configuration parsing to trigger an
error. With this patch, we just inhibit recursive calls by considering the
unique-id as empty during its evaluation. So "id-%[unique-id]" lf string
will be evaluated as "id-".
This patch must be backported to all stable versions.
In issue #2995, Thomas Kjaer reported that empty argument position
reporting had been broken yet again. This time it was broken by this
latest fix: 2b60e54fb1 ("BUG/MINOR: tools: improve parse_line()'s
robustness against empty args"). It turns out that this fix is not
the culprit and it's in fact correct. The culprit was the original
commit of this series, 7e4a2f39ef ("BUG/MINOR: tools: do not create
an empty arg from trailing spaces"), which used to reset arg_start
to outpos for every new char in addition to doing it for every arg.
This resulted in the end of the line to be seen as always being in
error, thus reporting an incorrect position that the caller would
correct in a generic way designating the beginning of the line. It
didn't reveal prior to the upper fix above because the misassigned
value was almost not used by then.
Assigning the value before entering the loop fixes this problem and
doens't break the series of previous oss-fuzz reproducers. Hopefully
it's the last one again.
This must be backported to 3.2. Thanks to @tkjaer for reporting the
issue along with a reproducer.
There was already a check for this but there used to be an exception
that allowed duplicate server names only in case where their IDs were
explicit and different. This has been emitting a warning since 3.1 and
planned for removal in 3.3, so let's do it now. The doc was updated,
though it never mentioned this unicity constraint, so that was added.
Only the check for the exception was removed, the rest of the code
that is currently made to deal with duplicate server names was not
cleaned yet (e.g. the tree doesn't need to support dups anymore, and
this could be done at insertion time). This may be a subject for future
cleanups.
As warned since 3.1, it's no longer permitted to have a frontend and
a backend under the same name. This causes too many designation issues,
and causes trouble with stick-tables as well. Now each proxy name is
unique.
This commit only changes the check to return an error. Some code parts
currently exist to find the best candidates, these will be able to be
simplified as future cleanup patches. The doc was updated.
For frontend side, quic_conn is only released if MUX wasn't allocated,
either due to handshake abort, in which case upper layer is never
allocated, or after transfer completion when full conn + MUX layers are
already released.
On the backend side, initialization is not performed in the same order.
Indeed, in this case, connection is first instantiated, the nthe
quic_conn is created to execute the handshake, while MUX is still only
allocated on handshake completion. As such, it is not possible anymore
to free immediately quic_conn on handshake failure. Else, this can cause
crash if the connection try to reaccess to its transport layer after
quic_conn release.
Such crash can easily be reproduced in case of connection error to the
QUIC server. Here is an example of an experienced backtrace.
Thread 1 "haproxy" received signal SIGSEGV, Segmentation fault.
0x0000555555739733 in quic_close (conn=0x55555734c0d0, xprt_ctx=0x5555573a6e50) at src/xprt_quic.c:28
28 qc->conn = NULL;
[ ## gdb ## ] bt
#0 0x0000555555739733 in quic_close (conn=0x55555734c0d0, xprt_ctx=0x5555573a6e50) at src/xprt_quic.c:28
#1 0x00005555559c9708 in conn_xprt_close (conn=0x55555734c0d0) at include/haproxy/connection.h:162
#2 0x00005555559c97d2 in conn_full_close (conn=0x55555734c0d0) at include/haproxy/connection.h:206
#3 0x00005555559d01a9 in sc_detach_endp (scp=0x7fffffffd648) at src/stconn.c:451
#4 0x00005555559d05b9 in sc_reset_endp (sc=0x55555734bf00) at src/stconn.c:533
#5 0x000055555598281d in back_handle_st_cer (s=0x55555734adb0) at src/backend.c:2754
#6 0x000055555588158a in process_stream (t=0x55555734be10, context=0x55555734adb0, state=516) at src/stream.c:1907
#7 0x0000555555dc31d9 in run_tasks_from_lists (budgets=0x7fffffffdb30) at src/task.c:655
#8 0x0000555555dc3dd3 in process_runnable_tasks () at src/task.c:889
#9 0x0000555555a1daae in run_poll_loop () at src/haproxy.c:2865
#10 0x0000555555a1e20c in run_thread_poll_loop (data=0x5555569d1c00 <ha_thread_info>) at src/haproxy.c:3081
#11 0x0000555555a1f66b in main (argc=5, argv=0x7fffffffde18) at src/haproxy.c:3671
To fix this, change the condition prior to calling quic_conn release. If
<conn> member is not NULL, delay the release, similarly to the case when
MUX is allocated. This allows connection to be freed first, and detach
from quic_conn layer through close xprt operation.
No need to backport.
Always set unusable if we could not use a server, instead of doing it in
each branch
This should be backported to 3.2 after e28e647fef43e5865c87f328832fec7794a423e5
is backported.
When fwlc_get_next_server(), if a server to avoid has been provided, and
we have to ignore it, don't forget to increase the number of unusable
servers, otherwise we may end up ignoring it over and over, never
switching to another server, in an infinite loop until the process gets
killed.
This hopefully fixes Github issues #3004 and #3014.
This should be backported to 3.2.
Implement attach and avail_streams mux-ops callbacks, which are used on
backend side for connection reuse.
Attach operation is used to initiate new streams on the connection
outside of the first one. It simply relies on qcc_init_stream_local() to
instantiate a new QCS instance, which is immediately linked to its
stream data layer.
Outside of attach, it is also necessary to implement avail_streams so
that the stream layer will try to initiate connection reuse. This method
reports the number of bidirectional streams which can still be opened
for the QUIC connection. It depends directly to the flow-control value
advertised by the peer. Thus, this ensures that attach won't cause any
flow control violation.
Prior to initiate first stream on the backend side, ensure that peer
flow-control allows at least that a single bidirectional stream can be
created. If this is not the case, abort MUX init operation.
Before this patch, flow-control limit was not checked. Hence, if peer
does not allow any bidirectional stream, haproxy would violate it, which
whould then cause the peer to close the connection.
Note that with the current situation, haproxy won't be able to talk to
servers which uses a 0 for initial max bidi streams. A proper solution
could be to pause the request until a MAX_STREAMS is received, under
timeout supervision to ensure the connection is closed if no frame is
received.
Implement support for MAX_STREAMS frame. On frontend, this was mostly
useless as haproxy would never initiate new bidirectional streams.
However, this becomes necessary to control stream flow-control when
using QUIC as a client on the backend side.
Parsing of MAX_STREAMS is implemented via new qcc_recv_max_streams().
This allows to update <ms_uni>/<ms_bidi> QCC fields.
This patch is necessary to achieve QUIC backend connection reuse.
Previously, no check on peer flow-control was implemented prior to open
a local QUIC stream. This was a small problem for frontend
implementation, as in this case haproxy as a server never opens
bidirectional streams.
On frontend, the only stream opened by haproxy in this case is for
HTTP/3 control unidirectional data. If the peer uses an initial value
for max uni streams set to 0, it would violate its flow control, and the
peer will probably close the connection. Note however that RFC 9114
mandates that each peer defines minimal initial value so that at least
the control stream can be created.
This commit improves the situation of too low initial max uni streams
value. Now, on HTTP/3 layer initialization, haproxy preemptively checks
flow control limit on streams via a new function
qcc_fctl_avail_streams(). If credit is already expired due to a too
small initial value, haproxy preemptively closes the connection using
H3_ERR_GENERAL_PROTOCOL_ERROR. This behavior is better as haproxy is now
the initiator of the connection closure.
This should be backported up to 2.8.
Remove avail_streams_bidi/avail_streams_uni mux_ops. These callbacks
were designed to be specific to QUIC. However, they won't be necessary,
as stream layer only cares about bidirectional streams.
Add some notes which load-balancing algorithm can be considered as
deterministic or non-deterministic and add some examples for each type.
This was asked via mailing list to clarify the usage of
prefer-last-server option.
This can be backported to all stable versions.
Add checks to ensure that :status pseudo-header received in HTTP/3
response is valid. If either the header is not provided, or it isn't a 3
digit numbers, the response is considered as invalid and the streams is
rejected. Also, glitch counter is now incremented in any of these cases.
This should fix coverity report from github issue #3009.
Convert BUG_ON_HOT() statements to BUG_ON() if HTX start-line is either
missing or duplicated when transcoding into a HTTP/3 request. This
ensures that such abnormal conditions will be detected even on default
builds.
This is linked to coverity report #3008.
Finalize HTTP/3 response transcoding into HTX message. This patch
implements conversion of HTTP/3 headers provided by the server into HTX
blocks.
Special checks have been implemented to reject connection-specific
headers, causing the stream to be shut in error. Also, handling of
content-length requires that the body size is equal to the value
advertized in the header to prevent HTTP desync.
On the backend side, HTTP/3 request response from server is transcoded
into a HTX message. Previously, a fixed value was used for the status
code.
Improve this by extracting the value specified by the server and set it
into the HTX status line. This requires to detect :status pseudo-header
from the HTTP/3 response.
Implement basic support for HTTP/3 request response transcoding into
HTX. This is done via a new dedicated function h3_resp_headers_to_htx().
A valid HTX status-line is allocated and stored. Status code is
hardcoded to 200 for now.
Following patches will be added to remove hardcoded status value and
also handle response headers provided by the server.
Refactor HTTP/3 request headers transcoding to HTX done in
h3_headers_to_htx(). Some operations are extracted into dedicated
functions, to check pseudo-headers and headers conformity, and also trim
the value of headers before encoding it in HTX.
The objective will be to simplify implementation of HTTP/3 response
transcoding by reusing these functions.
Also, h3_headers_to_htx() has been renamed to h3_req_headers_to_htx(),
to highlight that it is reserved to frontend usage.
Implement proper encoding of HTTP/3 authority pseudo-header during
request transcoding on the backend side. A pseudo-header :authority is
encoded if a value can be extracted from HTX start-line. A special check
is also implemented to ensure that a host header is not encoded if
:authority already is.
A new function qpack_encode_auth() is defined to implement QPACK
encoding of :authority header using literal field line with name ref.
Previously, HTTP/3 backend request :path was hardcoded to value '/'.
Change this so that we can now encode any path as requested by the
client. Path is extracted from the HTX URI. Also, qpack_encode_path() is
extended to support literal field line with name ref.
Previously, scheme was always set to https when transcoding an HTX
start-line into a HTTP/3 request. Change this so this conversion is now
fully compliant.
If no scheme is specified by the client, which is what happens most of
the time with HTTP/1, https is set for the HTTP/3 request. Else, reuse
the scheme requested by the client.
If either https or http is set, qpack_encode_scheme will encode it using
entry from QPACK static table. Else, a full literal field line with name
ref is used instead as the scheme value is specified as-is.
On the backend side, HTX start-line is converted into a HTTP/3 request
message. Previously, GET method was hardcoded. Implement proper method
conversion, by extracting it from the HTX start-line.
qpack_encode_method() has also been extended, so that it is able to
encode any method, either using a static table entry, or with a literal
field line with name ref representation.
Implement encoding of HTTP/3 request headers during HTX->H3 conversion
on the backend side. This simply relies on h3_encode_header().
Special check is implemented to ensure that connection-specific headers
are ignored. An HTTP/3 endpoint must never generate them, or the peer
will consider the message as malformed.
This commit is the first one of a serie which aim is to implement
transcoding of a HTX request into HTTP/3, which is necessary for QUIC
backend support.
Transcoding is implementing via a new function h3_req_headers_send()
when a HTX start-line is parsed. For now, most of the request fields are
hardcoded, using a GET method. This will be adjusted in the next
following patches.
On backend side, QUIC MUX needs to initialize the first local stream
during MUX init operation. This is necessary so that the first transfer
can then be performed.
sc_attach_mux() is used to attach the created QCS instance to its stream
data layer. However, return value was not checked, which may cause
issues on allocation error. This patch fixes it by returning an error on
MUX init operation and freeing the QCS instance in case of
sc_attach_mux() error.
This fixes coverity report from github issue #3007.
No need to backport.
When a connection error is reported, we try to collect as much information
as possible on the connection status and the server status is adjusted
accordingly. However, the function does nothing if there is no connection
error and if the healthcheck is not expired yet. It is a problem when an
internal error occurred. It may happen at many places and it is hard to be
sure an error is reported on the connection. And in fact, it is already a
problem when the multiplexer allocation fails. In that case, the healthcheck
is not interrupted as it should be. Concretely, it could only happen when a
connection is established.
It is hard to predict the effects of this bug. It may be unimportant. But it
could probably lead to a crash. To avoid any issue, a SOCKERR status is now
set by default when a connection error is reported. There is no reason to
report a connection error for nothing. So a healthcheck failure must be
reported. There is no "internal error" status. So a socket error is
reported.
This patch must be backport to all stable versions.
It is not especially a bug fixed. But APPCTX_FL_EOS and APPCTX_FL_ERROR
flags must be handled first. These flags are set by the applet itself and
should mark the end of all processing. So there is not reason to get the
output buffer in first place.
This patch could be backported as far as 3.0.
The output buffer must be available to process a command, at least to be
able to emit error messages. When this buffer is full or cannot be
allocated, we must wait. In that case, we must take care to notify the SE
will not consume input data. It is important to avoid wakeup in loop,
especially when the client aborts.
When the output buffer is available again and no longer full, and the CLI
applet is waiting for a command line, it must notify it will consume input
data.
This patch must be backported as far as 3.0.
QUIC support on the backend side has been implemented recently. This has
lead to some adjustment on qc_new_conn() to handle both FE and BE sides,
with some of these changes performed by the following commit.
29fb1aee57288a8b16ed91771ae65c2bfa400128
MINOR: quic-be: QUIC connection allocation adaptation (qc_new_conn())
An issue was introduced during some code adjustement. Initialization of
ODCID was incorrectly performed, which caused haproxy to emit invalid
transport parameters. Most of the clients detected this and immediatly
closed the connection.
Fix this by adjusting qc_lstnr_params_init() invokation : replace
<qc.dcid>, which in fact points to the received SCID, by <qc.odcid>
whose purpose is dedicated to original DCID storage.
This fixes github issue #3006. This issue also caused the majority of
tests in the interop to fail.
No backport needed.
This patch is OpenSSL3.5 QUIC API specific. It fixes
OSSL_FUNC_SSL_QUIC_TLS_got_transport_params_fn() callback (see man(3) SSL_set_quic_tls_cb).
The role of this callback is to store the transport parameters received by the peer.
At this time it is never used by QUIC listeners because there is another callback
which is used to store the transport parameters. This latter callback is not specific
to OpenSSL 3.5 QUIC API. As far as I know, the TLS stack call only one time
one of the callbacks which have been set to receive and store the transport parameters.
That said, OSSL_FUNC_SSL_QUIC_TLS_got_transport_params_fn() is called for QUIC
backends to store the server transport parameters.
qc_ssl_set_quic_transport_params() is useless is this callback. It is dedicated
to store the local tranport parameters (which are sent to the peer). Furthermore
<server> second parameter of quic_transport_params_store() must be 0 for a listener
(or QUIC server) whichs call it, denoting it does not receive the transport parameters
of a QUIC server. It must be 1 for a QUIC backend (a QUIC client which receives
the transport parameter of a QUIC server).
Must be backported to 3.2.
On backend side, HTTP/0.9 response body is copied into stream data HTX
buffer. Properly handle the case where the HTX out buffer space is too
small. Only copy a partial copy of the HTTP response. Transcoding will
be restarted when new room is available.
When QUIC is used on the frontend side, communication is restricted with
clients using privileged port. This is a simple protection against
DNS/NTP spoofing.
This feature should not be activated on the backend side, as in this
case it is quite frequent to exchange with server running on privileged
ports. As such, a new parameter is added to quic_recv() so that it is
only active on the frontend side.
Without this patch, it is impossible to communicate with QUIC servers
running on privileged ports, as incoming datagrams would be silently
dropped.
No need to backport.
The keep-query redirect option must do nothing is there is no query-string.
However, there is a bug. When there is no QS, an error is returned, leading
to return a 500-internal-error to the client.
To fix the bug, instead of returning 0 when there is no QS, we just skip the
QS processing.
This patch should fix the issue #3005. It must be backported as far as 3.1.
NEW_TOKEN frame is never emitted by a client, hence parsing was not
tested on frontend side.
On backend side, an issue can occur, as expected token length is static,
based on the token length used internally by haproxy. This is not
sufficient for most server implementation which uses larger token. This
causes a parsing error, which may cause skipping of following frames in
the same packet. This issue was detected using ngtcp2 as server.
As for now tokens are unused by haproxy, simply discard test on token
length during NEW_TOKEN frame parsing. The token itself is merely
skipped without being stored. This is sufficient for now to continue on
experimenting with QUIC backend implementation.
This does not need to be backported.
Report an error during server configuration if QUIC is used by SSL is
not activiated via 'ssl' keyword. This is done in _srv_parse_finalize(),
which is both used by static and dynamic servers.
Note that contrary to listeners, an error is reported instead of a
warning, and SSL is not automatically activated if missing. This is
mainly due to the complex server configuration : _srv_parse_finalize()
is ideal to affect every servers, including dynamic entries. However, it
is executed after server SSL context allocation performed via
<prepare_srv> XPRT operation. A proper fix would be to move SSL ctx
alloc in _srv_parse_finalize(), but this may have unknown impact. Thus,
for now a simpler solution has been chosen.
QUIC traces in ssl_quic_srv_new_ssl_ctx() are problematic as this
function is called early during startup. If activating traces via -dt
command-line argument, a crash occurs due to stderr sink not yet
available.
Thus, traces from ssl_quic_srv_new_ssl_ctx() are simply removed.
No backport needed.
This commit added a "err" C label reachable only with USE_QUIC_OPENSSL_COMPAT:
MINOR: quic-be: Missing callbacks initializations (USE_QUIC_OPENSSL_COMPAT)
leading coverity to warn this:
*** CID 1611481: Control flow issues (UNREACHABLE)
/src/quic_ssl.c: 802 in ssl_quic_srv_new_ssl_ctx()
796 goto err;
797 #endif
798
799 leave:
800 TRACE_LEAVE(QUIC_EV_CONN_NEW);
801 return ctx;
>>> CID 1611481: Control flow issues (UNREACHABLE)
>>> This code cannot be reached: "err:
SSL_CTX_free(ctx);".
802 err:
803 SSL_CTX_free(ctx);
804 ctx = NULL;
805 TRACE_DEVEL("leaving on error", QUIC_EV_CONN_NEW);
806 goto leave;
807 }
The less intrusive (without #ifdef) way to fix this it to add a "goto err"
statement from the code part which is reachable without USE_QUIC_OPENSSL_COMPAT.
Thank you to @chipitsine for having reported this issue in GH #3003.
This issue may occur when qc_new_conn() fails after having allocated
and attached <conn_cid> to its tree. This is the case when compiling
haproxy against WolfSSL for an unknown reason at this time. In this
case the <conn_cid> is freed by pool_head_quic_connection_id(), then
freed again by quic_conn_release().
This bug arrived with this commit:
MINOR: quic-be: QUIC connection allocation adaptation (qc_new_conn())
So, the aim of this patch is to free <conn_cid> only for QUIC backends
and if it is not attached to its tree. This is the case when <conn_id>
local variable passed with NULL value to qc_new_conn() is then intialized
to the same <conn_cid> value.
This patch should have come with this last commit for the last qc_new_conn()
modifications for QUIC backends:
MINOR: quic-be: get rid of ->li quic_conn member
qc_new_conn() must be passed NULL pointers for several variables as mentioned
by the comment. Some of these local variables are used to avoid too much
code modifications.
Implement transcoding of a HTX request into HTTP/0.9. This protocol is a
simplified version of HTTP. Request only supports GET method without any
header. As such, only a request line is written during snd_buf
operation.
Implement transcoding of a HTTP/0.9 response into a HTX message.
HTTP/0.9 is a really simple substract of HTTP spec. The response does
not have any status line and is contains only the payload body. Response
is finished when the underlying connection/stream is closed.
A status line is generated to be compliant with HTX. This is performed
on the first invokation of rcv_buf for the current stream. Status code
is set to 200. Payload body if present is then copied using
htx_add_data().
This commit is the second and final step to initiate QUIC MUX on the
backend side. On handshake completion, MUX is woken up just after its
creation. This step is necessary to notify the stream layer, via the QCS
instance pre-initialized on MUX init, so that the transfer can be
resumed.
This mode of operation is similar to TCP stack when TLS+ALPN are used,
which forces MUX initialization to be delayed after handshake
completion.
Adjust qmux_init() to handle frontend and backend sides differently.
Most notably, on backend side, the first bidirectional stream is created
preemptively. This step is necessary as MUX layer will be woken up just
after handshake completion.
Stream data layer is notified that data is expected when FIN is
received, which marks the end of the HTTP request. This prepares data
layer to be able to handle the expected HTTP response.
Thus, this step is only relevant on frontend side. On backend side, FIN
marks the end of the HTTP response. No further content is expected, thus
expect data should not be set in this case.
Note that se_expect_data() invokation via qcs_attach_sc() is not
protected. This is because this function will only be called during
request headers parsing which is performed on the frontend side.
Mux connection is flagged with new QC_CF_IS_BACK if used on the backend
side. For now the only change is during traces, to be able to
differentiate frontend and backend usage.
Complete document for rcv_buf/snd_buf operations. In particular, return
value is now explicitely defined. For H3 layer, associated functions
documentation is also extended.
Use conn_ctrl_init() on the connection when quic_connect_server()
succeeds. This is necessary so that the connection is considered as
completely initialized. Without this, connect operation will be call
again if connection is reused.
On backend side, multiplexer layer is initialized during
connect_server(). However, this step is not performed if ALPN is used,
as the negotiated protocol may be unknown. Multiplexer initialization is
delayed after TLS handshake completion.
There are still exceptions though that forces the MUX to be initialized
even if ALPN is used. One of them was if <mux_proto> server field was
already set at this stage, which is the case when an explicit proto is
selected on the server line configuration. Remove this condition so that
now MUX init is delayed with ALPN even if proto is forced.
The scope of this change should be minimal. In fact, the only impact
concerns server config with both proto and ALPN set, which is pretty
unlikely as it is contradictory.
The main objective of this patch is to prepare QUIC support on the
backend side. Indeed, QUIC proto will be forced on the server if a QUIC
address is used, similarly to bind configuration. However, we still want
to delay MUX initialization after QUIC handshake completion. This is
mandatory to know the selected application protocol, required during
QUIC MUX init.
Change wake callback behavior for QUIC MUX. This operation loops over
each QCS and notify their stream data layer on certain events via
internal helper qcc_wake_some_streams().
Previously, streams were notified only if an error occured on the
connection. Change this to notify streams data layer everytime wake
callback is used. This behavior is now identical to H2 MUX.
qcc_wake_some_streams() is also renamed to qcc_wake_streams(), as it
better reflect its true behavior.
This change should not have performance impact as wake mux ops should
not be called frequently. Note that qcc_wake_streams() can also be
called directly via qcc_io_process() to ensure a new error is correctly
propagated. As wake callback first uses qcc_io_process(), it will only
call qcc_wake_streams() if no error is present.
No known issue is associated with this commit. However, it could prevent
freezing transfer under certain condition. As such, it is considered as
a bug fix worthy of backporting.
This should be backported after a period of observation.
It was still failing on Ubuntu-24.04 with GCC+ASAN. So, instead of
understand the code path the compiler followed to report uninitialized
variables, let's init them now.
No backport needed.
commit 16eb0fab3 ("MAJOR: counters: dispatch counters over thread groups")
introduced a build regression on some compilers:
src/listener.c: In function 'listener_accept':
src/listener.c:1095:3: error: 'for' loop initial declarations are only allowed in C99 mode
for (int it = 0; it < global.nbtgroups; it++)
^
src/listener.c:1095:3: note: use option -std=c99 or -std=gnu99 to compile your code
src/listener.c:1101:4: error: 'for' loop initial declarations are only allowed in C99 mode
for (int it = 0; it < global.nbtgroups; it++) {
^
make: *** [src/listener.o] Error 1
make: *** Waiting for unfinished jobs....
Let's fix that.
No backport needed
In hlua_applet_tcp_recv_try() and hlua_applet_tcp_getline_yield(), GCC 14.2
reports warnings about 'blk2' variable that may be used uninitialized. It is
a bit strange because the code is pretty similar than before. But to make it
happy and to avoid bugs if the API change in future, 'blk2' is now used only
when its length is greater than 0.
No need to backport.
In hlua_applet_tcp_getline_yield(), the function may yield if there is no
data available. However we must take care to add a return statement just
after the call to hlua_yieldk(). I don't know the details of the LUA API,
but at least, this return statement fix a build error about uninitialized
variables that may be used.
It is a 3.3-specific issue. No backport needed.
On backend side, MUX is instantiated after QUIC handshake completion.
This step is performed via qc_ssl_provide_quic_data(). First, connection
flags for handshake completion are resetted. Then, MUX is instantiated
via conn_create_mux() function.
Force QUIC as <mux_proto> for server if a QUIC address is used. This is
similarly to what is already done for bind instances on the frontend
side. This step ensures that conn_create_mux() will select the proper
protocol.
Replace ->li quic_conn pointer to struct listener member by ->target which is
an object type enum and adapt the code.
Use __objt_(listener|server)() where the object type is known. Typically
this is were the code which is specific to one connection type (frontend/backend).
Remove <server> parameter passed to qc_new_conn(). It is redundant with the
<target> parameter.
GSO is not supported at this time for QUIC backend. qc_prep_pkts() is modified
to prevent it from building more than an MTU. This has as consequence to prevent
qc_send_ppkts() to use GSO.
ssl_clienthello.c code is run only by listeners. This is why __objt_listener()
is used in place of ->li.
Disable the code around SSL_get_peer_quic_transport_params() as this was done
for USE_QUIC_OPENSSL_COMPAT because SSL_get_peer_quic_transport_params() is not
defined by OpenSSL 3.5 QUIC API.
quic_tls_compat_keylog_callback() is the callback used by the QUIC OpenSSL
compatibility module to derive the TLS secrets from other secrets provided
by keylog. The <write> local variable to this function is initialized to denote
the direction (write to send, read to receive) the secret is supposed to be used
for. That said, as the QUIC cryptographic algorithms are symmetrical, the
direction is inversed between the peer: a secret which is used to write/send/cipher
data from a peer point of view is also the secret which is used to
read/receive/decipher data. This was confirmed by the fact that without this
patch, the TLS stack first provides the peer with Handshake to send/cipher
data. The client could not use such secret to decipher the Handshake packets
received from the server. This patch simply reverse the direction stored by
<write> variable to make the secrets derivation works for the QUIC client.
quic_tls_compat_init() function is called from OpenSSL QUIC compatibility module
(USE_QUIC_OPENSSL_COMPAT) to initialize the keylog callback and the callback
which stores the QUIC transport parameters as a TLS extensions into the stack.
These callbacks must also be initialized for QUIC backends.
This is done from TLS secrets derivation callback at Application level (the last
encryption level) calling SSL_get_peer_quic_transport_params() to have an access
to the TLS transport paremeters extension embedded into the Server Hello TLS message.
Then, quic_transport_params_store() is called to store a decoded version of
these transport parameters.
For connection to QUIC servers, this patch modifies the moment where the I/O
handler callback is switched to quic_conn_app_io_cb(). This is no more
done as for listener just after the handshake has completed but just after
it has been confirmed.
Discard the Initial packet number space as soon as possible. This is done
during handshakes in quic_conn_io_cb() as soon as an Handshake packet could
be successfully sent.
The initialization of <ssl_app_data_index> SSL user data index is required
to make all the SSL sessions to QUIC servers work as this is done for TCP
servers. The conn object notably retrieve for SSL callback which are
server specific (e.g. ssl_sess_new_srv_cb()).
Store the peer connection ID (SCID) as the connection DCID as soon as an Initial
packet is received.
Stop comparing the packet to QUIC_PACKET_TYPE_0RTT is already match as
QUIC_PACKET_TYPE_INITIAL.
A QUIC server must not send too short datagram with ack-eliciting packets inside.
This cannot be done from quic_rx_pkt_parse() because one does not know if
there is ack-eliciting frame into the Initial packets. If the packet must be
dropped, this is after having parsed it!
Modify quic_dgram_parse() to stop passing it a listener as third parameter.
In place the object type address of the connection socket owner is passed
to support the haproxy servers with QUIC as transport protocol.
qc_owner_obj_type() is implemented to return this address.
qc_counters() is also implemented to return the QUIC specific counters of
the proxy of owner of the connection.
quic_rx_pkt_parse() called by quic_dgram_parse() is also modify to use
the object type address used by this latter as last parameter. It is
also modified to send Retry packet only from listeners. A QUIC client
(connection to haproxy QUIC servers) must drop the Initial packets with
non null token length. It is also not supposed to receive O-RTT packets
which are dropped.
The QUIC datagram redispatch is there to counter the race condition which
exists only for QUIC connections to listener where datagrams may arrive
on the wrong socket between the bind() and connect() calls.
Run this code part only for listeners.
Allocate a connection to connect to QUIC servers from qc_conn_init() which is the
->init() QUIC xprt callback.
Also initialize ->prepare_srv and ->destroy_srv callback as this done for TCP
servers.
For haproxy QUIC servers (or QUIC clients), the peer is considered as validated.
This is a property which is more specific to QUIC servers (haproxy QUIC listeners).
No <odcid> is used for the QUIC client connection. It is used only on the QUIC server side.
The <token_odcid> is also not used on the QUIC client side. It must be embedded into
the transport parameters only on the QUIC server side.
The quic_conn is created before the socket allocation. So, the local address is
zeroed.
Initilize the transport parameter with qc_srv_params_init().
Stop hardcoding the <server> parameter passed value to qc_new_isecs() to correctly
initialize the Initial secrets.
Modify quic_connect_server() which is the ->connect() callback for QUIC protocol:
- add a BUG_ON() run when entering this funtion: the <fd> socket must equal -1
- conn->handle is a union. conn->handle.qc is use for QUIC connection,
conn->handle.fd must not be used to store the fd.
- code alignment fix for setsockopt(fd, SOL_SOCKET, (SO_SNDBUF|SO_RCVBUF))
statements
- remove the section of code which was duplicated from ->connect() TCP callback
- fd_insert() the new socket file decriptor created to connect to the QUIC
server with quic_conn_sock_fd_iocb() as callback for read event.
This patch only adds <proto_type> new proto_type enum parameter and <sock_type>
socket type parameter to sock_create_server_socket() and adapts its callers.
This is to prepare the use of this function by QUIC servers/backends.
Modify qc_alloc_ssl_sock_ctx() to pass the connection object as parameter. It is
NULL for a QUIC listener, not NULL for a QUIC server. This connection object is
set as value for ->conn quic_conn struct member. Initialise the SSL session object from
this function for QUIC servers.
qc_ssl_set_quic_transport_params() is also modified to pass the SSL object as parameter.
This is the unique parameter this function needs. <qc> parameter is used only for
the trace.
SSL_do_handshake() must be calle as soon as the SSL object is initialized for
the QUIC backend connection. This triggers the TLS CRYPTO data delivery.
tasklet_wakeup() is also called to send asap these CRYPTO data.
Modify the QUIC_EV_CONN_NEW event trace to dump the potential errors returned by
SSL_do_handshake().
Implement ssl_sock_new_ssl_ctx() to allocate a SSL server context as this is currently
done for TCP servers and also for QUIC servers depending on the <is_quic> boolean value
passed as new parameter. For QUIC servers, this function calls ssl_quic_srv_new_ssl_ctx()
which is specific to QUIC.
From connect_server(), QUIC protocol could not be retreived by protocol_lookup()
because of the PROTO_TYPE_STREAM default passed as argument. In place to support
QUIC srv->addr_type.proto_type may be safely passed.
The QUIC servers xprts have already been set at server line parsing time.
This patch prevents the QUIC servers xprts to be reset to <ssl_sock> value which is
the value used for SSL/TCP connections.
Add ->quic_params new member to server struct.
Also set the ->xprt member of the server being initialized and initialize asap its
transport parameters from _srv_parse_init().
This XPRT callback is called from check_config_validity() after the configuration
has been parsed to initialize all the SSL server contexts.
This patch implements the same thing for the QUIC servers.
Add a little check to verify that the version chosen by the server matches
with the client one. Initiliazes local transport parameters ->negotiated_version
value with this version if this is the case. If not, return 0;
According to the RFC, a QUIC client must encode the QUIC version it supports
into the "Available Versions" of "Version Information" transport parameter
order by descending preference.
This is done defining <quic_version_2> and <quic_version_draft_29> new variables
pointers to the corresponding version of <quic_versions> array elements.
A client announces its available versions as follows: v1, v2, draft29.
Activate QUIC protocol support for MUX-QUIC on the backend side,
additionally to current frontend support. This change is mandatory to be
able to implement QUIC on the backend side.
Without this modification, it is impossible to activate explicitely QUIC
protocol on a server line, hence an error is reported :
config : proxy 'xxxx' : MUX protocol 'quic' is not usable for server 'yyyy'
Mark QUIC address support for servers as experimental on the backend
side. Previously, it was allowed but wouldn't function as expected. As
QUIC backend support requires several changes, it is better to declare
it as experimental first.
QUIC is not implemented on the backend side. To prevent any issue, it is
better to reject any server configured which uses it. This is done via
_srv_parse_init() which is used both for static and dynamic servers.
This should be backported up to all stable versions.
Released version 3.3-dev1 with the following main changes :
- BUILD: tools: properly define ha_dump_backtrace() to avoid a build warning
- DOC: config: Fix a typo in 2.7 (Name format for maps and ACLs)
- REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+ (5)
- REGTESTS: Remove REQUIRE_VERSION=2.3 from all tests
- REGTESTS: Remove REQUIRE_VERSION=2.4 from all tests
- REGTESTS: Remove tests with REQUIRE_VERSION_BELOW=2.4
- REGTESTS: Remove support for REQUIRE_VERSION and REQUIRE_VERSION_BELOW
- MINOR: server: group postinit server tasks under _srv_postparse()
- MINOR: stats: add stat_col flags
- MINOR: stats: add ME_NEW_COMMON() helper
- MINOR: proxy: collect per-capability stat in proxy_cond_disable()
- MINOR: proxy: add a true list containing all proxies
- MINOR: log: only run postcheck_log_backend() checks on backend
- MEDIUM: proxy: use global proxy list for REGISTER_POST_PROXY_CHECK() hook
- MEDIUM: server: automatically add server to proxy list in new_server()
- MEDIUM: server: add and use srv_init() function
- BUG/MAJOR: leastconn: Protect tree_elt with the lbprm lock
- BUG/MEDIUM: check: Requeue healthchecks on I/O events to handle check timeout
- CLEANUP: applet: Update comment for applet_put* functions
- DEBUG: check: Add the healthcheck's expiration date in the trace messags
- BUG/MINOR: mux-spop: Fix null-pointer deref on SPOP stream allocation failure
- CLEANUP: sink: remove useless cleanup in sink_new_from_logger()
- MAJOR: counters: add shared counters base infrastructure
- MINOR: counters: add shared counters helpers to get and drop shared pointers
- MINOR: counters: add common struct and flags to {fe,be}_counters_shared
- MEDIUM: counters: manage shared counters using dedicated helpers
- CLEANUP: counters: merge some common counters between {fe,be}_counters_shared
- MINOR: counters: add local-only internal rates to compute some maxes
- MAJOR: counters: dispatch counters over thread groups
- BUG/MEDIUM: cli: Properly parse empty lines and avoid crashed
- BUG/MINOR: config: emit warning for empty args only in discovery mode
- BUG/MINOR: config: fix arg number reported on empty arg warning
- BUG/MINOR: quic: Missing SSL session object freeing
- MINOR: applet: Add API functions to manipulate input and output buffers
- MINOR: applet: Add API functions to get data from the input buffer
- CLEANUP: applet: Simplify a bit comments for applet_put* functions
- MEDIUM: hlua: Update TCP applet functions to use the new applet API
- BUG/MEDIUM: fd: Use the provided tgid in fd_insert() to get tgroup_info
- BUG/MINIR: h1: Fix doc of 'accept-unsafe-...-request' about URI parsing
The description of tests performed on the URI in H1 when
'accept-unsafe-violations-in-http-request' option is wrong. It states that
only characters below 32 and 127 are blocked when this option is set,
suggesting that otherwise, when it is not set, all invalid characters in the
URI, according to the RFC3986, are blocked.
But in fact, it is not true. By default all character below 32 and above 127
are blocked. And when 'accept-unsafe-violations-in-http-request' option is
set, characters above 127 (excluded) are accepted. But characters in
(33..126) are never checked, independently of this option.
This patch should fix the issue #2906. It should be backported as far as
3.0. For older versions, the docuementation could also be clarified because
this part is not really clear.
Note the request URI validation is still under discution because invalid
characters in (33.126) are never checked and some users request a stricter
parsing.
In fd_insert(), use the provided tgid to ghet the thread group info,
instead of using the one of the current thread, as we may call
fd_insert() from a thread of another thread group, that will happen at
least when binding the listeners. Otherwise we'd end up accessing the
thread mask containing enabled thread of the wrong thread group, which
can lead to crashes if we're binding on threads not present in the
thread group.
This should fix Github issue #2991.
This should be backported up to 2.8.
The functions responsible to extract data from the applet input buffer or to
push data into the applet output buffer are now relying on the newly added
functions in the applet API. This simplifies a bit the code.
There was already functions to pushed data from the applet to the stream by
inserting them in the right buffer, depending the applet was using or not
the legacy API. Here, functions to retreive data pushed to the applet by the
stream were added:
* applet_getchar : Gets one character
* applet_getblk : Copies a full block of data
* applet_getword : Copies one text block representing a word using a
custom separator as delimiter
* applet_getline : Copies one text line
* applet_getblk_nc : Get one or two blocks of data
* applet_getword_nc: Gets one or two blocks of text representing a word
using a custom separator as delimiter
* applet_getline_nc: Gets one or two blocks of text representing a line
In this patch, some functions were added to ease input and output buffers
manipulation, regardless the corresponding applet is using its own buffers
or it is relying on channels buffers. Following functions were added:
* applet_get_inbuf : Get the buffer containing data pushed to the applet
by the stream
* applet_get_outbuf : Get the buffer containing data pushed by the applet
to the stream
* applet_input_data : Return the amount of data in the input buffer
* applet_skip_input : Skips <len> bytes from the input buffer
* applet_reset_input: Skips all bytes from the input buffer
* applet_output_room: Returns the amout of space available at the output
buffer
* applet_need_room : Indicates that the applet have more data to deliver
and it needs more room in the output buffer to do
so
qc_alloc_ssl_sock_ctx() allocates an SSL_CTX object for each connection. It also
allocates an SSL object. When this function failed, it freed only the SSL_CTX object.
The correct way to free both of them is to call qc_free_ssl_sock_ctx().
Must be backported as far as 2.6.
If an empty argument is used in configuration, for example due to an
undefined environment variable, the rest of the line is not parsed. As
such, a warning is emitted to report this.
The warning was not totally correct as it reported the wrong argument
index. Fix this by this patch. Note that there is still an issue with
the "^" indicator, but this is not as easy to fix yet.
This is related to github issue #2995.
This should be backported up to 3.2.
Hide warning about empty argument outside of discovery mode. This is
necessary, else the message will be displayed twice, which hampers
haproxy output lisibility.
This should fix github isue #2995.
This should be backported up to 3.2.
Empty lines was not properly parsed and could lead to crashes because the
last argument was parsed outside of the cmdline buffer. Indeed, the last
argument is parsed to look for an eventual payload pattern. It is started
one character after the newline at the end of the command line. But it is
only valid for an non-empty command line.
So, now, this case is properly detected when we leave if an empty line is
detected.
This patch must be backported to 3.2.
Most fe and be counters are good candidates for being shared between
processes. They are now grouped inside "shared" struct sub member under
be_counters and fe_counters.
Now they are properly identified, they would greatly benefit from being
shared over thread groups to reduce the cost of atomic operations when
updating them. For this, we take the current tgid into account so each
thread group only updates its own counters. For this to work, it is
mandatory that the "shared" member from {fe,be}_counters is initialized
AFTER global.nbtgroups is known, because each shared counter causes the stat
to be allocated lobal.nbtgroups times. When updating a counter without
concurrency, the first counter from the array may be updated.
To consult the shared counters (which requires aggregation of per-tgid
individual counters), some helper functions were added to counter.h to
ease code maintenance and avoid computing errors.
cps_max (max new connections received per second), sps_max (max new
sessions per second) and http.rps_max (maximum new http requests per
second) all rely on shared counters (namely conn_per_sec, sess_per_sec and
http.req_per_sec). The problem is that shared counters are about to be
distributed over thread groups, and we cannot afford to compute the
total (for all thread groups) each time we update the max counters.
Instead, since such max counters (relying on shared counters) are a very
few exceptions, let's add internal (sess,conn,req) per sec freq counters
that are dedicated to cps_max, sps_max and http.rps_max computing.
Thanks to that, related *_max counters shouldn't be negatively impacted
by the thread-group distribution, yet they will not benefit from it
either. Related internal freq counters are prefixed with "_" to emphasize
the fact that they should not be used for other purpose (the shared ones,
which are about to be distributed over thread groups in upcoming commits
are still available and must be used instead). The internal ones could
eventually be removed at any time if we find another way to compute the
{cps,sps,http.rps)_max counters.
Now that we have a common struct between fe and be shared counters struct
let's perform some cleanup to merge duplicate members into the common
struct part. This will ease code maintenance.
proxies, listeners and server shared counters are now managed via helpers
added in one of the previous commits.
When guid is not set (ie: when not yet assigned), shared counters pointer
is allocated using calloc() (local memory) and a flag is set on the shared
counters struct to know how to manipulate (and free it). Else if guid is
set, then it means that the counters may be shared so while for now we
don't actually use a shared memory location the API is ready for that.
The way it works, for proxies and servers (for which guid is not known
during creation), we first call counters_{fe,be}_shared_get with guid not
set, which results in local pointer being retrieved (as if we just
manually called calloc() to retrieve a pointer). Later (during postparsing)
if guid is set we try to upgrade the pointer from local to shared.
Lastly, since the memory location for some objects (proxies and servers
counters) may change from creation to postparsing, let's update
counters->last_change member directly under counters_{fe,be}_shared_get()
so we don't miss it.
No change of behavior is expected, this is only preparation work.
fe_counters_shared and be_counters_shared may share some common members
since they are quite similar, so we add a common struct part shared
between the two. struct counters_shared is added for convenience as
a generic pointer to manipulate common members from fe or be shared
counters pointer.
Also, the first common member is added: shared fe and be counters now
have a flags member.
create include/haproxy/counters.h and src/counters.c files to anticipate
for further helpers as some counters specific tasks needs to be carried
out and since counters are shared between multiple object types (ie:
listener, proxy, server..) we need generic helpers.
Add some shared counters helper which are not yet used but will be updated
in upcoming commits.
Shareable counters are not tagged as shared counters and are dynamically
allocated in separate memory area as a prerequisite for being stored
in shared memory area. For now, GUID and threads groups are not taken into
account, this is only a first step.
also we ensure all counters are now manipulated using atomic operations,
namely, "last_change" counter is now read from and written to using atomic
ops.
Despite the numerous changes caused by the counters being moved away from
counters struct, no change of behavior should be expected.
As reported by Ilya in GH #2994, some cleanup parts in
sink_new_from_logger() function are not used.
We can actually simplify the cleanup logic to remove dead code, let's
do that by renaming "error_final" label to "error" and only making use
of the "error" label, because sink_free() already takes care of proper
cleanup for all sink members.
When we try to allocate a new SPOP stream, if an error is encountered,
spop_strm_destroy() is called to released the eventually allocated
stream. But, it must only be called if a stream was allocated. If the
reported error is an SPOP stream allocation failure, we must just leave to
avoid null-pointer dereference.
This patch should fix point 1 of the issue #2993. It must be backported as
far as 3.1.
These functions were copied from the channel API and modified to work with
applets using the new API or the legacy one. However, the comments were
updated accordingly. It is the purpose of this patch.
When a healthchecks is processed, once the first wakeup passed to start the
check, and as long as the expiration timer is not reached, only I/O events
are able to wake it up. It is an issue when there is a check timeout
defined. Especially if the connect timeout is high and the check timeout is
low. In that case, the healthcheck's task is never requeue to handle any
timeout update. When the connection is established, the check timeout is set
to replace the connect timeout. It is thus possible to report a success
while a timeout should be reported.
So, now, when an I/O event is handled, the healthcheck is requeue, except if
an success or an abort is reported.
Thanks to Thierry Fournier for report and the reproducer.
This patch must be backported to all stable versions.
In fwlc_srv_reposition(), set the server's tree_elt while we still hold
the lbprm read lock. While it was protected from concurrent
fwlc_srv_reposition() calls by the server's lb_lock, it was not from
dequeuing/requeuing that could occur if the server gets down/up or its
weight is changed, and that would lead to inconsistencies, and the
watchdog killing the process because it is stuck in an infinite loop in
fwlc_get_next_server().
This hopefully fixes github issue #2990.
This should be backported to 3.2.
rename _srv_postparse() internal function to srv_init() function and group
srv_init_per_thr() plus idle conns list init inside it. This way we can
perform some simplifications as srv_init() performs multiple server
init steps after parsing.
SRV_F_CHECKED flag was added, it is automatically set when srv_init()
runs successfully. If the flag is already set and srv_init() is called
again, nothing is done. This permis to manually call srv_init() earlier
than the default POST_CHECK hook when needed without risking to do things
twice.
while new_server() takes the parent proxy as argument and even assigns
srv->proxy to the parent proxy, it didn't actually inserted the server
to the parent proxy server list on success.
The result is that sometimes we add the server to the list after
new_server() is called, and sometimes we don't.
This is really error-prone and because of that hooks such as
REGISTER_POST_SERVER_CHECK() which as run for all servers listed in
all proxies may not be relied upon for servers which are not actually
inserted in their parent proxy server list. Plus it feels very strange
to have a server that points to a proxy, but then the proxy doesn't know
about it because it cannot find it in its server list.
To prevent errors and make proxy->srv list reliable, we move the insertion
logic directly under new_server(). This requires to know if we are called
during parsing or during runtime to either insert or append the server to
the parent proxy list. For that we use PR_FL_CHECKED flag from the parent
proxy (if the flag is set, then the proxy was checked so we are past the
init phase, thus we assume we are called during runtime)
This implies that during startup if new_server() has to be cancelled on
error paths we need to call srv_detach() (which is now exposed in server.h)
before srv_drop().
The consequence of this commit is that REGISTER_POST_SERVER_CHECK() should
not run reliably on all servers created using new_server() (without having
to manually loop on global servers_list)
REGISTER_POST_PROXY_CHECK() used to iterate over "main" proxies to run
registered callbacks. This means hidden proxies (and their servers) did
not get a chance to get post-checked and could cause issues if some post-
checks are expected to be executed on all proxies no matter their type.
Instead we now rely on the global proxies list. Another side effect is that
the REGISTER_POST_SERVER_CHECK() now runs as well for servers from proxies
that are not part of the main proxies list.
postcheck_log_backend() checks are executed no matter if the proxy
actually has the backend capability while the checks actually depend
on this.
Let's fix that by adding an extra condition to ensure that the BE
capability is set.
This issue is not tagged as a bug because for now it remains impossible
to have a syslog proxy without BE capability in the main proxy list, but
this may change in the future.
We have global proxies_list pointer which is announced as the list of
"all existing proxies", but in fact it only represents regular proxies
declared on the config file through "listen, frontend or backend" keywords
It is ambiguous, and we currently don't have a straightforwrd method to
iterate over all proxies (either public or internal ones) within haproxy
Instead we still have to manually iterate over multiple lists (main
proxies, log-forward proxies, peer proxies..) which is error-prone.
In this patch we add a struct list member (8 bytes) inside struct proxy
in order to store every proxy (except default ones) within a global
"proxies" list which is actually representative for all proxies existing
under haproxy process, like we already have for servers.
proxy_cond_disable() collects and prints cumulated connections for be and
fe proxies no matter their type. With shared stats it may cause issues
because depending on the proxy capabilities only fe or be counters may
be allocated.
In this patch we add some checks to ensure we only try to read from
valid memory locations, else we rely on default values (0).
init_srv_requeue() and init_srv_slowstart() functions are called after
initial server parsing via REGISTER_POST_SERVER_CHECK() hook, and they
are also manually called for dynamic server after the server is
initialized.
This may conflict with _srv_postparse() which is also registered via
REGISTER_POST_SERVER_CHECK() and called during dynamic server creation
To ensure functions don't conflict with each other, let's ensure they
are executed in proper order by calling init_srv_requeue and
init_srv_slowstart() from _srv_postparse() which now becomes the parent
function for server related postparsing stuff. No change of behavior is
expected.
This is no longer used since the migration to the native `haproxy -cc
'version_atleast(X)'` functionality.
see 8727614dc4046e91997ecce421bcb6a5537cac93
see 5efc48dcf1b133dd415c759e83b21d52dc303786
Introduced in:
25bcdb1d9 BUG/MAJOR: h1: Be stricter on request target validation during message parsing
see also:
fbbbc33df REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+
In resolve_sym_name() we declare a few symbols that we want to be able
to resolve. ha_dump_backtrace() was declared with a struct buffer instead
of a pointer to such a struct, which has no effect since we only want to
get the function's pointer, but produces a build warning with LTO, so
let's fix it.
This can be backported to 3.0.
Released version 3.2.0 with the following main changes :
- MINOR: promex: Add agent check status/code/duration metrics
- MINOR: ssl: support strict-sni in ssl-default-bind-options
- MINOR: ssl: also provide the "tls-tickets" bind option
- MINOR: server: define CLI I/O handler for "add server"
- MINOR: server: implement "add server help"
- MINOR: server: use stress mode for "add server help"
- BUG/MEDIUM: server: fix crash after duplicate GUID insertion
- BUG/MEDIUM: server: fix potential null-deref after previous fix
- MINOR: config: list recently added sections with -dKcfg
- BUG/MAJOR: cache: Crash because of wrong cache entry deleted
- DOC: configuration: fix the example in crt-store
- DOC: config: clarify the wording around single/double quotes
- DOC: config: clarify the legacy cookie and header captures
- DOC: config: fix alphabetical ordering of layer 7 sample fetch functions
- DOC: config: fix alphabetical ordering of layer 6 sample fetch functions
- DOC: config: fix alphabetical ordering of layer 5 sample fetch functions
- DOC: config: fix alphabetical ordering of layer 4 sample fetch functions
- DOC: config: fix alphabetical ordering of internal sample fetch functions
- BUG/MINOR: h3: Set HTX flags corresponding to the scheme found in the request
- BUG/MEDIUM: h3: Declare absolute URI as normalized when a :authority is found
- DOC: config: mention in bytes_in and bytes_out that they're read on input
- DOC: config: clarify the basics of ACLs (call point, multi-valued etc)
- REGTESTS: Make the script testing conditional set-var compatible with Vtest2
- REGTESTS: Explicitly allow failing shell commands in some scripts
- MINOR: listeners: Add support for a label on bind line
- BUG/MEDIUM: cli/ring: Properly handle shutdown in "show event" I/O handler
- BUG/MEDIUM: hlua: Properly detect shudowns for TCP applets based on the new API
- BUG/MEDIUM: hlua: Fix getline() for TCP applets to work with applet's buffers
- BUG/MEDIUM: hlua: Fix receive API for TCP applets to properly handle shutdowns
- CI: vtest: Rely on VTest2 to run regression tests
- CI: vtest: Fix the build script to properly work on MaOS
- CI: combine AWS-LC and AWS-LC-FIPS by template
- BUG/MEDIUM: httpclient: Throw an error if an lua httpclient instance is reused
- DOC: hlua: Add a note to warn user about httpclient object reuse
- DOC: hlua: fix a few typos in HTTPMessage.set_body_len() documentation
- DEV: patchbot: prepare for new version 3.3-dev
- MINOR: version: mention that it's 3.2 LTS now.
A few typos were noticed while gathering info for the 3.2 announce
messages, this fixes them, and will probably constitute the last
commit of this release. There's no need to backport it unless commit
94055a5e7 ("MEDIUM: hlua: Add function to change the body length of
an HTTP Message") is backported.
It is not supported to reuse an lua httpclient instance to process several
requests. A new object must be created for each request. Thanks to the
previous patch ("BUG/MEDIUM: httpclient: Throw an error if an lua httpclient
instance is reused"), an error is now reported if this happens. But it is
not obvious for users. So the lua-api docuementation was updated accordingly.
This patch is related to issue #2986. It should be backported with the
commit above.
It is not expected/supported to reuse an httpclient instance to process
several requests. A new instance must be created for each request. However,
in lua, there is nothing to prevent a user to create an httpclient object
and use it in a loop to process requests.
That's unfortunate because this will apparently work, the requests will be
sent and a response will be received and processed. However internally some
ressources will be allocated and never released. When the next response is
processed, the ressources allocated for the previous one are definitively
lost.
In this patch we take care to check that the httpclient object was never
used when a request is sent from a lua script by checking
HTTPCLIENT_FS_STARTED flags. This flag is set when a httpclient applet is
spawned to process a request and never removed after that. In lua, the
httpclient applet is created when the request is sent. So, it is the right
place to do this test.
This patch should fix the issue #2986. It should be backported as far as
2.6.
VTest2 (https://github.com/vtest/VTest2) was released and is a remplacement
for VTest. VTest was archived. So let's use the new version now.
If this commit is backported, the 2 following commits must also be
backported:
* 2808e3577 ("REGTESTS: Explicitly allow failing shell commands in some scripts")
* 82c291124 ("REGTESTS: Make the script testing conditional set-var compatible with Vtest2")
An optional timeout was added to AppletTCP.receive() to interrupt calls after a
delay. It was mandatory to be able to implement interactive applets (like
trisdemo). However, this broke the API and it made impossible to differentiate
the shutdowns from the delays expirations. Indeed, in both cases, an empty
string was returned.
Because historically an empty string was used to notify a connection shutdown,
it should not be changed. So now, 'nil' value is returned when no data was
available before the delay expiration.
The new AppletTCP:try_receive() function was also affected. To fix it, instead
of stating there is no delay when a receive is tried, an expired delay is
set. Concretely TICK_ETERNITY was replaced by now_ms.
Finally, AppletTCP:getline() function is not concerned for now because there
is no way to interrupt it after some delay.
The documentation and trisdemo lua script were updated accordingly.
This patch depends on "BUG/MEDIUM: hlua: Properly detect shudowns for TCP
applets based on the new API". However, it is a 3.2-specific issue, so no
backport is needed.
The commit e5e36ce09 ("BUG/MEDIUM: hlua/cli: Fix lua CLI commands to work
with applet's buffers") fixed the TCP applets API to work with applets using
its own buffers. Howver the getline() function was not updated. It could be
an issue for anyone registering a CLI commands reading lines.
This patch should be backported as far as 3.0.
The internal function responsible to receive data for TCP applets with
internal buffers is buggy. Indeed, for these applets, the buffer API is used
to get data. So there is no tests on the SE to properly detect connection
shutdowns. So, it must be performed by hand after the call to b_getblk_nc().
This patch must be backported as far as 3.0.
The commit 03dc54d802 ("BUG/MINOR: ring: Fix I/O handler of "show event"
command to not rely on the SC") introduced a regression. By removing
dependencies on the SC, a test to detect client shutdowns was removed. So
now, the CLI applet is no longer released when the client shut the
connection during a "show event -w".
So of course, we should not use the SC to detect the shutdowns. But the SE
must be used insteead.
It is a 3.2-specific issue, so no backport needed.
It is now possile to set a label on a bind line. All sockets attached to
this bind line inherits from this label. The idea is to be able to groud of
sockets. For now, there is no mechanism to create these groups, this must be
done by hand.
Vtest2, that should replaced Vtest in few months, will reject any failing
commands in shell blocks. However, some scripts are executing some commands,
expecting an error to be able to parse the error output. So, now use "set
+e" in those scripts to explicitly state failing commads are expected.
It is just used for non-final commands. At the end, the shell block must
still report a success.
VTest2 will replaced VTest in few months. There is not so much change
expected. One of them is that a User-Agent header is added by default in all
requests, except if an custom one is already set or if "-nouseragent" option
is used. To still be compatible with VTest, it is not possible to use the
option to avoid the header addition. So, a custom user-agent is added in the
last test of "sample_fetches/cond_set_var.vtc" to be sure it will pass with
Vtest and Vtest2. It is mandatory because the request length is tested.
This is essentially in order to address the concerns expressed in
issue #2226 where it is mentioned that the moment they are called is
not clear enough. Admittedly, re-reading the paragraph doesn't make
it obvious on a quick read that they behave like functions. This patch
adds an extra paragraph that makes the parallel with programming
languages' boolean functions and explains the fact that they can be
multi-valued. Hoping this is clearer now.
Issue #2267 suggests that it's unclear what exactly the byte counts mean
(particularly when compression is involved). Let's clarify that the counts
are read on data input and that they also cover headers and a bit of
internal overhead.
Since commit 2c3d656f8 ("MEDIUM: h3: use absolute URI form with
:authority"), the absolute URI form is used when a ':authority'
pseudo-header is found. However, this URI was not declared as normalized
internally. So, when the request is reformated to be sent to an h1 server,
the absolute-form is used instead of the origin-form. It is unexpected and
may be an issue for some servers that could reject the request.
So, now, we take care to set HTX_SL_F_HAS_AUTHORITY flag on the HTX message
when an authority was found and HTX_SL_F_NORMALIZED_URI flag is set for
"http" or "https" schemes.
No backport needed because the commit above must not be backported. It
should fix a regression reported on the 3.2-dev17 in issue #2977.
This commit depends on "BUG/MINOR: h3: Set HTX flags corresponding to the
scheme found in the request".
When a ":scheme" pseudo-header is found in a h3 request, the
HTX_SL_F_HAS_SCHM flag must be set on the HTX message. And if the scheme is
'http' or 'https', the corresponding HTX flag must also be set. So,
respectively, HTX_SL_F_SCHM_HTTP or HTX_SL_F_SCHM_HTTPS.
It is mainly used to send the right ":scheme" pseudo-header value to H2
server on backend side.
This patch could be backported as far as 2.6.
As reported in issue #2195, cookie captures and header captures are no
longer the recommended way to proceed. Let's mention that this is the
legacy way and provide a few pointers to the recommended functions and
actions to use the modern methods.
As reported in issue #2327, the wording used in the section about quoting
can be read two ways due to the use of the two types of quotes to protect
each other quote. Better only use the quoting without mixing the two when
mentioning them.
Fix a bad example in the crt-store section. site1 does not use the "web"
crt-store but the global one.
Must be backported as far as 3.0 however the section was 3.12 in
previous version.
When "vary" is enabled, we can have multiple entries for a given primary
key in the cache tree. There is a limit to how many secondary entries
can be inserted for a given key. When we try to insert a new secondary
entry, if the limit is already reached, we can try to find expired
entries with the same primary key, and if the limit is still reached we
want to abort the current insertion and to remove the node that was just
inserted.
In commit "a29b073: MEDIUM: cache: Add refcount on cache_entry" though,
a regression was introduced. Instead of removing the entry just inserted
as the comments suggested, we removed the second to last entry and
returned NULL. We then reset the eb.key of the cache_entry in the caller
because we assumed that the entry was already removed from the tree.
This means that some entries with an empty key were wrongly kept in the
tree and the last secondary entry, which keeps the number of secondary
entries of a given key was removed.
This ended up causing some crashes later on when we tried to iterate
over the elements of this given key. The crash could occur in multiple
places, either when trying to retrieve an entry or to add some new ones.
This crash was raised in GitHub issue #2950.
The fix should be backported up to 3.0.
A valid build warning was reported in the CI with latest commit b40ce97ecc
("BUG/MEDIUM: server: fix crash after duplicate GUID insertion"). Indeed,
if the first test in the function fails, we branch to the err label
with guid==NULL and will crash there. Let's just test guid before
dereferencing it for freeing.
This needs to be backported to 3.0 as well since the commit above was
meant to go there.
On "add server", if a GUID is defined, guid_insert() is used to add the
entry into the global GUID tree. If a similar entry already exists, GUID
insertion fails and the server creation is eventually aborted.
A crash could occur in this case because of an invalid memory access via
guid_remove(). The latter is caused via free_server() as the server
insertion is rejected. The invalid occurs on GUID key.
The issue occurs because of guid_insert(). The function properly
deallocates the GUID key on duplicate insertion, but it failed to reset
<guid.node.key> to NULL. This caused the invalid memory access on
guid_remove(). To fix this, ensure that key member is properly resetted
on guid_insert() error path.
This must be backported up to 3.0.
Implement stress mode on "add server help". This ensures that the
command is fully reentrant on full output buffer.
For testing, it requires compilation with USE_STRESS and global setting
"stress-level 1".
Implement "help" as a sub-command for "add server" CLI. The objective is
to list all the keywords that are supported for dynamic servers. CLI IO
handler and add_srv_ctx are used to support reentrancy on full output
buffer.
Now that this command is implemented, the outdated keyword list on "add
server" from management documentation can be removed.
Extend "add server" to support an IO handler function named
cli_io_handler_add_server(). A context object is also defined whose
usage will depend on IO handler capabilities.
IO handler is skipped when "add server" is run in default mode, i.e. on
a dynamic server creation. Thus, currently IO handler is unneeded.
However, it will become useful to support sub-commands for "add server".
Note that return value of "add server" parser has been changed on server
creation success. Previously, it was used incorrectly to report if
server was inserted or not. In fact, parser return value is used by CLI
generic code to detect if command processing has been completed, or
should continue to the IO handler. Now, "add server" always returns 1 to
signal that CLI processing is completed. This is necessary to preserve
CLI output emitted by parser, even now that IO handler is defined for
the command. Previously, output was emitted in every situations due to
IO handler not defined. See below code snippet from cli.c for a better
overview :
if (kw->parse && kw->parse(args, payload, appctx, kw->private) != 0) {
ret = 1;
goto fail;
}
/* kw->parse could set its own io_handler or io_release handler */
if (!appctx->cli_ctx.io_handler) {
ret = 1;
goto fail;
}
appctx->st0 = CLI_ST_CALLBACK;
ret = 1;
goto end;
Currently there is "no-tls-tickets" that is also supported in the
ssl-default-bind-options directive, but there's no way to re-enable
them on a specific "bind" line. This patch simply provides the option
to re-enable them. Note that the flag is inverted because tickets are
enabled by default and the no-tls-ticket option sets the flag to
disable them.
Several users already reported that it would be nice to support
strict-sni in ssl-default-bind-options. However, in order to support
it, we also need an option to disable it.
This patch moves the setting of the option from the strict_sni field
to a flag in the ssl_options field so that it can be inherited from
the default bind options, and adds a new "no-strict-sni" directive to
allow to disable it on a specific "bind" line.
The test file "del_ssl_crt-list.vtc" which already tests both options
was updated to make use of the default option and the no- variant to
confirm everything continues to work.
In the Prometheus exporter, the last health check status is already exposed,
with its code and duration in seconds. The server status is also exposed.
But the information about the agent check are not available. It is not
really handy because when a server status is changed because of the agent,
it is not obvious by looking to the Prometheus metrics. Indeed, the server
may reported as DOWN for instance, while the health check status still
reports a success. Being able to get the agent status in that case could be
valuable.
So now, the last agent check status is exposed, with its code and duration
in seconds. Following metrics can be grabbe now:
* haproxy_server_agent_status
* haproxy_server_agent_code
* haproxy_server_agent_duration_seconds
Note that unlike the other metrics, no per-backend aggregated metric is
exposed.
This patch is related to issue #2983.
Released version 3.2-dev17 with the following main changes :
- DOC: configuration: explicit multi-choice on bind shards option
- BUG/MINOR: sink: detect and warn when using "send-proxy" options with ring servers
- BUG/MEDIUM: peers: also limit the number of incoming updates
- MEDIUM: hlua: Add function to change the body length of an HTTP Message
- BUG/MEDIUM: stconn: Disable 0-copy forwarding for filters altering the payload
- BUG/MINOR: h3: don't insert more than one Host header
- BUG/MEDIUM: h1/h2/h3: reject forbidden chars in the Host header field
- DOC: config: properly index "table and "stick-table" in their section
- DOC: management: change reference to configuration manual
- BUILD: debug: mark ha_crash_now() as attribute(noreturn)
- IMPORT: slz: avoid multiple shifts on 64-bits
- IMPORT: slz: support crc32c for lookup hash on sse4 but only if requested
- IMPORT: slz: use a better hash for machines with a fast multiply
- IMPORT: slz: fix header used for empty zlib message
- IMPORT: slz: silence a build warning on non-x86 non-arm
- BUG/MAJOR: leastconn: do not loop forever when facing saturated servers
- BUG/MAJOR: queue: properly keep count of the queue length
- BUG/MINOR: quic: fix crash on quic_conn alloc failure
- BUG/MAJOR: leastconn: never reuse the node after dropping the lock
- MINOR: acme: renewal notification over the dpapi sink
- CLEANUP: quic: Useless BIO_METHOD initialization
- MINOR: quic: Add useful error traces about qc_ssl_sess_init() failures
- MINOR: quic: Allow the use of the new OpenSSL 3.5.0 QUIC TLS API (to be completed)
- MINOR: quic: implement all remaining callbacks for OpenSSL 3.5 QUIC API
- MINOR: quic: OpenSSL 3.5 internal QUIC custom extension for transport parameters reset
- MINOR: quic: OpenSSL 3.5 trick to support 0-RTT
- DOC: update INSTALL for QUIC with OpenSSL 3.5 usages
- DOC: management: update 'acme status'
- BUG/MEDIUM: wdt: always ignore the first watchdog wakeup
- CLEANUP: wdt: clarify the comments on the common exit path
- BUILD: ssl: avoid possible printf format warning in traces
- BUILD: acme: fix build issue on 32-bit archs with 64-bit time_t
- DOC: management: precise some of the fields of "show servers conn"
- BUG/MEDIUM: mux-quic: fix BUG_ON() on rxbuf alloc error
- DOC: watchdog: update the doc to reflect the recent changes
- BUG/MEDIUM: acme: check if acme domains are configured
- BUG/MINOR: acme: fix formatting issue in error and logs
- EXAMPLES: lua: avoid screen refresh effect in "trisdemo"
- CLEANUP: quic: remove unused cbuf module
- MINOR: quic: move function to check stream type in utils
- MINOR: quic: refactor handling of streams after MUX release
- MINOR: quic: add some missing includes
- MINOR: quic: adjust quic_conn-t.h include list
- CLEANUP: cfgparse: alphabetically sort the global keywords
- MINOR: glitches: add global setting "tune.glitches.kill.cpu-usage"
It was mentioned during the development of glitches that it would be
nice to support not killing misbehaving connections below a certain
CPU usage so that poor implementations that routinely misbehave without
impact are not killed. This is now possible by setting a CPU usage
threshold under which we don't kill them via this parameter. It defaults
to zero so that we continue to kill them by default.
Adjust include list in quic_conn-t.h. This file is included in many QUIC
source, so it is useful to keep as lightweight as possible. Note that
connection/QUIC MUX are transformed into forward declaration for better
layer separation.
Insert some missing includes statement in QUIC source files. This was
detected after the next commit which adjust the include list used in
quic_conn-t.h file.
quic-conn layer has to handle itself STREAM frames after MUX release. If
the stream was already seen, it is probably only a retransmitted frame
which can be safely ignored. For other streams, an active closure may be
needed.
Thus it's necessary that quic-conn layer knows the highest stream ID
already handled by the MUX after its release. Previously, this was done
via <nb_streams> member array in quic-conn structure.
Refactor this by replacing <nb_streams> by two members called
<stream_max_uni>/<stream_max_bidi>. Indeed, it is unnecessary for
quic-conn layer to monitor locally opened uni streams, as the peer
cannot by definition emit a STREAM frame on it. Also, bidirectional
streams are always opened by the remote side.
Previously, <nb_streams> were set by quic-stream layer. Now,
<stream_max_uni>/<stream_max_bidi> members are only set one time, just
prior to QUIC MUX release. This is sufficient as quic-conn do not use
them if the MUX is available.
Note that previously, IDs were used relatively to their type, thus
incremented by 1, after shifting the original value. For simplification,
use the plain stream ID, which is incremented by 4.
Move general function to check if a stream is uni or bidirectional from
QUIC MUX to quic_utils module. This should prevent unnecessary include
of QUIC MUX header file in other sources.
In current version of the game, there is a "screen refresh" effect: the
screen is cleared before being re-drawn.
I moved the clear right after the connection is opened and removed it
from rendering time.
Stop emitting \n in errmsg for intermediate error messages, this was
emitting multiline logs and was returning to a new line in the middle of
sentences.
We don't need to emit them in acme_start_task() since the errmsg is
ouput in a send_log which already contains a \n or on the CLI which
also emits it.
When starting the ACME task with a ckch_conf which does not contain the
domains, the ACME task would segfault because it will try to dereference
a NULL in this case.
The patch fix the issue by emitting a warning when no domains are
configured. It's not done at configuration parsing because it is not
easy to emit the warning because there are is no callback system which
give access to the whole ckch_conf once a line is parsed.
No backport needed.
RX buffer allocation has been reworked in current dev tree. The
objective is to support multiple buffers per QCS to improve upload
throughput.
RX buffer allocation failure is handled simply : the whole connection is
closed. This is done via qcc_set_error(), with INTERNAL_ERROR as error
code. This function contains a BUG_ON() to ensure it is called only one
time per connection instance.
On RX buffer alloc failure, the aformentioned BUG_ON() crashes due to a
double invokation of qcc_set_error(). First by qcs_get_rxbuf(), and
immediately after it by qcc_recv(), which is the caller of the previous
one. This regression was introduced by the following commit.
60f64449fbba7bb6e351e8343741bb3c960a2e6d
MAJOR: mux-quic: support multiple QCS RX buffers
To fix this, simply remove qcc_set_error() invocation in
qcs_get_rxbuf(). On buffer alloc failture, qcc_recv() is responsible to
set the error.
This does not need to be backported.
As reported in issue #2970, the output of "show servers conn" is not
clear. It was essentially meant as a debugging tool during some changes
to idle connections management, but if some users want to monitor or
graph them, more info is needed. The doc mentions the currently known
list of fields, and reminds that this output is not meant to be stable
over time, but as long as it does not change, it can provide some useful
metrics to some users.
The build failed on mips32 with a 64-bit time_t here:
https://github.com/haproxy/haproxy/actions/runs/15150389164/job/42595310111
Let's just turn the "remain" variable used to show the remaining time
into a more portable ullong and use %llu for all format specifiers,
since long remains limited to 32-bit on 32-bit archs.
No backport needed.
When building on MIPS-32 with gcc-9.5 and glibc-2.31, I got this:
src/ssl_trace.c: In function 'ssl_trace':
src/ssl_trace.c:118:42: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'ssize_t' {aka 'const int'} [-Wformat=]
118 | chunk_appendf(&trace_buf, " : size=%ld", *size);
| ~~^ ~~~~~
| | |
| | ssize_t {aka const int}
| long int
| %d
Let's just cast the type. No backport needed.
With commit a06c215f08 ("MEDIUM: wdt: always make the faulty thread
report its own warnings"), when the TH_FL_STUCK flag was flipped on,
we'd then go to the panic code instead of giving a second chance like
before the commit. This can trigger rare cases that only happen with
moderate loads like was addressed by commit 24ce001771 ("BUG/MEDIUM:
wdt: fix the stuck detection for warnings"). This is in fact due to
the loss of the common "goto update_and_leave" that used to serve
both the warning code and the flag setting for probation, and it's
apparently what hit Christian in issue #2980.
Let's make sure we exit naturally when turning the bit on for the
first time. Let's also update the confusing comment at the end of
the check that was left over by latest change.
Since the first commit was backported to 3.1, this commit should be
backported there as well.
For an unidentified reason, SSL_do_hanshake() succeeds at its first call when 0-RTT
is enabled for the connection. This behavior looks very similar by the one encountered
by AWS-LC stack. That said, it was documented by AWS-LC. This issue leads the
connection to stop sending handshake packets after having release the handshake
encryption level. In fact, no handshake packets could even been sent leading
the handshake to always fail.
To fix this, this patch simulates a "handshake in progress" state waiting
for the application level read secret to be established by the TLS stack.
This may happen only after the QUIC listener has completed/confirmed the handshake
upon handshake CRYPTO data receipt from the peer.
A QUIC must sent its transport parameter using a TLS custom extention. This
extension is reset by SSL_set_SSL_CTX(). It can be restored calling
quic_ssl_set_tls_cbs() (which calls SSL_set_quic_tls_cbs()).
The quic_conn struct is modified for two reasons. The first one is to store
the encoded version of the local tranport parameter as this is done for
USE_QUIC_OPENSSL_COMPAT. Indeed, the local transport parameter "should remain
valid until after the parameters have been sent" as mentionned by
SSL_set_quic_tls_cbs(3) manual. In our case, the buffer is a static buffer
attached to the quic_conn object. qc_ssl_set_quic_transport_params() function
whose role is to call SSL_set_tls_quic_transport_params() (aliased by
SSL_set_quic_transport_params() to set these local tranport parameter into
the TLS stack from the buffer attached to the quic_conn struct.
The second quic_conn struct modification is the addition of the new ->prot_level
(SSL protection level) member added to the quic_conn struct to store "the most
recent write encryption level set via the OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn
callback (if it has been called)" as mentionned by SSL_set_quic_tls_cbs(3) manual.
This patches finally implements the five remaining callacks to make the haproxy
QUIC implementation work.
OSSL_FUNC_SSL_QUIC_TLS_crypto_send_fn() (ha_quic_ossl_crypto_send) is easy to
implement. It calls ha_quic_add_handshake_data() after having converted
qc->prot_level TLS protection level value to the correct ssl_encryption_level_t
(boringSSL API/quictls) value.
OSSL_FUNC_SSL_QUIC_TLS_crypto_recv_rcd_fn() (ha_quic_ossl_crypto_recv_rcd())
provide the non-contiguous addresses to the TLS stack, without releasing
them.
OSSL_FUNC_SSL_QUIC_TLS_crypto_release_rcd_fn() (ha_quic_ossl_crypto_release_rcd())
release these non-contiguous buffer relying on the fact that the list of
encryption level (qc->qel_list) is correctly ordered by SSL protection level
secret establishements order (by the TLS stack).
OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn() (ha_quic_ossl_got_transport_params())
is a simple wrapping function over ha_quic_set_encryption_secrets() which is used
by boringSSL/quictls API.
OSSL_FUNC_SSL_QUIC_TLS_got_transport_params_fn() (ha_quic_ossl_got_transport_params())
role is to store the peer received transport parameters. It simply calls
quic_transport_params_store() and set them into the TLS stack calling
qc_ssl_set_quic_transport_params().
Also add some comments for all the OpenSSL 3.5 QUIC API callbacks.
This patch have no impact on the other use of QUIC API provided by the others TLS
stacks.
This patch allows the use of the new OpenSSL 3.5.0 QUIC TLS API when it is
available and detected at compilation time. The detection relies on the presence of the
OSSL_FUNC_SSL_QUIC_TLS_CRYPTO_SEND macro from openssl-compat.h. Indeed this
macro is defined by OpenSSL since 3.5.0 version. It is not defined by quictls.
This helps in distinguishing these two TLS stacks. When the detection succeeds,
HAVE_OPENSSL_QUIC is also defined by openssl-compat.h. Then, this is this new macro
which is used to detect the availability of the new OpenSSL 3.5.0 QUIC TLS API.
Note that this detection is done only if USE_QUIC_OPENSSL_COMPAT is not asked.
So, USE_QUIC_OPENSSL_COMPAT and HAVE_OPENSSL_QUIC are exclusive.
At the same location, from openssl-compat.h, ssl_encryption_level_t enum is
defined. This enum was defined by quictls and expansively used by the haproxy
QUIC implementation. SSL_set_quic_transport_params() is replaced by
SSL_set_quic_tls_transport_params. SSL_set_quic_early_data_enabled() (quictls) is also replaced
by SSL_set_quic_tls_early_data_enabled() (OpenSSL). SSL_quic_read_level() (quictls)
is not defined by OpenSSL. It is only used by the traces to log the current
TLS stack decryption level (read). A macro makes it return -1 which is an
usused values.
The most of the differences between quictls and OpenSSL QUI APIs are in quic_ssl.c
where some callbacks must be defined for these two APIs. This is why this
patch modifies quic_ssl.c to define an array of OSSL_DISPATCH structs: <ha_quic_dispatch>.
Each element of this arry defines a callback. So, this patch implements these
six callabcks:
- ha_quic_ossl_crypto_send()
- ha_quic_ossl_crypto_recv_rcd()
- ha_quic_ossl_crypto_release_rcd()
- ha_quic_ossl_yield_secret()
- ha_quic_ossl_got_transport_params() and
- ha_quic_ossl_alert().
But at this time, these implementations which must return an int return 0 interpreted
as a failure by the OpenSSL QUIC API, except for ha_quic_ossl_alert() which
is implemented the same was as for quictls. The five remaining functions above
will be implemented by the next patches to come.
ha_quic_set_encryption_secrets() and ha_quic_add_handshake_data() have been moved
to be defined for both quictls and OpenSSL QUIC API.
These callbacks are attached to the SSL objects (sessions) calling qc_ssl_set_cbs()
new function. This latter callback the correct function to attached the correct
callbacks to the SSL objects (defined by <ha_quic_method> for quictls, and
<ha_quic_dispatch> for OpenSSL).
The calls to SSL_provide_quic_data() and SSL_process_quic_post_handshake()
have been also disabled. These functions are not defined by OpenSSL QUIC API.
At this time, the functions which call them are still defined when HAVE_OPENSSL_QUIC
is defined.
There were no traces to diagnose qc_ssl_sess_init() failures from QUIC traces.
This patch add calls to TRACE_DEVEL() into qc_ssl_sess_init() and its caller
(qc_alloc_ssl_sock_ctx()). This was useful at least to diagnose SSL context
initialization failures when porting QUIC to the new OpenSSL 3.5 QUIC API.
Should be easily backported as far as 2.6.
This code is there from QUIC implementation start. It was supposed to
initialize <ha_quic_meth> as a BIO_METHOD static object. But this
BIO_METHOD is not used at all!
Should be backported as far as 2.6 to help integrate the next patches to come.
Output a sink message when the certificate was renewed by the ACME
client.
The message is emitted on the "dpapi" sink, and ends by \n\0.
Since the message contains this binary character, the right -0 parameter
must be used when consulting the sink over the CLI:
Example:
$ echo "show events dpapi -nw -0" | socat -t9999 /tmp/haproxy.sock -
<0>2025-05-19T15:56:23.059755+02:00 acme newcert foobar.pem.rsa\n\0
When used with the master CLI, @@1 should be used instead of @1 in order
to keep the connection to the worker.
Example:
$ echo "@@1 show events dpapi -nw -0" | socat -t9999 /tmp/master.sock -
<0>2025-05-19T15:56:23.059755+02:00 acme newcert foobar.pem.rsa\n\0
On ARM with 80 cores and a single server, it's sometimes possible to see
a segfault in fwlc_get_next_server() around 600-700k RPS. It seldom
happens as well on x86 with 128 threads with the same config around 1M
rps. It turns out that in fwlc_get_next_server(), before calling
fwlc_srv_reposition(), we have to drop the lock and that one takes it
back again.
The problem is that anything can happen to our node during this time,
and it can be freed. Then when continuing our work, we later iterate
over it and its next to find a node with an acceptable key, and by
doing so we can visit either uninitialized memory or simply nodes that
are no longer in the tree.
A first attempt at fixing this consisted in artificially incrementing
the elements count before dropping the lock, but that turned out to be
even worse because other threads could loop forever on such an element
looking for an entry that does not exist. Maintaining a separate
refcount didn't work well either, and it required to deal with the
memory release while dropping it, which is really not convenient.
Here we're taking a different approach consisting in simply not
trusting this node anymore and going back to the beginning of the
loop, as is done at a few other places as well. This way we can
safely ignore the possibly released node, and the test runs reliably
both on the arm and the x86 platforms mentioned above. No performance
regression was observed either, likely because this operation is quite
rare.
No backport is needed since this appeared with the leastconn rework
in 3.2.
If there is an alloc failure during qc_new_conn(), cleaning is done via
quic_conn_release(). However, since the below commit, an unchecked
dereferencing of <qc.path> is performed in the latter.
e841164a4402118bd7b2e2dc2b5068f21de5d9d2
MINOR: quic: account for global congestion window
To fix this, simply check <qc.path> before dereferencing it in
quic_conn_release(). This is safe as it is properly initialized to NULL
on qc_new_conn() first stage.
This does not need to be backported.
The queue length was moved to its own variable in commit 583303c48
("MINOR: proxies/servers: Calculate queueslength and use it."), however a
few places were missed in pendconn_unlink() and assign_server_and_queue()
resulting in never decreasing counts on aborted streams. This was
reproduced when injecting more connections than the total backend
could stand in TCP mode and letting some of them time out in the
queue. No backport is needed, this is only 3.2.
Since commit 9fe72bba3 ("MAJOR: leastconn; Revamp the way servers are
ordered."), there's no way to escape the loop visiting the mt_list heads
in fwlc_get_next_server if all servers in the list are saturated,
resulting in a watchdog panic. It can be reproduced with this config
and injecting with more than 2 concurrent conns:
balance leastconn
server s1 127.0.0.1:8000 maxconn 1
server s2 127.0.0.1:8000 maxconn 1
Here we count the number of saturated servers that were encountered, and
escape the loop once the number of remaining servers exceeds the number
of saturated ones. No backport is needed since this arrived in 3.2.
Building with clang 16 on MIPS64 yields this warning:
src/slz.c:931:24: warning: unused function 'crc32_uint32' [-Wunused-function]
static inline uint32_t crc32_uint32(uint32_t data)
^
Let's guard it using UNALIGNED_LE_OK which is the only case where it's
used. This saves us from introducing a possibly non-portable attribute.
This is libslz upstream commit f5727531dba8906842cb91a75c1ffa85685a6421.
Calling slz_rfc1950_finish() without emitting any data would result in
incorrectly emitting a gzip header (rfc1952) instead of a zlib header
(rfc1950) due to a copy-paste between the two wrappers. The impact is
almost inexistent since the zlib format is almost never used in this
context, and compressing totally empty messages is quite rare as well.
Let's take this opportunity for fixing another mistake on an RFC number
in a comment.
This is slz upstream commit 7f3fce4f33e8c2f5e1051a32a6bca58e32d4f818.
The current hash involves 3 simple shifts and additions so that it can
be mapped to a multiply on architecures having a fast multiply. This is
indeed what the compiler does on x86_64. A large range of values was
scanned to try to find more optimal factors on machines supporting such
a fast multiply, and it turned out that new factor 0x1af42f resulted in
smoother hashes that provided on average 0.4% better compression on both
the Silesia corpus and an mbox file composed of very compressible emails
and uncompressible attachments. It's even slightly better than CRC32C
while being faster on Skylake. This patch enables this factor on archs
with a fast multiply.
This is slz upstream commit 82ad1e75c13245a835c1c09764c89f2f6e8e2a40.
If building for sse4 and USE_CRC32C_HASH is defined, then we can use
crc32c to calculate the lookup hash. By default we don't do it because
even on skylake it's slower than the current hash, which only involves
a short multiply (~5% slower). But the gains are marginal (0.3%).
This is slz upstream commit 44ae4f3f85eb275adba5844d067d281e727d8850.
Note: this is not used by default and only merged in order to avoid
divergence between the code bases.
On 64-bit platforms, disassembling the code shows that send_huff() performs
a left shift followed by a right one, which are the result of integer
truncation and zero-extension caused solely by using different types at
different levels in the call chain. By making encode24() take a 64-bit
int on input and send_huff() take one optionally, we can remove one shift
in the hot path and gain 1% performance without affecting other platforms.
This is slz upstream commit fd165b36c4621579c5305cf3bb3a7f5410d3720b.
Building on MIPS64 with clang16 incorrectly reports some uninitialized
value warnings in stats-proxy.c due to some calls to ABORT_NOW() where
the compiler didn't know the code wouldn't return. Let's properly mark
the function as noreturn, and take this opportunity for also marking it
unused to avoid possible warnings depending on the build options (if
ABORT_NOW is not used). No backport needed though it will not harm.
Since e24b77e7 ('DOC: config: move the extraneous sections out of the
"global" definition') the ACME section of the configuration manual was
move from 3.13 to 12.8.
Change the reference to that section in "acme renew".
Tim reported in issue #2953 that "stick-table" and "table" were not
indexed as keywords. The issue was the indent level. Also let's make
sure to put a box around the "store" arguments as well.
In continuation with 9a05c1f574 ("BUG/MEDIUM: h2/h3: reject some
forbidden chars in :authority before reassembly") and the discussion
in issue #2941, @DemiMarie rightfully suggested that Host should also
be sanitized, because it is sometimes used in concatenation, such as
this:
http-request set-url https://%[req.hdr(host)]%[pathq]
which was proposed as a workaround for h2 upstream servers that require
:authority here:
https://www.mail-archive.com/haproxy@formilux.org/msg43261.html
The current patch then adds the same check for forbidden chars in the
Host header, using the same function as for the patch above, since in
both cases we validate the host:port part of the authority. This way
we won't reconstruct ambiguous URIs by concatenating Host and path.
Just like the patch above, this can be backported afer a period of
observation.
Let's make sure we drop extraneous Host headers after having compared
them. That also works when :authority was already present. This way,
like for h1 and h2, we only keep one copy of it, while still making
sure that Host matches :authority. This way, if a request has both
:authority and Host, only one Host header will be produced (from
:authority). Note that due to the different organization of the code
and wording along the evolving RFCs, here we also check that all
duplicates are identical, while h2 ignores them as per RFC7540, but
this will be re-unified later.
This should be backported to stable versions, at least 2.8, though
thanks to the existing checks the impact is probably nul.
It is especially a problem with Lua filters, but it is important to disable
the 0-copy forwarding if a filter alters the payload, or at least to be able
to disable it. While the filter is registered on the data filtering, it is
not an issue (and it is the common case) because, there is now way to
fast-forward data at all. But it may be an issue if a filter decides to
alter the payload and to unregister from data filtering. In that case, the
0-copy forwarding can be re-enabled in a hardly precdictable state.
To fix the issue, a SC flags was added to do so. The HTTP compression filter
set it and lua filters too if the body length is changed (via
HTTPMessage.set_body_len()).
Note that it is an issue because of a bad design about the HTX. Many info
about the message are stored in the HTX structure itself. It must be
refactored to move several info to the stream-endpoint descriptor. This
should ease modifications at the stream level, from filter or a TCP/HTTP
rules.
This should be backported as far as 3.0. If necessary, it may be backported
on lower versions, as far as 2.6. In that case, it must be reviewed and
adapted.
There was no function for a lua filter to change the body length of an HTTP
Message. But it is mandatory to be able to alter the message payload. It is
not possible update to directly update the message headers because the
internal state of the message must also be updated accordingly.
It is the purpose of HTTPMessage.set_body_len() function. The new body
length myst be passed as argument. If it is an integer, the right
"Content-Length" header is set. If the "chunked" string is used, it forces
the message to be chunked-encoded and in that case the "Transfer-Encoding"
header.
This patch should fix the issue #2837. It could be backported as far as 2.6.
There's a configurable limit to the number of messages sent to a
peer (tune.peers.max-updates-at-once), but this one is not applied to
the receive side. While it can usually be OK with default settings,
setups involving a large tune.bufsize (1MB and above) regularly
experience high latencies and even watchdogs during reloads because
the full learning process sends a lot of data that manages to fill
the entire buffer, and due to the compactness of the protocol, 1MB
of buffer can contain more than 100k updates, meaning taking locks
etc during this time, which is not workable.
Let's make sure the receiving side also respects the max-updates-at-once
setting. For this it counts incoming updates, and refrains from
continuing once the limit is reached. It's a bit tricky to do because
after receiving updates we still have to send ours (and possibly some
ACKs) so we cannot just leave the loop.
This issue was reported on 3.1 but it should progressively be backported
to all versions having the max-updates-at-once option available.
using "send-proxy" or "send-proxy-v2" option on a ring server is not
relevant nor supported. Worse, on 2.4 it causes haproxy process to
crash as reported in GH #2965.
Let's be more explicit about the fact that this keyword is not supported
under "ring" context by ignoring the option and emitting a warning message
to inform the user about that.
Ideally, we should do the same for peers and log servers. The proper way
would be to check servers options during postparsing but we currently lack
proper cross-type server postparsing hooks. This will come later and thus
will give us a chance to perform the compatibilty checks for server
options depending on proxy type. But for now let's simply fix the "ring"
case since it is the only one that's known to cause a crash.
It may be backported to all stable versions.
From the documentation, this wasn't clear enough that shards should
be followed by one of the options number / by-thread / by-group.
Align it with existing options in documentation so that it becomes
more explicit.
Released version 3.2-dev16 with the following main changes :
- BUG/MEDIUM: mux-quic: fix crash on invalid fctl frame dereference
- DEBUG: pool: permit per-pool UAF configuration
- MINOR: acme: add the global option 'acme.scheduler'
- DEBUG: pools: add a new integrity mode "backup" to copy the released area
- MEDIUM: sock-inet: re-check IPv6 connectivity every 30s
- BUG/MINOR: ssl: doesn't fill conf->crt with first arg
- BUG/MINOR: ssl: prevent multiple 'crt' on the same ssl-f-use line
- BUG/MINOR: ssl/ckch: always free() the previous entry during parsing
- MINOR: tools: ha_freearray() frees an array of string
- BUG/MINOR: ssl/ckch: always ha_freearray() the previous entry during parsing
- MINOR: ssl/ckch: warn when the same keyword was used twice
- BUG/MINOR: threads: fix soft-stop without multithreading support
- BUG/MINOR: tools: improve parse_line()'s robustness against empty args
- BUG/MINOR: cfgparse: improve the empty arg position report's robustness
- BUG/MINOR: server: dont depend on proxy for server cleanup in srv_drop()
- BUG/MINOR: server: perform lbprm deinit for dynamic servers
- MINOR: http: add a function to validate characters of :authority
- BUG/MEDIUM: h2/h3: reject some forbidden chars in :authority before reassembly
- MINOR: quic: account Tx data per stream
- MINOR: mux-quic: account Rx data per stream
- MINOR: quic: add stream format for "show quic"
- MINOR: quic: display QCS info on "show quic stream"
- MINOR: quic: display stream age
- BUG/MINOR: cpu-topo: fix group-by-cluster policy for disordered clusters
- MINOR: cpu-topo: add a new "group-by-ccx" CPU policy
- MINOR: cpu-topo: provide a function to sort clusters by average capacity
- MEDIUM: cpu-topo: change "performance" to consider per-core capacity
- MEDIUM: cpu-topo: change "efficiency" to consider per-core capacity
- MEDIUM: cpu-topo: prefer grouping by CCX for "performance" and "efficiency"
- MEDIUM: config: change default limits to 1024 threads and 32 groups
- BUG/MINOR: hlua: Fix Channel:data() and Channel:line() to respect documentation
- DOC: config: Fix a typo in the "term_events" definition
- BUG/MINOR: spoe: Don't report error on applet release if filter is in DONE state
- BUG/MINOR: mux-spop: Don't report error for stream if ACK was already received
- BUG/MINOR: mux-spop: Make the demux stream ID a signed integer
- BUG/MINOR: mux-spop: Don't open new streams for SPOP connection on error
- MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
- BUG/MEDIUM: mux-spop: Remove frame parsing states from the SPOP connection state
- BUG/MEDIUM: mux-spop: Properly handle CLOSING state
- BUG/MEDIUM: spop-conn: Report short read for partial frames payload
- BUG/MEDIUM: mux-spop: Properly detect truncated frames on demux to report error
- BUG/MEDIUM: mux-spop; Don't report a read error if there are pending data
- DEBUG: mux-spop: Review some trace messages to adjust the message or the level
- DOC: config: move address formats definition to section 2
- DOC: config: move stick-tables and peers to their own section
- DOC: config: move the extraneous sections out of the "global" definition
- CI: AWS-LC(fips): enable unit tests
- CI: AWS-LC: enable unit tests
- CI: compliance: limit run on forks only to manual + cleanup
- CI: musl: enable unit tests
- CI: QuicTLS (weekly): limit run on forks only to manual dispatch
- CI: WolfSSL: enable unit tests
Due to some historic mistakes that have spread to newly added sections,
a number of of recently added small sections found themselves described
under section 3 "global parameters" which is specific to "global" section
keywords. This is highly confusing, especially given that sections 3.1,
3.2, 3.3 and 3.10 directly start with keywords valid in the global section,
while others start with keywords that describe a new section.
Let's just create a new chapter "12. other sections" and move them all
there. 3.10 "HTTPclient tuning" however was moved to 3.4 as it's really
a definition of the global options assigned to the HTTP client. The
"programs" that are going away in 3.3 were moved at the end to avoid a
renumbering later.
Another nice benefit is that it moves a lot of text that was previously
keeping the global and proxies sections apart.
As suggested by Tim in issue #2953, stick-tables really deserve their own
section to explain the configuration. And peers have to move there as well
since they're totally dedicated to stick-tables.
Now we introduce a new section "Stick-tables and Peers", explaining the
concepts, and under which there is one subsection for stick-tables
configuration and one for the peers (which mostly keeps the existing
peers section).
Section 2 describes the config file format, variables naming etc, so
there's no reason why the address format used in this file should be
in a separate section, let's bring it into section 2 as well.
Some trace messages were not really accurrate, reporting a CLOSED connection
while only an error was reported on it. In addition, an TRACE_ERROR() was
used to report a short read on HELLO/DISCONNECT frames header. But it is not
an error. a TRACE_DEVEL() should be used instead.
This patch could be backported to 3.1 to ease future backports.
When an read error is detected, no error must be reported on the SPOP
connection is there are still some data to parse. It is important to be sure
to process all data before reporting the error and be sure to not truncate
received frames. However, we must also take care to handle short read case
to not wait data that will never be received.
This patch must be backported to 3.1.
There was no test in the demux part to detect truncated frames and to report
an error at the connection level. The SPOP streams were properly switch to
half-closed state. But waiting the associated SPOE applets were woken up and
released, the SPOP connection could be woken up several times for nothing. I
never triggered the watchdog in that case, but it is not excluded.
Now, at the end of the demux function, if a specific test was added to
detect truncated frames to report an error and close the connection.
This patch must be backported to 3.1.
When a frame was not fully received, a short read must be reported on the
SPOP connection to help the demux to handle truncated frames. This was
performed for frames truncated on the header part but not on the payload
part. It is now properly detected.
This patch must be backported to 3.1.
The CLOSING state was not handled at all by the SPOP multiplexer while it is
mandatory when a DISCONNECT frame was sent and the mux should wait for the
DISCONNECT frame in reply from the agent. Thanks to this patch, it should be
fixed.
In addition, if an error occurres during the AGENT HELLO frame parsing, the
SPOP connection is no longer switched to CLOSED state and remains in ERROR
state instead. It is important to be able to send the DISCONNECT frame to
the agent instead of closing the TCP connection immediately.
This patch depends on following commits:
* BUG/MEDIUM: mux-spop: Remove frame parsing states from the SPOP connection state
* MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
* BUG/MINOR: mux-spop: Don't open new streams for SPOP connection on error
* BUG/MINOR: mux-spop: Make the demux stream ID a signed integer
All the series must be backported to 3.1.
SPOP_CS_FRAME_H and SPOP_CS_FRAME_P states, that were used to handle frame
parsing, were removed. The demux process now relies on the demux stream ID
to know if it is waiting for the frame header or the frame
payload. Concretly, when the demux stream ID is not set (dsi == -1), the
demuxer is waiting for the next frame header. Otherwise (dsi >= 0), it is
waiting for the frame payload. It is especially important to be able to
properly handle DISCONNECT frames sent by the agents.
SPOP_CS_RUNNING state is introduced to know the hello handshake was finished
and the SPOP connection is able to open SPOP streams and exchange NOTIFY/ACK
frames with the agents.
It depends on the following fixes:
* MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
* BUG/MINOR: mux-spop: Make the demux stream ID a signed integer
This change will be mandatory for the next fix. It must be backported to 3.1
with the commits above.
After the ACK frame was parsed, it is useless to set the SPOP connection
state to SPOP_CS_FRAME_H state because this will be automatically handled by
the demux function. If it is not an issue, but this will simplify changes
for the next commit.
Till now, only SPOP connections fully closed or those with a TCP connection on
error were concerned. But available streams could be reported for SPOP
connections in error or closing state. But in these states, no NOTIFY frames
will be sent and no ACK frames will be parsed. So, no new SPOP streams should be
opened.
This patch should be backported to 3.1.
The demux stream ID of a SPOP connection, used when received frames are
parsed, must be a signed integer because it is set to -1 when the SPOP
connection is initialized. It will be important for the next fix.
This patch must be backported to 3.1.
When a SPOP connection was closed or was in error, an error was
systematically reported on all its SPOP streams. However, SPOP streams that
already received their ACK frame must be excluded. Otherwise if an agent
sends a ACK and close immediately, the ACK will be ignored because the SPOP
stream will handle the error first.
This patch must be backported to 3.1.
When the SPOE applet was released, if a SPOE filter context was still
attached to it, an error was reported to the filter. However, there is no
reason to report an error if the ACK message was already received. Because
of this bug, if the ACK message is received and the SPOE connection is
immediately closed, this prevents the ACK message to be processed.
This patch should be backported to 3.1.
When the channel API was revisted, the both functions above was added. An
offset can be passed as argument. However, this parameter could be reported
to be out of range if there was not enough input data was received yet. It
is an issue, especially with a tcp rule, because more data could be
received. If an error is reported too early, this prevent the rule to be
reevaluated later. In fact, an error should only be reported if the offset
is part of the output data.
Another issue is about the conditions to report 'nil' instead of an empty
string. 'nil' was reported when no data was found. But it is not aligned
with the documentation. 'nil' must only be returned if no more data cannot
be received and there is no input data at all.
This patch should fix the issue #2716. It should be backported as far as 2.6.
A test run on a dual-socket EPYC 9845 (2x160 cores) showed that we'll
be facing new limits during the lifetime of 3.2 with our current 16
groups and 256 threads max:
$ cat test.cfg
global
cpu-policy perforamnce
$ ./haproxy -dc -c -f test.cfg
...
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,320-335
2/1-32 33-64 32: 16-31,336-351
3/1-32 65-96 32: 32-47,352-367
4/1-32 97-128 32: 48-63,368-383
5/1-32 129-160 32: 64-79,384-399
6/1-32 161-192 32: 80-95,400-415
7/1-32 193-224 32: 96-111,416-431
8/1-32 225-256 32: 112-127,432-447
Raising the default limit to 1024 threads and 32 groups is sufficient
to buy us enough margin for a long time (hopefully, please don't laugh,
you, reader from the future):
$ ./haproxy -dc -c -f test.cfg
...
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,320-335
2/1-32 33-64 32: 16-31,336-351
3/1-32 65-96 32: 32-47,352-367
4/1-32 97-128 32: 48-63,368-383
5/1-32 129-160 32: 64-79,384-399
6/1-32 161-192 32: 80-95,400-415
7/1-32 193-224 32: 96-111,416-431
8/1-32 225-256 32: 112-127,432-447
9/1-32 257-288 32: 128-143,448-463
10/1-32 289-320 32: 144-159,464-479
11/1-32 321-352 32: 160-175,480-495
12/1-32 353-384 32: 176-191,496-511
13/1-32 385-416 32: 192-207,512-527
14/1-32 417-448 32: 208-223,528-543
15/1-32 449-480 32: 224-239,544-559
16/1-32 481-512 32: 240-255,560-575
17/1-32 513-544 32: 256-271,576-591
18/1-32 545-576 32: 272-287,592-607
19/1-32 577-608 32: 288-303,608-623
20/1-32 609-640 32: 304-319,624-639
We can change this default now because it has no functional effect
without any configured cpu-policy, so this will only be an opt-in
and it's better to do it now than to have an effect during the
maintenance phase. A tiny effect is a doubling of the number of
pool buckets and stick-table shards internally, which means that
aside slightly reducing contention in these areas, a dump of tables
can enumerate keys in a different order (hence the adjustment in the
vtc).
The only really visible effect is a slightly higher static memory
consumption (29->35 MB on a small config), but that difference
remains even with 50k servers so that's pretty much acceptable.
Thanks to Erwan Velu for the quick tests and the insights!
Most of the time, machines made of multiple CPU types use the same L3
for them, and grouping CPUs by frequencies to form groups doesn't bring
any value and on the opposite can impair the incoming connection balancing.
This choice of grouping by cluster was made in order to constitute a good
choice on homogenous machines as well, so better rely on the per-CCX
grouping than the per-cluster one in this case. This will create less
clusters on machines where it counts without affecting other ones.
It doesn't seem necessary to change anything for the "resource" policy
since it selects a single cluster.
This is similar to the previous change to the "performance" policy but
it applies to the "efficiency" one. Here we're changing the sorting
method to sort CPU clusters by average per-CPU capacity, and we evict
clusters whose per-CPU capacity is above 125% of the previous one.
Per-core capacity allows to detect discrepancies between CPU cores,
and to continue to focus on efficient ones as a priority.
Running the "performance" policy on highly heterogenous systems yields
bad choices when there are sufficiently more small than big cores,
and/or when there are multiple cluster types, because on such setups,
the higher the frequency, the lower the number of cores, despite small
differences in frequencies. In such cases, we quickly end up with
"performance" only choosing the small or the medium cores, which is
contrary to the original intent, which was to select performance cores.
This is what happens on boards like the Orion O6 for example where only
the 4 medium cores and 2 big cores are choosen, evicting the 2 biggest
cores and the 4 smallest ones.
Here we're changing the sorting method to sort CPU clusters by average
per-CPU capacity, and we evict clusters whose per-CPU capacity falls
below 80% of the previous one. Per-core capacity allows to detect
discrepancies between CPU cores, and to continue to focus on high
performance ones as a priority.
The current per-capacity sorting function acts on a whole cluster, but
in some setups having many small cores and few big ones, it becomes
easy to observe an inversion of metrics where the many small cores show
a globally higher total capacity than the few big ones. This does not
necessarily fit all use cases. Let's add new a function to sort clusters
by their per-cpu average capacity to cover more use cases.
This cpu-policy will only consider CCX and not clusters. This makes
a difference on machines with heterogenous CPUs that generally share
the same L3 cache, where it's not desirable to create multiple groups
based on the CPU types, but instead create one with the different CPU
types. The variants "group-by-2/3/4-ccx" have also been added.
Let's also add some text explaining the difference between cluster
and CCX.
Some (rare) boards have their clusters in an erratic order. This is
the case for the Radxa Orion O6 where one of the big cores appears as
CPU0 due to booting from it, then followed by the small cores, then the
medium cores, then the remaining big cores. This results in clusters
appearing this order: 0,2,1,0.
The core in cpu_policy_group_by_cluster() expected ordered clusters,
and performs ordered comparisons to decide whether a CPU's cluster has
already been taken care of. On the board above this doesn't work, only
clusters 0 and 2 appear and 1 is skipped.
Let's replace the cluster number comparison with a cpuset to record
which clusters have been taken care of. Now the groups properly appear
like this:
Tgrp/Thr Tid CPU set
1/1-2 1-2 2: 0,11
2/1-4 3-6 4: 1-4
3/1-6 7-12 6: 5-10
No backport is needed, this is purely 3.2.
Complete stream output for "show quic" by displaying information from
its upper QCS. Note that QCS may be NULL if already released, so a
default output is also provided.
Add a new format for "show quic" command labelled as "stream". This is
an equivalent of "show sess", dedicated to the QUIC stack. Each active
QUIC streams are listed on a line with their related infos.
The main objective of this command is to ensure there is no freeze
streams remaining after a transfer.
Add counters to measure Rx buffers usage per QCS. This reused the newly
defined bdata_ctr type already used for Tx accounting.
Note that for now, <tot> value of bdata_ctr is not used. This is because
it is not easy to account for data accross contiguous buffers.
These values are displayed both on log/traces and "show quic" output.
Add accounting at qc_stream_desc level to be able to report the number
of allocated Tx buffers and the sum of their data. This represents data
ready for emission or already emitted and waiting on ACK.
To simplify this accounting, a new counter type bdata_ctr is defined in
quic_utils.h. This regroups both buffers and data counter, plus a
maximum on the buffer value.
These values are now displayed on QCS info used both on logline and
traces, and also on "show quic" output.
As discussed here:
https://github.com/httpwg/http2-spec/pull/936https://github.com/haproxy/haproxy/issues/2941
It's important to take care of some special characters in the :authority
pseudo header before reassembling a complete URI, because after assembly
it's too late (e.g. the '/'). This patch does this, both for h2 and h3.
The impact on H2 was measured in the worst case at 0.3% of the request
rate, while the impact on H3 is around 1%, but H3 was about 1% faster
than H2 before and is now on par.
It may be backported after a period of observation, and in this case it
relies on this previous commit:
MINOR: http: add a function to validate characters of :authority
Thanks to @DemiMarie for reviving this topic in issue #2941 and bringing
new potential interesting cases.
As discussed here:
https://github.com/httpwg/http2-spec/pull/936https://github.com/haproxy/haproxy/issues/2941
It's important to take care of some special characters in the :authority
pseudo header before reassembling a complete URI, because after assembly
it's too late (e.g. the '/').
This patch adds a specific function which was checks all such characters
and their ranges on an ist, and benefits from modern compilers
optimizations that arrange the comparisons into an evaluation tree for
faster match. That's the version that gave the most consistent performance
across various compilers, though some hand-crafted versions using bitmaps
stored in register could be slightly faster but super sensitive to code
ordering, suggesting that the results might vary with future compilers.
This one takes on average 1.2ns per character at 3 GHz (3.6 cycles per
char on avg). The resulting impact on H2 request processing time (small
requests) was measured around 0.3%, from 6.60 to 6.618us per request,
which is a bit high but remains acceptable given that the test only
focused on req rate.
The code was made usable both for H2 and H3.
Last commit 7361515 ("BUG/MINOR: server: dont depend on proxy for server
cleanup in srv_drop()") introduced a regression because the lbprm
server_deinit is not evaluated anymore with dynamic servers, possibly
resulting in a memory leak.
To fix the issue, in addition to free_proxy(), the server deinit check
should be manually performed in cli_parse_delete_server() as well.
No backport needed.
In commit b5ee8bebfc ("MINOR: server: always call ssl->destroy_srv when
available"), we made it so srv_drop() doesn't depend on proxy to perform
server cleanup.
It turns out this is now mandatory, because during deinit, free_proxy()
can occur before the final srv_drop(). This is the case when using Lua
scripts for instance.
In 2a9436f96 ("MINOR: lbprm: Add method to deinit server and proxy") we
added a freeing check under srv_drop() that depends on the proxy.
Because of that UAF may occur during deinit when using a Lua script that
manipulate server objects.
To fix the issue, let's perform the lbprm server deinit logic under
free_proxy() directly, where the DEINIT server hooks are evaluated.
Also, to prevent similar bugs in the future, let's explicitly document
in srv_drop() that server cleanups should assume that the proxy may
already be freed.
No backport needed unless 2a9436f96 is.
OSS Fuzz found that the previous fix ebb19fb367 ("BUG/MINOR: cfgparse:
consider the special case of empty arg caused by \x00") was incomplete,
as the output can sometimes be larger than the input (due to variables
expansion) in which case the work around to try to report a bad arg will
fail. While the parse_line() function has been made more robust now in
order to avoid this condition, let's fix the handling of this special
case anyway by just pointing to the beginning of the line if the supposed
error location is out of the line's buffer.
All details here:
https://oss-fuzz.com/testcase-detail/5202563081502720
No backport is needed unless the fix above is backported.
The fix in 10e6d0bd57 ("BUG/MINOR: tools: only fill first empty arg when
not out of range") was not that good. It focused on protecting against
<arg> becoming out of range to detect we haven't emitted anything, but
it's not the right way to detect this. We're always maintaining arg_start
as a copy of outpos, and that later one is incremented when emitting a
char, so instead of testing args[arg] against out+arg_start, we should
instead check outpos against arg_start, thereby eliminating the <out>
offset and the need to access args[]. This way we now always know if
we've emitted an empty arg without dereferencing args[].
There's no need to backport this unless the fix above is also backported.
When thread support is disabled ("USE_THREAD=" or "USE_THREAD=0" when
building), soft-stop doesn't work as haproxy never ends after stopping
the proxies.
This used to work fine in the past but suddenly stopped working with
ef422ced91 ("MEDIUM: thread: make stopping_threads per-group and add
stopping_tgroups") because the "break;" instruction under the stopping
condition is never executed when support for multithreading is disabled.
To fix the issue, let's add an "else" block to run the "break;"
instruction when USE_THREAD is not defined.
It should be backported up to 2.8
When using a crt-list or a crt-store, keywords mentionned twice on the
same line overwritte the previous value.
This patch emits a warning when the same keyword is found another time
on the same line.
The ckch_conf_parse() function is the generic function which parses
crt-store keywords from the crt-store section, and also from a
crt-list.
When having multiple time the same keyword, a leak of the previous
value happens. This patch ensure that the previous value is always
freed before overwriting it.
This is the same problem as the previous "BUG/MINOR: ssl/ckch: always
free() the previous entry during parsing" patch, however this one
applies on PARSE_TYPE_ARRAY_SUBSTR.
No backport needed.
The ckch_conf_parse() function is the generic function which parses
crt-store keywords from the crt-store section, and also from a crt-list.
When having multiple time the same keyword, a leak of the previous value
happens. This patch ensure that the previous value is always freed
before overwriting it.
This patch should be backported as far as 3.0.
The 'ssl-f-use' implementation doesn't prevent to have multiple time the
'crt' keyword, which overwrite the previous value. Letting users think
that is it possible to use multiple certificates on the same line, which
is not the case.
This patch emits an alert when setting the 'crt' keyword multiple times
on the same ssl-f-use line.
Should fix issue #2966.
No backport needed.
Commit c7f29afc ("MEDIUM: ssl: replace "crt" lines by "ssl-f-use"
lines") forgot to remove an the allocation of the crt field which was
done with the first argument.
Since ssl-f-use takes keywords, this would put the first keyword in
"crt" instead of the certificate name.
IPv6 connectivity might start off (e.g. network not fully up when
haproxy starts), so for features like resolvers, it would be nice to
periodically recheck.
With this change, instead of having the resolvers code rely on a variable
indicating connectivity, it will now call a function that will check for
how long a connectivity check hasn't been run, and will perform a new one
if needed. The age was set to 30s which seems reasonable considering that
the DNS will cache results anyway. There's no saving in spacing it more
since the syscall is very check (just a connect() without any packet being
emitted).
The variables remain exported so that we could present them in show info
or anywhere else.
This way, "dns-accept-family auto" will now stay up to date. Warning
though, it does perform some caching so even with a refreshed IPv6
connectivity, an older record may be returned anyway.
This way we can preserve the entire contents of the released area for
later inspection. This automatically enables comparison at reallocation
time as well (like "integrity" does). If used in combination with
integrity, the comparison is disabled but the check of non-corruption
of the area mangled by integrity is still operated.
The automatic scheduler is useful but sometimes you don't want to use,
or schedule manually.
This patch adds an 'acme.scheduler' option in the global section, which
can be set to either 'auto' or 'off'. (auto is the default value)
This also change the ouput of the 'acme status' command so it does not
shows scheduled values. The state will be 'Stopped' instead of
'Scheduled'.
The new MEM_F_UAF flag can be set just after a pool's creation to make
this pool UAF for debugging purposes. This allows to maintain a better
overall performance required to reproduce issues while still having a
chance to catch UAF. It will only be used by developers who will manually
add it to areas worth being inspected, though.
Emission of flow-control frames have been recently modified. Now, each
frame is sent one by one, via a single entry list. If a failure occurs,
emission is interrupted and frame is reinserted into the original
<qcc.lfctl.frms> list.
This code is incorrect as it only checks if qcc_send_frames() returns an
error code to perform the reinsert operation. However, an error here
does not always mean that the frame was not properly emitted by lower
quic-conn layer. As such, an extra test LIST_ISEMPTY() must be performed
prior to reinsert the frame.
This bug would cause a heap overflow. Indeed, the reinsert frame would
be a random value. A crash would occur as soon as it would be
dereferenced via <qcc.lfctl.frms> list.
This was reproduced by issuing a POST with a big file and interrupt it
after just a few seconds. This results in a crash in about a third of
the tests. Here is an example command using ngtcp2 :
$ ngtcp2-client -q --no-quic-dump --no-http-dump \
-m POST -d ~/infra/html/1g 127.0.0.1 20443 "http://127.0.0.1:20443/post"
Heap overflow was detected via a BUG_ON() statement from qc_frm_free()
via qcc_release() caller :
FATAL: bug condition "!((&((*frm)->reflist))->n == (&((*frm)->reflist)))" matched at src/quic_frame.c:1270
This does not need to be backported.
Released version 3.2-dev15 with the following main changes :
- BUG/MEDIUM: stktable: fix sc_*(<ctr>) BUG_ON() regression with ctx > 9
- BUG/MINOR: acme/cli: don't output error on success
- BUG/MINOR: tools: do not create an empty arg from trailing spaces
- MEDIUM: config: warn about the consequences of empty arguments on a config line
- MINOR: tools: make parse_line() provide hints about empty args
- MINOR: cfgparse: visually show the input line on empty args
- BUG/MINOR: tools: always terminate empty lines
- BUG/MINOR: tools: make parseline report the required space for the trailing 0
- DEBUG: threads: don't keep lock label "OTHER" in the per-thread history
- DEBUG: threads: merge successive idempotent lock operations in history
- DEBUG: threads: display held locks in threads dumps
- BUG/MINOR: proxy: only use proxy_inc_fe_cum_sess_ver_ctr() with frontends
- Revert "BUG/MEDIUM: mux-spop: Handle CLOSING state and wait for AGENT DISCONNECT frame"
- MINOR: acme/cli: 'acme status' show the status acme-configured certificates
- MEDIUM: acme/ssl: remove 'acme ps' in favor of 'acme status'
- DOC: configuration: add "acme" section to the keywords list
- DOC: configuration: add the "crt-store" keyword
- BUG/MAJOR: queue: lock around the call to pendconn_process_next_strm()
- MINOR: ssl: add filename and linenum for ssl-f-use errors
- BUG/MINOR: ssl: can't use crt-store some certificates in ssl-f-use
- BUG/MINOR: tools: only fill first empty arg when not out of range
- MINOR: debug: bump the dump buffer to 8kB
- MINOR: stick-tables: add "ipv4" as an alias for the "ip" type
- MINOR: quic: extend return value during TP parsing
- BUG/MINOR: quic: use proper error code on missing CID in TPs
- BUG/MINOR: quic: use proper error code on invalid server TP
- BUG/MINOR: quic: reject retry_source_cid TP on server side
- BUG/MINOR: quic: use proper error code on invalid received TP value
- BUG/MINOR: quic: fix TP reject on invalid max-ack-delay
- BUG/MINOR: quic: reject invalid max_udp_payload size
- BUG/MEDIUM: peers: hold the refcnt until updating ts->seen
- BUG/MEDIUM: stick-tables: close a tiny race in __stksess_kill()
- BUG/MINOR: cli: fix too many args detection for commands
- MINOR: server: ensure server postparse tasks are run for dynamic servers
- BUG/MEDIUM: stick-table: always remove update before adding a new one
- BUG/MEDIUM: quic: free stream_desc on all data acked
- BUG/MINOR: cfgparse: consider the special case of empty arg caused by \x00
- DOC: config: recommend disabling libc-based resolution with resolvers
Using both libc and haproxy resolvers can lead to hard to diagnose issues
when their bevahiour diverges; recommend using only one type of resolver.
Should be backported to stable versions.
Link: https://www.mail-archive.com/haproxy@formilux.org/msg45663.html
Co-authored-by: Lukas Tribus <lukas@ltri.eu>
The reporting of the empty arg location added with commit 08d3caf30
("MINOR: cfgparse: visually show the input line on empty args") falls
victim of a special case detected by OSS Fuzz:
https://issues.oss-fuzz.com/issues/415850462
In short, making an argument start with "\x00" doesn't make it empty for
the parser, but still emits an empty string which is detected and
displayed. Unfortunately in this case the error pointer is not set so
the sanitization function crashes.
What we're doing in this case is that we fall back to the position of
the output argument as an estimate of where it was located in the input.
It's clearly inexact (quoting etc) but will still help the user locate
the problem.
No backport is needed unless the commit above is backported.
The following patch simplifies qc_stream_desc_ack(). The qc_stream_desc
instance is not freed anymore, even if all data were acknowledged. As
implies by the commit message, the caller is responsible to perform this
cleaning operation.
f4a83fbb14bdd14ed94752a2280a2f40c1b690d2
MINOR: quic: do not remove qc_stream_desc automatically on ACK handling
However, despite the commit instruction, qc_stream_desc_free()
invokation was not moved in the caller. This commit fixes this by adding
it after stream ACK handling. This is performed only when a transfer is
completed : all data is acknowledged and qc_stream_desc has been
released by its MUX stream instance counterpart.
This bug may cause a significant increase in memory usage when dealing
with long running connection. However, there is no memory leak, as every
qc_stream_desc attached to a connection are finally freed when quic_conn
instance is released.
This must be backported up to 3.1.
Since commit 388539faa ("MEDIUM: stick-tables: defer adding updates to a
tasklet"), between the entry creation and its arrival in the updates tree,
there is time for scheduling, and it now becomes possible for an stksess
entry to be requeued into the list while it's still in the tree as a remote
one. Only local updates were removed prior to being inserted. In this case
we would re-insert the entry, causing it to appear as the parent of two
distinct nodes or leaves, and to be visited from the first leaf during a
delete() after having already been removed and freed, causing a crash,
as Christian reported in issue #2959.
There's no reason to backport this as this appeared with the commit above
in 3.2-dev13.
commit 29b76cae4 ("BUG/MEDIUM: server/log: "mode log" after server
keyword causes crash") introduced some postparsing checks/tasks for
server
Initially they were mainly meant for "mode log" servers postparsing, but
we already have a check dedicated to "tcp/http" servers (ie: only tcp
proto supported)
However when dynamic servers are added they bypass _srv_postparse() since
the REGISTER_POST_SERVER_CHECK() is only executed for servers defined in
the configuration.
To ensure consistency between dynamic and static servers, and ensure no
post-check init routine is missed, let's manually invoke _srv_postparse()
after creating a dynamic server added via the cli.
d3f928944 ("BUG/MINOR: cli: Issue an error when too many args are passed
for a command") added a new check to prevent the command to run when
too many arguments are provided. In this case an error is reported.
However it turns out this check (despite marked for backports) was
ineffective prior to 20ec1de21 ("MAJOR: cli: Refacor parsing and
execution of pipelined commands") as 'p' pointer was reset to the end of
the buffer before the check was executed.
Now since 20ec1de21, the check works, but we have another issue: we may
read past initialized bytes in the buffer because 'p' pointer is always
incremented in a while loop without checking if we increment it past 'end'
(This was detected using valgrind)
To fix the issue introduced by 20ec1de21, let's only increment 'p' pointer
if p < end.
For 3.2 this is it, now for older versions, since d3f928944 was marked for
backport, a sligthly different approach is needed:
- conditional p increment must be done in the loop (as in this patch)
- max arg check must moved above "fill unused slots" comment where p is
assigned to the end of the buffer
This patch should be backported with d3f928944.
It might be possible not to see the element in the tree, then not to
see it in the update list, thus not to take the lock before deleting.
But an element in the list could have moved to the tree during the
check, and be removed later without the updt_lock.
Let's delete prior to checking the presence in the tree to avoid
this situation. No backport is needed since this arrived in -dev13
with the update list.
In peer_treat_updatemsg(), we call stktable_touch_remote() after
releasing the write lock on the TS, asking it to decrement the
refcnt, then we update ts->seen. Unfortunately this is racy and
causes the issue that Christian reported in issue #2959.
The sequence of events is very hard to trigger manually, but what happens
is the following:
T1. stktable_touch_remote(table, ts, 1);
-> at this point the entry is in the mt_list, and the refcnt is zero.
T2. stktable_trash_oldest() or process_table_expire()
-> these can run, because the refcnt is now zero.
The entry is cleanly deleted and freed.
T1. HA_ATOMIC_STORE(&ts->seen, 1)
-> we dereference freed memory.
A first attempt at a fix was made by keeping the refcnt held during
all the time the entry is in the mt_list, but this is expensive as
such entries cannot be purged, causing lots of skips during
trash_oldest_data(). This managed to trigger watchdogs, and was only
hiding the real cause of the problem.
The correct approach clearly is to maintain the ref_cnt until we
touch ->seen. That's what this patch does. It does not decrement
the refcnt, while calling stktable_touch_remote(), and does it
manually after touching ->seen. With this the problem is gone.
Note that a reproducer involves the following:
- a config with 10 stick-ctr tracking the same table with a
random key between 10M and 100M depending on the machine.
- the expiration should be between 10 and 20s. http_req_cnt
is stored and shared with the peers.
- 4 total processes with such a config on the local machine,
each corresponding to a different peer. 3 of the peers are
bound to half of the cores (all threads) and share the same
threads; the last process is bound to the other half with
its own threads.
- injecting at full load, ~256 conn, on the shared listening
port. After ~2x expiration time to 1 minute the lone process
should segfault in pools code due to a corrupted by_lru list.
This problem already exists in earlier versions but the race looks
narrower. Given how difficult it is to trigger on a given machine
in its current form, it's likely that it only happens once in a
while on stable branches. The fix must be backported wherever the
code is similar, and there's no hope to reproduce it to validate
the backport.
Thanks again to Christian for his amazing help!
Add a checks on received max_udp_payload transport parameters. As
defined per RFC 9000, values below 1200 are invalid, and thus the
connection must be closed with TRANSPORT_PARAMETER_ERROR code.
Prior to this patch, an invalid value was silently ignored.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Checks are implemented on some received transport parameter values,
to reject invalid ones defined per RFC 9000. This is the case for
max_ack_delay parameter.
The check was not properly implemented as it only reject values strictly
greater than the limit set to 2^14. Fix this by rejecting values of 2^14
and above. Also, the proper error code TRANSPORT_PARAMETER_ERROR is now
set.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
As per RFC 9000, checks must be implemented to reject invalid values for
received transport parameters. Such values are dependent on the
parameter type.
Checks were already implemented for ack_delay_exponent and
active_connection_id_limit, accordingly with the QUIC specification.
However, the connection was closed with an incorrect error code. Fix
this to ensure that TRANSPORT_PARAMETER_ERROR code is used as expected.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Close the connection on error if retry_source_connection_id transport
parameter is received. This is specified by RFC 9000 as this parameter
must not be emitted by a client. Previously, it was silently ignored.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
This commit is similar to the previous one. It fixes the error code
reported when dealing with invalid received transport parameters. This
time, it handles reception of original_destination_connection_id,
preferred_address and stateless_reset_token which must not be emitted by
the client.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Handle missing received transport parameter value
initial_source_connection_id / original_destination_connection_id.
Previously, such case would result in an error reported via
quic_transport_params_store(), which triggers a TLS alert converted as
expected as a CONNECTION_CLOSE. The issue is that the error code
reported in the frame was incorrect.
Fix this by returning QUIC_TP_DEC_ERR_INVAL for such conditions. This is
directly handled via quic_transport_params_store() which set the proper
TRANSPORT_PARAMETER_ERROR code for the CONNECTION_CLOSE. However, no
error is reported so the SSL handshake is properly terminated without a
TLS alert. This is enough to ensure that the CONNECTION_CLOSE frame will
be emitted as expected.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Extend API used for QUIC transport parameter decoding. This is done via
the introduction of a dedicated enum to report the various error
condition detected. No functional change should occur with this patch,
as the only returned code is QUIC_TP_DEC_ERR_TRUNC, which results in the
connection closure via a TLS alert.
This patch will be necessary to properly reject transport parameters
with the proper CONNECTION_CLOSE error code. As such, it should be
backported up to 2.6 with the following series.
However the doc purposely says the opposite, to encourage migrating away
from "ip". The goal is that in the future we change "ip" to mean "ipv6",
which seems to be what most users naturally expect. But we cannot break
configurations in the LTS version so for now "ipv4" is the alias.
The reason for not changing it in the table is that the type name is
used at a few places (look for "].kw"):
- dumps
- promex
We'd rather not change that output for 3.2, but only do it in 3.3.
This way, 3.2 can be made future-proof by using "ipv4" in the config
without any other side effect.
Please see github issue #2962 for updates on this transition.
Now with the improved backtraces, the lock history and details in the
mux layers, some dumps appear truncated or with some chars alone at
the beginning of the line. The issue is in fact caused by the limited
dump buffer size (2kB for stderr, 4kB for warning), that cannot hold
a complete trace anymore.
Let's jump bump them to 8kB, this will be plenty for a long time.
In commit 3f2c8af313 ("MINOR: tools: make parse_line() provide hints
about empty args") we've added the ability to record the position of
the first empty arg in parse_line(), but that check requires to
access the args[] array for the current arg, which is not valid in
case we stopped on too large an argument count. Let's just check the
arg's validity before doing so.
This was reported by OSS Fuzz:
https://issues.oss-fuzz.com/issues/415850462
No backport is needed since this was in the latest dev branch.
When declaring a certificate via the crt-store section, this certificate
can then be used 2 ways in a crt-list:
- only by using its name, without any crt-store options
- or by using the exact set of crt-list option that was defined in the
crt-store
Since ssl-f-use is generating a crt-list, this is suppose to behave the
same. To achieve this, ckch_conf_parse() will parse the keywords related
to the ckch_conf on the ssl-f-use line and use ckch_conf_cmp() to
compare it to the previous declaration from the crt-store. This
comparaison is only done when any ckch_conf keyword are present.
However, ckch_conf_parse() was done for the crt-list, and the crt-list
does not use the "crt" parameter to declare the name of the certificate,
since it's the first element of the line. So when used with ssl-f-use,
ckch_conf_parse() will always see a "crt" keyword which is a ckch_conf
one, and consider that it will always need to have the exact same set of
paremeters when using the same crt in a crt-store and an ssl-f-use line.
So a simple configuration like this:
crt-store web
load crt "foo.com.crt" key "foo.com.key" alias "foo"
frontend mysite
bind :443 ssl
ssl-f-use crt "@web/foo" ssl-min-ver TLSv1.2
Would lead to an error like this:
config : '@web/foo' in crt-list '(null)' line 0, is already defined with incompatible parameters:
- different parameter 'key' : previously 'foo.com.key' vs '(null)'
In order to fix the issue, this patch parses the "crt" parameter itself
for ssl-f-use instead of using ckch_conf_parse(), so the keyword would
never be considered as a ckch_conf keyword to compare.
This patch also take care of setting the CKCH_CONF_SET_CRTLIST flag only
if a ckch_conf keyword was found. This flag is used by ckch_conf_cmp()
to know if it has to compare or not.
No backport needed.
Fill cfg_crt_node with a filename and linenum so the post_section
callback can use it to emit errors.
This way the errors are emitted with the right filename and linenum
where ssl-f-use is used instead of (null):0
The extra call to pendconn_process_next_strm() made in commit cda7275ef5
("MEDIUM: queue: Handle the race condition between queue and dequeue
differently") was performed after releasing the server queue's lock,
which is incompatible with the calling convention for this function.
The result is random corruption of the server's streams list likely
due to picking old or incorrect pendconns from the queue, and in the
end infinitely looping on apparently already locked mt_list objects.
Just adding the lock fixes the problem.
It's very difficult to reproduce, it requires low maxconn values on
servers, stickiness on the servers (cookie), a long enough slowstart
(e.g. 10s), and regularly flipping servers up/down to re-trigger the
slowstart.
No backport is needed as this was only in 3.2.
Add the "crt-store" keyword with its argument in the "3.12" section, so
this could be detected by haproxy-dconv has a keyword and put in the
keywords list.
Must be backported as far as 3.0
Remove the 'acme ps' command which does not seem useful anymore with the
'acme status' command.
The big difference with the 'acme status' command is that it was only
displaying the running tasks instead of the status of all certificate.
The "acme status" command, shows the status of every certificates
configured with ACME, not only the running task like "acme ps".
The IO handler loops on the ckch_store tree and outputs a line for each
ckch_store which has an acme section set. This is still done under the
ckch_store lock and doesn't support resuming when the buffer is full,
but we need to change that in the future.
This reverts commit 53c3046898633e56f74f7f05fb38cabeea1c87a1.
This patch introduced a regression leading to a loop on the frames
demultiplexing because a frame may be ignore but not consumed.
But outside this regression that can be fixed, there is a design issue that
was not totally fixed by the patch above. The SPOP connection state is mixed
with the status of the frames demultiplexer and this needlessly complexify
the connection management. Instead of fixing the fix, a better solution is
to revert it to work a a proper solution.
For the record, the idea is to deal with the spop connection state onlu
using 'state' field and to introduce a new field to handle the frames
demultiplexer state. This should ease the closing state management.
Another issue that must be fixed. We must take care to not abort a SPOP
stream when an error is detected on a SPOP connection or when the connection
is closed, if the ACK frame was already received for this stream. It is not
a common case, but it can be solved by saving the last known stream ID that
recieved a ACK.
This patch must be backported if the commit above is backported.
proxy_inc_fe_cum_sess_ver_ctr() was implemented in 9969adbc
("MINOR: stats: add by HTTP version cumulated number of sessions and
requests")
As its name suggests, it is meant to be called for frontends, not backends
Also, in 9969adbc, when used under h1_init(), a precaution is taken to
ensure that the function is only called with frontends.
However, this precaution was not applied in h2_init() and qc_init().
Due to this, it remains possible to have proxy_inc_fe_cum_sess_ver_ctr()
being called with a backend proxy as parameter. While it did not cause
known issues so far, it is not expected and could result in bugs in the
future. Better fix this by ensuring the function is only called with
frontends.
It may be backported up to 2.8
Based on the lock history, we can spot some locks that are still held
by checking the last operation that happened on them: if it's not an
unlock, then we know the lock is held. In this case we append the list
after "locked:" with their label and state like below:
U:QUEUE S:IDLE_CONNS U:IDLE_CONNS R:TASK_WQ U:TASK_WQ S:QUEUE S:QUEUE S:QUEUE locked: QUEUE(S)
S:IDLE_CONNS U:IDLE_CONNS S:TASK_RQ U:TASK_RQ S:QUEUE U:QUEUE S:IDLE_CONNS locked: IDLE_CONNS(S)
R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ locked: TASK_WQ(R)
W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT locked: STK_TABLE(W) STK_TABLE_UPDT(W)
The format is slightly different (label(status)) so as to easily
differentiate them visually from the history.
In order to make the lock history a bit more useful, let's try to merge
adjacent lock/unlock sequences that don't change anything for other
threads. For this we can replace the last unlock with the new operation
on the same label, and even just not store it if it was the same as the
one before the unlock, since in the end it's the same as if the unlock
had not been done.
Now loops that used to be filled with "R:LISTENER U:LISTENER" show more
useful info such as:
S:IDLE_CONNS U:IDLE_CONNS S:PEER U:PEER S:IDLE_CONNS U:IDLE_CONNS R:LISTENER U:LISTENER
U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE
R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS W:STK_TABLE_UPDT U:STK_TABLE_UPDT S:PEER
It's worth noting that it can sometimes induce confusion when recursive
locks of the same label are used (a few exist on peers or stick-tables),
as in such a case the two operations would be needed. However these ones
are already undebuggable, so instead they will just have to be renamed
to make sure they use a distinct label.
Most threads are filled with "R:OTHER U:OTHER" in their history. Since
anything non-important can use other it's not observable but it pollutes
the history. Let's just drop OTHER entirely during the recording.
The fix in commit 09a325a4de ("BUG/MINOR: tools: always terminate empty
lines") is insufficient. While it properly addresses the lack of trailing
zero, it doesn't account for it in the returned outlen that is used to
allocate a larger line. This happens at boot if the very first line of
the test file is exactly a sharp with nothing else. In this case it will
return a length 0 and the caller (parse_cfg()) will try to re-allocate an
entry of size zero and will fail, bailing out a lack of memory. This time
it should really be OK.
It doesn't need to be backported, unless the patch above would be.
Since latest commit 7e4a2f39ef ("BUG/MINOR: tools: do not create an empty
arg from trailing spaces"), an empty line will no longer produce an arg
and no longer append a trailing zero to them. This was not visible because
one is already present in the input string, however all the trailing args
are set to out+outpos-1, which now points one char before the buffer since
nothing was emitted, and was noticed by ASAN, and/or when parsing garbage.
Let's make sure to always emit the zero for empty lines as well to address
this issue. No backport is needed unless the patch above gets backported.
Now when an empty arg is found on a line, we emit the sanitized
input line and the position of the first empty arg so as to help
the user figure the cause (likely an empty environment variable).
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
In order to help parse_line() callers report the position of empty
args to the user, let's decide that if no error is emitted, then
we'll stuff the errptr with the position of the first empty arg
without affecting the return value.
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
For historical reasons, the config parser relies on the trailing '\0'
to detect the end of the line being parsed. When the lines started to be
tokenized into arguments, this principle has been preserved, and now all
the parsers rely on *args[arg]='\0' to detect the end of a line. But as
reported in issue #2944, while most of the time it breaks the parsing
like below:
http-request deny if { path_dir '' }
it can also cause some elements to be silently ignored like below:
acl bad_path path_sub '%2E' '' '%2F'
This may also subtly happen with environment variables that don't exist
or which are empty:
acl bad_path path_sub '%2E' "$BAD_PATTERN" '%2F'
Fortunately, parse_line() returns the number of arguments found, so it's
easy from the callers to verify if any was empty. The goal of this commit
is not to perform sensitive changes, it's only to mention when parsing a
line that an empty argument was found and alert about its consequences
using a warning. Most of the time when this happens, the config does not
parse. But for examples as the ACLs above, there could be consequences
that are better detected early.
This patch depends on this previous fix:
BUG/MINOR: tools: do not create an empty arg from trailing spaces
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
Trailing spaces on the lines of the config file create an empty arg
which makes it complicated to detect really empty args. Let's first
address this. Note that it is not user-visible but prevents from
fixing user-visible issues. No backport is needed.
The initial issue was introduced with this fix that already tried to
address it:
8a6767d266 ("BUG/MINOR: config: don't count trailing spaces as empty arg (v2)")
The current patch properly addresses leading and trailing spaces by
only counting arguments if non-lws chars were found on the line. LWS
do not cause a transition to a new arg anymore but they complete the
current one. The whole new code relies on a state machine to detect
when to create an arg (!in_arg->in_arg), and when to close the current
arg. A special care was taken for word expansion in the form of
"${ARGS[*]}" which still continue to emit individual arguments past
the first LWS. This example works fine:
ARGS="100 check inter 1000"
server name 192.168.1."${ARGS[*]}"
It properly results in 6 args:
"server", "name", "192.168.1.100", "check", "inter", "1000"
This fix should not have any visible user impact and is a bit tricky,
so it's best not to backport it, at least for a while.
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
Previous patch 7251c13c7 ("MINOR: acme: move the acme task init in a dedicated
function") mistakenly returned the wrong error code when "acme renew" parsing
was successful, and tried to emit an error message.
This patch fixes the issue by returning 0 when the acme task was correctly
scheduled to start.
No backport needed.
As reported in GH #2958, commit 6c9b315 caused a regression with sc_*
fetches and tracked counter id > 9.
As such, the below configuration would cause a BUG_ON() to be triggered:
global
log stdout format raw local0
tune.stick-counters 11
defaults
log global
mode http
frontend www
bind *:8080
acl track_me bool(true)
http-request set-var(txn.track_var) str("a")
http-request track-sc10 var(txn.track_var) table rate_table if track_me
http-request set-var(txn.track_var_rate) sc_gpc_rate(0,10,rate_table)
http-request return status 200
backend rate_table
stick-table type string size 1k expire 5m store gpc_rate(1,1m)
While in 6c9b315 the src_fetch logic was removed from
smp_fetch_sc_stkctr(), num > 9 is indeed not expected anymore as
original num value. But what we didn't consider is that num is effectively
re-assigned for generic sc_* variant.
Thus the BUG_ON() is misplaced as it should only be evaluated for
non-generic fetches. It explains why it triggers with valid configurations
Thanks to GH user @tkjaer for his detailed report and bug analysis
No backport needed, this bug is specific to 3.2.
Released version 3.2-dev14 with the following main changes :
- MINOR: acme: retry label always do a request
- MINOR: acme: does not leave task for next request
- BUG/MINOR: acme: reinit the retries only at next request
- MINOR: acme: change the default max retries to 5
- MINOR: acme: allow a delay after a valid response
- MINOR: acme: wait 5s before checking the challenges results
- MINOR: acme: emit a log when starting
- MINOR: acme: delay of 5s after the finalize
- BUG/MEDIUM: quic: Let it be known if the tasklet has been released.
- BUG/MAJOR: tasks: fix task accounting when killed
- CLEANUP: tasks: use the local state, not t->state, to check for tasklets
- DOC: acme: external account binding is not supported
- MINOR: hlua: ignore "tune.lua.bool-sample-conversion" if set after "lua-load"
- MEDIUM: peers: Give up if we fail to take locks in hot path
- MEDIUM: stick-tables: defer adding updates to a tasklet
- MEDIUM: stick-tables: Limit the number of old entries we remove
- MEDIUM: stick-tables: Limit the number of entries we expire
- MINOR: cfgparse-global: add explicit error messages in cfg_parse_global_env_opts
- MINOR: ssl: add function to extract X509 notBefore date in time_t
- BUILD: acme: need HAVE_ASN1_TIME_TO_TM
- MINOR: acme: move the acme task init in a dedicated function
- MEDIUM: acme: add a basic scheduler
- MINOR: acme: emit a log when the scheduler can't start the task
This patch implements a very basic scheduler for the ACME tasks.
The scheduler is a task which is started from the postparser function
when at least one acme section was configured.
The scheduler will loop over the certificates in the ckchs_tree, and for
each certificate will start an ACME task if the notAfter date is past
curtime + (notAfter - notBefore) / 12, or 7 days if notBefore is not
available.
Once the lookup over all certificates is terminated, the task will sleep
and will wakeup after 12 hours.
acme_start_task() is a dedicated function which starts an acme task
for a specified <store> certificate.
The initialization code was move from the "acme renew" command parser to
this function, in order to be called from a scheduler.
When env variable name or value are not provided for setenv/presetenv it's not
clear from the old error message shown at stderr, what exactly is missed. User
needs to search in it's configuration.
Let's add more explicit error messages about these inconsistencies.
No need to be backported.
In process_table_expire(), limit the number of entries we remove in one
call, and just reschedule the task if there's more to do. Removing
entries require to use the heavily contended update write lock, and we
don't want to hold it for too long.
This helps getting stick tables perform better under heavy load.
Limit the number of old entries we remove in one call of
stktable_trash_oldest(), as we do so while holding the heavily contended
update write lock, so we'd rather not hold it for too long.
This helps getting stick tables perform better under heavy load.
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
In peer_send_msgs(), give up in order to retry later if we failed at
getting the update read lock.
Similarly, in __process_running_peer_sync(), give up and just reschedule
the task if we failed to get the peer lock. There is an heavy contention
on both those locks, so we could spend a lot of time trying to get them.
This helps getting peers perform better under heavy load.
tune.lua.bool-sample-conversion must be set before any lua-load or
lua-load-per-thread is used for it to be considered. Indeed, lua-load
directives are parsed on the fly and will cause some parts of the scripts
to be executed during init already (script body/init contexts).
As such, we cannot afford to have "tune.lua.bool-sample-conversion" set
after some Lua code was loaded, because it would mean that the setting
would be handled differently for Lua's code executed during or after
config parsing.
To avoid ambiguities, the documentation now states that the setting must
be set before any lua-load(-per-thread) directive, and if the setting
is met after some Lua was already loaded, the directive is ignored and
a warning informs about that.
It should fix GH #2957
It may be backported with 29b6d8af16 ("MINOR: hlua: rename
"tune.lua.preserve-smp-bool" to "tune.lua.bool-sample-conversion"")
There's no point reading t->state to check for a tasklet after we've
atomically read the state into the local "state" variable. Not only it's
more expensive, it's also less clear whether that state is supposed to
be atomic or not. And in any case, tasks and tasklets have their type
forever and the one reflected in state is correct and stable.
After recent commit b81c9390f ("MEDIUM: tasks: Mutualize the TASK_KILLED
code between tasks and tasklets"), the task accounting was no longer
correct for killed tasks due to the decrement of tasks in list that was
no longer done, resulting in infinite loops in process_runnable_tasks().
This just illustrates that this code remains complex and should be further
cleaned up. No backport is needed, as this was in 3.2.
quic_conn_release() may, or may not, free the tasklet associated with
the connection. So make it return 1 if it was, and 0 otherwise, so that
if it was called from the tasklet handler itself, the said handler can
act accordingly and return NULL if the tasklet was destroyed.
This should be backported if 9240cd4a2771245fae4d0d69ef025104b14bfc23
is backported.
The next request was always leaving the task befor initializing the
httpclient. This patch optimize it by jumping to the next step at the
end of the current one. This way, only the httpclient is doing a
task_wakeup() to handle the response. But transiting from response to
the next request does not leave the task.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.