Fill cfg_crt_node with a filename and linenum so the post_section
callback can use it to emit errors.
This way the errors are emitted with the right filename and linenum
where ssl-f-use is used instead of (null):0
The extra call to pendconn_process_next_strm() made in commit cda7275ef5
("MEDIUM: queue: Handle the race condition between queue and dequeue
differently") was performed after releasing the server queue's lock,
which is incompatible with the calling convention for this function.
The result is random corruption of the server's streams list likely
due to picking old or incorrect pendconns from the queue, and in the
end infinitely looping on apparently already locked mt_list objects.
Just adding the lock fixes the problem.
It's very difficult to reproduce, it requires low maxconn values on
servers, stickiness on the servers (cookie), a long enough slowstart
(e.g. 10s), and regularly flipping servers up/down to re-trigger the
slowstart.
No backport is needed as this was only in 3.2.
Remove the 'acme ps' command which does not seem useful anymore with the
'acme status' command.
The big difference with the 'acme status' command is that it was only
displaying the running tasks instead of the status of all certificate.
The "acme status" command, shows the status of every certificates
configured with ACME, not only the running task like "acme ps".
The IO handler loops on the ckch_store tree and outputs a line for each
ckch_store which has an acme section set. This is still done under the
ckch_store lock and doesn't support resuming when the buffer is full,
but we need to change that in the future.
This reverts commit 53c3046898.
This patch introduced a regression leading to a loop on the frames
demultiplexing because a frame may be ignore but not consumed.
But outside this regression that can be fixed, there is a design issue that
was not totally fixed by the patch above. The SPOP connection state is mixed
with the status of the frames demultiplexer and this needlessly complexify
the connection management. Instead of fixing the fix, a better solution is
to revert it to work a a proper solution.
For the record, the idea is to deal with the spop connection state onlu
using 'state' field and to introduce a new field to handle the frames
demultiplexer state. This should ease the closing state management.
Another issue that must be fixed. We must take care to not abort a SPOP
stream when an error is detected on a SPOP connection or when the connection
is closed, if the ACK frame was already received for this stream. It is not
a common case, but it can be solved by saving the last known stream ID that
recieved a ACK.
This patch must be backported if the commit above is backported.
proxy_inc_fe_cum_sess_ver_ctr() was implemented in 9969adbc
("MINOR: stats: add by HTTP version cumulated number of sessions and
requests")
As its name suggests, it is meant to be called for frontends, not backends
Also, in 9969adbc, when used under h1_init(), a precaution is taken to
ensure that the function is only called with frontends.
However, this precaution was not applied in h2_init() and qc_init().
Due to this, it remains possible to have proxy_inc_fe_cum_sess_ver_ctr()
being called with a backend proxy as parameter. While it did not cause
known issues so far, it is not expected and could result in bugs in the
future. Better fix this by ensuring the function is only called with
frontends.
It may be backported up to 2.8
Based on the lock history, we can spot some locks that are still held
by checking the last operation that happened on them: if it's not an
unlock, then we know the lock is held. In this case we append the list
after "locked:" with their label and state like below:
U:QUEUE S:IDLE_CONNS U:IDLE_CONNS R:TASK_WQ U:TASK_WQ S:QUEUE S:QUEUE S:QUEUE locked: QUEUE(S)
S:IDLE_CONNS U:IDLE_CONNS S:TASK_RQ U:TASK_RQ S:QUEUE U:QUEUE S:IDLE_CONNS locked: IDLE_CONNS(S)
R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ locked: TASK_WQ(R)
W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT locked: STK_TABLE(W) STK_TABLE_UPDT(W)
The format is slightly different (label(status)) so as to easily
differentiate them visually from the history.
The fix in commit 09a325a4de ("BUG/MINOR: tools: always terminate empty
lines") is insufficient. While it properly addresses the lack of trailing
zero, it doesn't account for it in the returned outlen that is used to
allocate a larger line. This happens at boot if the very first line of
the test file is exactly a sharp with nothing else. In this case it will
return a length 0 and the caller (parse_cfg()) will try to re-allocate an
entry of size zero and will fail, bailing out a lack of memory. This time
it should really be OK.
It doesn't need to be backported, unless the patch above would be.
Since latest commit 7e4a2f39ef ("BUG/MINOR: tools: do not create an empty
arg from trailing spaces"), an empty line will no longer produce an arg
and no longer append a trailing zero to them. This was not visible because
one is already present in the input string, however all the trailing args
are set to out+outpos-1, which now points one char before the buffer since
nothing was emitted, and was noticed by ASAN, and/or when parsing garbage.
Let's make sure to always emit the zero for empty lines as well to address
this issue. No backport is needed unless the patch above gets backported.
Now when an empty arg is found on a line, we emit the sanitized
input line and the position of the first empty arg so as to help
the user figure the cause (likely an empty environment variable).
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
In order to help parse_line() callers report the position of empty
args to the user, let's decide that if no error is emitted, then
we'll stuff the errptr with the position of the first empty arg
without affecting the return value.
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
For historical reasons, the config parser relies on the trailing '\0'
to detect the end of the line being parsed. When the lines started to be
tokenized into arguments, this principle has been preserved, and now all
the parsers rely on *args[arg]='\0' to detect the end of a line. But as
reported in issue #2944, while most of the time it breaks the parsing
like below:
http-request deny if { path_dir '' }
it can also cause some elements to be silently ignored like below:
acl bad_path path_sub '%2E' '' '%2F'
This may also subtly happen with environment variables that don't exist
or which are empty:
acl bad_path path_sub '%2E' "$BAD_PATTERN" '%2F'
Fortunately, parse_line() returns the number of arguments found, so it's
easy from the callers to verify if any was empty. The goal of this commit
is not to perform sensitive changes, it's only to mention when parsing a
line that an empty argument was found and alert about its consequences
using a warning. Most of the time when this happens, the config does not
parse. But for examples as the ACLs above, there could be consequences
that are better detected early.
This patch depends on this previous fix:
BUG/MINOR: tools: do not create an empty arg from trailing spaces
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
Trailing spaces on the lines of the config file create an empty arg
which makes it complicated to detect really empty args. Let's first
address this. Note that it is not user-visible but prevents from
fixing user-visible issues. No backport is needed.
The initial issue was introduced with this fix that already tried to
address it:
8a6767d266 ("BUG/MINOR: config: don't count trailing spaces as empty arg (v2)")
The current patch properly addresses leading and trailing spaces by
only counting arguments if non-lws chars were found on the line. LWS
do not cause a transition to a new arg anymore but they complete the
current one. The whole new code relies on a state machine to detect
when to create an arg (!in_arg->in_arg), and when to close the current
arg. A special care was taken for word expansion in the form of
"${ARGS[*]}" which still continue to emit individual arguments past
the first LWS. This example works fine:
ARGS="100 check inter 1000"
server name 192.168.1."${ARGS[*]}"
It properly results in 6 args:
"server", "name", "192.168.1.100", "check", "inter", "1000"
This fix should not have any visible user impact and is a bit tricky,
so it's best not to backport it, at least for a while.
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
Previous patch 7251c13c7 ("MINOR: acme: move the acme task init in a dedicated
function") mistakenly returned the wrong error code when "acme renew" parsing
was successful, and tried to emit an error message.
This patch fixes the issue by returning 0 when the acme task was correctly
scheduled to start.
No backport needed.
As reported in GH #2958, commit 6c9b315 caused a regression with sc_*
fetches and tracked counter id > 9.
As such, the below configuration would cause a BUG_ON() to be triggered:
global
log stdout format raw local0
tune.stick-counters 11
defaults
log global
mode http
frontend www
bind *:8080
acl track_me bool(true)
http-request set-var(txn.track_var) str("a")
http-request track-sc10 var(txn.track_var) table rate_table if track_me
http-request set-var(txn.track_var_rate) sc_gpc_rate(0,10,rate_table)
http-request return status 200
backend rate_table
stick-table type string size 1k expire 5m store gpc_rate(1,1m)
While in 6c9b315 the src_fetch logic was removed from
smp_fetch_sc_stkctr(), num > 9 is indeed not expected anymore as
original num value. But what we didn't consider is that num is effectively
re-assigned for generic sc_* variant.
Thus the BUG_ON() is misplaced as it should only be evaluated for
non-generic fetches. It explains why it triggers with valid configurations
Thanks to GH user @tkjaer for his detailed report and bug analysis
No backport needed, this bug is specific to 3.2.
This patch implements a very basic scheduler for the ACME tasks.
The scheduler is a task which is started from the postparser function
when at least one acme section was configured.
The scheduler will loop over the certificates in the ckchs_tree, and for
each certificate will start an ACME task if the notAfter date is past
curtime + (notAfter - notBefore) / 12, or 7 days if notBefore is not
available.
Once the lookup over all certificates is terminated, the task will sleep
and will wakeup after 12 hours.
acme_start_task() is a dedicated function which starts an acme task
for a specified <store> certificate.
The initialization code was move from the "acme renew" command parser to
this function, in order to be called from a scheduler.
When env variable name or value are not provided for setenv/presetenv it's not
clear from the old error message shown at stderr, what exactly is missed. User
needs to search in it's configuration.
Let's add more explicit error messages about these inconsistencies.
No need to be backported.
In process_table_expire(), limit the number of entries we remove in one
call, and just reschedule the task if there's more to do. Removing
entries require to use the heavily contended update write lock, and we
don't want to hold it for too long.
This helps getting stick tables perform better under heavy load.
Limit the number of old entries we remove in one call of
stktable_trash_oldest(), as we do so while holding the heavily contended
update write lock, so we'd rather not hold it for too long.
This helps getting stick tables perform better under heavy load.
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
In peer_send_msgs(), give up in order to retry later if we failed at
getting the update read lock.
Similarly, in __process_running_peer_sync(), give up and just reschedule
the task if we failed to get the peer lock. There is an heavy contention
on both those locks, so we could spend a lot of time trying to get them.
This helps getting peers perform better under heavy load.
tune.lua.bool-sample-conversion must be set before any lua-load or
lua-load-per-thread is used for it to be considered. Indeed, lua-load
directives are parsed on the fly and will cause some parts of the scripts
to be executed during init already (script body/init contexts).
As such, we cannot afford to have "tune.lua.bool-sample-conversion" set
after some Lua code was loaded, because it would mean that the setting
would be handled differently for Lua's code executed during or after
config parsing.
To avoid ambiguities, the documentation now states that the setting must
be set before any lua-load(-per-thread) directive, and if the setting
is met after some Lua was already loaded, the directive is ignored and
a warning informs about that.
It should fix GH #2957
It may be backported with 29b6d8af16 ("MINOR: hlua: rename
"tune.lua.preserve-smp-bool" to "tune.lua.bool-sample-conversion"")
There's no point reading t->state to check for a tasklet after we've
atomically read the state into the local "state" variable. Not only it's
more expensive, it's also less clear whether that state is supposed to
be atomic or not. And in any case, tasks and tasklets have their type
forever and the one reflected in state is correct and stable.
After recent commit b81c9390f ("MEDIUM: tasks: Mutualize the TASK_KILLED
code between tasks and tasklets"), the task accounting was no longer
correct for killed tasks due to the decrement of tasks in list that was
no longer done, resulting in infinite loops in process_runnable_tasks().
This just illustrates that this code remains complex and should be further
cleaned up. No backport is needed, as this was in 3.2.
quic_conn_release() may, or may not, free the tasklet associated with
the connection. So make it return 1 if it was, and 0 otherwise, so that
if it was called from the tasklet handler itself, the said handler can
act accordingly and return NULL if the tasklet was destroyed.
This should be backported if 9240cd4a27
is backported.
The next request was always leaving the task befor initializing the
httpclient. This patch optimize it by jumping to the next step at the
end of the current one. This way, only the httpclient is doing a
task_wakeup() to handle the response. But transiting from response to
the next request does not leave the task.
Add an extra parametre to conn_create_mux(), "closed_connection".
If a pointer is provided, then let it know if the connection was closed.
Callers have no way to determine that otherwise, and we need to know
that, at least in ssl_sock_io_cb(), as if the connection was closed we
need to return NULL, as the tasklet was free'd, otherwise that can lead
to memory corruption and crashes.
This should be backported if 9240cd4a27
is backported too.
The code to handle a task/tasklet when it's been killed before it were
to run is mostly identical, so move it outside of task and tasklet
specific code, and inside the common code.
This commit is just cosmetic, and should have no impact.
TASK_QUEUED was used to mean "the task has been scheduled to run",
TASK_IN_LIST was used to mean "the tasklet has been scheduled to run",
remove TASK_IN_LIST and just use TASK_QUEUED for tasklets instead.
This commit is just cosmetic, and should not have any impact.
There is some code that should run no matter if the task was killed or
not, and was needlessly duplicated, so only use one instance.
This also fixes a small bug when a tasklet that got killed before it
could run would still count as a tasklet that ran, when it should not,
which just means that we'd run one less useful task before going back to
the poller.
This commit is mostly cosmetic, and should not have any impact.
The code that checks if we're currently running, and waits if so, was
identical between tasks and tasklets, so move it in code common to tasks
and tasklets.
This commit is just cosmetic, and should not have any impact.
Use acme_ctx_destroy() instead of a simple free() upon error in the
"acme renew" error handling.
It's better to use this function to be sure than everything has been
been freed.
Implement a way to display the running acme tasks over the CLI.
It currently only displays a "Running" status with the certificate name
and the acme section from the configuration.
The displayed running tasks are limited to the size of a buffer for now,
it will require a backref list later to be called multiple times to
resume the list.
{listener,proxy,server}_get_stats() methods are know to be expensive,
expecially if used under an iteration. Indeed, while automatic yield
is performed every X lua instructions (defaults to 10k), computing an
object's stats 10K times in a single cpu loop is not desirable and
could create contention.
In this patch we leverage hlua_yield_asap() at the end of *_get_stats()
methods in order to force the automatic yield to occur ASAP after the
method returns. Hopefully this should help in similar scenarios as the
one described in GH #2903
When called, this function will try to enforce a yield (if available) as
soon as possible. Indeed, automatic yield is already enforced every X
Lua instructions. However, there may be some cases where we know after
running heavy operation that we should yield already to avoid taking too
much CPU at once.
This is what this function offers, instead of asking the user to manually
yield using "core.yield()" from Lua itself after using an expensive
Lua method offered by haproxy, we can directly enforce the yield without
the need to do it in the Lua script.
The previous commit has implemented a new calcul method for
MAX_STREAM_DATA frame emission. Now, a frame may be emitted as soon as a
buffer was consumed by a QCS instance.
This will probably increase the number of MAX_STREAM_DATA frame
emission. It may even cause a series of frame emitted for the same
stream with increasing values under high load, which is completely
unnecessary.
To improve this, limit the number of MAX_STREAM_DATA frames built to one
per QCS instance. This is implemented by storing a reference to this
frame in QCS structure via a new member <tx.msd_frm>.
Note that to properly reset QCS msd_frm member, emission of flow-control
frames have been changed. Now, each frame is emitted individually. On
one side, it is better as it prevent to emit frames related to different
streams in a single datagram, which is not desirable in case of packet
loss. However, this can also increase sendto() syscall invocation.
Recently, QCS Rx allocation buffer method has been improved. It is now
possible to allocate multiple buffers per QCS instances, which was
necessary to improve HTTP/3 POST throughput.
However, a limitation remained related to the emission of
MAX_STREAM_DATA. These frames are only emitted once at least half of the
receive capacity has been consumed by its QCS instance. This may be too
restrictive when a client need to upload a large payload.
Improve this by adjusting MAX_STREAM_DATA allocation. If QCS capacity is
still limited to 1 or 2 buffers max, the old calcul is still used. This
is necessary when user has limited upload throughput via their
configuration. If QCS capacity is more than 2 buffers, a new frame is
emitted if at least a buffer was consumed.
This patch has reduced number of STREAM_DATA_BLOCKED frames received in
POST tests with some specific clients.
Becaues of a typo, '||' was used instead of '|' to test the SPOP conneciton
flags and decide if the mux is ready or not. The regression was introduced
in the commit fd7ebf117 ("BUG/MEDIUM: mux-spop: Wait end of handshake to
declare a spop connection ready").
This patch must be backported to 3.1 with the commit above.
The newly added SSL traces require an extra 'conn' parameter to
ssl_sock_chose_sni_ctx which was added in the "regular" code but not in
the wolfssl specific one.
Wolfssl also has a different prototype for some getter functions
(SSL_get_servername for instance), which do not expect a const SSL while
openssl version does.
h1_destroy() may be called to release a H1C after a multiplexer upgrade. In
that case, the connection is no longer attached to the H1C. It must not be
used in the h1 trace message because the connection context is no longer a H1C.
Because of this bug, when a H1>H2 upgrade is performed, a crash may be
experienced if the H1 traces are enabled.
This patch must be backport to all stable versions.
When an applicative upgrade of the H1 multiplexer is performed, we must not
pretend the connection was released. Indeed, in that case, a H1 stream is
still their with a stream connector attached on it. It must be detached
first before releasing the H1 connection and the underlying connection. So
it is important to not pretend the connection was already released.
Concretely, in that case h1_process() must return 0 instead of -1. It is
minor error because, AFAIK, it is harmless. But it is not correct. So let's
fix it to avoid futur bugs.
To be clear, this happens when a TCP connection is upgraded to H1 connection
and a H2 preface is detected, leading to a second upgrade from H1 to H2.
This patch may be backport to all stable versions.
In the SPOE specification, when an error occurred on the SPOP connection,
HAProxy must send a DISCONNECT frame and wait for the agent DISCONNECT frame
in return before trully closing the connection.
However, this part was not properly handled by the SPOP multiplexer. In this
case, the SPOP connection should be in the CLOSING state. But this state was
not used at all. Depending on when the error was encountered, the connection
could be closed immediately, without sending any DISCONNECT frame. It was
the case when an early error was detected during the AGENT-HELLO frame
parsing. Or it could be moved from ERROR to FRAME_H state, as if no error
were detected. This case was less dramatic than it seemed because some flags
were also set to prevent any problem. But it was not obvious.
So now, the SPOP connection is properly switch to CLOSING state when an
DISCONNECT is sent to the agent to be able to wait for its DISCONNECT in
reply. spop_process_demux() was updated to parse frames in that state and
some validity checks was added.
This patch must be backport to 3.1.
A SPOP connection must not be considered as ready while the hello handshake
is not finished with success. In addition, no error or shutdown must have
been reported for the underlying connection. Otherwise a freshly openned
spop connexion may be reused while it is in fact dead, leading to a
connection retry.
This patch must be backported to 3.1.
We had to parse the sigAlg extension by hand in order to properly select
the certificate used by the SSL frontends. These traces allow to dump
the allowed sigAlg list sent by the client in its clientHello.
This callback allows to pick the used certificate on an SSL frontend.
The certificate selection is made according to the information sent by
the client in the clientHello. The traces that were added will allow to
better understand what certificate was chosen and why. It will also warn
us if the chosen certificate was the default one.
The actual certificate parsing happens in ssl_sock_chose_sni_ctx. It's
in this function that we actually get the filename of the certificate
used.
If OCSP stapling fails because of a missing or invalid OCSP response we
used to silently disable stapling for the given session. We can now know
a bit more what happened regarding OCSP stapling.
Those traces dump information about the multiple SSL_do_handshake calls
(renegotiation and regular call). Some errors coud also be dumped in
case of rejected early data.
Depending on the chosen verbosity, some information about the current
handshake can be dumped as well (servername, tls version, chosen cipher
for instance).
In case of failed handshake, the error codes and messages will also be
dumped in the log to ease debugging.
Add a dedicated trace for some unlikely allocation failures and async
errors. Those traces will ostly be used to identify the start and end of
a given SSL connection.
This function can be used to convert a TLSv1.3 sigAlg entry (2bytes)
from the signature_agorithms client hello extension into a string.
In order to ease debugging, some TLSv1.2 combinations can also be
dumped. In TLSv1.2 those signature algorithms pairs were built out of a
one byte signature identifier combined to a one byte hash identifier.
In TLSv1.3 those identifiers are two bytes blocs that must be treated as
such.
In relation to issue #2954, it appears that turning some size_t length
calculations to the int that uses my_strndup() upsets coverity a bit.
Instead of dealing with such warnings each time, better address it at
the root. An inspection of all call places show that the size passed
there is always positive so we can safely use an unsigned type, and
size_t will always suit it like for strndup() where it's available.
Since 91e785edc ("MINOR: stream: Rely on a per-stream max connection
retries value"), px->conn_retries may be ignored in the following cases:
* proxy not part of a list which gets properly post-init (ie: main proxy
list, log-forward list, sink list)
* proxy lacking the CAP_FE capability
Documenting such cases where the px->conn_retries is set but effectively
ignored, so that we either remove ignored statements or fix them in
the future if they are really needed. In fact all cases affected here are
automomous applets that already handle the retries themselves so the fact
that 91e785edc made ->conn_retries ineffective should not be a big deal
anyway.
when dns session callback (dns_session_release()) is called upon error
(ie: when some pending queries were not sent), we try our best to
re-create the applet in order to preserve the pending queries and give
them a chance to be retried. This is done at the end of
dns_session_release().
However, doing so exposes to an issue: if the error preventing queries
from being sent is still encountered over and over the dns session could
stay there indefinitely. Meanwhile, other dns sessions may be created on
the same dns_stream_server periodically. If previous failing dns sessions
don't terminate but we also keep creating new ones, we end up accumulating
failing sessions on a given dns_stream_server, which can eventually cause
ressource shortage.
This issue was found when trying to address ("BUG/MINOR: dns: add tempo
between 2 connection attempts for dns servers")
To fix it, we track the number of failed consecutive sessions for a given
dns server. When we reach the threshold (set to 100), we consider that the
link to the dns server is broken (at least temporarily) and we force
dns_session_new() to fail, so that we stop creating new sessions until one
of the existing one eventually succeeds.
A workaround for this fix consists in setting the "maxconn" parameter on
nameserver directive (under resolvers section) to a reasonnable value so
that no more than "maxconn" sessions may co-exist on the same server at
a given time.
This may be backported to all stable versions.
("CLEANUP: dns: remove unused dns_stream_server struct member") may be
backported to ease the backport.
As reported by Lukas Tribus on the mailing list [1], trying to connect to
a nameserver with invalid network settings causes haproxy to retry a new
connection attempt immediately which eventually causes unexpected CPU usage
on the thread responsible for the applet (namely 100% on one CPU will be
observed).
This can be reproduced with the test config below:
resolvers default
nameserver ns1 tcp4@8.8.8.8:53 source 192.168.99.99
listen listen
mode http
bind :8080
server s1 www.google.com resolvers default init-addr none
To fix this the issue, we add a temporisation of one second between a new
connection attempt is retried. We do this in dns_session_create() when we
know that the applet was created in the release callback (when previous
query attempt was unsuccessful), which means initial connection is not
affected.
[1]: https://www.mail-archive.com/haproxy@formilux.org/msg45665.html
This should fix GH #2909 and may be backported to all stable versions.
This patch depends on ("MINOR: applet: add appctx_schedule() macro")
"virt@acme" was the default map used during development, now this must
be configured in the acme section or it won't try to use any map.
This patch removes the references to virt@acme in the comments and the
code.
The stateless mode which was documented previously in the ACME example
is not convenient for all use cases.
First, when HAProxy generates the account key itself, you wouldn't be
able to put the thumbprint in the configuration, so you will have to get
the thumbprint and then reload.
Second, in the case you are using multiple account key, there are
multiple thumbprint, and it's not easy to know which one you want to use
when responding to the challenger.
This patch allows to configure a map in the acme section, which will be
filled by the acme task with the token corresponding to the challenge,
as the key, and the thumbprint as the value. This way it's easy to reply
the right thumbprint.
Example:
http-request return status 200 content-type text/plain lf-string "%[path,field(-1,/)].%[path,field(-1,/),map(virt@acme)]\n" if { path_beg '/.well-known/acme-challenge/' }
Define a new settings tune.quic.frontend.max-tot-window. It contains a
size argument which can be used to set a limit on the sum of all QUIC
connections congestion window. This is applied both on
quic_cc_path_set() and quic_cc_path_inc().
Note that this limitation cannot reduce a congestion window more than
the minimal limit which is set to 2 datagrams.
Use the newly defined cshared type to account for the sum of congestion
window of every QUIC connection. This value is stored in global counter
quic_mem_global defined in proto_quic module.
Congestion window is limit by a minimal and maximum values which can
never be exceeded. Min value is hardcoded to 2 datagrams as recommended
by the specification. Max value is specified via haproxy configuration.
These values must be respected each time the congestion window size is
adjusted. However, in some rare occasions, limit were not always
enforced. Fix this by implementing wrappers to set or increment the
congestion window. These functions ensure limits are always applied
after the operation.
Additionnally, wrappers also ensure that if window reached a new maximum
value, it is saved in <cwnd_last_max> field.
This should be backported up to 2.6, after a brief period of
observation.
Write minor adjustments to QUIC BBR functions. The objective is to
centralize every modification of path cwnd field.
No functional change. This patch will be useful to simplify
implementation of global QUIC Tx memory usage limitation.
There was some possible confusion between fields related to congestion
window size min and max limit which cannot be exceeded, and the maximum
value previously reached by the window.
Fix this by adopting a new naming scheme. Enforced limit are now renamed
<limit_max>/<limit_min>, while the previously reached max value is
renamed <cwnd_last_max>.
This should be backported up to 3.1.
The account creation was mistakenly ending the task instead of being
wakeup for the NewOrder state, it was preventing the creation of the
certificate, however the account was correctly created.
To fix this, only the jump to the end label need to be remove, the
standard leaving codepath of the function will allow to be wakeup.
No backport needed.
TCP_NOTSENT_LOWAT is very convenient as it indicates when to report
EAGAIN on the sending side. It takes a margin on top of the estimated
window, meaning that it's no longer needed to store too many data in
socket buffers. Instead there's just enough to fill the send window
and a little bit of margin to cover the scheduling time to restart
sending. Experiments on a 100ms network have shown a 10-fold reduction
in the memory used by socket buffers by just setting this value to
tune.bufsize, without noticing any performance degradation. Theoretically
the responsiveness on multiplexed protocols such as H2 should also be
improved.
The ping frame's pattern must be written at offset 9 (frame header
length), not 8. This was added in 3.2 with commit 4dcfe098a6 ("MINOR:
mux-h2: prepare to support PING emission"), so no backport is needed.
While humans can find it convenient to enter the worker process in prompt
mode, for external tools it will not be convenient to have to systematically
disable it. A better approach is to replicate the master socket's mode
there, since it has already been configured to suit the user: interactive,
prompt and timed modes are automatically passed to the worker process.
This makes the using the worker commands more natural from the master
process, without having to systematically adapt it for each new connection.
Support the same syntax in master mode as in worker mode in order to
configure the prompt. The only thing is that for now the master doesn't
have a non-interactive mode and it doesn't seem necessary to implement
it, so we only support the interactive and prompt modes. However the code
was written in a way that makes it easy to change this later if desired.
Now the prompt mode can more finely be configured between non-interactive
(default), interactive without prompt, and interactive with prompt. This
will ease the usage from automated tools which are not necessarily
interested in having to consume '> ' after each command nor displaying
"+" on payload lines. This can also be convenient when coming from the
master CLI to keep the same output format.
The CLI's "prompt" command toggles two distinct things:
- displaying or hiding the prompt at the beginning of the line
- single-command vs interactive mode
These are two independent concepts and the prompt mode doesn't
always cope well with tools that would like to upload data without
having to read the prompt on return. Also, the master command line
works in interactive mode by default with no prompt, which is not
consistent (and not convenient for tools). So let's start by splitting
the bit in two, and have a new APPCTX_CLI_ST1_INTER flag dedicated
to the interactive mode. For now the "prompt" command alone continues
to toggle the two at once.
Generate the private key on the account file when the file does not
exists. This generate a private key of the type and parameters
configured in the acme section.
This will display the lock labels and modes for each non-empty step
at the end of "show threads" when these are defined. This allows to
emit up to the last 8 locking operation for each thread on 64 bit
machines.
We now default the value to zero and make sure all tests properly take
care of values above zero. This is in preparation for supporting several
degrees of debugging.
Machines lacking CAS8B/DWCAS and emit a warning in lb_fwlc.c without
threads due to declaration ordering. Let's just move the variable
declaration into the block that uses it as a last variable. No
backport is needed.
Not all systems have strndup(), that's why we have our "my_strndup()",
so let's make use of it here. This fixes the build on Solaris 10. No
backport is needed.
It has been requested to have the current_session_rate exposed at the
frontend level. For now only the per-process value was exposed
(ST_I_INF_SESS_RATE).
Thanks to the work done lately to merge promex and stat_cols_px[]
array, let's simply defined an .alt_name for the ST_I_PX_RATE metric in
order to have promex exposing it as current_session_rate for relevant
contexts.
Add a -t option to 'show ssl sni', allowing to add an offset to the
current date so it would allow to check which certificates are expired
after a certain period of time.
Since we made it possible for a bind_conf to listen to multiple thread
groups with shards in 2.8 with commit 9d360604bd ("MEDIUM: listener:
rework thread assignment to consider all groups"), the per-listener
connection count was not properly transferred to the target listener
with the connection when switching to another thread group. This results
in one listener possibly reaching high values and another one possibly
reaching negative values. Usually it's not visible, unless a maxconn is
set on the bind_conf, in which case comparisons will quickly put an end
to the willingness to accept new connections.
This problem only happens when thread groups are enabled, and it seems
very hard to trigger it normally, it only impacts sockets having a single
shard, hence currently the CLI (or any conf with "bind ... shards 1"),
where it can be reproduced with a config having a very low "maxconn" on
the stats socket directive (here, 4), and issuing a few tens of
socat <<< "show activity" in parallel, or sending HTTP connections to a
single-shared listener. Very quickly, haproxy stops accepting connections
and eats CPU in the poller which tries to get its connections accepted.
A BUG_ON(l->nbconn<0) after HA_ATOMIC_DEC() in listener_release() also
helps spotting them better.
Many thanks to Christian Ruppert who once again provided a very accurate
report in GH #2951 with the required data permitting this analysis.
This fix must be backported to 2.8.
tasklets were originally designed to alway run on only one thread, so it
was not possible to have it run on 2 threads concurrently.
The API has been extended so that another thread may wake the tasklet,
the idea was still that we wanted to have it run on one thread only.
However, the way it's been done meant that unless a tasklet was bound to
a specific tid with tasklet_set_tid(), or we explicitely used
tasklet_wakeup_on() to specify the thread for the target to run on, it
would be scheduled to run on the current thread.
This is in fact a desirable feature. There is however a race condition
in which the tasklet would be scheduled on a thread, while it is running
on another. This could lead to the same tasklet to run on multiple
threads, which we do not want.
To fix this, just do what we already do for regular tasks, set the
"TASK_RUNNING" flag, and when it's time to execute the tasklet, wait
until that flag is gone.
Only one case has been found in the current code, where the tasklet
could run on different threads depending on who wakes it up, in the
leastconn load balancer, since commit
627280e15f.
It should not be a problem in practice, as the function called can be
called concurrently.
If a bug is eventually found in relation to this problem, and this patch
should be backported, the following patches should be backported too :
MEDIUM: quic: Make sure we return the tasklet from quic_accept_run
MEDIUM: quic: Make sure we return NULL in quic_conn_app_io_cb if needed
MEDIUM: quic: Make sure we return the tasklet from qcc_io_cb
MEDIUM: mux_fcgi: Make sure we return the tasklet from fcgi_deferred_shut
MEDIUM: listener: Make sure w ereturn the tasklet from accept_queue_process
MEDIUM: checks: Make sure we return the tasklet from srv_chk_io_cb
In quic_conn_app_io_cb, make sure we return NULL if the tasklet has been
destroyed, so that the scheduler knows. It is not yet needed, but will
be soon.
On systems where the network is not reachable at boot time (certain HA
systems for example, or dynamically addressed test machines), we'll want
to be able to periodically revalidate the IPv6 reachability status. The
current code makes it complicated because it sets the config bits once
for all at boot time. This commit changes this so that the config bits
are not changed, but instead we rely on a static inline function that
relies on sock_inet6_seems_reachable for every test (really cheap). This
also removes the now unneeded resolvers late init code.
This variable for now is still set at boot time but this will ease the
transition later, as the resolvers code is now ready for this.
The new adhoc parser for the '@@' prefix forgot to require the presence
of the LF character marking the end of the line. This is the reason why
entering incomplete commands would display garbage, because the line was
expected to have its LF character replaced with a zero.
The problem is well illustrated by using socat in raw mode:
socat /tmp/master.sock STDIO,raw,echo=0
then entering "@@1 show info" one character at a time would error just
after the second "@". The command must take care to report an incomplete
line and wait for more data in such a case.
This reverts commit 0e94339eaf.
This patch was in fact fixing the symptom, not the cause. The root cause
of the problem is that the parser was processing an incomplete line when
looking for '@@'. When the LF is present, this problem does not exist
as it's properly replaced with a zero. This can be verified using socat
in raw mode:
socat /tmp/master.sock STDIO,raw,echo=0
Then entering "@@1 show info" one character at a time will immediately
fail on "@@" without going further. A subsequent patch will fix this.
No backport is needed.
When the CLI applet was refactord in the commit 20ec1de21 ("MAJOR: cli:
Refacor parsing and execution of pipelined commands"), a regression was
introduced. The applet shutdown was not longer handled when the applet was
waiting for the next command line. It is especially visible when a client
timeout occurred because the client connexion is no longer closed.
To fix the issue, the test on the SE_FL_SHW flag was reintroduced in
CLI_ST_PARSE_CMDLINE state, but only is there is no pending input data.
It is a 3.2-specific issue. No backport needed.
When the ACME task is checking for the status of the challenge, it would
only succeed or retry upon failure.
However that's not the best way to do it, ACME objects contain an
"status" field which could have a final status or a in progress status,
so we need to be able to retry.
This patch adds an acme_ret enum which contains OK, RETRY and FAIL.
In the case of the CHKCHALLENGE, the ACME could return a "pending" or a
"processing" status, which basically need to be rechecked later with the
RETRY. However a "invalid" or "valid" status is final and will return
either a FAIL or a OK.
So instead of retrying in any case, the "invalid" status will ends the
task with an error.
Parse the Retry-After header in response and store it in order to use
the value as the next delay for the next retry, fallback to 3s if the
value couldn't be parse or does not exist.
Instead of always having to force IPv4 or IPv6, let's now also offer
"auto" which will only enable IPv6 if the system has a default gateway
for it. This means that properly configured dual-stack systems will
default to "ipv4,ipv6" while those lacking a gateway will only use
"ipv4". Note that no real connectivity test is performed, so firewalled
systems may still get it wrong and might prefer to rely on a manual
"ipv4" assignment.
In order to ease dual-stack deployments, we could at least try to
check if ipv6 seems to be reachable. For this we're adding a test
based on a UDP connect (no traffic) on port 53 to the base of
public addresses (2001::) and see if the connect() is permitted,
indicating that the routing table knows how to reach it, or fails.
Based on this result we're setting a global variable that other
subsystems might use to preset their defaults.
In order to ease troubleshooting and testing, the new "-4" command line
argument enforces queries and processing of "A" DNS records only, i.e.
those representing IPv4 addresses. This can be useful when a host lack
end-to-end dual-stack connectivity. This overrides the global
"dns-accept-family" directive and is equivalent to value "ipv4".
By default, DNS resolvers accept both IPv4 and IPv6 addresses. This can be
influenced by the "resolve-prefer" keywords on server lines as well as the
family argument to the "do-resolve" action, but that is only a preference,
which does not block the other family from being used when it's alone. In
some environments where dual-stack is not usable, stumbling on an unreachable
IPv6-only DNS record can cause significant trouble as it will replace a
previous IPv4 one which would possibly have continued to work till next
request. The "dns-accept-family" global option permits to enforce usage of
only one (or both) address families. The argument is a comma-delimited list
of the following words:
- "ipv4": query and accept IPv4 addresses ("A" records)
- "ipv6": query and accept IPv6 addresses ("AAAA" records)
When a single family is used, no request will be sent to resolvers for the
other family, and any response for the othe family will be ignored. The
default value is "ipv4,ipv6", which effectively enables both families.
When '@@' alone is sent on the master CLI (no trailing LF), we get an
error that displays anything past these two characters in the buffer
since there's no room for a \0. Let's make sure to limit the length of
the process name in this case. No backport is needed since this was added
with 00c967fac4 ("MINOR: master/cli: support bidirectional communications
with workers").
When a service is initialized, the "use-service" rule that was executed is
now saved in the stream, using "current_rule" field, instead of saving it
into the applet context. It is safe to do so becaues this field is unused at
this stage. To avoid any issue, it is reset after the service
initialization. Doing so, it is no longer necessary to save it in the applet
context. It was the last usage of the rule pointer in the applet context.
The init functions for TCP and HTTP lua services were updated accordingly.
The lua function name was used in error messages of HTTP/TCP lua services
while the applet name can be used. Concretely, this will not change
anything, because when a lua service is regiestered, the lua function name
is used to name the applet. But it is easier, cleaner and more logicial
because it is really the applet name that should be displayed in these error
messages.
Thanks to this change, when a response is delivered from the cache, it is no
longer necessary to get the cache filter configuration from the http
"use-cache" rule saved in the appctx to get the currently used cache. It was
a bit complex to get an info that can be directly and naturally stored in
the cache applet context.
There are several fields in the appctx structure only used by the CLI. To
make things cleaner, all these fields are now placed in a dedicated context
inside the appctx structure. The final goal is to move it in the service
context and add an API for cli commands to get a command coontext inside the
cli context.
Thanks to the CLI refactoring ("MAJOR: cli: Refacor parsing and execution of
pipelined commands"), it is possible to fix "show event" I/O handle function
to no longer use the SC.
When the applet API was refactored to no longer manipulate the channels or
the stream-connectors, this part was missed. However, without the patch
above, it could not be fixed. It is now possible so let's do it.
This patch must not be backported becaues it depends on refactoring of the
CLI applet.
Thanks to the CLI refactoring ("MAJOR: cli: Refacor parsing and execution of
pipelined commands"), it is possible to fix the I/O handler function used by
lua CLI commands to no longer use the SC.
When the applet API was refactored to no longer manipulate the channels or
the stream-connectors, this part was missed. However, without the patch
above, it could not be fixed. It is now possible so let's do it.
This patch must not be backported becaues it depends on refactoring of the
CLI applet.
CLI_ST_GETREQ state was renamed into CLI_ST_PARSE_CMDLINE and CLI_ST_PARSEREQ
into CLI_ST_PROCESS_CMDLINE to reflect the real action performed in these
states.
Before this patch, when pipelined commands were received, each command was
parsed and then excuted before moving to the next command. Pending commands
were not copied in the input buffer of the applet. The major issue with this
way to handle commands is the impossibility to consume inputs from commands
with an I/O handler, like "show events" for instance. It was working thanks
to a "bug" if such commands were the last one on the command line. But it
was impossible to use them followed by another command. And this prevents us
to implement any streaming support for CLI commands.
So we decided to refactor the command line parsing to have something similar
to a basic shell. Now an entire line is parsed, including the payload,
before starting commands execution. The command line is copied in a
dedicated buffer. "appctx->chunk" buffer is used for this purpose. It was an
unsed field, so it is safe to use it here. Once the command line copied, the
commands found on this line are executed. Because the applet input buffer
was flushed, any input can be safely consumed by the CLI applet and is
available for the command I/O handler. Thanks to this change, "show event
-w" command can be followed by a command. And in theory, it should be
possible to implement commands supporting input data streaming. For
instance, the Tetris like lua applet can be used on the CLI now.
Note that the payload, if any, is part of the command line and must be fully
received before starting the commands processing. It means there is still
the limitation to a buffer, but not only for the payload but for the whole
command line. The payload is still necessarily at the end of the command
line and is passed as argument to the last command. Internally, the
"appctx->cli_payload" field was introduced to point on the payload in the
command line buffer.
This patch is quite huge but it cannot easily be splitted. It should not
introduced significant changes.
When a bidirection connection with no command is establisehd with a worker
(so "@@<pid>" alone), a "prompt" command is automatically added to display
the worker's prompt and enter in interactive mode in the worker context.
However, till now, an unfinished command line is sent, with a semicolon
instead of a newline at the end. It is not exactly a bug because this
works. But it is not really expected and could be a problem for future
changes.
So now, a full command line is sent: the "prompt" command finished by a
newline character.
When a command is parsed to split it in an array of arguments, by default,
at most 64 arguments are supported. But no warning was emitted when there
were too many arguments. Instead, the arguments above the limit were
silently ignored. It could be an issue for some commands, like "add server",
because there was no way to know some arguments were ignored.
Now an error is issued when too many arguments are passed and the command is
not executed.
This patch should be backported to all stable versions.
Add an early return to qcc_decode_qcs() if QCC instance is flagged on
error and connection is scheduled for immediate closure.
The main objective is to ensure to not trigger BUG_ON() from
qcc_set_error() : if a stream decoding has set the connection error, do
not try to process decoding on other streams as they may also encounter
an error. Thus, the connection is closed asap with the first encountered
error case.
This should be backported up to 2.6, after a period of observation.
With the support of multiple Rx buffers per QCS instance, stream
decoding in qcc_io_recv() has been reworked for the next haproxy
release. An issue appears in a double while loop : a break statement is
used in the inner loop, which is not sufficient as it should instead
exit from the outer one.
Fix this by replacing break with a goto statement.
No need to backport this.
Remove return statement in h3_rcv_buf() in case of stream/connection
error. Instead, reuse already existing label err. This simplifies the
code path. It also fixes the missing leave trace for these cases.
Use a customized proxy for the ACME client.
The proxy is initialized at the first acme section parsed.
The proxy uses the httpsclient log format as ACME CA use HTTPS.
Add an experimental "https" log-format for the httpclient, it is not
used by the httpclient by default, but could be define in a customized
proxy.
The string is basically a httpslog, with some of the fields replaced by
their backend equivalent or - when not available:
"%ci:%cp [%tr] %ft -/- %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %[bc_err]/%[ssl_bc_err,hex]/-/-/%[ssl_bc_is_resumed] -/-/-"
The 'pause' HTTP action can now be used to suspend for a moment the message
analysis. A timeout, expressed in milliseconds using a time-format
parameter, or an expression can be used. If an expression is used, errors
and invalid values are ignored.
Internally, the action will set the analysis expiration date on the
corresponding channel to the configured value and it will yield while it is
not expired.
The 'pause' action is available for 'http-request' and 'http-response'
rules.
When a SPOP connection is opened, the maximum size for frames is negociated.
This negociated size is properly used when a frame is received and if a too
big frame is detected, an error is triggered. However, the same was not
performed on the sending path. No check was performed on frames sent to the
agent. So it was possible to send frames bigger than the maximum size
supported by the the SPOE agent.
Now, the size of NOTIFY and DISCONNECT frames is checked before sending them
to the agent.
Thanks to Miroslav to have reported the issue.
This patch must be backported to 3.1.
Since the commit "MINOR: hlua/h1: Use http_parse_cont_len_header() to parse
content-length value", this function is no longer used. So it can be safely
removed.
Till now, h1_parse_cont_len_header() was used during the H1 message parsing and
by the lua HTTP applets to parse the content-length header value. But a more
generic function was added some years ago doing exactly the same operations. So
let's use it instead.
Thanks to the commit "MINOR: mux-h1: Don't remove custom "Content-Length: 0"
header in 1xx and 204 messages", we are now sure that 1xx and 204 responses
were sanitized during the parsing. So, if one of these headers are found in
such responses when sent to the client, it means it was added by hand, via a
"set-header" action for instance. In this context, we are able to make an
exception for the "Content-Length: 0" header, and only this one with this
value, to not break leagacy applications.
So now, a user can force the "Content-Length: 0" header to appear in 1xx and
204 responses by adding the right action in hist configuration.
"Transfer-Encoding" headers are still dropped as "Content-Length" headers
with another value than 0. Note, that in practice, only 101 and 204 are
concerned because other 1xx message are not subject to HTTP analysis.
This patch should fix the issue #2888. There is no reason to backport
it. But if we do so, the patch above must be backported too.
According to the RFC9110 and RFC9112, a server must not add 'Content-Length'
or 'Transfer-Encoding' headers into 1xx and 204 responses. So till now,
these headers were dropped from the response when it is sent to the client.
However, it seems more logical to remove it during the message parsing. In
addition to sanitize messages as early as possible, this will allow us to
apply some exception in some cases (This will be the subject of another
patch).
In this patch, 'Content-Length' and 'Transfer-Encoding' headers are removed
from 1xx and 204 responses during the parsing but the same is still
performed during the formatting stage.
In RFC9110, it is stated that trailers could be merged with the
headers. While it should be performed with a speicial care, it may be a
problem for some applications. To avoid any trouble with such applications,
two new options were added to drop trailers during the message forwarding.
On the backend, "http-drop-request-trailers" option can be enabled to drop
trailers from the requests before sending them to the server. And on the
frontend, "http-drop-response-trailers" option can be enabled to drop
trailers from the responses before sending them to the client. The options
can be defined in defaults sections and disabled with "no" keyword.
This patch should fix the issue #2930.
This changes commit d2a9149f0 ("BUG/MINOR: proxy: always detach a proxy
from the names tree on free()") to be cleaner. Aurlien spotted that
the free(p->id) was indeed already done in proxy_free_common(), which is
called before we delete the node. That's still a bit ugly and it only
works because ebpt_delete() does not dereference the key during the
operation. Better play safe and delete the entry before freeing it,
that's more future-proof.
Stephen Farrell reported in issue #2942 that recent haproxy versions
crash if there's no resolv.conf. A quick bisect with his reproducer
showed that it started with commit 4194f75 ("MEDIUM: tree-wide: avoid
manually initializing proxies") which reorders the proxies initialization
sequence a bit. The crash shows a corrupted tree, typically indicating a
use-after-free. With the help of ASAN it was possible to find that a
resolver proxy had been destroyed and freed before the name insertion
that causes the crash, very likely caused by the absence of the needed
resolv.conf:
#0 0x7ffff72a82f7 in free (/usr/local/lib64/libasan.so.5+0x1062f7)
#1 0x94c1fd in free_proxy src/proxy.c:436
#2 0x9355d1 in resolvers_destroy src/resolvers.c:2604
#3 0x93e899 in resolvers_create_default src/resolvers.c:3892
#4 0xc6ed29 in httpclient_resolve_init src/http_client.c:1170
#5 0xc6fbcf in httpclient_create_proxy src/http_client.c:1310
#6 0x4ae9da in ssl_ocsp_update_precheck src/ssl_ocsp.c:1452
#7 0xa1b03f in step_init_2 src/haproxy.c:2050
But free_proxy() doesn't delete the ebpt_node that carries the name,
which perfectly explains the situation. This patch simply deletes the
name node and Stephen confirmed that it fixed the problem for him as
well. Let's also free it since the key points to p->id which is never
freed either in this function!
No backport is needed since the patch above was first merged into
3.2-dev10.
In check_config_validity(), we initialize the proxy in several stages.
We do so for the sink list for stage1, but not for stage2. It may not be
needed right now, but it may become needed in the future, so do it
anyway.
Now that all there is one tree per thread group, all thread groups will
start on the same server. To prevent that, just insert the servers in a
different order for each thread group.
When using the round-robin load balancer, the major source of contention
is the lbprm lock, that has to be held every time we pick a server.
To mitigate that, make it so there are one tree per thread-group, and
one lock per thread-group. That means we now have a lb_fwrr_per_tgrp
structure that will contain the two lb_fwrr_groups (active and backup) as well
as the lock to protect them in the per-thread lbprm struct, and all
fields in the struct server are now moved to the per-thread structure
too.
Those changes are mostly mechanical, and brings good performances
improvment, on a 64-cores AMD CPU, with 64 servers configured, we could
process about 620000 requests par second, and we now can process around
1400000 requests per second.
Move the "next_weight" outside of fwrr_group, and inside struct lb_fwrr
directly, one for the active servers, one for the backup servers.
We will soon have one fwrr_group per thread group, but next_weight will
be global to all of them.
Add a pointer to the server into the struct srv_per_tgroup, so that if
we only have access to that srv_per_tgroup, we can come back to the
corresponding server.
Move the call to initialize the proxy's per-thread structure earlier
than currently done, so that they are usable when we're initializing the
load balancers.
Move the code responsible for calling per-thread server initialization
earlier than it was done, so that per-thread structures are available a
bit later, when we initialize load-balancing.
This patch contains this 4 new fetches and doc changes for the new fetches:
- req.ssl_cipherlist
- req.ssl_sigalgs
- req.ssl_keyshare_groups
- req.ssl_supported_groups
Towards:#2532
Now we can simply call is_sched_alive() on the local thread to verify
that the scheduler is still ticking instead of having to keep a copy of
the ctxsw and comparing it. It's cleaner, doesn't require to maintain
a local copy, doesn't rely on activity[] (whose purpose is mainly for
observation and debugging), and shows how this could be extended later
to cover other use cases. Practically speaking this doesn't change
anything however, the algorithm is still the same.
This verifies that the scheduler is still ticking without having to
access the activity[] array nor keeping local copies of the ctxsw
counter. It just tests and sets a flag that is reset after each
return from a ->process() function.
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
It has been noticed quite a few times during troubleshooting and even
testing that warnings can happen in avalanches from multiple threads
at the same time, and that their reporting it interleaved bacause the
output is produced in small chunks. Originally, this code inspired by
the panic code aimed at making sure to log whatever could be emitted
in case it would crash later. But this approach was wrong since writes
are atomic, and performing 5 writes in sequence in each dumping thread
also means that the outputs can be mixed up at 5 different locations
between multiple threads. The output of warnings is never very long,
and the stack-based buffer is 4kB so let's just concatenate everything
in the buffer and emit it at once using a single write(). Now there's
no longer this confusion on the output.
Since we no longer call it with a foreign thread, let's simplify its code
and get rid of the special cases that were relying on ha_thread_dump_fill()
and synchronization with a remote thread. We're not only dumping the
current thread so ha_thread_dump_one() is sufficient.
Warnings remain tricky to deal with, especially for other threads as
they require some inter-thread synchronization that doesn't cope very
well with other parallel activities such as "show threads" for example.
However there is nothing that forces us to handle them this way. The
panic for example is already handled by bouncing the WDT signal to the
faulty thread.
This commit rearranges the WDT handler to make a better used of this
existing signal bouncing feature of the WDT handler so that it's no
longer limited to panics but can also deal with warnings. In order not
to bounce on all wakeups, we only bounce when there is a suspicion,
that is, when the warning timer has been crossed. We'll let the target
thread verify the stuck flag and context switch count by itself to
decide whether or not to panic, warn, or just do nothing and update
the counters.
As a bonus, now all warning traces look the same regardless of the
reporting thread:
call trace(16):
| 0x6bc733 <01 00 00 e8 6d e6 de ff]: ha_dump_backtrace+0x73/0x309 > main-0x2570
| 0x6bd37a <00 00 00 e8 d6 fb ff ff]: ha_thread_dump_fill+0xda/0x104 > ha_thread_dump_one
| 0x6bd625 <00 00 00 e8 7b fc ff ff]: ha_stuck_warning+0xc5/0x19e > ha_thread_dump_fill
| 0x7b2b60 <64 8b 3b e8 00 aa f0 ff]: wdt_handler+0x1f0/0x212 > ha_stuck_warning
| 0x7fd7e2cef3a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
| 0x7ffc6af9e634 <85 a6 00 00 00 0f 01 f9]: linux-vdso:__vdso_gettimeofday+0x34/0x2b0
| 0x6bad74 <7c 24 10 e8 9c 01 df ff]: sc_conn_io_cb+0x9fa4 > main-0x2400
| 0x67c457 <89 f2 4c 89 e6 41 ff d0]: main+0x1cf147
| 0x67d401 <48 89 df e8 8f ed ff ff]: cli_io_handler+0x191/0xb38 > main+0x1cee80
| 0x6dd605 <40 48 8b 45 60 ff 50 18]: task_process_applet+0x275/0xce9
The goal is to let the caller deal with the pointer so that the function
only has to fill that buffer without worrying about locking. This way,
synchronous dumps from "show threads" are produced and emitted directly
without causing undesired locking of the buffer nor risking causing
confusion about thread_dump_buffer containing bits from an interrupted
dump in progress.
It's only the caller that's responsible for notifying the requester of
the end of the dump by setting bit 0 of the pointer if needed (i.e. it's
only done in the debug handler).
This function was initially designed to dump any threadd into the presented
buffer, but the way it currently works is that it's always called for the
current thread, and uses the distinction between coming from a sighandler
or being called directly to detect which thread is the caller.
Let's simplify all this by replacing thr with tid everywhere, and using
the thread-local pointers where it makes sense (e.g. th_ctx, th_ctx etc).
The confusing "from_signal" argument is now replaced with "is_caller"
which clearly states whether or not the caller declares being the one
asking for the dump (the logic is inverted, but there are only two call
places with a constant).
We don't need to copy the old dump pointer to the thread_dump_pointer
area anymore to indicate a dump is collected. It used to be done as an
artificial way to keep the pointer for the post-mortem analysis but
since we now have this pointer stored separately, that's no longer
needed and it simplifies the mechanim to reset it.
Instead of using the thread dump buffer for post-mortem analysis, we'll
keep a copy of the assigned pointer whenever it's used, even for warnings
or "show threads". This will offer more opportunities to figure from a
core what happened, and will give us more freedom regarding the value of
the thread_dump_buffer itself. For example, even at the end of the dump
when the pointer is reset, the last used buffer is now preserved.
If a thread is dumping itself (warning, show thread etc) and another one
wants to dump the state of all threads (e.g. panic), it may interrupt the
first one during backtrace() and re-enter it from the signal handler,
possibly triggering a deadlock in the underlying libc. Let's postpone
the debug signal delivery at this point until the call ends in order to
avoid this.
If a thread is currently resolving a symbol while another thread triggers
a thread dump, the current thread may enter the debug handler and call
resolve_sym_addr() again, possibly deadlocking if the underlying libc
uses locking. Let's postpone the debug signal delivery in this area
during the call. This will slow the resolution a little bit but we don't
care, it's not supposed to happen often and it must remain rock-solid.
This is an extension of eb41d768f ("MINOR: tools: use only opportunistic
symbols resolution"). It also makes sure we're not calling dladddr() in
parallel to dladdr_and_size(), as a preventive measure against some
potential deadlocks in the inner layers of the libc.
A few functions are used when debugging debug signals and watchdog, but
being static, they're not resolved and are hard to spot in dumps, and
they appear as any random other function plus an offset. Let's just not
mark them static anymore, it only hurts:
- cli_io_handler_show_threads()
- debug_run_cli_deadlock()
- debug_parse_cli_loop()
- debug_parse_cli_panic()
In the following trace trying to abuse the watchdog from the CLI's
"debug dev loop" command running in parallel to "show threads" loops,
it's clear that some re-entrance may happen in ha_thread_dump_fill().
A first minimal fix consists in using a test-and-set on the flag
indicating that the function is currently dumping threads, so that
the one from the signal just returns. However the caller should be
made more reliable to serialize all of this, that's for future
work.
Here's an example capture of 7 threads stuck waiting for each other:
(gdb) bt
#0 0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
#1 0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
#2 0x00000000005ba4f5 in ha_thread_dump_fill (thr=2, buf=0x7ffdd8e08ab0) at src/debug.c:402
#3 ha_thread_dump_fill (buf=0x7ffdd8e08ab0, thr=<optimized out>) at src/debug.c:384
#4 0x00000000005baac4 in ha_stuck_warning (thr=thr@entry=2) at src/debug.c:840
#5 0x00000000006a360d in wdt_handler (sig=<optimized out>, si=<optimized out>, arg=<optimized out>) at src/wdt.c:156
#6 <signal handler called>
#7 0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
#8 0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
#9 0x00000000005ba4c2 in ha_thread_dump_fill (thr=2, buf=0x7fe78f2d6420) at src/debug.c:426
#10 ha_thread_dump_fill (buf=0x7fe78f2d6420, thr=2) at src/debug.c:384
#11 0x00000000005ba7c6 in cli_io_handler_show_threads (appctx=0x2a89ab0) at src/debug.c:548
#12 0x000000000057ea43 in cli_io_handler (appctx=0x2a89ab0) at src/cli.c:1176
#13 0x00000000005d7885 in task_process_applet (t=0x2a82730, context=0x2a89ab0, state=<optimized out>) at src/applet.c:920
#14 0x0000000000659002 in run_tasks_from_lists (budgets=budgets@entry=0x7ffdd8e0a5c0) at src/task.c:644
#15 0x0000000000659bd7 in process_runnable_tasks () at src/task.c:886
#16 0x00000000005cdcc9 in run_poll_loop () at src/haproxy.c:2858
#17 0x00000000005ce457 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3075
#18 0x0000000000430628 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3665
As seen in issue #2860, there are some situations where a watchdog could
trigger during the debug signal handler, and where similarly the debug
signal handler may trigger during the wdt handler. This is really bad
because it could trigger some deadlocks inside inner libc code such as
dladdr() or backtrace() since the code will not protect against re-
entrance but only against concurrent accesses.
A first attempt was made using ha_sigmask() but that's not always very
convenient because the second handler is called immediately after
unblocking the signal and before returning, leaving signal cascades in
backtrace. Instead, let's mark which signals to block at registration
time. Here we're blocking wdt/dbg for both signals, and optionally
SIGRTMAX if DEBUG_DEV is used as that one may also be used in this case.
This should be backported at least to 3.1.
The function must make sure to return NULL for foreign threads and
the local buffer for the current thread in this case, otherwise panics
(and sometimes even warnings) will segfault when USE_THREAD_DUMP is
disabled. Let's slightly re-arrange the function to reduce the #if/else
since we have to specifically handle the case of !USE_THREAD_DUMP anyway.
This needs to be backported wherever b8adef065d ("MEDIUM: debug: on
panic, make the target thread automatically allocate its buf") was
backported (at least 2.8).
This commit extends MUX H2 connection reversal step to properly take
into account the new idle-ping feature. It first ensures that h2c task
is properly instantiated/freed depending now on both timers and
idle-ping configuration. Also, h2c_update_timeout() is now called
instead of manually requeuing the task, which ensures the proper timer
value is selected depending on the new connection side.
If idle-ping is activated and h2c task is expired due to missing PING
ACK, consider that the peer is away and the connection can be closed
immediately. GOAWAY emission is thus skipped.
A new test is necessary in h2c_update_timeout() when PING ACK is
currently expected, but the next timer expiration selected is not
idle-ping. This may happen if http-keep-alive/http-request timers are
selected first. In this case, H2_CF_IDL_PING_SENT flag is resetted. This
is necessary to not prevent GOAWAY emission on expiration.
This commit is the counterpart of the previous one, adapted on the
frontend side. "idle-ping" is added as keyword to bind lines, to be able
to refresh client timeout of idle frontend connections.
H2 MUX behavior remains similar as the previous patch. The only
significant change is in h2c_update_timeout(), as idle-ping is now taken
into account also for frontend connection. The calculated value is
compared with http-request/http-keep-alive timeout value. The shorter
delay is then used as expired date. As hr/ka timeout are based on
idle_start, this allows to run them in parallel with an idle-ping timer.
This commit implements support for idle-ping on the backend side. First,
a new server keyword "idle-ping" is defined in configuration parsing. It
is used to set the corresponding new server member.
The second part of this commit implements idle-ping support on H2 MUX. A
new inlined function conn_idle_ping() is defined to access connection
idle-ping value. Two new connection flags are defined H2_CF_IDL_PING and
H2_CF_IDL_PING_SENT. The first one is set for idle connections via
h2c_update_timeout().
On h2_timeout_task() handler, if first flag is set, instead of releasing
the connection as before, the second flag is set and tasklet is
scheduled. As both flags are now set, h2_process_mux() will proceed to
PING emission. The timer has also been rearmed to the idle-ping value.
If a PING ACK is received before next timeout, connection timer is
refreshed. Else, the connection is released, as with timer expiration.
Also of importance, special care is needed when a backend connection is
going to idle. In this case, idle-ping timer must be rearmed. Thus a new
invokation of h2c_update_timeout() is performed on h2_detach().
Adapt the already existing function h2c_ack_ping(). The objective is to
be able to emit a PING request. First, it is renamed as h2c_send_ping().
A new boolean argument <ack> is used to emit either a PING request or
ack.
Reorganize code for timeout calculation in case the connection is idle.
The objective is to better reflect the relations between each timeouts
as follow :
* if GOAWAY already emitted, use shut-timeout, or if unset fallback to
client/server one. However, an already set timeout is never erased.
* else, for frontend connection, http-request or keep-alive timeout is
applied depending on the current demux state. If the selected value is
unset, fallback to client timeout
* for backend connection, no timeout is set to perform http-reuse
This commit is pure refactoring, so no functional change should occur.
Since the following commit, MUX H2 timeout function has been slightly
exetended.
d38d8c6ccb
BUG/MEDIUM: mux-h2: make sure control frames do not refresh the idle timeout
A side-effect of this patch is that now backend idle connection expire
timer is not reset if already defined. This means that if a timer was
registered prior to the connection transition to idle, the connection
would be destroyed on its timeout. If this happens for enough
connection, this may have an impact on the reuse rate.
In practice, this case should be rare, as h2c timer is set to
TICK_ETERNITY while there is active streams. The timer is not refreshed
most of the time before going the transition to idle, so the connection
won't be deleted on expiration.
The only case where it could occur is if there is still pending data
blocked on emission on stream detach. Here, timeout server is applied on
the connection. When the emission completes, the connection goes to
idle, but the timer will still armed, and thus will be triggered on the
idle connection.
To prevent this, explicitely reset h2c timer to TICK_ETERNITY for idle
backend connection via h2c_update_timeout().
This patch is explicitely not scheduled for backport for now, as it is
difficult to estimate the real impact of the previous code state.
GOAWAY emission should not be emitted before preface. Thus, max_id field
from h2c acting as a server is initialized to -1, which prevents its
emission until preface is received from the peer. If acting as a client,
max_id is initialized to a valid value on the first h2s emission.
This causes an issue with reverse HTTP on the active side. First, it
starts as a client, so the peer does not emit a preface but instead a
simple SETTINGS frame. As role are switched, max_id is initialized much
later when the first h2s response is emitted. Thus, if the connection
must be terminated before any stream transfer, GOAWAY cannot be emitted.
To fix this, ensure max_id is initialized to 0 on h2_conn_reverse() for
active connect side. Thus, a GOAWAY indicating that no stream has been
handled can be generated.
Note that passive connect side is not impacted, as it max_id is
initialized thanks to preface reception.
This should be backported up to 2.9.
Active connect on reverse http relies on connect timeout to detect
connection failure. Thus, if this timeout was unset, connection failure
may not be properly detected.
Fix this by fallback on hardcoded value of 1s for connect if timeout is
unset in the configuration. This is considered as a minor bug, as
haproxy advises against running with timeout unset.
This must be backported up to 2.9.
While reviewing HTTP/2 MUX timeout, it seems there is a possibility that
MUX task is requeued via h2c_update_timeout() with an already expired
date. This can happens with idle connections on two cases :
* first with shut timeout, as timer is not refreshed if already set
* second with http-request and keep-alive timers, which are based on
idle_start
Queuing an already expired task is an undefined behavior. Fix this by
using task_wakeup() instead of task_queue() at the end of
h2c_update_timeout() if such case occurs.
This should be backported up to 2.6.
Jacques Heunis from bloomberg reported on the mailing list [1] that
with haproxy 2.8 up to master, yielding from a Lua tcp service while
data was still buffered inside haproxy would eat some data which was
definitely lost.
He provided the reproducer below which turned out to be really helpful:
global
log stdout format raw local0 info
lua-load haproxy_yieldtest.lua
defaults
log global
timeout connect 10s
timeout client 1m
timeout server 1m
listen echo
bind *:9090
mode tcp
tcp-request content use-service lua.print_input
haproxy_yieldtest.lua:
core.register_service("print_input", "tcp", function(applet)
core.Info("Start printing input...")
while true do
local inputs = applet:getline()
if inputs == nil or string.len(inputs) == 0 then
core.Info("closing input connection")
return
end
core.Info("Received line: "..inputs)
core.yield()
end
end)
And the script below:
#!/usr/bin/bash
for i in $(seq 1 9999); do
for j in $(seq 1 50); do
echo "${i}_foo_${j}"
done
sleep 2
done
Using it like this:
./test_seq.sh | netcat localhost 9090
We can clearly see the missing data for every "foo" burst (every 2
seconds), as they are holes in the numbering.
Thanks to the reproducer, it was quickly found that only versions
>= 2.8 were affected, and that in fact this regression was introduced
by commit 31572229e ("MEDIUM: hlua/applet: Use the sedesc to report and
detect end of processing")
In fact in 31572229e 2 mistakes were made during the refaco.
Indeed, both in hlua_applet_tcp_fct() (which is involved in the reproducer
above) and hlua_applet_http_fct(), the request (buffer) is now
systematically consumed when returning from the function, which wasn't the
case prior to this commit: when HLUA_E_AGAIN is returned, it means a
yield was requested and that the processing is not done yet, thus we
should not consume any data, like we did prior to the refacto.
Big thanks to Jacques who did a great job reproducing and reporting this
issue on the mailing list.
[1]: https://www.mail-archive.com/haproxy@formilux.org/msg45778.html
It should be backported up to 2.8 with commit 31572229e
Change the representation of the start-line URI when parsing a HTTP/3
request into HTX. Adopt the same conversion as HTTP/2. If :authority
header is used (default case), the URI is encoded using absolute-form,
with scheme, host and path concatenated. If only a plain host header is
used instead, fallback to the origin form.
This commit may cause some configuration to be broken if parsing is
performed on the URI. Indeed, now most of the HTTP/3 requests will be
represented with an absolute-form URI at the stream layer.
Note that prior to this commit a check was performed on the path used as
URI to ensure that it did not contain any invalid characters. Now, this
is directly performed on the URI itself, which may include the path.
This must not be backported.
Ensure that the HTX start-line generated after parsing an HTTP/3 request
does not contain any invalid character, i.e. control or whitespace
characters.
Note that for now path is used directly as URI. Thus, the check is
performed directly over it. A patch will change this to generate an
absolute-form URI in most cases, but it won't be backported to avoid
configuration breaking in stable versions.
This must be backported up to 2.6.
RFC 9114 specifies some requirements for :path pseudo-header when using
http or https scheme. This commit enforces this by rejecting a request
if needed. Thus, path cannot be empty, and it must either start with a
'/' character or contains only '*'.
This must be backported up to 2.6.
As specified in RFC 9114, connection headers required special care in
HTTP/3. When a request is received with connection headers, the stream
is immediately closed. Conversely, when translating the response from
HTX, such headers are not encoded but silently ignored.
However, "upgrade" was not listed in connection headers. This commit
fixes this by adding a check on it both on request parsing and response
encoding.
This must be backported up to 2.6.
This commit does a similar job than the previous one, but it acts now on
the response path. Any leading or trailing whitespaces characters from a
HTX block header value are removed, prior to the header encoding via
QPACK.
This must be backported up to 2.6.
Remove any leading and trailing whitespace from header field values
prior to inserting a new HTX header block. This is done when parsing a
HEADERS frame, both as headers and trailers.
This must be backported up to 2.6.
Free the acme_ctx task context once the task is done.
It frees everything but the config and the httpclient,
everything else is free.
The ckch_store is freed in case of error, but when the task is
successful, the ptr is set to NULL to prevent the free once inserted in
the tree.
This patch registers the task in the ckch_store so we don't run 2 tasks
at the same time for a given certificate.
Move the task creation under the lock and check if there was already a
task under the lock.
Exponential backoff values was multiplied by 3000 instead of 3 with a
second to ms conversion. Leading to a 9000000ms value at the 2nd
attempt.
Fix the issue by setting the value in seconds and converting the value
in tick_add().
No backport needed.
When receiving the final certificate, it need to be loaded by
ssl_sock_load_pem_into_ckch(). However this function will remove any
existing private key in the struct ckch_store.
In order to fix the issue, the ptr to the key is swapped with a NULL
ptr, and restored once the new certificate is commited.
However there is a discrepancy when there is an error in
ssl_sock_load_pem_into_ckch() fails and the pointer is lost.
This patch fixes the issue by restoring the pointer in the error path.
This must fix issue #2933.
Add a 1MB ring buffer called "dpapi" for communication with the
dataplane API. It would first be used to transmit ACME informations to
the dataplane API but could be used for more.
A server abort must be handled by the request analyzers only when the
response forwarding was already started. Otherwise, it it the responsability
of the response analyzer to detect this event. L7-retires and conditions to
decide to silently close a client conneciotn are handled by this analyzer.
Because a reused server connections closed too early could be detected at
the wrong place, it was possible to get a 502/SH instead of a silent close,
preventing the client to safely retries its request.
Thanks to this patch, we are able to silently close the client connection in
this case and eventually to perform a L7 retry.
This patch must be backported as far as 2.8.
During the response payload forwarding, if the back SC is closed, we try to
figure out if it is because of a client abort or a server abort. However,
the condition was not accurrate, especially when abortonclose option is
set. Because of this issue, a server abort may be reported (SD-- in logs)
instead of a client abort (CD-- in logs).
The right way to detect a client abort when we try to forward the response
is to test if the back SC was shut down (SC_FL_SHUT_DOWN flag set) AND
aborted (SC_FL_ABRT_DONE flag set). When these both flags are set, it means
the back connection underwent the shutdown, which should be converted to a
client abort at this stage.
This patch should be backported as far as 2.8. It should fix last strange SD
report in the issue #2749.
>>> CID 1609049: Code maintainability issues (UNUSED_VALUE)
>>> Assigning value "NULL" to "new_ckchs" here, but that stored value is overwritten before it can be used.
592 struct ckch_store *old_ckchs, *new_ckchs = NULL;
Coverity reported an issue where a variable is initialized to NULL then
directry overwritten with another value. This doesn't arm but this patch
removes the useless initialization.
Must fix issue #2932.
In call traces, we're interested in seeing the code that was executed, not
the code that was not yet. The return address is where the CPU will return
to, so we want to see the bytes that precede this location. In the example
below on x86 we can clearly see a number of direct "call" instructions
(0xe8 + 4 bytes). There are also indirect calls (0xffd0) that cannot be
exploited but it gives insights about where the code branched, which will
not always be the function above it if that one used tail branching for
example. Here's an example dump output:
call ------------,
v
0x6bd634 <64 8b 38 e8 ac f7 ff ff]: debug_handler+0x84/0x95
0x7fa4ea2593a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
0x752132 <00 00 00 00 00 90 41 55]: htx_remove_blk+0x2/0x354
0x5b1a2c <4c 89 ef e8 04 07 1a 00]: main+0x10471c
0x5b5f05 <48 89 df e8 8b b8 ff ff]: main+0x108bf5
0x60b6f4 <89 ee 4c 89 e7 41 ff d0]: tcpcheck_eval_send+0x3b4/0x14b2
0x610ded <00 00 00 e8 53 a5 ff ff]: tcpcheck_main+0x7dd/0xd36
0x6c5ab4 <48 89 df e8 5c ab f4 ff]: wake_srv_chk+0xc4/0x3d7
0x6c5ddc <48 89 f7 e8 14 fc ff ff]: srv_chk_io_cb+0xc/0x13
For code dumps, dumping from the return address is pointless, what is
interesting is to dump before the return address to read the machine
code that was executed before branching. Let's just make the function
support negative sizes to indicate that we're dumping this number of
bytes to the address instead of this number from the address. In this
case, in order to distinguish them, we're using a '<' instead of '[' to
start the series of bytes, indicating where the bytes expand and where
they stop. For example we can now see this:
0x6bd634 <64 8b 38 e8 ac f7 ff ff]: debug_handler+0x84/0x95
0x7fa4ea2593a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
0x752132 <00 00 00 00 00 90 41 55]: htx_remove_blk+0x2/0x354
0x5b1a2c <4c 89 ef e8 04 07 1a 00]: main+0x10471c
0x5b5f05 <48 89 df e8 8b b8 ff ff]: main+0x108bf5
0x60b6f4 <89 ee 4c 89 e7 41 ff d0]: tcpcheck_eval_send+0x3b4/0x14b2
0x610ded <00 00 00 e8 53 a5 ff ff]: tcpcheck_main+0x7dd/0xd36
0x6c5ab4 <48 89 df e8 5c ab f4 ff]: wake_srv_chk+0xc4/0x3d7
0x6c5ddc <48 89 f7 e8 14 fc ff ff]: srv_chk_io_cb+0xc/0x13
These counters can have a noticeable cost on large machines, though not
dramatic. There's no single good choice to keep them enabled or disabled.
This commit adds multiple choices:
- DEBUG_COUNTERS set to 2 will automatically enable them by default, while
1 will disable them by default
- the global "debug.counters on/off" will allow to change the setting at
boot, regardless of DEBUG_COUNTERS as long as it was at least 1.
- the CLI "debug counters on/off" will also allow to change the value at
run time, allowing to observe a phenomenon while it's happening, or to
disable counters if it's suspected that their cost is too high
Finally, the "debug counters" command will append "(stopped)" at the end
of the CNT lines when these counters are stopped.
Not that the whole mechanism would easily support being extended to all
counter types by specifying the types to apply to, but it doesn't seem
useful at all and would require the user to also type "cnt" on debug
lines. This may easily be changed in the future if it's found relevant.
Till now the per-line glitches counters were only enabled with the
confusingly named DEBUG_GLITCHES (which would not turn glitches off
when disabled). Let's instead change it to DEBUG_COUNTERS and make sure
it's enabled by default (though it can still be disabled with
-DDEBUG_GLITCHES=0 just like for DEBUG_STRICT). It will later be
expanded to cover more counters.
It's easy to leave some trailing \n or even other characters that can
mangle the debug output. Let's verify at boot time that the debug sections
are clean by checking for chars 0x20 to 0x7e inclusive. This is very simple
to do and it managed to find another one in a multi-line message:
[WARNING] (23696) : Invalid character 0x0a at position 96 in description string at src/cli.c:2516 _send_status()
This way new offending code will be spotted before being committed.
These ones were added by mistake during the change of the cfgparse
mechanism in 3.1, but they're corrupting the output of "debug counters"
by leaving stray ']' on their own lines. We could possibly check them
all once at boot but it doens't seem worth it.
This should be backported to 3.1.
Following error is triggered at linker invokation, when we try to compile with
USE_THREAD=0 and -O0.
make -j 8 TARGET=linux-glibc USE_LUA=1 USE_PCRE2=1 USE_LINUX_CAP=1 \
USE_MEMORY_PROFILING=1 OPT_CFLAGS=-O0 USE_THREAD=0
/usr/bin/ld: src/thread.o: warning: relocation against `thread_cpus_enabled_at_boot' in read-only section `.text'
/usr/bin/ld: src/thread.o: in function `thread_detect_count':
/home/vk/projects/haproxy/src/thread.c:1619: undefined reference to `thread_cpus_enabled_at_boot'
/usr/bin/ld: /home/vk/projects/haproxy/src/thread.c:1619: undefined reference to `thread_cpus_enabled_at_boot'
/usr/bin/ld: /home/vk/projects/haproxy/src/thread.c:1620: undefined reference to `thread_cpus_enabled_at_boot'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
make: *** [Makefile:1044: haproxy] Error 1
thread_cpus_enabled_at_boot is only available when we compiled with
USE_THREAD=1, which is the default for the most targets now.
In some cases, we need to recompile in mono-thread mode, thus
thread_cpus_enabled_at_boot should be protected with USE_THREAD in
thread_detect_count().
thread_detect_count() is always called during the process initialization
never mind of multi thread support. It sets some defaults in global.nbthread
and global.nbtgroups.
This patch is related to GitHub issue #2916.
No need to be backported as it was added in 3.2-dev9 version.
When receiving the final certificate, it need to be loaded by
ssl_sock_load_pem_into_ckch(). However this function will remove any
existing private key in the struct ckch_store.
In order to fix the issue, the ptr to the key is swapped with a NULL
ptr, and restored once the new certificate is commited.
However there is a discrepancy when there is an error in
ssl_sock_load_pem_into_ckch() fails and the pointer is lost.
This patch fixes the issue by restoring the pointer in the error path.
This must fix issue #2933.
This step is the latest to have a usable ACME certificate in haproxy.
It looks for the previous certificate, locks the "BIG CERTIFICATE LOCK",
copy every instance, deploys new ones, remove the previous one.
This is done in one step in a function which does not yield, so it could
be problematic if you have thousands of instances to handle.
It still lacks the rate limit which is mandatory to be used in
production, and more cleanup and deinit.
Copy the original ckch_store instead of creating a new one. This allows
to inherit the ckch_conf from the previous structure when doing a
ckchs_dup(). The ckch_conf contains the SAN for ACME.
Free the previous PKEY since it a new one is generated.
Handle new members of the ckch_conf in ckchs_dup() and
ckch_conf_clean().
This could be automated at some point since we have the description of
all types in ckch_conf_kws.
Once the Order status is "valid", the certificate URL is accessible,
this patch implements the retrieval of the certificate which is stocked
in ctx->store.
Generate the X509_REQ using the generated private key and the SAN from
the configuration. This is only done once before the task is started.
It could probably be done at the beginning of the task with the private
key generation once we have a scheduler instead of a CLI command.
This patch implements a check on the challenge URL, once haproxy asked
for the challenge to be verified, it must verify the status of the
challenge resolution and if there weren't any error.
This patch sends the "{}" message to specify that a challenge is ready.
It iterates on every challenge URL in the authorization list from the
acme_ctx.
This allows the ACME server to procede to the challenge validation.
https://www.rfc-editor.org/rfc/rfc8555#section-7.5.1
This patch implements the retrieval of the challenges objects on the
authorizations URLs. The challenges object contains a token and a
challenge url that need to be called once the challenge is setup.
Each authorization URLs contain multiple challenge objects, usually one
per challenge type (HTTP-01, DNS-01, ALPN-01... We only need to keep the
one that is relevent to our configuration.
This patch implements the newOrder action in the ACME task, in order to
ask for a new certificate, a list of SAN is sent as a JWS payload.
the ACME server replies a list of Authorization URLs. One Authorization
is created per SAN on a Order.
The authorization URLs are stored in a linked list of 'struct acme_auth'
in acme_ctx, so we can get the challenge URLs from them later.
The location header is also store as it is the URL of the order object.
https://datatracker.ietf.org/doc/html/rfc8555#section-7.4
This patch implements the retrival of the KID (account identifier) using
the pkey.
A request is sent to the newAccount URL using the onlyReturnExisting
option, which allow to get the kid of an existing account.
acme_jws_payload() implement a way to generate a JWS payload using the
nonce, pkey and provided URI.
ACME requests are supposed to be sent with a Nonce, the first Nonce
should be retrieved using the newNonce URI provided by the directory.
This nonce is stored and must be replaced by the new one received in the
each response.
The first request of the ACME protocol is getting the list of URLs for
the next steps.
This patch implements the first request and the parsing of the response.
The response is a JSON object so mjson is used to parse it.
The "acme renew" command launch the ACME task for a given certificate.
The CLI parser generates a new private key using the parameters from the
acme section..
This commit allows to configure the generated private keys, you can
configure the keytype (RSA/ECDSA), the number of bits or the curves.
Example:
acme LE
uri https://acme-staging-v02.api.letsencrypt.org/directory
account account.key
contact foobar@example.com
challenge HTTP-01
keytype ECDSA
curves P-384
Add new acme keywords for the ckch_conf parsing, which will be used on a
crt-store, a crt line in a frontend, or even a crt-list.
The cfg_postparser_acme() is called in order to check if a section referenced
elsewhere really exists in the config file.
Add a configuration parser for the new acme section, the section is
configured this way:
acme letsencrypt
uri https://acme-staging-v02.api.letsencrypt.org/directory
account account.key
contact foobar@example.com
challenge HTTP-01
When unspecified, the challenge defaults to HTTP-01, and the account key
to "<section_name>.account.key".
Section are stored in a linked list containing acme_cfg structures, the
configuration parsing is mostly resolved in the postsection parser
cfg_postsection_acme() which is called after the parsing of an acme section.
Some rare commands in the worker require to keep their input open and
terminate when it's closed ("show events -w", "wait"). Others maintain
a per-session context ("set anon on"). But in its default operation
mode, the master CLI passes commands one at a time to the worker, and
closes the CLI's input channel so that the command can immediately
close upon response. This effectively prevents these two specific cases
from being used.
Here the approach that we take is to introduce a bidirectional mode to
connect to the worker, where everything sent to the master is immediately
forwarded to the worker (including the raw command), allowing to queue
multiple commands at once in the same session, and to continue to watch
the input to detect when the client closes. It must be a client's choice
however, since doing so means that the client cannot batch many commands
at once to the master process, but must wait for these commands to complete
before sending new ones. For this reason we use the prefix "@@<pid>" for
this. It works exactly like "@" except that it maintains the channel
open during the whole execution. Similarly to "@<pid>" with no command,
"@@<pid>" will simply open an interactive CLI session to the worker, that
will be ended by "quit" or by closing the connection. This can be convenient
for the user, and possibly for clients willing to dedicate a connection to
the worker.
Even though spoe agent proxy is statically allocated, it uses the proxy
API and is initialized like a regular proxy, thus specific cleanup is
required upon release. This is not tagged as a bug because as of now this
would only cause some minor memory leak upon deinit.
We check the presence of proxy->id to know if it was initialized since
we cannot rely on a pointer for that.
This is just to make valgrind and friends happy, leverage deinit_proxy()
for checks_fe proxy upon deinit to ensure proper cleanup.
We check the presence of proxy->id to know if it was initialized because
we cannot rely on a pointer for that.
Same as free_proxy(), but does not free the base proxy pointer (ie: the
proxy itself may not be allocated)
Goal is to be able to cleanup statically allocated dummy proxies.
In this patch we try to use the proxy API init functions as much as
possible to avoid code redundancy and prevent proxy initialization
errors. As such, we prefer using alloc_new_proxy() and setup_new_proxy()
instead of manually allocating the proxy pointer and performing the
base init ourselves.
spoe agent frontend is used by the agent internally, but it is not meant
to be directly exposed like user-facing proxies defined in the config.
As such, better mark it as internal using PR_CAP_INT capability to prevent
any mis-use.
CHECKS-FE frontend is a dummy frontend used to create checks sessions
as such, it is internal and should not be exposed to the user.
Better mark it as internal using PR_CAP_INT capability to prevent
proxy API from ever exposing it.
Split alloc_new_proxy() in two functions: the preparing part is now
handled by setup_new_proxy() which can be called individually, while
alloc_new_proxy() takes care of allocating a new proxy struct and then
calling setup_new_proxy() with the freshly allocated proxy.
errmsg is used with memprintf and friends, thus it must be NULL
initialized before being passed to memprintf, else invalid read will
occur.
However in hlua_init() the errmsg value isn't initialized, let's fix that
This is really minor because it would only cause issue on error paths,
yet it may be backported to all stable versions, just in case.
The server's "usesrc" keyword supports among other options "client"
and "clientip". The former means we bind to the client's IP and port
to connect to the server, while the latter means we bind to its IP
only. It's done in two steps, first alloc_bind_address() retrieves
the IP address and port, and second, tcp_connect_server() decides
to either bind to the IP only or IP+port.
The problem comes with idle connection pools, which hash all the
parameters: the hash is calculated before (and ideally withouy) calling
tcp_connect_server(), and it considers the whole struct sockaddr_storage
for the hash, except that both client and clientip entirely fill it with
the client's address. This means that both client and clientip make use
of the source port in the hash calculation, making idle connections
almost not reusable when using "usesrc clientip" while they should for
clients coming from the same source. A work-around is to force the
source port to zero using "tcp-request session set-src-port int(0)" but
it's ugly.
Let's fix this by properly zeroing the port for AF_INET/AF_INET6 addresses.
This can be backported to 2.4. Thanks to Sebastien Gross for providing a
reproducer for this problem.
At the moment it is not supported to produce multi-line events on the
"show events" output, simply because the LF character is used as the
default end-of-event mark. However it could be convenient to produce
well-formatted multi-line events, e.g. in JSON or other formats. UNIX
utilities have already faced similar needs in the past and added
"-print0" to "find" and "-0" to "xargs" to mention that the delimiter
is the NUL character. This makes perfect sense since it's never present
in contents, so let's do exactly the same here.
Thus from now on, "show events <ring> -0" will delimit messages using
a \0 instead of a \n, permitting a better and safer encapsulation.
In order to support delimiting output events with other characters than
just the LF, let's pass the delimiter through the API. The default remains
the LF, used by applet_append_line(), and ignored by the log forwarder.
Aleandro Prudenzano of Doyensec and Edoardo Geraci of Codean Labs
reported a bug in sample_conv_regsub(), which can cause replacements
of multiple back-references to overflow the temporary trash buffer.
The problem happens when doing "regsub(match,replacement,g)": we're
replacing every occurrence of "match" with "replacement" in the input
sample, which requires a length check. For this, a max is applied, so
that a replacement may not use more than the remaining length in the
buffer. However, the length check is made on the replaced pattern and
not on the temporary buffer used to carry the new string. This results
in the remaining size to be usable for each input match, which can go
beyond the temporary buffer size if more than one occurrence has to be
replaced with something that's larger than the remaining room.
The fix proposed by Aleandro and Edoardo is the correct one (check on
"trash" not "output"), and is the one implemented in this patch.
While it is very unlikely that a config will replace multiple short
patterns each with a larger one in a request, this possibility cannot
be entirely ruled out (e.g. mask a known, short IP address using
"XXX.XXX.XXX.XXX"). However when this happens, the replacement pattern
will be static, and not be user-controlled, which is why this patch is
marked as medium.
The bug was introduced in 2.2 with commit 07e1e3c93e ("MINOR: sample:
regsub now supports backreferences"), so it must be backported to all
versions.
Special thanks go to Aleandro and Edoardo for reporting this bug with
a simple reproducer and a fix.
There have been some reports that using %HV logformat alias with CBOR
encoder would produce invalid CBOR payload according to some CBOR
implementations such as "cbor.me". Indeed, with the below log-format:
log-format "%{+cbor}o %(protocol)HV"
And the resulting CBOR payload:
BF6870726F746F636F6C7F48485454502F312E31FFFF
cbor.me would complain with: "bytes/text mismatch (ASCII-8BIT != UTF-8) in
streaming string") error message.
It is due to the version string being first announced as text, while CBOR
encoder actually encodes it as byte string later when lf_encode_chunk()
is used.
In fact it affects all patterns combining LOG_VARTEXT_START() with
lf_encode_chunk() which means %HM, %HU, %HQ, %HPO and %HP are also
affected. To fix the issue, in _lf_encode_bytes() (which is
lf_encode_chunk() helper), we now check if we are inside a VARTEXT (we
can tell it if ctx->in_text is true), in which case we consider that we
already announced the current data as regular text so we keep the same
type to encode the bytes from the chunk to prevent inconsistencies.
It should be backported in 3.0
negative SNI filters on crt-list lines only have a meaning when they
match a positive wildcard filter. This patch adds a warning which
is emitted when trying to use negative filters without any wildcard on
the same line.
This was discovered in ticket #2900.
negative wildcard filters were always a noop, and are not useful for
anything unless you want to use !* alone to remove every name from a
certificate.
This is confusing and the documentation never stated it correctly. This
patch adds a warning during the bind initialization if it founds one,
only !* does not emit a warning.
This patch was done during the debugging of issue #2900.
_lf_cbor_encode_byte() comment was not updated in c33b857df ("MINOR: log:
support true cbor binary encoding") to reflect the new behavior.
Indeed, binary form is now supported. Updating the comment that says
otherwise.
Some notification_* functions were not thread safe by default as they
assumed only one producer would emit events for registered tasks.
While this suited well with the Lua sockets use-case, this proved to
be a limitation with some other event sources (ie: lua Queue class)
instead of having to deal with both the non thread safe and thread
safe variants (_mt suffix), which is error prone, let's make the
entire API thread safe regarding the event list.
Pruning functions still require that only one thread executes them,
with Lua this is always the case because there is one cleanup list
per context.
Queue:alarm() sets a wakeup alarm on the task when new data becomes
available on Queue. It must be re-armed for each event.
Lua documentation was updated
This is the non-blocking variant for AppletTCP:receive(). It doesn't
take any argument, instead it tries to read as much data as available
at once. If no data is available, empty string is returned.
Lua documentation was updated.
core.wait() now accepts optional delay parameter in ms. Passed this delay
the task is woken up if no event woke the task before.
Lua documentation was updated.
Similar to core.yield(), except that the task is not woken up
automatically, instead it waits for events to trigger the task
wakeup.
Lua documentation was updated.
If Queue:pop_wait() excecuted from a stream context and pop_wait() is
aborted due to a Lua or ressource error, then the waiting object pointing
to the task will still be registered, so if the task eventually dissapears,
Queue:push() may try to wake invalid task pointer..
To prevent this bug from happening, we now rely on notification_* API to
deliver waiting signals. This way signals are properly garbage collected
when a lua context is destroyed.
It should be backported in 2.8 with 86fb22c55 ("MINOR: hlua_fcn: add Queue
class").
This patch depends on ("MINOR: task: add thread safe notification_new and
notification_wake variants")
This commit is a direct follow-up of the previous one. It defines a new
server keyword check-pool-conn-name. It is used as the default value for
the name parameter of idle connection hash generation.
Its behavior is similar to server keyword pool-conn-name, but reserved
for checks reuse. If check-pool-conn-name is set, it is used in priority
to match a connection for reuse. If unset, a fallback is performed on
check-sni.
Support for connection reuse during server checks was implemented
recently. This is activated with the server keyword check-reuse-pool.
Similarly to stream processing via connect_backend(), a connection hash
is calculated when trying to perform reuse for checks. This is necessary
to retrieve for a connection which shares the check connect parameters.
However, idle connections can additionnally be tagged using a
pool-conn-name or SNI under connect_backend(). Check reuse does not test
these values, which prevent to retrieve a matching connection.
Improve this by using "check-sni" value as idle connection hash input
for check reuse. be_calculate_conn_hash() API has been adjusted so that
name value can be passed as input, both when using streams or checks.
Even with the current patch, there is still some scenarii which could
not be covered for checks connection reuse. most notably, when using
dynamic pool-conn-name/SNI value. It is however at least sufficient to
cover simpler cases.
Without check-reuse-pool, it is impossible to perform check on server
using @rhttp protocol. This is due to the inherent nature of the
protocol which does not implement an active connect method.
Thus, ensure that check-reuse-pool is always set when a reverse HTTP
server is declared. This reduces server configuration and should prevent
any omission. Note that it is still require to add "check" server
keyword so activate server checks.
Duplicate server check.reuse_pool boolean value in srv_settings_cpy().
This is necessary to ensure that check-reuse-pool value can be set via
default-server or server-template.
This does not need to be backported.
Server instance can be NULL on connect_server(), either when dispatch or
transparent proxy are active. However, in alloc_dst_address() access to
<srv> is safe thanks to SF_ASSIGNED stream flag. Add an ASSUME_NONNULL()
to reflect this state.
This should fix coverity report from github issue #2922.
The new "crt" lines in frontend and listen sections are confusing:
- a filename is mandatory but we could need a syntax without the
filename in the future, if the filename is generated for example
- there is no clue about the fact that its only used on the frontend
side when reading the line
A new "ssl-f-use" line replaces the "crt" line, but a "crt" keyword
can be used on this line. "f" indicates that this is the frontend
configuration, a "ssl-b-use" keyword could be used in the future.
The "crt" lines only appeared in 3.2-dev so this won't change anything
for people using configurations from previous major versions.
This patch sets the expire of the entry to the max value in
configuration if the value showed in the peer update message
is too far in futur.
This should be backported an all supported branches.
At the begining of process_stream(), the flags of the stream connectors and
channels are saved to be able to handle changes performed in sub-functions
(for instance in analyzers). But, some operations were performed before
saving these flags: Synchronous receives and forced shutdowns. While it
seems to safe for now, it is a bit annoying because some events could be
missed.
So, to avoid bugs in the future, the channels and stream connectors flags
are now really saved before any other processing.
When a forced shutdown is performed on a stream, it is possible to freeze it
infinitly because it is performed in an unexpected way from process_stream()
point of view, especially when the stream is waiting for a server
connection. The events sequence is a bit complex but at the end the stream
remains blocked in turn-around state and no event are trriggered to unblock
it.
By trying to fix the issue, we considered it was safer to rethink the
feature. The idea is to quickly shutdown a stream to release resources. For
instance to be able to delete a server. So, instead of scheduling a
shutdown, it is more efficient to trigger an error and detach the stream
from the server, if neecessary. The same code than the one used to deal with
connection errors in back_handle_st_cer() is used.
This patch must be slowly backported as far as 2.6.
Not all systems have strndup(), that's why we have our "my_strndup()",
so let's make use of it here. This fixes the build on Solaris 10.
No backport is needed, this was just merged with commit fdcb97614c
("MINOR: ssl/ckch: add substring parser for ckch_conf").
The UDP GSO code emits a build warning with older toolchains (gcc 5 and 6):
src/quic_sock.c: In function 'cmsg_set_gso':
src/quic_sock.c:683:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
*((uint16_t *)CMSG_DATA(c)) = gso_size;
^
Let's just use the write_u16() function that's made for this purpose.
It was verified that for all versions from 5 to 13, gcc produces the
exact same code with the fix (and without the warning). It arrived in
3.1 with commit 448d3d388a ("MINOR: quic: add GSO parameter on quic_sock
send API") so this can be backported there.
Since recent commit ee94a6cfc1 ("MINOR: backend: extract conn reuse
from connect_server()") a build warning "set but not used" on the
"reuse" variable is emitted, because indeed the variable is now only
checked when SSL is in use. Let's just mark it as such.
Implement the possibility to reuse idle connections when performing
server checks. This is done thanks to the recently introduced functions
be_calculate_conn_hash() and be_reuse_connection().
One side effect of this change is that be_calculate_conn_hash() can now
be called with a NULL stream instance. As such, part of the functions
are adjusted accordingly.
Note that to simplify configuration, connection reuse is not performed
if any specific check connection parameters are defined on the server
line or via the tcp-check connect rule. This is performed via newly
defined tcpcheck_use_nondefault_connect().
Define a new server keyword check-reuse-pool, and its counterpart with a
"no" prefix. For the moment, only parsing is implemented. The real
behavior adjustment will be implemented in the next patch.
Adjust newly defined be_reuse_connection() API. The stream argument is
removed. This will allows checks to be able to invoke it without relying
on a stream instance.
Following the previous patch, the part directly related to connection
reuse is extracted from connect_server(). It is now define in a new
function be_reuse_connection().
On connection reuse, a hash is first calculated. It is generated from
various connection parameters, to retrieve a matching connection.
Extract hash calculation from connect_server() into a new dedicated
function be_calculate_conn_hash(). The objective is to be able to
perform connection reuse for checks, without connect_server() invokation
which relies on a stream instance.
The main objective of this patch is to remove the stream instance from
conn_backend_get() parameters. This would allow to perform reuse outside
of stream contexts, for example for checks purpose.
Previously, if a server reached its pool-high-count limit, connection
were killed on connect_server() when reuse was not possible. However,
this is now performed even if reuse is done since the following patch :
b3397367dc
MEDIUM: connections: Kill connections even if we are reusing one.
Thus, adjust the related comment to reflect this state.
On backend connection reuse, a hash is calculated from various
parameters, to ensure the selected connection match the requested
parameters. Notably, destination address is one of these parameters.
However, it is only taken into account if using a transparent server
(server address 0.0.0.0).
This may cause issue where an incorrect connection is reused, which is
not targetted to the correct destination address. This may be the case
if a set-dst/set-dst-port is used with a transparent proxy (proxy option
transparent).
The fix is simple enough. Destination address is now always used as
input to the connection reuse hash.
This must be backported up to 2.6. Note that for reverse HTTP to work,
it relies on the following patch, which ensures destination address
remains NULL in this case.
commit e94baf6ca71cb2319610baa74dbf17b9bc602b18
BUG/MINOR: rhttp: fix incorrect dst/dst_port values
Previously, destination address of backend connection was systematically
always reassigned. However, this step is unnecessary on connection
reuse. Indeed, reuse should only be conducted with connection using the
same destination address matching the stream requirements.
This patch removes this unnecessary assignment. It is now only performed
when reuse cannot be conducted and a new connection is instantiated.
Functionnally speaking, this patch should not change anything in theory,
as reuse is performed in conformance with the destination address.
However, it appears that it was not always properly enforced. The
systematic assignment of the destination address hides these issues, so
it is now remove. The identified bogus cases will then be fixed in the
following patches.would
This should be backported up to all stable versions.
With a @rhttp server, connect is not possible, transfer is only possible
via idle connection reuse. The server does not have any network address.
Thus, it is unnecessary to allocate the stream destination address prior
to connection reuse. This patch adjusts this by fixing
alloc_dst_address() to take this into account.
Prior to this patch, alloc_dst_address() would incorrectly assimilate a
@rhttp server with a transparent proxy mode. Thus stream destination
address would be copied from the destination address. Connection adress
would then be rewrote with this incorrect value. This did not impact
connect or reuse as destination addr is only used in idle conn hash
calculation for transparent servers. However, it causes incorrect values
for dst/dst_port samples.
This should be backported up to 2.9.
It may happen that the server is going down, and fwlc_srv_reposition()
is still called, because streams still attached to the server are
being terminated.
So in fwlc_srv_reposition(), just do nothing if we've been removed from
the tree.
This should fix github issue #2919.
This should not be backported, unless commit
9fe72bba3c is also backported.

"raw" logformat node typecast is a special value (unlike str,bool,int..)
which tells haproxy to completely ignore logformat options (including
encoding ones) and force binary output for the current node only. It is
mainly intended for use with JSON or CBOR encoders in order to generate
nested CBOR or nested JSON by storing intermediate log-formats within
variables and assembling the final object in the parent log-format.
Example:
http-request set-var-fmt(txn.intermediate) "%{+json}o %(lower)[str(value)]"
log-format "%{+json}o %(upper)[str(value)] %(intermediate:raw)[var(txn.intermediate)]"
Would produce:
{"upper": "value", "intermediate": {"lower": "value"}}
src/ssl_ckch.c: In function ‘ckch_conf_parse’:
src/ssl_ckch.c:4852:40: error: potential null pointer dereference [-Werror=null-dereference]
4852 | while (*r) {
| ^~
Add a test on r before using *r.
No backport needed
fdcb97614c ("MINOR: ssl/ckch: add substring parser for ckch_conf")
introduced a leak in the error path when the strndup fails.
This patch fixes issue #2920. No backport needed.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
Add two new methods to lbprm, server_deinit() and proxy_deinit(),
in case something should be done at the lbprm level when
removing servers and proxies.
jwk_thumbprint() is a function which is a function which implements
RFC7368 and emits a JWK thumbprint using a EVP_PKEY.
EVP_PKEY_EC_to_pub_jwk() and EVP_PKEY_RSA_to_pub_jwk() were changed in
order to match what is required to emit a thumbprint (ie, no spaces or
lines and the lexicographic order of the fields)
When first pre-parsing the config to detect the presence or absence of
the master mode, we must not emit messages because they are not supposed
to be visible at this point, otherwise they appear twice each. The
pre-parsing, also called discovery mode, is only for internal use,
thus it should remain silent.
This should be backported to 3.1 where this mode was introduced.
This adds "group-by-{2,3,4}-clusters", which, as its name implies,
create one thread group per X clusters. This can be useful when CPUs
are split into too small clusters, as well as when the total number
of assigned cores is not even between the clusters, to try to spread
the load between less different ones.
When emitting the CPU topology info with -dc, also emit a list of
thread-to-CPU mapping. The group/thread and thread ID are emitted
with the list of their CPUs on each line. The count of CPUs is shown
to ease comparisons, and as much as possible, we try to pack identical
lines within a group by showing thread ranges.
The new function "print_cpu_set()" will print cpu sets in a human-friendly
way, with commas and dashes for intervals. The goal is to keep them compact
enough.
It was previously done in thread_detect_count() but that's not quite
handy because we still don't know about the groups setting. Better do
it slightly later and have all the relevant info instead.
GCC 15 throws the following warning on fixed-size char arrays if they do not
contain terminated NUL:
src/tools.c:2041:25: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Werror=unterminated-string-initialization]
2041 | const char hextab[16] = "0123456789ABCDEF";
We are using a couple of such definitions for some constants. Converting them
to flexible arrays, like: hextab[] = "0123456789ABCDEF" may have consequences,
as enlarged arrays won't fit anymore where they were possibly located due to
the memory alignement constraints.
GCC adds 'nonstring' variable attribute for such char arrays, but clang and
other compilers don't have it. Let's wrap 'nonstring' with our
__nonstring macro, which will test if the compiler supports this attribute.
This fixes the issue #2910.
gcc 15 throws such kind of warnings about initialization of some char arrays:
src/log.c:181:33: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Werror=unterminated-string-initialization]
181 | const char sess_term_cond[16] = "-LcCsSPRIDKUIIII"; /* normal, Local, CliTo, CliErr, SrvTo, SrvErr, PxErr, Resource, Internal, Down, Killed, Up, -- */
| ^~~~~~~~~~~~~~~~~~
src/log.c:182:33: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (9 chars into 8 available) [-Werror=unterminated-string-initialization]
182 | const char sess_fin_state[8] = "-RCHDLQT"; /* cliRequest, srvConnect, srvHeader, Data, Last, Queue, Tarpit */
So, let's make it happy by not giving the sizes of these char arrays
explicitly, thus he can accomodate there NUL terminators.
Reported in GitHub issue #2910.
This should be backported up to 2.6.
By default, pools of comparable sizes are merged together. However, the
current algorithm is dumb: it rounds the requested size to the next
multiple of 16 and compares the sizes like this. This results in many
entries which are already multiples of 16 not being merged, for example
1024 and 1032 are separate, 65536 and 65540 are separate, 48 and 56 are
separate (though 56 merges with 64).
This commit changes this to consider not just the entry size but also the
average entry size, that is, it compares the average size of all objects
sharing the pool with the size of the object looking for a pool. If the
object is not more than 1% bigger nor smaller than the current average
size or if it neither 16 bytes smaller nor larger, then it can be merged.
Also, it always respects exact matches in order to avoid merging objects
into larger pools or worse, extending existing ones for no reason, and
when there's a tie, it always avoids extending an existing pool.
Also, we now visit all existing pools in order to spot the best one, we
do not stop anymore at the smallest one large enough. Theoretically this
could cost a bit of CPU but in practice it's O(N^2) with N quite small
(typically in the order of 100) and the cost at each step is very low
(compare a few integer values). But as a side effect, pools are no
longer sorted by size, "show pools bysize" is needed for this.
This causes the objects to be much better grouped together, accepting to
use a little bit more sometimes to avoid fragmentation, without causing
everyone to be merged into the same pool. Thanks to this we're now
seeing 36 pools instead of 48 by default, with some very nice examples
of compact grouping:
- Pool qc_stream_r (80 bytes) : 13 users
> qc_stream_r : size=72 flags=0x1 align=0
> quic_cstrea : size=80 flags=0x1 align=0
> qc_stream_a : size=64 flags=0x1 align=0
> hlua_esub : size=64 flags=0x1 align=0
> stconn : size=80 flags=0x1 align=0
> dns_query : size=64 flags=0x1 align=0
> vars : size=80 flags=0x1 align=0
> filter : size=64 flags=0x1 align=0
> session pri : size=64 flags=0x1 align=0
> fcgi_hdr_ru : size=72 flags=0x1 align=0
> fcgi_param_ : size=72 flags=0x1 align=0
> pendconn : size=80 flags=0x1 align=0
> capture : size=64 flags=0x1 align=0
- Pool h3s (56 bytes) : 17 users
> h3s : size=56 flags=0x1 align=0
> qf_crypto : size=48 flags=0x1 align=0
> quic_tls_se : size=48 flags=0x1 align=0
> quic_arng : size=56 flags=0x1 align=0
> hlua_flt_ct : size=56 flags=0x1 align=0
> promex_metr : size=48 flags=0x1 align=0
> conn_hash_n : size=56 flags=0x1 align=0
> resolv_requ : size=48 flags=0x1 align=0
> mux_pt : size=40 flags=0x1 align=0
> comp_state : size=40 flags=0x1 align=0
> notificatio : size=48 flags=0x1 align=0
> tasklet : size=56 flags=0x1 align=0
> bwlim_state : size=48 flags=0x1 align=0
> xprt_handsh : size=48 flags=0x1 align=0
> email_alert : size=56 flags=0x1 align=0
> caphdr : size=41 flags=0x1 align=0
> caphdr : size=41 flags=0x1 align=0
- Pool quic_cids (32 bytes) : 13 users
> quic_cids : size=16 flags=0x1 align=0
> quic_tls_ke : size=32 flags=0x1 align=0
> quic_tls_iv : size=12 flags=0x1 align=0
> cbuf : size=32 flags=0x1 align=0
> hlua_queuew : size=24 flags=0x1 align=0
> hlua_queue : size=24 flags=0x1 align=0
> promex_modu : size=24 flags=0x1 align=0
> cache_st : size=24 flags=0x1 align=0
> spoe_appctx : size=32 flags=0x1 align=0
> ehdl_sub_tc : size=32 flags=0x1 align=0
> fcgi_flt_ct : size=16 flags=0x1 align=0
> sig_handler : size=32 flags=0x1 align=0
> pipe : size=24 flags=0x1 align=0
- Pool quic_crypto (1032 bytes) : 2 users
> quic_crypto : size=1032 flags=0x1 align=0
> requri : size=1024 flags=0x1 align=0
- Pool quic_conn_r (65544 bytes) : 2 users
> quic_conn_r : size=65536 flags=0x1 align=0
> dns_msg_buf : size=65540 flags=0x1 align=0
On a very unscientific test consisting in sending 1 million H1 requests
and 1 million H2 requests to the stats page, we're seeing an ~6% lower
memory usage with the patch:
before the patch:
Total: 48 pools, 4120832 bytes allocated, 4120832 used (~3555680 by thread caches).
after the patch:
Total: 36 pools, 3880648 bytes allocated, 3880648 used (~3299064 by thread caches).
This should be taken with care however since pools allocate and release
in batches.
When using hash-based load balancing, requests are always assigned to
the server corresponding to the hash bucket for the balancing key,
without taking maxconn or maxqueue into account, unlike in other load
balancing methods like 'first'. This adds a new backend directive that
can be used to take maxconn and possibly maxqueue in that context. This
can be used when hashing is desired to achieve cache locality, but
sending requests to a different server is preferable to queuing for a
long time or failing requests when the initial server is saturated.
By default, affinity is preserved as was the case previously. When
'hash-preserve-affinity' is set to 'maxqueue', servers are considered
successively in the order of the hash ring until a server that does not
have a full queue is found.
When 'maxconn' is set on a server, queueing cannot be disabled, as
'maxqueue=0' means unlimited. To support picking a different server
when a server is at 'maxconn' irrespective of the queue,
'hash-preserve-affinity' can be set to 'maxconn'.
Define a new global configuration tune.quic.frontend.max-data. This
allows users to explicitely set the value for the corresponding QUIC TP
initial-max-data, with direct impact on haproxy memory consumption.
Initial TP value for max-data is automatically calculated to be adjusted
to the maximum number of opened streams over a QUIC connection. This
took into account both max-streams-bidi-remote and uni-streams. By
default, this is equivalent to 100 + 3 = 103 max opened streams.
This patch simplifies the calculation by only using bidirectional
streams. Uni streams are ignored because they are only used for HTTP/3
control exchanges, which should only represents a few bytes. For now,
users can only configure the max number of remote bidi streams, so the
simplified calculation should make more sense to them.
Note that this relies on the assumption that HTTP/3 is used as
application protocol. To support other protocols, it may be necessary to
review this and take into account both local bidi and uni streams.
Adjust initialization of flow-control transport parameters via
quic_transport_params_init().
This is purely cosmetic, with some comments added. It is also a
preparatory step for future patches with addition of new configuration
keywords related to flow-control TP values.
A new structure quic_tune has recently been defined. Its purpose is to
store global options related to QUIC. Previously, only the tunable to
toggle pacing was stored in it.
This commit moves several QUIC related tunable from global to quic_tune
structure. This better centralizes QUIC configuration option and gives
room for future generic options.
By default, create_pool() tries to merge similar pools into one. But when
dealing with certain bugs, it's hard to say which ones were merged together.
We do have the information at registration time, so let's just create a
list of registrations ("pool_registration") attached to each pool, that
will store that information. It can then be consulted on the CLI using
"show pools detailed", where the names, sizes, alignment and flags are
reported.
The goal will be to support other dump options. We don't need 32 bits to
express sorting criteria, let's reserve only 4 bits for them and leave
the remaining ones unused.
For now alt_name is systematically set to NULL. Thanks to this change we
may easily add an altname to existing metrics. Also by requiring explicit
value it offers more visibility for this field.
The following patch fixed a BUG_ON() which could be triggered if RS/SS
emission was scheduled after stream local closure.
7ee1279f4b
BUG/MEDIUM: mux-quic: fix crash on RS/SS emission if already close local
qcc_send_stream() was rewritten as a wrapper around an internal
_qcc_send_stream() used to bypass the faulty BUG_ON(). However, an extra
unnecessary BUG_ON() was added by mistake in _qcc_send_stream().
This should not cause any issue, as the BUG_ON() is only active if <urg>
argument is false, which is not the case for RS/SS emission. However,
this patch is labelled as a bug as this BUG_ON() is unnecessary and may
cause issues in the future.
This should be backported up to 2.8, after the above mentionned patch.
A BUG_ON() is present in qcc_send_stream() to ensure that emission is
never performed with a stream already closed locally. However, this
function is also used for RESET_STREAM/STOP_SENDING emission. No
protection exists to ensure that RS/SS is not scheduled after stream
local closure, which would result in this BUG_ON() crash.
This crash can be triggered with the following QUIC client sequence :
1. SS is emitted to open a new stream. QUIC-MUX schedules a RS emission
by and the stream is locally closed.
2. An invalid HTTP/3 request is sent on the same stream, for example
with duplicated pseudo-headers. The objective is to ensure
qcc_abort_stream_read() is called after stream closure, which results
in the following backtrace.
0x000055555566a620 in qcc_send_stream (qcs=0x7ffff0061420, urg=1, count=0) at src/mux_quic.c:1633
1633 BUG_ON(qcs_is_close_local(qcs));
[ ## gdb ## ] bt
#0 0x000055555566a620 in qcc_send_stream (qcs=0x7ffff0061420, urg=1, count=0) at src/mux_quic.c:1633
#1 0x000055555566a921 in qcc_abort_stream_read (qcs=0x7ffff0061420) at src/mux_quic.c:1658
#2 0x0000555555685426 in h3_rcv_buf (qcs=0x7ffff0061420, b=0x7ffff748d3f0, fin=0) at src/h3.c:1454
#3 0x0000555555668a67 in qcc_decode_qcs (qcc=0x7ffff0049eb0, qcs=0x7ffff0061420) at src/mux_quic.c:1315
#4 0x000055555566c76e in qcc_recv (qcc=0x7ffff0049eb0, id=12, len=0, offset=23, fin=0 '\000',
data=0x7fffe0049c1c "\366\r,\230\205\354\234\301;\2563\335\037k\306\334\037\260", <incomplete sequence \323>) at src/mux_quic.c:1901
#5 0x0000555555692551 in qc_handle_strm_frm (pkt=0x7fffe00484b0, strm_frm=0x7ffff00539e0, qc=0x7fffe0049220, fin=0 '\000') at src/quic_rx.c:635
#6 0x0000555555694530 in qc_parse_pkt_frms (qc=0x7fffe0049220, pkt=0x7fffe00484b0, qel=0x7fffe0075fc0) at src/quic_rx.c:980
#7 0x0000555555696c7a in qc_treat_rx_pkts (qc=0x7fffe0049220) at src/quic_rx.c:1324
#8 0x00005555556b781b in quic_conn_app_io_cb (t=0x7fffe0037f20, context=0x7fffe0049220, state=49232) at src/quic_conn.c:601
#9 0x0000555555d53788 in run_tasks_from_lists (budgets=0x7ffff748e2b0) at src/task.c:603
#10 0x0000555555d541ae in process_runnable_tasks () at src/task.c:886
#11 0x00005555559c39e9 in run_poll_loop () at src/haproxy.c:2858
#12 0x00005555559c41ea in run_thread_poll_loop (data=0x55555629fb40 <ha_thread_info+64>) at src/haproxy.c:3075
The proper solution is to not execute this BUG_ON() for RS/SS emission.
Indeed, it is valid and can be useful to emit these frames, even after
stream local closure.
To implement this, qcc_send_stream() has been rewritten as a mere
wrapper function around the new internal _qcc_send_stream(). The latter
is used only by QMUX for STREAM, RS and SS emission. Application layer
continue to use the original function for STREAM emission, with the
BUG_ON() still in place there.
This must be backported up to 2.8.
While being a generic metric, ST_I_PX_REQ_TOT is handled specifically for
the frontend case. But the frontend capability isn't set for that metric
It is actually quite misleading, because the capability may be checked
to see whether the metric is relevant for a given scope, yet it is
relevant for frontend scope.
In this patch we also add the frontend capability for the metric.
Use stat_col storage for stat_cols_info[] array instead of name_desc.
As documented in 65624876f ("MINOR: stats: introduce a more expressive
stat definition method"), stat_col supersedes name_desc storage but
it remains backward compatible. Here we migrate to the new API to be
able to further extend stat_cols_info[] in following patches.
Goal is to merge promex metrics definition into the main one.
Promex metrics will use the metric capability to know available scopes,
thus only metrics relevant for prometheus were updated.
Further extend logic implemented in 65624876 ("MINOR: stats: introduce a
more expressive stat definition method") and 4e9e8418 ("MINOR: stats:
prepare stats-file support for values other than FN_COUNTER"): we don't
rely anymore on the presence of the capability to know if the metric is
generic or not. This is because it prevents us from setting a capability
on static statistics. Yet it could be useful to set the capability even
on static metrics, thus we add a dedicated .generic bit to tell haproxy
that the metric is generic and can be handled automatically by the API.
Also, ME_NEW_* helpers are not explicitly associated to generic metric
definition (as it was already the case before) to avoid ambiguities.
It may change in the future as we may need to use the new definition
method to define static metrics (without the generic bit set). But for
now it isn't the case as this need definition was implemented for generic
metrics support in the first place. If we want to define static metrics
using the API, we could add a new set of helpers for instance.
On frontend side, when a stream is shut while the response was already fully
sent, it was cancelled by sending a RST_STREAM(CANCEL) frame. However, it is
not accurrate. CANCEL error code must only be used if the response headers
were sent, but not the full response. As stated in the RFC 9113, when the
response was fully sent, to stop the request sending, a RST_STREAM with an
error code of NO_ERROR must be sent.
This patch should solve the issue #1219. It must be backported to all stable
versions.
The ckch_store_load_files() function makes specific processing for
PARSE_TYPE_STR as if it was a type only used for paths.
This patch changes a little bit the way it's done,
PARSE_TYPE_STR is only meant to strdup() a string and stores the
resulting pointer in the ckch_conf structure.
Any processing regarding the path is now done in the callback.
Since the callbacks were basically doing the same thing, they were
transformed into the DECLARE_CKCH_CONF_LOAD() macros which allows to
do some templating of these functions.
The resulting ckch_conf_load_* functions will do the same as before,
except they will also do the path processing instead of letting
ckch_store_load_files() do it, which means we don't need the "base"
member anymore in the struct ckch_conf_kws.
With the SSL configuration, crt-base, key-base are often used, these
keywords concatenates the base path with the path when the path does not
start by '/'.
This is done at several places in the code, so a function to do this
would be better to standardize the code.
Recent commit e5e36ce09 ("BUG/MEDIUM: hlua/cli: Fix lua CLI commands
to work with applet's buffers") revealed a bug in hlua cli applet handling
Indeed, playing with Willy's lua tetris script on the cli, a segfault
would be encountered when forcefully closing the session by sending a
CTRL+C on the terminal.
In fact the crash was caused by a UAF: while the cli applet was already
freed, the lua task responsible for waking it up would still point to it.
Thus hlua_applet_wakeup() could be called even if the applet didn't exist
anymore.
To fix the issue, in hlua_cli_io_release_fct() we must also free the hlua
task linked to the applet, like we already do for
hlua_applet_tcp_release() and hlua_applet_http_release().
While this bug exists on stable versions (where it should be backported
too for precaution), it only seems to be triggered starting with 3.0.
'global.fd_hard_limit' stays uninitialized, if haproxy is started with -m
(global.rlimit_memmax). 'remain' is the MAX between soft and hard process fd
limits. It will be always bigger than 'global.fd_hard_limit' (0) in this case.
So, if we reassign 'remain' to the 'global.fd_hard_limit' unconditionally,
calculated then 'maxconn' will be even negative and the DEFAULT_MAXCONN (100)
will be set as the 'ideal_maxconn'.
During the 'global.maxconn' calculations in set_global_maxconn(), if the
provided 'global.rlimit_memmax' is quite big, system will refuse to calculate
based on its 'global.maxconn' and we will do a fallback to the 'ideal_maxconn',
which is 100.
Same problem for the configs with SSL frontends and backends.
This fixes the issue #2899.
This should be backported to v3.1.0.
Thanks to the previous commits, we now know that "wait srv-removable"
does not require thread isolation, as long as 3372a2ea00 ("BUG/MEDIUM:
queues: Stricly respect maxconn for outgoing connections") and c880c32b16
("MINOR: stream: decrement srv->served after detaching from the list")
are present. Let's just get rid of thread_isolate() here, which can
consume a lot of CPU on highly threaded machines when removing many
servers at once.
This function was marked as requiring thread isolation because its code
was extracted from cli_parse_delete_server() and was running under
isolation. But upon closer inspection, and using atomic loads to check
a few counters, it is actually safe to run without isolation, so let's
reflect that in its description.
However, it remains true that cli_parse_delete_server() continues to call
it under isolation.
Now that thanks to commit c880c32b16 ("MINOR: stream: decrement
srv->served after detaching from the list") we can trust srv->served,
let's use it and no longer loop on threads when checking if a server
still has streams attached to it. This will be much cheaper and will
result in keeping isolation for a shorter time in the "wait" command.
Baptiste reported that using the new optional timeout argument introduced
in 19e48f2 ("MINOR: hlua: add an optional timeout to AppletTCP:receive()")
the following error would occur at some point:
runtime error: file.lua:lineno: bad argument #-2 to 'receive' (number
expected, got light userdata) from [C]: in method 'receive...
In fact this is caused by exp_date being retrieved using relative index -1
instead of absolute index 3. Indeed, while using relative index is fine
most of the time when we trust the stack, when combined with yielding the
top of the stack when resuming from yielding is not necessarily the same
as when the function was first called (ie: if some data was pushed to the
stack in the yieldable function itself). As such, it is safer to use
explicit index to access exp_date variable at position 3 on the stack.
It was confirmed that doing so addresses the issue.
No backport needed unless 19e48f2 is.
In commit 3372a2ea00 ("BUG/MEDIUM: queues: Stricly respect maxconn for
outgoing connections"), it has been ensured that srv->served is held
as long as possible around the periods where a stream is attached to a
server. However, it's decremented early when entering sess_change_server,
and actually just before detaching from that server's list. While there
is theoretically nothing wrong with this, it prevents us from looking at
this counter to know if streams are still using a server or not.
We could imagine decrementing it much later but that wouldn't work with
leastconn, since that algo needs ->served to be final before calling
lbprm.server_drop_conn(). Thus what we're doing here is to detach from
the server, then decrement ->served, and only then call the LB callback
to update the server's position in the tree. At this moment the stream
doesn't know the server anymore anyway (except via this function's
local variable) so it's safe to consider that no stream knows the server
once the variable reaches zero.
In ad0133cc ("MINOR: log: handle log-forward "option host""), we
de-reference saddr without first checking if saddr is NULL. In practise
saddr shouldn't be null, but it may be the case if memory error happens
for tcp syslog handler so we must assume that it can be NULL at some
point.
To fix the bug, we simply check for NULL before de-referencing it
under syslog_io_handler(), as the function comment suggests.
No backport needed unless ad0133cc is.
This patch implements the function EVP_PKEY_to_jws_algo() which returns
a jwt_alg compatible with the private key.
This value can then be passed to jws_b64_protected() and
jws_b64_signature() which modified to take an jwt_alg instead of a char.
TCP services might want to be interactive, and without a timeout on
receive(), the possibilities are a bit limited. Let's add an optional
timeout in the 3rd argument to possibly limit the wait time. In this
case if the timeout strikes before the requested size is complete,
a possibly incomplete block will be returned.
Coverity has reported that cpu2 seems sometimes unused in
cpu_fixup_topology():
*** CID 1593776: Code maintainability issues (UNUSED_VALUE)
/src/cpu_topo.c: 690 in cpu_fixup_topology()
684 continue;
685
686 if (ha_cpu_topo[cpu].cl_gid != curr_id) {
687 if (curr_id >= 0 && cl_cpu <= 2)
688 small_cl++;
689 cl_cpu = 0;
>>> CID 1593776: Code maintainability issues (UNUSED_VALUE)
>>> Assigning value from "cpu" to "cpu2" here, but that stored value is overwritten before it can be used.
690 cpu2 = cpu;
691 curr_id = ha_cpu_topo[cpu].cl_gid;
692 }
693 cl_cpu++;
694 }
695
That's it. 'cpu2' automatic/stack variable is used only in for() loop scopes to
save cpus ID in which we are interested in. In the loop pointed by coverity
this variable is not used for further processing within the loop's scope.
Then it is always reinitialized to 0 in the another following loops.
This fixes GitHUb issue #2895.
This cpu policy keeps the smallest CPU cluster. This can
be used to limit the resource usage to the strict minimum
that still delivers decent performance, for example to
try to further reduce power consumption or minimize the
number of cores needed on some rented systems for a
sidecar setup, in order to scale the system down more
easily. Note that if a single cluster is present, it
will still be fully used.
When started on a 64-core EPYC gen3, it uses only one CCX
with 8 cores and 16 threads, all in the same group.
This cpu policy tries to evict performant core clusters and only
focuses on efficiency-oriented ones. On an intel i9-14900k, we can
get 525k rps using 8 performance cores, versus 405k when using all
24 efficiency cores. In some cases the power savings might be more
desirable (e.g. scalability tests on a developer's laptop), or the
performance cores might be better suited for another component
(application or security component).
This cpu policy tries to evict efficient core clusters and only
focuses on performance-oriented ones. On an intel i9-14900k, we can
get 525k rps using only 8 cores this way, versus 594k when using all
24 cores. The gains from using all these codes are not significant
enough to waste them on this. Also these cores can be much slower
at doing SSL handshakes so it can make sense to evict them. Better
keep the efficiency cores for network interrupts for example.
Also, on a developer's machine it can be convenient to keep all these
cores for the local tasks and extra tools (load generators etc).
When a cluster is too large to fit into a single group, let's split it
into two equal groups, which will still be allowed to use all the CPUs
of the cluster. This allows haproxy to start all the threads with a
minimum number of groups (e.g. 2x40 for 80 cores).
This policy forms thread groups from the CPU clusters, and bind all the
threads in them to all the CPUs of the cluster. This is recommended on
system with bad inter-CCX latencies. It was shown to simply triple the
performance with queuing on a 64-core EPYC without having to manually
assign the cores with cpu-map.
This is now superseded by the default "safe" cpu-policy, and every time
it's used, that code was bypassed anyway since global.nbthread was set.
We can now safely remove it. Note that for other policies which do not
set a thread count nor further restrict CPUs (such as "none", or even
"safe" when finding a single node), we continue to go through the fallback
code that automatically assigns CPUs to threads and counts them.
This now turns the cpu-policy to "first-usable-node" by default, so that
we preserve the current default behavior consisting in binding to the
first node if nothing was forced. If a second node is found,
global.nbthread is set and the previous code will be skipped.
This is a reimplemlentation of the current default policy. It binds to
the first node having usable CPUs if found, and drops CPUs from the
second and next nodes.
We'll need to let the user decide what's best for their workload, and in
order to do this we'll have to provide tunable options. For that, we're
introducing struct ha_cpu_policy which contains a name, a description
and a function pointer. The purpose will be to use that function pointer
to choose the best CPUs to use and now to set the number of threads and
thread-groups, that will be called during the thread setup phase. The
only supported policy for now is "none" which doesn't set/touch anything
(i.e. all available CPUs are used).
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
cluster number(s). It can be used to bind to only some clusters, such
as CCX or different energy efficiency cores. For this reason, here we
use the cluster's local ID (local to the node).
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
core number(s). It can be used to bind to only some clusters as well
as to evict efficient cores whose number is known.