Listener functions must follow a common locking pattern:
1. Get the proxy's lock if necessary
2. Get the protocol's lock if necessary
3. Get the listener's lock if necessary
We must take care to respect this order to avoid any ABBA issue. However, an
issue was introduced in the commit bcad7e631 ("MINOR: listener: add
relax_listener() function"). relax_listener() gets the lisener's lock and if
resume_listener() is called, the proxy's lock is then acquired.
So to fix the issue, the proxy's lock is first acquired in relax_listener(),
if necessary.
This patch should fix the issue #2222. It must be backported as far as 2.4
because the above commit is marked to be backported there.
On a clean installation, users might want to use server-state-file and
the recommended zero-warning option. This caused a problem if
server-state-file was not found, as a warning was emited, causing
startup to fail.
This will allow users to specify nonexistent server-state-file at first,
and dump states to the file later.
Fixes#2190
CF: Technically speaking, this patch can be backported to all stable
versions. But it is better to do so to 2.8 only for now.
Users might want to pre-create an empty file for later dumping
server-states. This commit allows for that by emiting a notice in case
file is empty and a warning if file is not empty, but version is unknown
Fix partially: #2190
CF: Technically speaking, this patch can be backported to all stable
versions. But it is better to do so to 2.8 only for now.
There are cases where there are enough room on the network to send 1200 bytes
into a PING only Initial packets. This may be considered as the last chance
for the connection to complete the handshake. Indeed, the client should
reply with at least a 1200 bytes datagram with an Initial packet inside.
This would give the haproxy endpoint a credit of 3600 bytes to complete
the handshake before reaching the anti-amplification limit again, and so on.
It is hard to analyze the impact of this bug. I guess it could lead a connection
to probe infinitively (with an exponential backoff probe timeout) during an handshake,
but one has never seen such a case.
Add missing parentheses around ->flags of the TX packet built by qc_do_build_pkt()
to detect that this packet embeds ack-eliciting frames. In this case if a probing
packet was needed the ->pto_probe value of the packet number space must be
decremented.
Must be backported as far as 2.6.
quic_conn_io_cb() is the I/O handler used during the handshakes. It called
qc_prep_pkts() after having called qc_treat_rx_pkts() to prepare datagrams
after having parsed incoming ones with branches to "next_level" label depending
on the connection state and if the current TLS session was a 0-RTT session
or not. The code doing that was ugly and not easy to maintain.
As qc_prep_pkts() is able to handle all the encryption levels available for
a connection, there is no need to keep this code.
After simplification, for now on, to be short, quic_conn_io_cb() called only one
time qc_prep_pkts() after having called qc_treat_rx_pkts().
Furthermore, there are more chances that this I/O handler could be reused for
the haproxy server side connections.
The aim of this patch is to allow the building of QUIC datagrams with
as much as packets with different encryption levels inside during handshake.
At this time, this is possible only for at most two encryption levels.
That said, most of the time, a server only needs to use two encryption levels
by datagram, except during retransmissions.
Modify qc_prep_pkts(), the function responsible of building datagrams, to pass
a list of encryption levels as parameter in place of two encryption levels. This
function is also used when retransmitting datagrams. In this case this is a
customized/flexible list of encryption level which is passed to this function.
Add ->retrans new member to quic_enc_level struct, to be used as attach point
to list of encryption level used only during retransmission, and ->retrans_frms
new member which is a pointer to a list of frames to be retransmitted.
This context may be released at the same time as the Initial TLS context.
This is done calling quic_tls_ctx_secs_free() and pool_free() in two code locations.
Implement quic_nictx_free() to do that.
Shorten ->negotiated_ictx quic_conn struct member (->nictx).
This variable is used during version negotiation. Indeed, a connection
may have to support support several QUIC versions of paquets during
the handshake. ->nictx is the QUIC TLS cipher context used for the negotiated
QUIC version.
This patch allows a connection to dynamically allocate this TLS cipher context.
Add a new pool (pool_head_quic_tls_ctx) for such QUIC TLS cipher context object.
Modify qc_new_conn() to initialize ->nictx to NULL value.
quic_tls_ctx_secs_free() frees all the secrets attached to a QUIC TLS cipher context.
Modify it to do nothing if it is called with a NULL TLS cipher context.
Modify to allocate ->nictx from qc_conn_finalize() just before initializing
its secrets. qc_conn_finalize() allocates -nictx only if needed (if a new QUIC
version was negotiated).
Modify qc_conn_release() which release a QUIC connection (quic_conn struct) to
release ->nictx TLS cipher context.
There is no need to keep an encoded version of the QUIC listener transport
parameters attache to the connection.
Remove ->enc_params and ->enc_params_len member of quic_conn struct.
Use variables to build the encoded transport parameter local to
ha_quic_set_encryption_secrets() before they are passed to
SSL_set_quic_transport_params().
Modify qc_ssl_sess_init() prototype. It was expected to be used with
the encoded transport parameters as passed parameter, but they were not
used. Cleanup this function.
During startup, when the "none" method for "init-addr" is evaluated, a
warning is emitted if a resolution failure was previously encountered. The
documentation of the "none" method states it should be used to ignore server
resolution failures and let the server starts in DOWN state. However,
because a warning may be emitted, it is not possible to start HAProxy with
"zero-warning" option.
The same is true when "-dr" command line option is used. It is counter
intuitive and, in a way, this contradict what is specified in the
documentation.
So instead, a notice message is now emitted. At the end, if "-dr" command
line option is used or if "none" method is explicitly used, it means the
admin is agree with server resolution failures. There is no reason to emit a
warning.
This patch should fix the issue #2176. It could be backported to all stable
versions but backporting to 2.8 is probably enough for now.
parse_cpu_set() stopped returning the undocumented -1 which was a
leftover from an earlier attempt, changed from ulong to int since
it only returns a success/failure and no more a mask. Thus it must
not return -1 and its callers must only test for != 0, as is
documented.
This field used to store the cpumap of the first thread in a group, and
was used till 2.4 to hold some default settings, after which it was no
longer used. Let's just drop it.
The per-process CPU affinity settings are only applied during forking,
which means that cpu-map are ignored when running in foreground (e.g.
haproxy started with -db). This is historic due to the original semantics
of a process array, but isn't documented and causes surprises when trying
to debug affinity settings.
Let's make sure the setting is applied to the workers themselves even
in foreground. This may be backported to 2.6 though it is really not
important. If backported, it also depends on previous commit:
BUG/MINOR: cpuset: remove the bogus "proc" from the cpu_map struct
We're currently having a problem with the porting from cpu_map from
processes to thread-groups as it happened in 2.7 with commit 5b09341c0
("MEDIUM: cpu-map: replace the process number with the thread group
number"), though it seems that it has deeper roots even in 2.0 and
that it was progressively made worng over time.
The issue stems in the way the per-process and per-thread cpu-sets were
employed over time. Originally only processes were supported. Then
threads were added after an optional "/" and it was documented that
"cpu-map 1" is exactly equivalent to "cpu-map 1/all" (this was clarified
in 2.5 by commit 317804d28 ("DOC: update references to process numbers
in cpu-map and bind-process").
The reality is different: when processes were still supported, setting
"cpu-map 1" would apply the mask to the process itself (and only when
run in the background, which is not documented either and is also a
bug for another fix), and would be combined with any possible per-thread
mask when calculating the threads' affinity, possibly resulting in empty
sets. However, "cpu-map 1/all" would only set the mask for the threads
and not the process. As such the following:
cpu-map 1 odd
cpu-map 1/1-8 even
would leave no CPU while doing:
cpu-map 1/all odd
cpu-map 1/1-8 even
would allow all CPUs.
While such configs are very unlikely to ever be met (which is why this
bug is tagged minor), this is becoming quite more visible while testing
automatic CPU binding during 2.9 development because due to this bug
it's much more common to end up with incorrect bindings.
This patch fixes it by simply removing the .proc entry from cpu_map and
always setting all threads' maps. The process is no longer arbitrarily
bound to the group 1's mask, but in case threads are disabled, we'll
use thread 1's mask since it contains the configured CPUs.
This fix should be backported at least to 2.6, but no need to insist if
it resists as it's easier to break cpu-map than to fix an unlikely issue.
As documented, the NUMA auto-detection is not supposed to be used when
the CPU affinity was set either by taskset (already checked) or by a
cpu-map directive. However this check was missing, so that configs
having cpu-map entries would still first bind to a single node. In
practice it has no impact on correct configs since bindings will be
replaced. However for those where the cpu-map directive are not
exhaustive it will have the impact of binding those threads to one node,
which disagrees with the doc (and makes future evolutions significantly
more complicated).
This could be backported to 2.4 where numa-cpu-mapping was added, though
if nobody encountered this by then maybe we should only focus on recent
versions that are more NUMA-friendly (e.g. 2.8 only). This patch depends
on this previous commit that brings the function we rely on:
MINOR: cpuset: add cpu_map_configured() to know if a cpu-map was found
Since we'll soon want to adjust the "thread-groups" degree of freedom
based on the presence of cpu-map, we first need to be able to detect
if cpu-map was used. This function scans all cpu-map sets to detect if
any is present, and returns true accordingly.
This adds the "core.get_var()" method allow the reading
of "proc." scoped variables outside of TXN or HTTP/TCPApplet.
Fixes: #2212
Signed-off-by: Daan van Gorkum <djvg@djvg.net>
A FCGI response may contain a "Location" header with no status code. In this
case a 302-Found HTTP response must be returned to the client. However,
while the status code is indeed 302, the reason is wrong. "Found" must be
set instead of "Moved Temporarily".
This patch must be backported as far as 2.2. With the commit e3e4e0006
("BUG/MINOR: http: Return the right reason for 302"), this should fix the
issue #2208.
Calling lual_newstate(Init main lua stack) in the hlua_init_state()
function, the return value of lua_newstate() may be NULL (for example
in case of OOM). In this case, L will be NULL, and then crash happens
in lua_getextraspace(). So, we add a check for lua_newstate.
This should be backported at least to 2.4, maybe further.
Building with gcc-6.5:
src/quic_conn.c: In function 'send_retry':
src/quic_conn.c:6554:2: error: dereferencing type-punned pointer will break
strict-aliasing rules [-Werror=strict-aliasing]
*((uint32_t *)((unsigned char *)&buf[i])) = htonl(qv->num);
This patch use write_n32 to set the value.
This could be backported until v2.6
This bug arrived with this commit:
MEDIUM: quic: Dynamic allocations of QUIC TLS encryption levels
It is possible that haproxy receives a late Initial packet after it has
released its Initial or Handshake encryption levels. In this case
it must not try to retransmit packets from such encryption levels to
speed up the handshake completion.
No need to backport.
Adds a new sample fetch method to get the curve name used in the
key agreement to enable better observability. In OpenSSLv3, the function
`SSL_get_negotiated_group` returns the NID of the curve and from the NID,
we get the curve name by passing the NID to OBJ_nid2sn. This was not
available in v1.1.1. SSL_get_curve_name(), which returns the curve name
directly was merged into OpenSSL master branch last week but will be available
only in its next release.
Because of a cut/paste error, the wrong reason was returned for 302
code. The 301 reason was returned instead. Thus now, "Found" is returned for
302, instead of "Moved Permanently".
This pathc should fix the issue 2208. It must be backported to all stable
versions.
When "add" or "sub" conveters are used, an overflow detection is performed.
When 2 negative integers are added (or a positive integer is substracted to
a positive one), we take care to not exceed the low limit (LLONG_MIN) and
when 2 positive integers are added, we take care to not exceed the high
limit (LLONG_MAX).
However, because of a missing 'else' statement, if there is no overflow in
the first case, we fall back on the second check (the one for positive adds)
and LLONG_MAX is returned. It means that most of time, when 2 negative
integers are added (or a positive integer is substracted to a negative one),
LLONG_MAX is returned.
This patch should solve the issue #2216. It must be backported to all stable
versions.
I assumed that the hlua_yieldk() function used in queue:pop_wait()
function would eventually return when the continuation function would
return.
But this is wrong, the continuation function is simply called back by the
resume after the hlua_yieldk() which does not return in this case. The
caller is no longer the initial calling function, but Lua, so when the
continuation function eventually returns, it does not give the hand back
to the C calling function (queue:pop_wait()), like we're used to, but
directly to Lua which will continue the normal execution of the (Lua)
function that triggered the C-function, effectively bypassing the end
of the C calling function.
Because of this, the queue waiting list cleanup never occurs!
This causes some undesirable effects:
- pop_wait() will slowly leak over the time, because the allocated queue
waiting entry never gets deallocated when the function is finished
- queue:push() will become slower and slower because the wait list will
keep growing indefinitely as a result of the previous leak
- the task that performed at least 1 pop_wait() could suffer from
useless wakeups because it will stay indefinitely in the queue waiting
list, so every queue:push() will try to wake the task, even if the
task is not waiting for new queue items.
- last but not least, if the task that performed at least 1 pop_wait ends
or crashes, the next queue:push() will lead to invalid reads and
process crash because it will try to wakeup a ghost task that doesn't
exist anymore.
To fix this, the pop_wait function was reworked with the assumption that
the hlua_yieldk() with continuation function never returns. Indeed, it is
now the continuation function that will take care of the cleanup, instead
of the parent function.
This must be backported in 2.8 with 86fb22c5 ("MINOR: hlua_fcn: add Queue class")
lua_yieldk ctx argument is of type lua_KContext which is typedefed to
intptr_t when available so it can be used to store pointers.
But the wrapper function hlua_yieldk() passes it as a regular it so it
breaks that promise.
Changing hlua_yieldk() prototype so that ctx argument is of type
lua_KContext.
This bug had no functional impact because ctx argument is not being
actively used so far. This may be backported to all stable versions
anyway.
The internal tick clock was used to export the timestamp int the token
on retry packets. Doing this in cluster mode the nodes don't
understand the timestamp from tokens generated by others.
This patch re-work this using the the real current date (wall-clock time).
Timestamp are also now considered in secondes instead of milleseconds.
This patch should be backported until v2.6
RFC 9000, 17.2.5.1:
"The client MUST use the value from the Source Connection ID
field of the Retry packet in the Destination Connection ID
field of subsequent packets that it sends."
There was no control of this and we could accept a different
dcid on init packets containing a valid token.
The randomized value used as new scid on retry packets is now
added in the aad used to encode the token. This way the token
will appear as invalid if the dcid missmatch the scid of
the previous retry packet.
This should be backported until v2.6
According to rfc 5869 about hkdf, extract function returns a
pseudo random key usable to perform expand using labels to derive keys.
So the intermediate expand on a label is useless, the key should be strong
enought using only one expand.
This patch should be backported until v2.6
Computing the token key and IV, a stronger derived key was used
to compute the key but the weak secret was still used to compute
the IV. This could be used to found the secret.
This patch fix this using the same derived key than the one used
to compute the token key.
This should backport until v2.6
Configuration parsing allow port like 8000/websocket/. This is
a nonsense and allowing this syntax may hide to the user something
not corresponding to its intent.
This patch should not be backported because it could break existing
configurations
In hlua_queue_size(), queue size is loaded as a regular int, but the
queue might be shared by multiple threads that could perform some
atomic pushing or popping attempts in parallel, so we better use an
atomic load operation to guarantee consistent readings.
This could be backported in 2.8.
When errors are encountered in sink_new_from_logsrv() function,
incompetely allocated ressources are freed to prevent memory leaks.
For instance: logsrv implicit server is manually cleaned up on error prior
to returning from the function.
However, since 198e92a8e5 ("MINOR: server: add a global list of all known
servers") every server created using new_server() is registered to the
global list, but unfortunately the manual srv cleanup in
sink_new_from_logsrv() doesn't remove the srv from the global list, so the
freed server will still be referenced there, which can result in invalid
reads later.
Moreover, server API has evolved since, and now the srv_drop() function is
available for that purpose, so let's use it, but make sure that srv is
freed before the proxy because on older versions srv_drop() expects the
srv to be linked to a valid proxy pointer.
This must be backported up to 2.4.
[For 2.4 version, free_server() must be used instead of srv_drop()]
As the example/lua/mailers.lua script does its best to mimic the c-mailer
implementation, it should support the "timeout mail" directive as well.
This could be backported in 2.8.
srv->rid default value is set in _srv_parse_init() after the server is
succesfully allocated using new_server().
This is wrong because new_server() can be used independently so rid value
assignment would be skipped in this case.
Hopefully new_server() allocates server data using calloc() so srv->rid
is already set to 0 in practise. But if calloc() is replaced by malloc()
or other memory allocating function that doesn't zero-initialize srv
members, this could lead to rid being uninitialized in some cases.
This should be backported in 2.8 with 61e3894dfe ("MINOR: server: add
srv->rid (revision id) value")
Multiple error paths (memory,IO related) in cfg_post_parse_ring() were
not implemented correcly and could result in memory leak or undefined
behavior.
Fixing them all at once.
This can be backported in 2.4
sft freeing attempt made in a575421 ("BUG/MINOR: sink: missing sft free in
sink_deinit()") is incomplete, because sink->sft is meant to be used as a
list and not a single sft entry.
Because of that, the previous fix only frees the first sft entry, which
fixes memory leaks for single-server forwarders (this is the case for
implicit rings), but could still result in memory leaks when multiple
servers are configured in a explicit ring sections.
What this patch does: instead of directly freeing sink->sft, it iterates
over every list members to free them.
It must be backported up to 2.4 with a575421.
When leaving cfg_parse_log_forward() on error paths, errmsg which is local
to the function could still point to valid data, and it's our
responsibility to free it.
Instead of freeing it everywhere it is invoved, we free it prior to
leaving the function.
This should be backported as far as 2.4.
Multiple error paths were badly handled in cfg_parse_log_forward():
some errors were raised without interrupting the function execution,
resulting in undefined behavior.
Instead of fixing issues separately, let's fix the whole function at once.
This should be backported as far as 2.4.
"missing name for ip-forward section" is generated instead of "missing
name name for log-forward section" in cfg_parse_log_forward().
This may be backported up to 2.4.
In e709e1e ("MEDIUM: logs: buffer targets now rely on new sink_write")
we started using the sink API instead of using the ring_write function
directly.
But as indicated in the commit message, the maxlen parameter of the log
directive now only applies to the message part and not the complete
payload. I don't know what the original intent was (maybe minimizing code
changes) but it seems wrong, because the doc doesn't mention this special
case, and the result is that the ring->buffer output can differ from all
other log output types, making it very confusing.
One last issue with this is that log messages can end up being dropped at
runtime, only for the buffer target, and even if logsrv->maxlen is
correctly set (including default: 1024) because depending on the generated
header size the payload can grow bigger than the accepted sink size (sink
maxlen is not mandatory) and we have no simple way to detect this at
configuration time.
First, we partially revert e709e1e:
TARGET_BUFFER still leverages the proper sink API, but thanks to
"MINOR: sink: pass explicit maxlen parameter to sink_write()" we now
explicitly pass the logsrv->maxlen to the sink_write function in order
to stop writing as soon as either sink->maxlen or logsrv->maxlen is
reached.
This restores pre-e709e1e behavior with the added benefit from using the
high-level API, which includes automatically announcing dropped message
events.
Then, we also need to take the ending '\n' into account: it is not
explicitly set when generating the logline for TARGET_BUFFER, but it will
be forcefully added by the sink_forward_io_handler function from the tcp
handler applet when log messages from the buffer are forwarded to tcp
endpoints.
In current form, because the '\n' is added later in the chain, maxlen is
not being considered anymore, so the final log message could exceed maxlen
by 1 byte, which could make receiving servers unhappy in logging context.
To prevent this, we sacrifice 1 byte from the logsrv->maxlen to ensure
that the final message will never exceed log->maxlen, even if the '\n'
char is automatically appended later by the forwarding applet.
Thanks to this change TCP (over RING/BUFFER) target now behaves like
FD and UDP targets.
This commit depends on:
- "MINOR: sink: pass explicit maxlen parameter to sink_write()"
It may be backported as far as 2.2
[For 2.2 and 2.4 the patch does not apply automatically, the sink_write()
call must be updated by hand]
sink_write() currently relies on sink->maxlen to know when to stop
writing a given payload.
But it could be useful to pass a smaller, explicit value to sink_write()
to stop before the ring maxlen, for instance if the ring is shared between
multiple feeders.
sink_write() now takes an optional maxlen parameter:
if maxlen is > 0, then sink_write will stop writing at maxlen if maxlen
is smaller than ring->maxlen, else only ring->maxlen will be considered.
[for haproxy <= 2.7, patch must be applied by hand: that is:
__sink_write() and sink_write() should be patched to take maxlen into
account and function calls to sink_write() should use 0 as second argument
to keep original behavior]
A regression was introduced with 5464885 ("MEDIUM: log/sink: re-work
and merge of build message API.").
For UDP targets, a final '\n' is systematically inserted, but with the
rework of the build message API, it is inserted after the maxlen
limitation has been enforced, so this can lead to the final message
becoming maxlen+1. For strict syslog servers that only accept up to
maxlen characters, this could be a problem.
To fix the regression, we take the final '\n' into account prior to
building the message, like it was done before the rework of the API.
This should be backported up to 2.4.
When maxlen parameter exceeds sink size, a warning is generated and maxlen
is enforced to sink size. But the err_code is incorrectly set to ERR_ALERT
Indeed, being a "warning", ERR_WARN should be used here.
This may be backported as far as 2.2
When a ring section defines its size using the "size" directive with a
smaller size than the default one or smaller than the previous one,
a warning is generated to inform the user that the new size will be
ignored.
However the err_code is returned as FATAL, so this cause haproxy to
incorrectly abort the init sequence.
Changing the err_code to ERR_WARN so that this warning doesn't refrain
from successfully starting the process.
This should be backported as far as 2.4
Adding missing free for sft (string_forward_target) in sink_deinit(),
which resulted in minor leak for each declared ring target at deinit().
(either explicit and implicit rings are affected)
This may be backported up to 2.4.
_proxy_http_parse_7239_expr() helper used in proxy_http_parse_7239()
function may return ERR_ABORT in case of memory error. But the error check
used below is insufficient to catch ERR_ABORT so the function could keep
executing prior to returning ERR_ABORT, which may cause undefined
behavior. Hopefully no sensitive handling is performed in this case so
this bug has very limited impact, but let's fix it anyway.
We now use ERR_CODE mask instead of ERR_FATAL to check if err_code is set
to any kind of error combination that should prevent the function from
further executing.
This may be backported in 2.8 with b2bb9257d2 ("MINOR: proxy/http_ext:
introduce proxy forwarded option")
forward proxy server list created from sink_new_from_logsrv() is invalid
Indeed, srv->next is literally assigned to itself. This did not cause
issues during syslog handling because the sft was properly set, but it
will cause the free_proxy(sink->forward_px) at deinit to go wild since
free_proxy() will try to iterate through the proxy srv list to free
ressources, but because of the improper list initialization, double-free
and infinite-loop will occur.
This bug was revealed by 9b1d15f53a ("BUG/MINOR: sink: free forward_px on deinit()")
It must be backported as far as 2.4.
When a s-maxage cache-control directive is present, it overrides any
other max-age or expires value (see section 5.2.2.9 of RFC7234). So if
we have a max-age=0 alongside a strictly positive s-maxage, the response
should be cached.
This bug was raised in GitHub issue #2203.
The fix can be backported to all stable branches.
This bug was introduced by this commit:
MEDIUM: quic: Release encryption levels and packet number spaces asap
Add some checks before derefencing pointers to packet number spaces objects
to dump them from "show quic" command.
No backport needed.
Thierry Fournier reported an annoying side-effect when using the debug()
converter.
Consider the following examples:
[1] http-request set-var(txn.test) bool(true),ipmask(24)
[2] http-request redirect location /match if { bool(true),ipmask(32) }
When starting haproxy with [1] example we get:
config : parsing [test.conf:XX] : error detected in frontend 'fe' while parsing 'http-request set-var(txn.test)' rule : converter 'ipmask' cannot be applied.
With [2], we get:
config : parsing [test.conf:XX] : error detected in frontend 'fe' while parsing 'http-request redirect' rule : error in condition: converter 'ipmask' cannot be applied in ACL expression 'bool(true),ipmask(32)'.
Now consider the following examples which are based on [1] and [2]
but with the debug() sample conv inserted in-between those incompatible
sample types:
[1*] http-request set-var(txn.test) bool(true),debug,ipmask(24)
[2*] http-request redirect location /match if { bool(true),debug,ipmask(32) }
According to the documentation, "it is safe to insert the debug converter
anywhere in a chain, even with non-printable sample types".
Thus we don't expect any side-effect from using it within a chain.
However in current implementation, because of debug() returning SMP_T_ANY
type which is a pseudo type (only resolved at runtime), the sample
compatibility checks performed at parsing time are completely uneffective.
(haproxy will start and no warning will be emitted)
The indesirable effect of this is that debug() prevents haproxy from
warning you about impossible type conversions, hiding potential errors
in the configuration that could result to unexpected evaluation or logic
while serving live traffic. We better let haproxy warn you about this kind
of errors when it has the chance.
With our previous examples, this could cause some inconveniences. Let's
say for example that you are testing a configuration prior to deploying
it. When testing the config you might want to use debug() converter from
time to time to check how the conversion chain behaves.
Now after deploying the exact same conf, you might want to remove those
testing debug() statements since they are no longer relevant.. but
removing them could "break" the config and suddenly prevent haproxy from
starting upon reload/restart. (this is the expected behavior, but it
comes a bit too late because of debug() hiding errors during tests..)
To fix this, we introduce a new output type for sample expressions:
SMP_T_SAME - may only be used as "expected" out_type (parsing time)
for sample converters.
As it name implies, it is a way for the developpers to indicate that the
resulting converter's output type is guaranteed to match the type of the
sample that is presented on the converter's input side.
(converter may alter data content, but data type must not be changed)
What it does is that it tells haproxy that if switching to the converter
(by looking at the converter's input only, since outype is SAME) is
conversion-free, then the converter type can safely be ignored for type
compatibility checks within the chain.
debug()'s out_type is thus set to SMP_T_SAME instead of ANY, which allows
it to fully comply with the doc in the sense that it does not impact the
conversion chain when inserted between sample items.
Co-authored-by: Thierry Fournier <thierry.f.78@gmail.com>
Historically, the ADDR pseudo-type did not exist. So when IPV6 support
was added to existing IPV4 sample fetches (e.g.: src,dst,hdr_ip...) the
expected out_type in related sample definitions was left on IPV4 because
it was required to declare the out_type as the lowest common denominator
(the type that can be casted into all other ones) to make compatibility
checks at parse time work properly.
However, now that ADDR pseudo-type may safely be used as out_type since
("MEDIUM: sample: add missing ADDR=>? compatibility matrix entries"), we
can use ADDR for fetches that may output both IPV4 and IPV6 at runtime.
One added benefit on top of making the code less confusing is that
'haproxy -dKsmp' output will now show "addr" instead of "ipv4" for such
fetches, so the 'haproxy -dKsmp' output better complies with the fetches
signatures from the documentation.
out_ip fetch, which returns an ip according to the doc, was purposely
left as is (returning IPV4) since the smp_fetch_url_ip() implementation
forces output type to IPV4 anyway, and since this is an historical fetch
I prefer not to touch it to prevent any regression. However if
smp_fetch_url_ip() were to be fixed to also return IPV6 in the future,
then its expected out_type may be changed to ADDR as well.
Multiple notes in the code were updated to mention that the appropriate
pseudo-type may be used instead of the lowest common denominator for
out_type when available.
Since 1478aa7 ("MEDIUM: sample: Add IPv6 support to the ipmask converter")
ipmask converter may return IPV6 as well, so the sample definition must
be updated. (the output type is currently explicited to IPV4)
This has no functional impact: one visible effect will be that
converter's visible output type will change from "ipv4" to "addr" when
using 'haproxy -dKcnv'.
SMP_T_ADDR support was added in b805f71 ("MEDIUM: sample: let the cast
functions set their output type").
According to the above commit, it is made clear that the ADDR type is
a pseudo/generic type that may be used for compatibility checks but
that cannot be emitted from a fetch or converter.
With that in mind, all conversions from ADDR to other types were
explicitly disabled in the compatibility matrix.
But later, when map_*_ip functions were updated in b2f8f08 ("MINOR: map:
The map can return IPv4 and IPv6"), we started using ADDR as "expected"
output type for converters. This still complies with the original
description from b805f71, because it is used as the theoric output
type, and is never emitted from the converters themselves (only "real"
types such as IPV4 or IPV6 are actually being emitted at runtime).
But this introduced an ambiguity as well as a few bugs, because some
compatibility checks are being performed at config parse time, and thus
rely on the expected output type to check if the conversion from current
element to the next element in the chain is theorically supported.
However, because the compatibility matrix doesn't support ADDR to other
types it is currently impossible to use map_*_ip converters in the middle
of a chain (the only supported usage is when map_*_ip converters are
at the very end of the chain).
To illustrate this, consider the following examples:
acl test str(ok),map_str_ip(test.map) -m found # this will work
acl test str(ok),map_str_ip(test.map),ipmask(24) -m found # this will raise an error
Likewise, stktable_compatible_sample() check for stick tables also relies
on out_type[table_type] compatibility check, so map_*_ip cannot be used
with sticktables at the moment:
backend st_test
stick-table type string size 1m expire 10m store http_req_rate(10m)
frontend fe
bind localhost:8080
mode http
http-request track-sc0 str(test),map_str_ip(test.map) table st_test # raises an error
To fix this, and prevent future usage of ADDR as expected output type
(for either fetches or converters) from introducing new bugs, the
ADDR=>? part of the matrix should follow the ANY type logic.
That is, ADDR, which is a pseudo-type, must be compatible with itself,
and where IPV4 and IPV6 both support a backward conversion to a given
type, ADDR must support it as well. It is done by setting the relevant
ADDR entries to c_pseudo() in the compatibility matrix to indicate that
the operation is theorically supported (c_pseudo() will never be executed
because ADDR should not be emitted: this only serves as a hint for
compatibility checks at parse time).
This is what's being done in this commit, thanks to this the broken
examples documented above should work as expected now, and future
usage of ADDR as out_type should not cause any issue.
This function is used for ANY=>!ANY conversions in the compatibility
matrix to help differentiate between real NOOP (c_none) and pseudo
conversions that are theorically supported at config parse time but can
never occur at runtime,. That is, to explicit the fact that actual related
runtime operations (e.g.: ANY->IPV4) are not NOOP since they might require
some conversion to be performed depending on the input type.
When checking the conf we don't know the effective out types so
cast[pseudo type][pseudo type] is allowed in the compatibility matrix,
but at runtime we only expect cast[real type][(real type || pseudo type)]
because fetches and converters may not emit pseudo types, thus using
c_none() everywhere was too ambiguous.
The process will crash if c_pseudo() is invoked to help catch bugs:
crashing here means that a pseudo type has been encountered on a
converter's input at runtime (because it was emitted earlier in the
chain), which is not supported and results from a broken sample fetch
or converter implementation. (pseudo types may only be used as out_type
in sample definitions for compatibility checks at parsing time)
Both sample_parse_expr() and parse_acl_expr() implement some code
logic to parse sample conv list after respective fetch or acl keyword.
(Seems like the acl one was inspired by the sample one historically)
But there is clearly code duplication between the two functions, making
them hard to maintain.
Hopefully, the parsing logic between them has stayed pretty much the
same, thus the sample conv parsing part may be moved in a dedicated
helper parsing function.
This is what's being done in this commit, we're adding the new function
sample_parse_expr_cnv() which does a single thing: parse the converters
that are listed right after a sample fetch keyword and inject them into
an already existing sample expression.
Both sample_parse_expr() and parse_acl_expr() were adapted to now make
use of this specific parsing function and duplicated code parts were
cleaned up.
Although sample_parse_expr() remains quite complicated (numerous function
arguments due to contextual parsing data) the first goal was to get rid of
code duplication without impacting the current behavior, with the added
benefit that it may allow further code cleanups / simplification in the
future.
bc_{dst,src} sample fetches were introduced in 7d081f0 ("MINOR:
tcp_samples: Add samples to get src/dst info of the backend connection")
However a copy-pasting error was probably made here because they both
return IP addresses (not integers) like their fc_{src,dst} or src,dst
analogs.
Hopefully, this had no functional impact because integer can be converted
to both IPV4 and IPV6 types so the compatibility checks were not affected.
Better fix this to prevent confusion in code and "haproxy -dKsmp" output.
This should be backported up to 2.4 with 7d081f0.
This is to prevent the build from these gcc warning when compiling with -O1 option:
willy@wtap:haproxy$ sh make-native-quic-memstat src/quic_conn.o CPU_CFLAGS="-O1"
CC src/quic_conn.o
src/quic_conn.c: In function 'qc_prep_pkts':
src/quic_conn.c:3700:44: warning: potential null pointer dereference [-Wnull-dereference]
3700 | frms = &qel->pktns->tx.frms;
| ~~~^~~~~~~
src/quic_conn.c:3626:33: warning: 'end' may be used uninitialized in this function [-Wmaybe-uninitialized]
3626 | if (end - pos < QUIC_INITIAL_PACKET_MINLEN) {
| ~~~~^~~~~
This bug was introduced by this commit:
MINOR: quic: Remove pool_zalloc() from qc_new_conn().
If ->path is not initialized to NULL value, and if a QUIC connection object
allocation has failed (from qc_new_conn()), haproxy could crash in
quic_conn_prx_cntrs_update() when dereferencing this QUIC connection member.
No backport needed.
This bug was reported by GH #2200 (coverity issue) as follows:
*** CID 1516590: Resource leaks (RESOURCE_LEAK)
/src/quic_tls.c: 159 in quic_conn_enc_level_init()
153
154 LIST_APPEND(&qc->qel_list, &qel->list);
155 *el = qel;
156 ret = 1;
157 leave:
158 TRACE_LEAVE(QUIC_EV_CONN_CLOSE, qc);
>>> CID 1516590: Resource leaks (RESOURCE_LEAK)
>>> Variable "qel" going out of scope leaks the storage it points to.
159 return ret;
160 }
161
162 /* Uninitialize <qel> QUIC encryption level. Never fails. */
163 void quic_conn_enc_level_uninit(struct quic_conn *qc, struct quic_enc_level *qel)
164 {
This bug was introduced by this commit which has foolishly assumed the encryption
level memory would be released after quic_conn_enc_level_init() has failed. This
is no more possible because this object is dynamic and no more a static member
of the QUIC connection object.
Anyway, this patch modifies quic_conn_enc_level_init() to ensure this is
no more leak when quic_conn_enc_level_init() fails calling quic_conn_enc_level_uninit()
in case of memory allocation error.
quic_conn_enc_level_uninit() code was moved without modification only to be defined
before quic_conn_enc_level_init()
There is no need to backport this.
Ilya reported in issue #2193 that the latest Fedora complains about us
passing NULL to epoll_wait() in the "debug dev fd" code to try to detect
an epoll FD. That was intentional to get the kernel's verifications and
make sure we're facing a poller, but since such a warning comes from the
libc, it's possible that it plans to replace the syscall with a wrapper
in the near future (e.g. epoll_pwait()), and that just hiding the NULL
(as was confirmed to work) might just postpone the problem.
Let's take another approach, instead we'll create a new dummy FD that
we'll try to remove from the epoll set using epoll_ctl(). Since we
created the FD we're certain it cannot be there. In this case (and
only in this case) epoll_ctl() will return -ENOENT, otherwise it will
typically return EINVAL or EBADF. It was verified that it works and
doesn't return false positives for other FD types. It should be
backported to the branches that contain a backport of the commit which
introduced the feature, apparently as far as 2.4:
5be7c198e ("DEBUG: cli: add a new "debug dev fd" expert command")
Some compiler could complain with such a warning:
src/quic_conn.c:3700:44: warning: potential null pointer dereference [-Wnull-dereference]
3700 | frms = &qel->pktns->tx.frms;
It could not figure out that <qel> could not be NULL at this location.
This is fixed calling qc_prep_hpkts() with a disguise 1st parameter.
This patch allows the low level packet parser to drop packets with type for discarded
packet number spaces. Furthermore, this prevents it from reallocating new encryption
levels and packet number spaces already released/discarded. When a packet number space
is discarded, it MUST NOT be reallocated.
As the packet number space discarding is done asap the type of packet received is
known, some packet number space discarding check may be safely removed from qc_try_rm_hp()
and qc_qel_may_rm_hp() which are called after having parse the packet header, and
is type.
As the packet number spaces and encryption level are dynamically allocated,
the information about the packet number space discarded status must be kept
somewhere else than in these objects.
quic_tls_discard_keys() is no more useful.
Modify quic_pktns_discard() to do the same job: flag the quic_conn object
has having discarded packet number space.
Implement quic_tls_pktns_is_disarded() to check if a packet number space is
discarded. Note the Application data packet number space is never discarded.
There is no need to check that the packet number space associated to the encryption
level to handle the CRYPTO frames is available when entering qc_handle_crypto_frm().
This has already been done by the caller: qc_treat_rx_pkts().
Release the memory allocated for the Initial and Handshake encryption level
under packet number spaces as soon as possible.
qc_treat_rx_pkts() has been modified to do that for the Initial case. This must
be done after all the Initial packets have been parsed and the underlying TLS
secrets have been flagged as to be discarded. As the Initial encryption level is
removed from the list attached to the quic_conn object inside a list_for_each_entry()
loop, this latter had to be converted into a list_for_each_entry_safe() loop.
The memory allocated by the Handshake encryption level and packet number spaces
is released just before leaving the handshake I/O callback (qc_conn_io_cb()).
As ->iel and ->hel pointer to Initial and Handshake encryption are reset to null
value when releasing the encryption levels memory, some check have been added
at several place before dereferencing them. Same thing for the ->ipktns and ->htpktns
pointer to packet number spaces.
Also take the opportunity of this patch to modify qc_dgrams_retransmit() to
use packet number space variables in place of encryption level variables to shorten
some statements. Furthermore this reflects the jobs it has to do: retransmit
UDP datagrams from packet number spaces.
quic_conn_io_cb() is the I/O handler callback used during the handshakes.
quic_conn_app_io_cb() is used after the handshakes. Both call qc_rm_hp_pkts()
before parsing the RX packets ordered by their packet numbers calling qc_treat_rx_pkts().
qc_rm_hp_pkts() is there to remove the header protection to reveal the packet
numbers, then reorder the packets by their packet numbers.
qc_rm_hp_pkts() may be safely directly called by qc_treat_rx_pkts(), which is
itself called by the I/O handlers.
Insert a loop around the existing code of qc_treat_rx_pkts() to handle all
the RX packets by encryption level for all the encryption levels allocated
for the connection, i.e. for all the encryption levels with secrets derived
from the ones provided by the TLS stack through ->set_encryption_secrets().
Replace ->els static array of encryption levels by 4 pointers into the QUIC
connection object, quic_conn struct.
->iel denotes the Initial encryption level,
->eel the Early-Data encryption level,
->hel the Handshaske encryption level and
->ael the Application Data encryption level.
Add ->qel_list to this structure to list the encryption levels after having been
allocated. Modify consequently the encryption level object itself (quic_enc_level
struct) so that it might be added to ->qel_list QUIC connection list of
encryption levels.
Implement qc_enc_level_alloc() to initialize the value of a pointer to an encryption
level object. It is used to initialized the pointer newly added to the quic_conn
structure. It also takes a packet number space pointer address as argument to
initialize it if not already initialized.
Modify quic_tls_ctx_reset() to call it from quic_conn_enc_level_init() which is
called by qc_enc_level_alloc() to allocate an encryption level object.
Implement 2 new helper functions:
- ssl_to_qel_addr() to match and pointer address to a quic_encryption level
attached to a quic_conn object with a TLS encryption level enum value;
- qc_quic_enc_level() to match a pointer to a quic_encryption level attached
to a quic_conn object with an internal encryption level enum value.
This functions are useful to be called from ->set_encryption_secrets() and
->add_handshake_data() TLS stack called which takes a TLS encryption enum
as argument (enum ssl_encryption_level_t).
Replace all the use of the qc->els[] array element values by one of the newly
added ->[ieha]el quic_conn struct member values.
Very simple patch to define and declare a pool for the QUIC TLS encryptions levels.
It will be used to dynamically allocate such objects to be attached to the
QUIC connection object (quic_conn struct) and to remove from quic_conn struct the
static array of encryption levels (see ->els[]).
Add a pool to dynamically handle the memory used for the QUIC TLS packet number spaces.
Remove the static array of packet number spaces at QUIC connection level (struct
quic_conn) and add three new members to quic_conn struc as pointers to quic_pktns
struct, one by packet number space as follows:
->ipktns for Initial packet number space,
->hpktns for Handshake packet number space and
->apktns for Application packet number space.
Also add a ->pktns_list new member (struct list) to quic_conn struct to attach
the list of the packet number spaces allocated for the QUIC connection.
Implement ssl_to_quic_pktns() to map and retrieve the addresses of these pointers
from TLS stack encryption levels.
Modify quic_pktns_init() to initialize these members.
Modify ha_quic_set_encryption_secrets() and ha_quic_add_handshake_data() to
allocate the packet numbers and initialize the encryption level.
Implement quic_pktns_release() which takes pointers to pointers to packet number
space objects to release the memory allocated for a packet number space attached
to a QUIC connection and reset their address values.
Modify qc_new_conn() to allocation only the Initial packet number space and
Initial encryption level.
Modify QUIC loss detection API (quic_loss.c) to use the new ->pktns_list
list attached to a QUIC connection in place of a static array of packet number
spaces.
Replace at several locations the use of elements of an array of packet number
spaces by one of the three pointers to packet number spaces
While HTTP makes no promises of sub-message delivery, haproxy tries to
make reasonable efforts to be friendly to applications relying on this,
particularly though the "option http-no-delay" statement. However, it
was reported that when slz compression is being used, a few bytes can
remain pending for more data to complete them in the SLZ queue when
built on a 64-bit little endian architecture. This is because aligning
blocks on byte boundary is costly (requires to switch to literals and
to send 4 bytes of block size), so incomplete bytes are left pending
there until they reach at least 32 bits. On other architecture, the
queue is only 8 bits long.
Robert Newson from Apache's CouchDB project explained that the heartbeat
used by CouchDB periodically delivers a single LF character, that it used
to work fine before the change enlarging the queue for 64-bit platforms,
but only forwards once every 3 LF after the change. This was definitely
caused by this incomplete byte sequence queuing. Zlib is not affected,
and the code shows that ->flush() is always called. In the case of SLZ,
the called function is rfc195x_flush_or_finish() and when its "finish"
argument is zero, no flush is performed because there was no such flush()
operation.
The previous patch implemented a flush() operation in SLZ, and this one
makes rfc195x_flush_or_finish() call it when finish==0. This way each
delivered data block will now provoke a flush of the queue as is done
with zlib.
This may very slightly degrade the compression ratio, but another change
is needed to condition this on "option http-no-delay" only in order to
be consistent with other parts of the code.
This patch (and the preceeding slz one) should be backported at least to
2.6, but any further change to depend on http-no-delay should not.
In some cases it may be desirable for latency reasons to forcefully
flush the queue even if it results in suboptimal compression. In our
case the queue might contain up to almost 4 bytes, which need an EOB
and a switch to literal mode, followed by 4 bytes to encode an empty
message. This means that each call can add 5 extra bytes in the ouput
stream. And the flush may also result in the header being produced for
the first time, which can amount to 2 or 10 bytes (zlib or gzip). In
the worst case, a total of 19 bytes may be emitted at once upon a flush
with 31 pending bits and a gzip header.
This is libslz upstream commit cf8c4668e4b4216e930b56338847d8d46a6bfda9.
The 32-bits version field of the Retry paquet was inversed by the code. As this
field must be in the network byte order on the wire, this code has supposed that
the sender of the Retry packet will always be little endian. Hopefully this is
often the case on our Intel machines ;)
Must be backported as far as 2.6.
The 4 bits least significant bits of the first byte in a Retry packet must be
random. There are generated calling statistical_prng_range() with 16 as argument.
Must be backported as far as 2.6.
When a stick-table is defined within a peers section, the name is
prefixed with the peers section name. However when checking for
duplicate table names, the check was using the table name without
the prefix, and would thus never match.
Must be backported as far as 2.6.
This patch introduces the "client-sigalgs" keyword for the server line,
which allows to configure the list of server signature algorithms
negociated during the handshake. Also available as
"ssl-default-server-client-sigalgs" in the global section.
This patch introduces the "sigalgs" keyword for the server line, which
allows to configure the list of server signature algorithms negociated
during the handshake. Also available as "ssl-default-server-sigalgs" in
the global section.
This warning happened in 2.9-dev with commit 723c73f8a ("MEDIUM: mux-h1:
Split h1_process_mux() to make code more readable"). It's the usual gcc
crap that relies on comments to disable the warning but which drops these
comments between the preprocessor and the compiler, so using any split
build system (distcc, ccache etc) reintroduces the warning. Use the more
reliable and portable __fallthrough instead. No backport needed.
Return a more acurate error than the previous patch, CO_ER_SSL_EMPTY is
the code for "Connection closed during SSL handshake" which is more
precise than CO_ER_SSL_ABORT ("Connection error during SSL handshake").
No backport needed.
During a SSL_do_handshake(), SSL_ERROR_ZERO_RETURN can be returned in case
the remote peer sent a close_notify alert. Previously this would set the
connection error to CO_ER_SSL_HANDSHAKE, this patch sets it to
CO_ER_SSL_ABORT to have a more acurate error.
This bug was introduced by this commit which was not sufficient:
BUG/MINOR: quic: Possible endless loop in quic_lstnr_dghdlr()
It was revealed by the blackhole interop runner test with neqo as client.
qc_conn_release() could be called after having locke the CID tree when two different
threads was creating the same connection at the same time. Indeed in this case
the last thread which tried to create a new connection for the same an already existing
CID could not manage to insert an already inserted CID in the connection CID tree.
This was expected. It had to destroy the newly created for nothing connection calling
qc_conn_release(). But this function also locks this tree calling free_quic_conn_cids() leading to a deadlock.
A solution would have been to delete the new CID created from its tree before
calling qc_conn_release().
A better solution is to stop inserting the first CID from qc_new_conn(), and to
insert it into the CID tree only if there was not an already created connection.
This is whas is implemented by this patch.
Must be backported as far as 2.7.
Aurelien Darragon found a case of leak when working on ticket #2184.
When a reexec_on_failure() happens *BEFORE* protocol_bind_all(), the
worker is not fork and the mworker_proc struct is still there with
its 2 socketpairs.
The socketpair that is supposed to be in the master is already closed in
mworker_cleanup_proc(), the one for the worker was suppposed to
be cleaned up in mworker_cleanlisteners().
However, since the fd is not bound during this failure, the fd is never
closed.
This patch fixes the problem by setting the fd to -1 in the mworker_proc
after the fork, so we ensure that this it won't be close if everything
was done right, and then we try to close it in mworker_cleanup_proc()
when it's not set to -1.
This could be triggered with the script in ticket #2184 and a `ulimit -H
-n 300`. This will fail before the protocol_bind_all() when trying to
increase the nofile setrlimit.
In recent version of haproxy, there is a BUG_ON() in fd_insert() that
could be triggered by this bug because of the global.maxsock check.
Must be backported as far as 2.6.
The problem could exist in previous version but the code is different
and this won't be triggered easily without other consequences in the
master.
A regression was introduced in 730b983 ("MINOR: proxy: move 'forwardfor'
option to http_ext")
Indeed, when the forwardfor if-none option is specified on the frontend
but forwardfor is not specified at all on the backend: if-none from the
frontend is ignored.
But this behavior conflicts with the historical one, if-none should only
be ignored if forwardfor is also enabled on the backend and if-none is
not set there.
It should fix GH #2187.
This should be backported in 2.8 with 730b983 ("MINOR: proxy: move
'forwardfor' option to http_ext")
h1_append_chunk_size() and h1_prepend_chunk_crlf() functions were marked as
possibly unused to be able to add them in standalone commits. Now these
functions are used, the __maybe_unused statement can be removed.
When the HTX was introduced, we have lost the support for the kernel
splicing for chunked messages. Thanks to this patch set, it is possible
again. Of course, we still need to keep the H1 parser synchronized. Thus
only the chunk content can be spliced. We still need to read the chunk
envelope using a buffer.
There is no reason to backport this feature. But, just in case, this patch
depends on following patches:
* "MEDIUM: filters/htx: Don't rely on HTX extra field if payload is filtered"
* "MINOR: mux-h1: Add function to prepend the chunk crlf to the output buffer"
* "MINOR: mux-h1: Add function to append the chunk size to the output buffer"
* "REORG: mux-h1: Rename functions to emit chunk size/crlf in the output buffer"
* "MEDIUM: mux-h1: Split h1_process_mux() to make code more readable"
If an HTTP data filter is registered on a channel, we must not rely on the
HTX extra field because the payload may be changed and we cannot predict if
this value will change or not. It is too errorprone to let filters deal with
this reponsibility. So we set it to 0 when payload filtering is performed,
but only if the payload length can be determined. It is important because
this field may be used when data are forwarded. In fact, it will be used by
the H1 multiplexer to be able to splice chunk-encoded payload.
h1_process_mux() function was pretty huge a quite hard to debug. So, the
funcion is split in sub-functions. Each sub-function is responsible to a
part of the message (start-line, headers, payload, trailers...). We are
still relying on a HTTP parser to format the message to be sure to detect
errors. Functionnaly speaking, there is no change. But the code is now more
readable.
Replace a "less than" comparison between two tick variable by a call to tick_is_lt()
in quic_loss_pktns(). This bug could lead to a wrong packet loss detection
when the loss time computed values could wrap. This is the case 20 seconds after
haproxy has started.
Must be backported as far as 2.6.
In ticket #2184, HAProxy is crashing in a BUG_ON() after a lot of reload
when the previous processes did not exit.
Each worker has a socketpair which is a FD in the master, when reloading
this FD still exists until the process leaves. But the global.maxconn
value is not incremented for each of these FD. So when there is too much
workers and the number of FD reaches maxsock, the next FD inserted in
the poller will crash the process.
This patch fixes the issue by increasing the maxsock for each remaining
worker.
Must be backported in every maintained version.
This bug was introduced by this commit:
MINOR: quic: Remove pool_zalloc() from qc_new_conn()
The transport parameters was not initialized. This leaded to a crash when
dumping the received ones from TRACE()s.
Also reset the lengths of the CIDs attached to a quic_conn struct to 0 value
to prevent them from being dumped from traces when not already initialized.
No backport needed.
Replace a call to pool_zalloc() by a call to pool_malloc() into quic_dgram_parse
to allocate quic_rx_packet struct objects.
Initialize almost all the members of quic_rx_packet struct.
->saddr is initialized by quic_rx_pkt_retrieve_conn().
->pnl and ->pn are initialized by qc_do_rm_hp().
->dcid and ->scid are initialized by quic_rx_pkt_parse() which calls
quic_packet_read_long_header() for a long packet. For a short packet,
only ->dcid will be initialized.
pool_zalloc() is replaced by pool_alloc() into qc_conn_alloc_ssl_ctx() to allocate
a ssl_sock_ctx struct. ssl_sock_ctx struct member are all initiliazed to null values
excepted ->ssl which is initialized by the next statement: a call to qc_ssl_sess_init().
qc_new_conn() is ued to initialize QUIC connections with quic_conn struct objects.
This function calls quic_conn_release() when it fails to initialize a connection.
quic_conn_release() is also called to release the memory allocated by a QUIC
connection.
Replace pool_zalloc() by pool_alloc() in this function and initialize
all quic_conn struct members which are referenced by quic_conn_release() to
prevent use of non initialized variables in this fonction.
The ebtrees, the lists attached to quic_conn struct must be initialized.
The tasks must be reset to their NULL default values to be safely destroyed
by task_destroy(). This is all the case for all the TLS cipher contexts
of the encryption levels (struct quic_enc_level) and those for the keyupdate.
The packet number spaces (struct quic_pktns) must also be initialized.
->prx_counters pointer must be initialized to prevent quic_conn_prx_cntrs_update()
from dereferencing this pointer.
->latest_rtt member of quic_loss struct must also be initialized. This is done
by quic_loss_init() called by quic_path_init().
This may happen when the initilization of a new QUIC conn fails with qc_new_conn()
when receiving an Initial paquet. This is done after having allocated a CID with
new_quic_cid() called by quic_rx_pkt_retrieve_conn() which stays in the listener
connections tree without a QUIC connection attached to. Then when the listener
receives another Initial packet for the same CID, quic_rx_pkt_retrieve_conn()
returns NULL again (no QUIC connection) but with an thread ID already bound to the
connection, leading the datagram to be requeued in the same datagram handler thread
queue. And so on.
To fix this, the connection is created after having created the connection ID.
If this fails, the connection is deallocated.
During the race condition, when two different threads handle two datagrams for
the same connection, in addition to releasing the newer created connection ID,
the newer QUIC connection must also be released.
Must be backported as far as 2.7.
quic_conn_prx_cntrs_update() may be called from quic_conn_release() with
NULL as value for ->prx_counters member. This is the case when qc_new_conn() fails
when allocating <buf_area>. In this case quic_conn_prx_cntrs_update() BUG_ON().
Must be backported as far as 2.7.
On soft-stop, netns_sig_stop() function is called to purge the shared
namespace tree by iterating over each entries to free them.
However, once an entry is cleaned up and removed from the tree, the entry
itself isn't freed and this results into a minor leak when soft-stopping
because entry was allocated using calloc() in netns_store_insert() when
parsing the configuration.
This could be backported in every stable versions.
When support for 'namespace' keyword was added for the 'default-server'
directive in 22f41a2 ("MINOR: server: Make 'default-server' support
'namespace' keyword."), we forgot to copy the attribute from the parent
to the newly created server.
This resulted in the 'namespace' keyword being parsed without errors when
used from a 'default-server' directive, but in practise the option was
simply ignored.
There's no need to duplicate the netns struct because it is stored in
a shared list, so copying the pointer does the job.
This patch partially fixes GH #2038 and should be backported to all
stable versions.
The local address was dumped as "from" address by dump_quic_full() and
the peer address as "to" address. This patch fixes this issue.
Furthermore, to support the server side (QUIC client) to come, it is preferable
to stop using "from" and "to" labels to dump the local and peer addresses which
is confusing for a QUIC client which uses its local address as "from" address.
To mimic netstat, this is "Local Address" and "Foreign Address" which will
be displayed by "show quic" CLI command and "local_addr" and "foreign_addr"
for "show quic full" command to mention the local addresses and the peer
addresses.
Must be backported as far as 2.7.
This bug arrived with this commit which was supposed to fix another one:
BUG/MINOR: quic: Wrong Application encryption level selection when probing
The aim of this patch was to prevent the Application encryption to be selected
when probing leading to ACK only packets to be sent if the ack delay timer
had fired in the meantime, leading to crashes when no 01-RTT had been sent
because the ack range tree is empty in this case.
This statement is not correct (qc->pktns->flags & QUIC_FL_PKTNS_PROBE_NEEDED)
because qc->pktns is an array of packet number space. But it is equivalent
to (qc->pktns[QUIC_TLS_PKTNS_INITIAL].flags & QUIC_FL_PKTNS_PROBE_NEEDED).
That said, the patch mentionned above is not more useful since this following
which disable the ack time during the handshakes:
BUG/MINOR: quic: Do not use ack delay during the handshakes
This commit revert the first patch mentionned above.
Must be backported as far as 2.6.
It was reported in issue #2181, strange behavior during the new SSL
hanshake failure logs.
Errors were logged with the code 0, which is unknown to OpenSSL.
This patch mades 2 changes:
- It stops using ERR_error_string() when the SSL error code is 0
- It uses ERR_error_string_n() to be thread-safe
Must be backported to 2.8.
When an HTTP applet tries to get request data, we must take care to properly
detect the end of the message. It an empty HTX message with the SC_FL_EOI
flag set on the front SC. However, an issue was introduced during the SC
refactoring performed in the 2.8. The backend SC is tested instead of the
frontend one.
Because of this bug, the receive functions hang because the test on
SC_FL_EOI flag never succeeds. Of course, by checking the frontend SC (the
opposite SC to the one attached to the appctx), it works.
This patch should fix the issue #2180. It must be backported to the 2.8.
proxy default-server is a specific type of server that is not allocated
using new_server(): it is directly stored within the parent proxy
structure. However, since it may contain some default config options that
may be inherited by regular servers, it is also subject to dynamic members
(strings, structures..) that needs to be deallocated when the parent proxy
is cleaned up.
Unfortunately, srv_drop() may not be used directly from p->defsrv since
this function is meant to be used on regular servers only (those created
using new_server()).
To circumvent this, we're splitting srv_drop() to make a new function
called srv_free_params() that takes care of the member cleaning which
originally takes place in srv_drop(). This function is exposed through
server.h, so it may be called from outside server.c.
Thanks to this, calling srv_free_params(&p->defsrv) from free_proxy()
prevents any memory leaks due to dynamic parameters allocated when
parsing a default-server line from a proxy section.
This partially fixes GH #2173 and may be backported to 2.8.
[While it could also be relevant for other stable versions, the patch
won't apply due to architectural changes / name changes between 2.4 => 2.6
and then 2.6 => 2.8. Considering this is a minor fix that only makes
memory analyzers happy during deinit paths (at least for <= 2.8), it might
not be worth the trouble to backport them any further?]
bind->settings.interface hint is allocated when "interface" keyword
is specified on a bind line, but the string isn't explicitly freed in
proxy_free, resulting in minor memory leak on deinit paths when the
keyword is being used.
It partially fixes GH #2173 and may be backported to all stable versions.
[in 2.2 free_proxy did not exist so the patch must be applied directly
in deinit() function from haproxy.c]
When interface keyword is used multiple times within the same bind line,
the previous value isn't checked and is rewritten as-is, resulting in a
small memory leak.
Ensuring the interface name is first freed before assigning it to a new
value.
This may be backported to every stable versions.
[Note for 2.2, the fix must be performed in bind_parse_interface() from
proto_tcp.c, directly within the listener's loop, also ha_free() was
not available so free() must be used instead]
There are several misuses in peers sections that are not detected during the
configuration parsing and that could lead to undefined behaviors or crashes.
First, only one listener is expected for a peers section. If several bind
lines or local peer definitions are used, an error is triggered. However, if
multiple addresses are set on the same bind line, there is no error while
only the last listener is properly configured. On the 2.8, there is no crash
but side effects are hardly predictable. On older version, HAProxy crashes
if an unconfigured listener is used.
Then, there is no check on remote peers name. It is unexpected to have same
name for several remote peers. There is now a test, performed during the
post-parsing, to verify all remote peer names are unique.
Finally, server parsing options for the peers sections are changed to be
sure a port is always defined, and not a port range or a port offset.
This patch fixes the issue #2066. It could be backported to all stable
versions.
When a SPOE appctx is processing frames in sync mode, we must only skip
sending a new frame if it is still waiting for a ACK frame after a receive
attempt. It was performed before the receive attempt. As a consequence, if
the ACK frame was received, the SPOE appctx did not try to process queued
messages immediately. This could increase the queue time and thus slow down
the processing time of the stream.
Thanks to Daniel Epperson for his help to diagnose the bug.
This patch must be backported to every stable versions.
Historically the client-fin and server-fin timeouts were made to allow
a connection closure to be effective quickly if the last data were sent
down a socket and the client didn't close, something that can happen
when the peer's FIN is lost and retransmits are blocked by a firewall
for example. This made complete sense in 1.5 for TCP and HTTP in close
mode. But nowadays with muxes, it's not done at the right layer anymore
and even the description doesn't match what is being done, because what
happens is that the stream will abort the whole transfer after it's done
sending to the mux and this timeout expires.
We've seen in GH issue 2095 that this can happen with very short timeout
values, and while this didn't trigger often before, now that the muxes
(h2 & quic) properly report an end of stream before even the first
sc_conn_sync_recv(), it seems that it can happen more often, and have
two undesirable effects:
- logging a timeout when that's not the case
- aborting the request channel, hence the server-side conn, possibly
before it had a chance to be put back to the idle list, causing
this connection to be closed and not reusable.
Unfortunately for TCP (mux_pt) this remains necessary because the mux
doesn't have a timeout task. So here we're adding tests to only do
this through an HTX mux. But to be really clean we should in fact
completely drop all of this and implement these timeouts in the mux
itself.
This needs to be backported to 2.8 where the issue was discovered,
and maybe carefully to older versions, though that is not sure at
all. In any case, using a higher timeout or removing client-fin in
HTTP proxies is sufficient to make the issue disappear.
Lua's `get_stats` function stopped working in
4cfb0019e6, due to the addition a new field
ST_F_PROTO without a corresponding entry in `stat_fields`.
Fix the issue by adding the entry, like
a46b142e88 did previously for a different field.
This patch fixes GitHub Issue #2174, it should be backported to 2.8.
Add the total number of sent packets for each QUIC connection dumped by
"show quic". Also add the remaining counter values only if not null.
Must be backported to 2.7.
There's a rare case where on long fat pipes, we can see the keep-alive
timeout trigger before the end of the transfer of the last large object,
and the connection closed a bit quickly after the end of the transfer
because a GOAWAY is queued. The data are not destroyed, except that
the WINDOW_UPDATES from the client arriving late while the last data
are being drained by the socket buffers may at some point trigger a
reset, and some clients might choke a bit too early on these. Let's
make sure we only arm the idle_start timestamp once the output buffer
is empty. Of course it will still not cover for the data pending in the
socket buffers but it will at least let those in the buffer leave in
peace. More elaborate options can be used to protect the data in the
kernel buffers, such as the one described in GH issue #5.
It's very likely that this old issue was emphasized by the following
commit in 2.6:
15a4733d5 ("BUG/MEDIUM: mux-h2: make use of http-request and keep-alive timeouts")
and the behavior probably changed again with this one in 2.8, which
was backported to 2.7 and scheduled for 2.6:
d38d8c6cc ("BUG/MEDIUM: mux-h2: make sure control frames do not refresh the idle timeout")
As such this patch should be backported to 2.6 after some observation
period.
This patch is similar to the previous one but for QUIC mux functions
used inside the mux code itself or application layer. Replace all
occurences of qc_* prefix by qcc_* or qcs_*. This should help to better
differentiate code between quic_conn and MUX.
This should be backported up to 2.7.
Rename all QUIC mux function exposed through mux_ops structure. Use the
prefix qmux_* or qmux_strm_*. The objective is to remove qc_* prefix
which should only be used in quic_conn layer.
This should be backported up to 2.7.
Aurlien found a tiny race in thread_isolate() that can allow a thread
that was running under isolation to continue running while another one
enters isolation. The reason is that the check for harmless is only
done before winning the CAS, but since the previously isolated thread
doesn't wait for !rdv_request in thread_release(), it can effectively
continue its activities while the next one believes it's isolated. A
proper solution consists in looping once again in thread_isolate() to
recheck (and wait) for all threads to be isolated once the CAS is won.
The issue was introduced in 2.7 by commit 598cf3f22 ("MAJOR: threads:
change thread_isolate to support inter-group synchronization") so the
fix needs to be backported there.
Recently stconn flags were reviewed for QUIC mux to be conform with
other HTTP muxes. However, a mistake was made when dealing with a proper
stream FIN with both EOI and EOS set. This was done as RESET_STREAM
received after a FIN are ignored by QUIC mux and thus there is no
difference between EOI or EOI+EOS. However, analyzers may interpret EOS
as an interrupted request which result in a 400 HTTP error code.
To fix this, only set EOI on proper stream FIN. EOS is set when input is
interrupted (RESET_STREAM before FIN) or a STOP_SENDING is received
which prevent transfer to complete. In this last case, EOS must be
manually set too if FIN has been received before STOP_SENDING to go
directly from ERR_PENDING to final ERROR state.
This must be backported up to 2.7.
There was a misnaming in stats counter for *_BLOCKED frames in regard to
QUIC rfc convention. This patch fixes it to prevent future ambiguity :
- STREAMS_BLOCKED -> STREAM_DATA_BLOCKED
- STREAMS_DATA_BLOCKED_BIDI -> STREAMS_BLOCKED_BIDI
- STREAMS_DATA_BLOCKED_UNI -> STREAMS_BLOCKED_UNI
This should be backported up to 2.7.
Remove nb_streams field from qcc. It was not used outside of a BUG_ON()
statement to ensure we never have a negative count of streams. However
this is already checked with other fields.
This should be backported up to 2.7.
haproxy does not compile anymore on macOS+clang since 425d7ad ("MINOR:
init: pre-allocate kernel data structures on init"). This is due to
rlim_cur being printed uncasted using %lu format specifier, with rlim_cur
being stored as a rlim_t which is a typedef so its size may vary depending
on the system's architecture.
This is not the first time we need to dump rlim_cur in case of errors,
there are already multiple occurences in the init code. Everywhere this
happens, rlim is casted as a regular int and printed using the '%d'
format specifier, so we do the same here as well to fix the build issue.
No backport needed unless 425d7ad gets backported.
preload_libgcc_s() use pthread_create to create a thread and then call
pthread_join to use it, but it doesn't check if the option is successful.
So add a check to aviod potential crash.
in __ssl_sock_init, BIO_meth_new may failed and return NULL if
OPENSSL_zalloc failed. in this case, ha_meth will be NULL, and then
crash happens in BIO_meth_set_write. So, we add a check for ha_meth.
The Linux kernel maintains data structures to track a processes' open file
descriptors, and it expands these structures as necessary when FD usage grows
(at every FD=2^X starting at 64). However when threading is in use, during
expansion the kernel will pause (observed up to 47ms) while it waits for thread
synchronization (see https://bugzilla.kernel.org/show_bug.cgi?id=217366).
This change addresses the issue and avoids the random pauses by opening the
maximum file descriptor during initialization, so that expansion will not occur
while processing traffic.
When a message is compressed, A "Vary" header is added with
"accept-encoding" value. However, a new header is always added, regardless
there is already a Vary header or not. In addition, if there is already a
Vary header, there is no check on values to be sure "accept-encoding" value
is not already there. So it is possible to have it twice.
To improve this part, we now test Vary header values and "accept-encoding"
is only added if it was not found. In addition, "accept-encoding" value is
appended to the last Vary header found, if any. Otherwise, a new header is
added.
Fixing hlua_lua2smp() usage in hlua's code since it was assumed that
hlua_lua2smp() makes a standalone smp out of lua data, but it is not
the case.
This is especially true when dealing with lua strings (string is
extracted using lua_tolstring() which returns a pointer to lua string
memory location that may be reclaimed by lua at any time when no longer
used from lua's point of view). Thus, smp generated by hlua_lua2smp() may
only be used from the lua context where the call was initially made, else
it should be explicitly duplicated before exporting it out of lua's
context to ensure safe (standalone) usage.
This should be backported to all stable versions.
Add ->sent_pkt counter to quic_conn struct to count the packet at QUIC connection
level. Then, when the connection is released, the ->sent_pkt counter value
is added to the one for the listener.
Must be backported to 2.7.
Add some statistical counters to quic_conn struct from quic_counters struct which
are used at listener level to handle them at QUIC connection level. This avoid
calling atomic functions. Furthermore this will be useful soon when a counter will
be added for the total number of packets which have been sent which will be very
often incremented.
Some counters were not added, espcially those which count the number of QUIC errors
by QUIC error types. Indeed such counters would be incremented most of the time
only one time at QUIC connection level.
Implement quic_conn_prx_cntrs_update() which accumulates the QUIC connection level
statistical counters to the listener level statistical counters.
Must be backported to 2.7.
There is no reason to test <qc> nullity at the end of this function because it is
clearly not null, furthermore the trace handle the case where <qc> is null.
Must be backported to 2.7.
Align the "show quic" help information with all the others command help information.
Furthermore, makes this information match the management documentation.
Must be backported to 2.7.
quic_retry_token_check() must decipher the token sent to and received back from
clients. This token is made of the token format byte, the ODCID prefixed by its one byte
length, the timestamp of its creation, and terminated by an AEAD TAG followed
by the salt used to derive the secret to cipher the token.
So, the length of these data must be between
2 + QUIC_ODCID_MINLEN + sizeof(uint32_t) + QUIC_TLS_TAG_LEN + QUIC_RETRY_TOKEN_SALTLEN
and
2 + QUIC_CID_MAXLEN + sizeof(uint32_t) + QUIC_TLS_TAG_LEN + QUIC_RETRY_TOKEN_SALTLEN.
Must be backported to 2.7 and 2.6.
This bug would never occur because the buffer supplied to quic_generate_retry_token()
to build a Retry token is large enough to embed such a token. Anyway, this patch
fixes quic_generate_retry_token() implementation.
There were two errors: this is the ODCID which is added to the token. Furthermore
the timestamp was not taken into an account.
Must be backported to 2.6 and 2.7.
Add source and destination addresses to QUIC_EV_CONN_RCV trace event. This is
used by datagram/socket level functions (quic_sock.c).
Must be backported to 2.7.
We must evaluate if EOS/EOI/ERR_PENDING/ERROR flags must be set on the SE
when the frontend SC is created because the rxbuf is transferred to the
steeam at this stage. It means the call to h2_rcv_buf() may be skipped on
some circumstances.
And indeed, it happens when HAproxy quickly replies, for instance because of
a deny rule. In this case, depending on the scheduling, the abort may block
the receive attempt from the SC. In this case if SE flags were not properly
set earlier, there is no way to terminate the request and the session may be
freezed.
For now, I can't explain why there is no timeout when this happens but it
remains an issue because here we should not rely on timeouts to close the
stream.
This patch relies on following commits:
* MINOR: mux-h2: Add a function to propagate termination flags from h2s to SE
* MINOR: mux-h2: Set H2_SF_ES_RCVD flag when decoding the HEADERS frame
The issue was encountered on the 2.8 but it seems the bug exists since the
2.4. But it is probably a good idea to only backport the series to 2.7 only
and wait for a bug report on earlier versions.
This patch should solve the issue #2147.
The function h2s_propagate_term_flags() was added to check the H2S state and
evaluate when EOI/EOS/ERR_PENDING/ERROR flags must be set on the SE. It is
not the only place where those flags are set. But it centralizes the synchro
between the H2 stream and the SC.
For now, this function is only used at the end of h2_rcv_buf(). But it will
be used to fix a bug.
The flag H2_SF_ES_RCVD is set on the H2 stream when the ES flag is found in
a frame. On HEADERS frame, it was set in function processing the frame. It
is moved in the function decoding the frame. Fundamentally, this changes
nothing. But it will be useful to have this information earlier when a
client H2 stream is created.
In h2c_frt_stream_new(), H2_SF_BODY_TUNNEL flags was tested on demux frame
flags (h2c->dff) instead of the h2s flags. By chance, it is a noop test
becasue H2_SF_BODY_TUNNEL value, once converted to an int8_t, is 0.
It is a 2.8-specific issue. No backport needed.
A RESET_STREAM is emitted in several occasions :
- protocol error during HTTP/3.0 parsing
- STOP_SENDING reception
In both cases, if a stream-endpoint is attached we must set its ERR
flag. This was correctly done but after some delay as it was only when
the RESET_STREAM was emitted. Change this to set the ERR flag as soon as
one of the upper cases has been encountered. This should help to release
faster streams in error.
This should be backported up to 2.7.
A recent review was done to rationalize ERR/EOS/EOI flags on stream
endpoint. A common definition for both H1/H2/QUIC mux have been written
in the following documentation :
./doc/internals/stconn-close.txt
In QUIC it is possible to close each channels of a stream independently
with RESET_STREAM and STOP_SENDING frames. When a RESET_STREAM is
received, it indicates that the peer has ended its transmission in an
abnormal way. However, it is still ready to receive.
Previously, on RESET_STREAM reception, QUIC MUX set the ERR flag on
stream-endpoint. However, according to the QUIC mechanism, it should be
instead EOS but this was impossible due to a BUG_ON() which prevents EOS
without EOI or ERR. This BUG_ON was only present because this case was
never used before the introduction of QUIC. It was removed in a recent
commit which allows us to now properly set EOS alone on RESET_STREAM
reception.
In practice, this change allows to continue to send data even after
RESET_STREAM reception. However, currently browsers always emit it with
a STOP_SENDING as this is used to abort the whole H3 streams. In the end
this will result in a stream-endpoint with EOS and ERR_PENDING/ERR
flags.
This should be backported up to 2.7.
A recent review was done to rationalize ERR/EOS/EOI flags on stream
endpoint. A common definition for both H1/H2/QUIC mux have been written
in the following documentation :
./doc/internals/stconn-close.txt
Always set EOS with EOI flag to conform to this specification. EOI is
set whenever the proper stream end has been encountered : with QUIC it
corresponds to a STREAM frame with FIN bit. At this step, RESET_STREAM
frames are ignored by QUIC MUX as allowed by RFC 9000. This means we can
always set EOS at the same time with EOI.
This should be backported up to 2.7.
During the refactoring on SC/SE flags, it was stated that SE_FL_EOS flag
should not be set without on of SE_FL_EOI or SE_FL_ERROR flags. In fact, it
is a problem for the QUIC/H3 multiplexer. When a RST_STREAM frame is
received, it means no more data will be received from the peer. And this
happens before the end of the message (RST_STREAM frame received after the
end of the message are ignored). At this stage, it is a problem to report an
error because from the QUIC point of view, it is valid. Data may still be
sent to the peer. If an error is reported, this will stop the data sending
too.
In the same idea, the H1 mulitplexer reports an error when the message is
truncated because of a read0. But only an EOS flag should be reported in
this case, not an error. Fundamentally, it is important to distinguish
errors from shuts for reads because some cases are valid. For instance a H1
client can choose to stop uploading data if it received the server response.
So, relax tests on SE flags by removing BUG_ON_HOT() on SE_FL_EOS flag. For
now, the abort will be handled in the HTTP analyzers.
Output of 'show quic' CLI in oneline mode was not correctly done. This
was caused both due to differing qc pointer size and ports length. Force
proper alignment by using maximum sizes as expected and complete with
blanks if needed.
This should be backported up to 2.7.
qc_prep_app_pkts() is responsible to built several new packets for
sending. It can fail due to memory allocation error. Before this patch,
the Tx buffer was released on error even if some packets were properly
generated.
With this patch, if an error happens on qc_prep_app_pkts(), we still try
to send already built packets if Tx buffer is not empty. The sending
loop is then interrupted and the Tx buffer is released with data
cleared.
This should be backported up to 2.7.
It is expected that quic_packet_encrypt() and
quic_apply_header_protection() never fails as encryption is done in
place. This allows to remove their return value.
This is useful to simplify error handling on sending path. An error can
only be encountered on the first steps when allocating a new packet or
copying its frame content. After a clear packet is successfully built,
no error is expected on encryption.
However, it's still unclear if our assumption that in-place encryption
function never fail. As such, a WARN_ON() statement is used if an error
is detected at this stage. Currently, it's impossible to properly manage
this without data loss as this will leave partially unencrypted data in
the send buffer. If warning are reported a solution will have to be
implemented.
This should be backported up to 2.7.
quic_aead_iv_build() should never fail unless we call it with buffers of
different size. This never happens in the code as every input buffers
are of size QUIC_TLS_IV_LEN.
Remove the return value and add a BUG_ON() to prevent future misusage.
This is especially useful to remove one error handling on the sending
patch via quic_packet_encrypt().
This should be backported up to 2.7.
Complete each useful BUG_ON statements with a comment to explain its
purpose. Also convert BUG_ON_HOT to BUG_ON as they should not have a
big impact.
This should be backported up to 2.7.
Task pointer check in debug_parse_cli_task() computes the theoric end
address of provided task pointer to check if it is valid or not thanks to
may_access() helper function.
However, relative ending address is calculated by adding task size to 't'
pointer (which is a struct task pointer), thus it will result to incorrect
address since the compiler automatically translates 't + x' to
't + x * sizeof(*t)' internally (with sizeof(*t) != 1 here).
Solving the issue by using 'ptr' (which is the void * raw address) as
starting address to prevent automatic address scaling.
This was revealed by coverity, see GH #2157.
No backport is needed, unless 9867987 ("DEBUG: cli: add "debug dev task"
to show/wake/expire/kill tasks and tasklets") gets backported.
When hlua_event_runner() pauses the subscription (ie: if the consumer
can't keep up the pace), hlua_traceback() is used to get the current
lua trace (running context) to provide some info to the user.
However, as hlua_traceback() may raise an error (__LJMP) is set, it is
used within a SET_SAFE_LJMP() / RESET_SAFE_LJMP() combination to ensure
lua errors are properly handled and don't result in unexpected behavior.
But the current usage of SET_SAFE_LJMP() within the function is wrong
since hlua_traceback() will run a second time (unprotected) if the
first (protected) attempt fails. This is undefined behavior and could
even lead to crashes.
Hopefully it is very hard to trigger this code path, thus we can consider
this as a minor bug.
Also using this as an opportunity to enhance the message report to make
it more meaningful to the user.
This should fix GH #2159.
It is a 2.8 specific bug, no backport needed unless c84899c636
("MEDIUM: hlua/event_hdl: initial support for event handlers") gets
backported.
When the process is stopping, the server resolutions are suspended. However
the task is still periodically woken up for nothing. If there is a huge
number of resolution, it may lead to a noticeable CPU consumption for no
reason.
To avoid this extra CPU cost, we stop to schedule the the resolution tasks
during the stopping stage. Of course, it is only true for server
resolutinos. Dynamic ones, via do-resolve actions, are not concerned. These
ones must still be triggered during stopping stage.
Concretly, during the stopping stage, the resolvers task is no longer
scheduled if there is no running resolutions. In this case, if a do-resolve
action is evaluated, the task is woken up.
This patch should partially solve the issue #2145.
When the process is stopping, the health-checks are suspended. However the
task is still periodically woken up for nothing. If there is a huge number
of health-checks and if they are woken up in same time, it may lead to a
noticeable CPU consumption for no reason.
To avoid this extra CPU cost, we stop to schedule the health-check tasks
when the proxy is disabled or stopped.
This patch should partially solve the issue #2145.
When the fcgi configuration is checked and fcgi rules are created, a useless
assignment to NULL is reported by Covertiy. Let's remove it.
This patch should fix the coverity report #2161.
This is a better and more general solution to the problem described in
this commit:
BUG/MINOR: checks: postpone the startup of health checks by the boot time
Now we're updating the now_offset that is used to compute now_ms at the
few points where we update the ready date during boot. This ensures that
now_ms while being stable during all the boot process will be correct
and will start with the boot value right after the boot is finished. As
such the patch above is rolled back (we don't want to count the boot
time twice).
This must not be backported because it relies on the more flexible clock
architecture in 2.8.
Right now there's no way to enforce a specific value of now_ms upon
startup in order to compensate for the time it takes to load a config,
specifically when dealing with the health check startup. For this we'd
need to force the now_offset value to compensate for the last known
value of the current date. This patch exposes a function to do exactly
this.
When health checks are started at boot, now_ms could be off by the boot
time. In general it's not even noticeable, but with very large configs
taking up to one or even a few seconds to start, this can result in a
part of the servers' checks being scheduled slightly in the past. As
such all of them will start groupped, partially defeating the purpose of
the spread-checks setting. For example, this can cause a burst of
connections for the network, or an excess of CPU usage during SSL
handshakes, possibly even causing some timeouts to expire early.
Here in order to compensate for this, we simply add the known boot time
to the computed delay when scheduling the startup of checks. That's very
simple and particularly efficient. For example, a config with 5k servers
in 800 backends checked every 5 seconds, that was taking 3.8 seconds to
start used to show this distribution of health checks previously despite
the spread-checks 50:
3690 08:59:25
417 08:59:26
213 08:59:27
71 08:59:28
428 08:59:29
860 08:59:30
918 08:59:31
938 08:59:32
1124 08:59:33
904 08:59:34
647 08:59:35
890 08:59:36
973 08:59:37
856 08:59:38
893 08:59:39
154 08:59:40
Now with the fix it shows this:
470 08:59:59
929 09:00:00
896 09:00:01
937 09:00:02
854 09:00:03
827 09:00:04
906 09:00:05
863 09:00:06
913 09:00:07
873 09:00:08
162 09:00:09
This should be backported to all supported versions. It depends on
this commit:
MINOR: clock: measure the total boot time
For 2.8 where the internal clock is now totally independent on the human
one, an more generic fix will consist in simply updating now_ms to reflect
the startup time.
Just like we have the uptime in "show info", let's add the boot time.
It's trivial to collect as it's just the difference between the ready
date and the start date, and will allow users to monitor this element
in order to take action before it starts becoming problematic. Here
the boot time is reported in milliseconds, so this allows to even
observe sub-second anomalies in startup delays.
Some huge configs take a significant amount of time to start and this
can cause some trouble (e.g. health checks getting delayed and grouped,
process not responding to the CLI etc). For example, some configs might
start fast in certain environments and slowly in other ones just due to
the use of a wrong DNS server that delays all libc's resolutions. Let's
first start by measuring it by keeping a copy of the most recently known
ready date, once before calling check_config_validity() and then refine
it when leaving this function. A last call is finally performed just
before deciding to split between master and worker processes, and it covers
the whole boot. It's trivial to collect and even allows to get rid of a
call to clock_update_date() in function check_config_validity() that was
used in hope to better schedule future events.
When integrating the number of warnings in "show info" in 2.8 with commit
3c4a297d2 ("MINOR: stats: report the total number of warnings issued"),
the update of the trash buffer used by the Tainted flag got displaced
lower. There's no harm for now util someone adds a new metric requiring
a call to chunk_newstr() and gets both values merged. Let's move the
call to its location now.
In process_chk_conn(), some assignments to NULL are useless and are reported
by Coverity as unused value. while it is harmless, these assignments can be
removed.
This patch should fix the coverity report #2158.
When server is transitionning from UP to DOWN, a log message is generated.
e.g.: "Server backend_name/server_name is DOWN")
However since f71e064 ("MEDIUM: server: split srv_update_status() in two
functions"), the allocated buffer tmptrash which is used to prepare the
log message is not freed after it has been used, resulting in a small
memory leak each time a server goes DOWN because of an operational
change.
This is a 2.8 specific bug, no backport needed unless the above commit
gets backported.
Within srv_update_status subfunctions _op() and _adm(), each time tmptrash
is freed, we assign it to NULL to ensure it will not be reused.
However, within those functions it is not very useful given that tmptrash
is never checked against NULL except upon allocation through
alloc_trash_chunk(), which happens everytime a new log message is
generated, sent, and then freed right away, so there are no code paths
that could lead to tmptrash being checked for reuse (tmptrash is
systematically overwritten since all log messages are independant from
each other).
This was raised by coverity, see GH #2162.
A regression was introduced with the commit cb59e0bc3 ("BUG/MINOR:
tcp-rules: Stop content rules eval on read error and end-of-input"). We
should not shorten the inspect-delay when the EOI flag is set on the SC.
Idea of the inspect-delay is to wait a TCP rule is matching. It is only
interrupted if an error occurs, on abort or if the peer shuts down. It is
also interrupted if the buffer is full. This last case is a bit ambiguous
and discutable. It could be good to add ACLS, like "wait_complete" and
"wait_full" to do so. But for now, we only remove the test on SC_FL_EOI
flag.
This patch must be backported to all stable versions.
This makes use of spread-checks also for the startup of the check tasks.
This provides a smoother load on startup for uneven configurations which
tend to enable only *some* servers. Below is the connection distribution
per second of the SSL checks of a config with 5k servers spread over 800
backends, with a check inter of 5 seconds:
- default:
682 08:00:50
826 08:00:51
773 08:00:52
1016 08:00:53
885 08:00:54
889 08:00:55
825 08:00:56
773 08:00:57
1016 08:00:58
884 08:00:59
888 08:01:00
491 08:01:01
- with spread-checks 50:
437 08:01:19
866 08:01:20
777 08:01:21
1023 08:01:22
1118 08:01:23
923 08:01:24
641 08:01:25
859 08:01:26
962 08:01:27
860 08:01:28
929 08:01:29
909 08:01:30
866 08:01:31
849 08:01:32
114 08:01:33
- with spread-checks 50 + this patch:
680 08:01:55
922 08:01:56
962 08:01:57
899 08:01:58
819 08:01:59
843 08:02:00
916 08:02:01
896 08:02:02
886 08:02:03
846 08:02:04
903 08:02:05
894 08:02:06
178 08:02:07
The load is much smoother from the start, this can help initial health
checks succeed when many target the same overloaded server for example.
This could be backported as it should make border-line configs more
reliable across reloads.
When a full message is received for a stream, MUX is responsible to set
EOI flag. This was done through rcv_buf stream callback by checking if
QCS HTX buffer contained the EOM flag.
This is not correct for HTTP without body. In this case, QCS HTX buffer
is never used. Only a local HTX buffer is used to transfer headers just
as stream endpoint is created. As such, EOI is never transmitted to the
upper layer.
If the transfer occur without any issue, this does not seem to cause any
problem. However, in case the transfer is aborted, the stream is never
released which cause a memory leak and prevent the process soft-stop.
To fix this, also check if EOM is put by application layer during
headers conversion. If true, this is transferred through a new argument
to qc_attach_sc() MUX function which is responsible to set the EOI flag.
This issue was reproduced using h2load with hundred of connections.
h2load is interrupted with a SIGINT which causes streams to never be
closed on haproxy side.
This should be backported up to 2.6.
Uninline and move qc_attach_sc() function to implementation source file.
This will be useful for next commit to add traces in it.
This should be backported up to 2.7.
MUX is responsible to put EOS on stream when read channel is closed.
This happens if underlying connection is closed or a RESET_STREAM is
received. FIN STREAM is ignored in this case.
For connection closure, simply check for CO_FL_SOCK_RD_SH.
For RESET_STREAM reception, a new flag QC_CF_RECV_RESET has been
introduced. It is set when RESET_STREAM is received, unless we already
received all data. This is conform to QUIC RFC which allows to ignore a
RESET_STREAM in this case. During RESET_STREAM processing, input buffer
is emptied so EOS can be reported right away on recv_buf operation.
This should be backported up to 2.7.
Add traces to render each stream transition more explicit. Also, move
ERR_PENDING to ERROR transition after other stream flags are set, as
with the MUX H2 implementation. This is purely a cosmetic change and it
should have no functional impact.
This should be backported up to 2.7.
The following patch introduced proper error management on buffer
allocation failure :
0abde9dee6
BUG/MINOR: mux-quic: properly handle buf alloc failure
However, when decoding an empty STREAM frame with just FIN bit set, this
was not done correctly. Indeed, there is a missing goto statement in
case of a NULL buffer check.
This was reported thanks to coverity analysis. This should fix github
issue #2163.
This must be backported up to 2.6.
Since the following patch
commit 6c501ed23b
BUG/MINOR: mux-quic: differentiate failure on qc_stream_desc alloc
it is not possible to check if Tx buf allocation failed due to a
configured limit exhaustion or a simple memory failure.
This patch fixes it as the condition was inverted. Indeed, if buf_avail
is null, this means that the limit has been reached. On the contrary
case, this is a real memory alloc failure. This caused the flag
QC_CF_CONN_FULL to not be properly used and may have caused disruption
on transfer with several streams or large data.
This was detected due to an abnormal error QUIC MUX traces. Also change
in consequence trace for limit exhaustion to be more explicit.
This must be backported up to 2.6.
Fix the openssl build with older openssl version by disabling the new
ssl_c_r_dn fetch.
This also disable the ssl_client_samples.vtc file for OpenSSL version
older than 1.1.1
Christopher found as part of the analysis of Tim's issue #1891 that commit
15a4733d5 ("BUG/MEDIUM: mux-h2: make use of http-request and keep-alive
timeouts") introduced in 2.6 incompletely addressed a timeout issue in the
H2 mux. The problem was that the http-keepalive and http-request timeouts
were not applied before it. With that commit they are now considered, but
if a GOAWAY is sent (or even attempted to be sent), then they are not used
anymore again, because the way the code is arranged consists in applying
the client-fin timeout (if set) to the current date, and falling back to
the client timeout, without considering the idle_start period. This means
that a config having a "timeout http-keepalive" would still not close the
connection quickly when facing a client that periodically sends PING,
PRIORITY or whatever other frame types.
In addition, after the GOAWAY was attempted to be sent, there was no check
for pending data in the output buffer, meaning that it would be possible
to truncate some responses in configs involving a very short client-fin
timeout.
Finally the spreading of the closures during the soft-stop brought in 2.6
by commit b5d968d9b ("MEDIUM: global: Add a "close-spread-time" option to
spread soft-stop on time window") didn't consider the particular case of
an idle "pre-connect" connection, which would also live long if a browser
failed to deliver a valid request for a long time.
All of this indicates that the conditions must be reworked so as not to
have that level of exclusion between conditions, but rather stick to the
rules from the doc that are already enforced on other muxes:
- timeout client always applies if there are data pending, and is
relative to each new I/O ;
- timeout http-request applies before the first complete request and
is relative to the entry in idle state ;
- timeout http-keepalive applies between idle and the next complete
request and is relative to the entry in idle state ;
- timeout client-fin applies when in idle after a shut was sent (here
the shut is the GOAWAY). The shut may only be considered as sent if
the buffer is empty and the flags indicate that it was successfully
sent (or failed) but not if it's still waiting for some room in the
output buffer for example. This implies that this timeout may then
lower the http-keepalive/http-request ones.
This is what this patch implements. Of course the client timeout still
applies as a fallback when all the ones above are not set or when their
conditions are not met.
It would seem reasoanble to backport this to 2.7 first, then only after
one or two releases to 2.6.
This patch addresses #1514, adds the ability to fetch DN of the root
ca that was in the chain when client certificate was verified during SSL
handshake.
The HTTPCLIENT and the OCSP-UPDATE proxies are internal proxies, we
don't need to display logs of them stopping during the stopping of the
process.
This patch checks if a proxy has the flag PR_CAP_INT so it doesn't
display annoying messages.
When the SC is detached from the endpoint, the xref between the endpoints is
removed. At this stage, the sedesc cannot be undefined. So we can remove the
test on it.
This issue should fix the issue #2156. No backport needed.
In the proxy CLI analyzer, when pcli_parse_request returns -1, the
client was shut to prevent any problem with the master CLI.
This behavior is a little bit excessive and not handy at all in prompt
mode. For example one could have activated multiples mode, then have an
error which disconnect the CLI, and they would have to reconnect and
enter all the modes again.
This patch introduces the pcli_error() function, which only output an
error and flush the input buffer, instead of closing everything.
When encountering a parsing error, this function is used, and the prompt
is written again, without any disconnection.
SSL hanshake error were unable to dump the OpenSSL error string by
default, to do so it was mandatory to configure a error-log-format with
the ssl_fc_err fetch.
This patch implements the session_build_err_string() function which creates
the error log to send during session_kill_embryonic(), a special case is
made with CO_ER_SSL_HANDSHAKE which is able to dump the error string
with ERR_error_string().
Before:
<134>May 12 17:14:04 haproxy[183151]: 127.0.0.1:49346 [12/May/2023:17:14:04.571] frt2/1: SSL handshake failure
After:
<134>May 12 17:14:04 haproxy[183151]: 127.0.0.1:49346 [12/May/2023:17:14:04.571] frt2/1: SSL handshake failure (error:0A000418:SSL routines::tlsv1 alert unknown ca)
qc_init() is used to initialize a QUIC MUX instance. On failure, each
resources are released via a series of goto statements. There is one
issue if the app_ops.init callback fails. In this case, MUX task is not
freed.
This can cause a crash as the task is already scheduled. When the
handler will run, it will crash when trying to access qcc instance.
To fix this, properly destroy qcc task on fail_install_app_ops label.
The impact of this bug is minor as app_ops.init callback succeeds most
of the time. However, it may fail on allocation failure due to memory
exhaustion.
This may fix github issue #2154.
This must be backported up to 2.7.
qc_stream_buf_alloc() can fail for two reasons :
* limit of Tx buffer per connection reached
* allocation failure
The first case is properly treated. A flag QC_CF_CONN_FULL is set on the
connection to interrupt emission. It is cleared when a buffer became
available after in order ACK reception and the MUX tasklet is woken up.
The allocation failure was handled with the same mechanism which in this
case is not appropriate and could lead to a connection transfer freeze.
Instead, prefer to close the connection with a QUIC internal error code.
To differentiate the two causes, qc_stream_buf_alloc() API was changed
to return the number of available buffers to the caller.
This must be backported up to 2.6.
The total number of buffer per connection for sending is limited by a
configuration value. To ensure this, <stream_buf_count> quic_conn field
is incremented on qc_stream_buf_alloc().
qc_stream_buf_alloc() may fail if the buffer cannot be allocated. In
this case, <stream_buf_count> should not be incremented. To fix this,
simply move increment operation after buffer allocation.
The impact of this bug is low. However, if a connection suffers from
several buffer allocation failure, it may cause the <stream_buf_count>
to be incremented over the limit without being able to go back down.
This must be backported up to 2.6.
The function qc_get_ncbuf() is used to allocate a ncbuf content.
Allocation failure was handled using a plain BUG_ON.
Fix this by a proper error management. This buffer is only used for
STREAM frame reception to support out-of-order offsets. When an
allocation failed, close the connection with a QUIC internal error code.
This should be backported up to 2.6.
A convenience function qc_get_buf() is implemented to centralize buffer
allocation on MUX and H3 layers. However, allocation failure was not
handled properly with a BUG_ON() statement.
Replace this by proper error management. On emission, streams is
temporarily skip over until the next qc_send() invocation. On reception,
H3 uses this function for HTX conversion; on alloc failure the
connection will be closed with QUIC internal error code.
This must be backported up to 2.6.
Remove QUIC MUX function qcs_http_handle_standalone_fin(). The purpose
of this function was only used when receiving an empty STREAM frame with
FIN bit. Besides, it was called by each application protocol which could
have different approach and render the function purpose unclear.
Invocation of qcs_http_handle_standalone_fin() have been replaced by
explicit code in both H3 and HTTP/0.9 module. In the process, use
htx_set_eom() to reliably put EOM on the HTX message.
This should be backported up to 2.7, along with the previous patch which
introduced htx_set_eom().
Implement a new HTX utility function htx_set_eom(). If the HTX message
is empty, it will first add a dummy EOT block. This is a small trick
needed to ensure readers will detect the HTX buffer as not empty and
retrieve the EOM flag.
Replace the H2 code related by a htx_set_eom() invocation. QUIC also has
the same code which will be replaced in the next commit.
This should be backported up to 2.7 before the related QUIC patch.
It is possible to receive datagram from other connection on a dedicated
quic-conn socket. This is due to a race condition between bind() and
connect() system calls.
To handle this, an explicit check is done on each datagram. If the DCID
is not associated to the connection which owns the socket, the datagram
is redispatch as if it arrived on the listener socket.
This redispatch step was not properly done because the source address
specified for the redispatch function was incorrect. Instead of using
the datagram source address, we used the address of the socket
quic-conn which received the datagram due to the above race condition.
Fix this simply by using the address from the recvmsg() system call.
The impact of this bug is minor as redispatch on connection socket
should be really rare. However, when it happens it can lead to several
kinds of problems, like for example a connection initialized with an
incorrect peer address. It can also break the Retry token check as this
relies on the peer address.
In fact, Retry token check failure was the reason this bug was found.
When using h2load with thousands of clients, the counter of Retry token
failure was unusually high. With this patch, no failure is reported
anymore for Retry.
Must be backported to 2.7.
A check was missing in parse_logsrv() to make sure that malloc-dependent
variable is checked for non-NULL before using it.
If malloc fails, the function raises an error and stops, like it's already
done at a few other places within the function.
This partially fixes GH #2130.
It should be backported to every stable versions.
usermsgs_buf.size is set without first checking if previous malloc
attempt succeeded.
This could fool the buffer API into assuming that the buffer is
initialized, resulting in unsafe read/writes.
Guarding usermsgs_buf.size assignment with the malloc attempt result
to make the buffer initialization safe against malloc failures.
This partially fixes GH #2130.
It should be backported up to 2.6.
Some malloc resulsts were not checked in standalone ncbuf code.
As this is debug/test code, we don't need to explicitly handle memory
errors, we just add some BUG_ON() to ensure that memory is properly
allocated and prevent unexpected results.
This partially fixes issue GH #2130.
No backport needed.
Commit 986798718 ("DEBUG: cli: add "debug dev task" to show/wake/expire/kill
tasks and tasklets") caused a build failure on 32-bit platforms when parsing
the task's pointer. Let's use strtoul() and not strtoll(). No backport is
needed, unless the commit above gets backported.
httpclient.resolvers.disabled allow to disable completely the resolvers
of the httpclient, prevents the creation of the "default" resolvers
section, and does not insert the http do-resolve rule in the proxies.
Now we can detect the listener associated with a QUIC listener and report
a bit more info (e.g. listening port and frontend name), and provide a bit
more info about connections as well, and filter on both front connections
and listeners using the "l" and "f" flags.
This provides more consistency between the master and the worker. When
"prompt timed" is passed on the master, the timed mode is toggled. When
enabled, for a master it will show the master process' uptime, and for
a worker it will show this worker's uptime. Example:
master> prompt timed
[0:00:00:50] master> show proc
#<PID> <type> <reloads> <uptime> <version>
11940 master 1 [failed: 0] 0d00h02m10s 2.8-dev11-474c14-21
# workers
11955 worker 0 0d00h00m59s 2.8-dev11-474c14-21
# old workers
11942 worker 1 0d00h02m10s 2.8-dev11-474c14-21
# programs
[0:00:00:58] master> @!11955
[0:00:01:03] 11955> @!11942
[0:00:02:17] 11942> @
[0:00:01:10] master>
Entering "prompt timed" toggles reporting of the process' uptime in
the prompt, which will report days, hours, minutes and seconds since
it was started. As discussed with Tim in issue #2145, this can be
convenient to roughly estimate the time between two outputs, as well
as detecting that a process failed to be reloaded for example.
There's something very irritating on the CLI, when just pressing ENTER,
it complains "Unknown command: ''..." and dumps all the help. This
action is often done to add a bit of clearance after a dump to visually
find delimitors later, but this stupid error makes it unusable. This
patch addresses this by just returning on empty command instead of trying
to look up a matching keyword. It will result in an empty line to mark
the end of the empty command and a prompt again.
It's probably not worth backporting this given that nobody seems to have
complained about it yet.
Now that we have free_acl_cond(cond) function that does cond prune then
frees cond, replace all occurences of this pattern:
| prune_acl_cond(cond)
| free(cond)
with:
| free_acl_cond(cond)
http_parse_redirect_rule() doesn't perform enough checks around NULL
returning allocating functions.
Moreover, existing error paths don't perform cleanups. This could lead to
memory leaks.
Adding a few checks and a cleanup path to ensure memory errors are
properly handled and that no memory leaks occurs within the function
(already allocated structures are freed on error path).
It should partially fix GH #2130.
This patch depends on ("MINOR: proxy: add http_free_redirect_rule() function")
This could be backported up to 2.4. The patch is also relevant for
2.2 but "MINOR: proxy: add http_free_redirect_rule() function" would
need to be adapted first.
==
Backport notes:
-> For 2.2 only:
Replace:
(strcmp(args[cur_arg], "drop-query") == 0)
with:
(!strcmp(args[cur_arg],"drop-query"))
-> For 2.2 and 2.4:
Replace:
"expects 'code', 'prefix', 'location', 'scheme', 'set-cookie', 'clear-cookie', 'drop-query', 'ignore-empty' or 'append-slash' (was '%s')",
with:
"expects 'code', 'prefix', 'location', 'scheme', 'set-cookie', 'clear-cookie', 'drop-query' or 'append-slash' (was '%s')",
Adding http_free_redirect_rule() function to free a single redirect rule
since it may be required to free rules outside of free_proxy() function.
This patch is required for an upcoming bugfix.
[for 2.2, free_proxy function did not exist (first seen in 2.4), thus
http_free_redirect_rule() needs to be deducted from haproxy.c deinit()
function if the patch is required]
cookie_str from struct redirect, which may be allocated through
http_parse_redirect_rule() function is not properly freed on proxy
cleanup within free_proxy().
This could be backported to all stable versions.
[for 2.2, free_proxy() did not exist so the fix needs to be performed
directly in deinit() function from haproxy.c]
A xref is added between the endpoint descriptors. It is created when the
server endpoint is attached to the SC and it is destroyed when an endpoint
is detached.
This xref is not used for now. But it will be useful to retrieve info about
an endpoint for the opposite side. It is also the warranty there is still a
endpoint attached on the other side.
A mux must never report it is waiting for room in the channel buffer if this
buffer is empty. Because there is nothing the application layer can do to
unblock the situation. Indeed, when this happens, it means the mux is
waiting for data to progress. It typically happens when all headers are not
received.
In the FCGI mux, if some data remain in the RX buffer but the channel buffer
is empty, it does no longer report it is waiting for room.
This patch should fix the issue #2150. It must be backported as far as 2.6.
When end-of-stream is reported by a FCGI stream, we must take care to also
report an error if end-of-input was not reported. Indeed, it is now
mandatory to set SE_FL_EOI or SE_FL_ERROR flags when SE_FL_EOS is set.
It is a 2.8-specific issue. No backport needed.
When "optioon socket-stats" is used in a frontend, its listeners have
their own stats and will appear in the stats page. And when the stats
page has "stats show-legends", then a tooltip appears on each such
socket with ip:port and ID. The problem is that since QUIC arrived, it
was not possible to distinguish the TCP listeners from the QUIC ones
because no protocol indication was mentioned. Now we add a "proto"
legend there with the protocol name, so we can see "tcp4" or "quic6"
and figure how the socket is bound.
Following previous patch, error notification from quic_conn has been
adjusted to rely on standard connection flags. Most notably, CO_FL_ERROR
on the connection instance when a fatal error is detected.
Check for CO_FL_ERROR is implemented by qc_send(). If set the new flag
QC_CF_ERR_CONN will be set for the MUX instance. This flag is similar to
the local error flag and will abort most of the futur processing. To
ensure stream upper layer is also notified, qc_wake_some_streams()
called by qc_process() will put the stream on error if this new flag is
set.
This should be backported up to 2.7.
When an error is detected at quic-conn layer, the upper MUX must be
notified. Previously, this was done relying on quic_conn flag
QUIC_FL_CONN_NOTIFY_CLOSE set and the MUX wake callback called on
connection closure.
Adjust this mechanism to use an approach more similar to other transport
layers in haproxy. On error, connection flags are updated with
CO_FL_ERROR, CO_FL_SOCK_RD_SH and CO_FL_SOCK_WR_SH. The MUX is then
notified when the error happened instead of just before the closing. To
reflect this change, qc_notify_close() has been renamed qc_notify_err().
This function must now be explicitely called every time a new error
condition arises on the quic_conn layer.
To ensure MUX send is disabled on error, qc_send_mux() now checks
CO_FL_SOCK_WR_SH. If set, the function returns an error. This should
prevent the MUX from sending data on closing or draining state.
To complete this patch, MUX layer must now check for CO_FL_ERROR
explicitely. This will be the subject of the following commit.
This should be backported up to 2.7.
Remove the unnecessary err label for qc_send(). Anyway, this label
cannot be used once some frames are sent because there is no cleanup
part for it.
This should be backported up to 2.7.
Factorize code for send subscribing on the lower layer in a dedicated
function qcc_subscribe_send(). This allows to call the lower layer only
if not already subscribed and print a trace in this case. This should
help to understand when subscribing is really performed.
In the future, this function may be extended to avoid subscribing under
new conditions, such as connection already on error.
This should be backported up to 2.7.
Do not built STREAM frames if MUX is already subscribed for sending on
lower layer. Indeed, this means that either socket currently encountered
a transient error or congestion window is full.
This change is an optimization which prevents to allocate and release a
series of STREAM frames for nothing under congestion.
Note that nothing is done for other frames (flow-control, RESET_STREAM
and STOP_SENDING). Indeed, these frames are not restricted by flow
control. However, this means that they will be allocated for nothing if
send is blocked on a transient error.
This should be backported up to 2.7.
Add traces for when an upper layer stream is woken up by the MUX. This
should help to diagnose frozen stream issues.
This should be backported up to 2.7.
When detach is conducted by stream endpoint layer, a stream is either
freed or just flagged as detached if the transfer is not yet finished.
In the latter case, the stream will be finally freed via
qc_purge_streams() which is called periodically.
A subscribe was done on quic-conn layer if a stream cannot be freed via
qc_purge_streams() as this means FIN STREAM has not yet been sent.
However, this is unnecessary as either HTX EOM was not yet received and
we are waiting for the upper layer, or FIN stream is still in the buffer
but was not yet transmitted due to an incomplete transfer, in which case
a subscribe should have already been done.
This should be backported up to 2.7.
MUX uses qc_send_mux() function to send frames list over a QUIC
connection. On network congestion, the lower layer will reject some
frames and it is the MUX responsibility to free them. There is another
category of error which are when the sendto() fails. In this case, the
lower layer will free the packet and its attached frames and the MUX
should not touch them.
This model was violated by MUX layer for RESET_STREAM and STOP_SENDING
emission. In this case, frames were freed every time by the MUX on
error. This causes a double free error which lead to a crash.
Fix this by always ensuring if frames were rejected by the lower layer
before freeing them on the MUX. This is done simply by checking if frame
list is not empty, as RESET_STREAM and STOP_SENDING are sent
individually.
This bug was never reproduced in production. Thus, it is labelled as
MINOR.
This must be backported up to 2.7.
Since recent modification of MUX error processing, shutw operation was
skipped for a connection reported as on error. However, this can caused
the stream layer to not be notified about error. The impact of this bug
is unknown but it may lead to stream never closed.
To fix this, simply skip over send operations when connection is on
error while keep notifying the stream layer.
This should be backported up to 2.7.
As discussed a few times over the years, it's quite difficult to know
how often we stop accepting connections because the global maxconn was
reached. This is not easy to know because when we reach the limit we
stop accepting but we don't know if incoming connections are pending,
so it's not possible to know how many were delayed just because of this.
However, an interesting equivalent metric consist in counting the number
of times an accepted incoming connection resulted in the limit being
reached. I.e. "we've accepted the last one for now". That doesn't imply
any other one got delayed but it's a factual indicator that something
might have been delayed. And by counting the number of such events, it
becomes easier to know whether some limits need to be adjusted because
they're reached often, or if it's exceptionally rare.
The metric is reported as a counter in show info and on the stats page
in the info section right next to "maxconn".
Now in "show info" we have a TotalWarnings field that reports the total
number of warnings issued since the process started. It's also reported
in the the stats page next to the uptime.
qc_treat_ack_of_ack() must remove ranges of acknowlegments from an ebtree which
have been acknowledged. This is done keeping track of the largest acknowledged
packet number which has been acknowledged and sent with an ack-eliciting packet.
But due to the data structure of the acknowledgement ranges used to build an ACK frame,
one must leave at least one range in such an ebtree which must at least contain
a unique one-element range with the largest acknowledged packet number as element.
This issue was revealed by @Tristan971 in GH #2140.
Must be backported in 2.7 and 2.6.
When pushing a lua object through lua Queue class, a new reference is
created from the object so that it can be safely restored when needed.
Likewise, when popping an object from lua Queue class, the object is
restored at the top of the stack via its reference id.
However, once the object is restored the related queue entry is removed,
thus the object reference must be dropped to prevent reference leak.
queue:pop_wait() was broken during late refactor prior to merge.
(Due to small modifications to ensure that pop() returns nil on empty
queue instead of nothing)
Because of this, pop_wait() currently behaves exactly as pop(), resulting
in 100% active CPU when used in a while loop.
Indeed, _hlua_queue_pop() should explicitly return 0 when the queue is
empty since pop_wait logic relies on this and the pushnil should be
handled directly in queue:pop() function instead.
Adding some comments as well to document this.
During the startup stage, if a proxy was disabled in config, all filters
were released and removed. But it may be an issue if some info are shared
between filters of the same type. Resources may be released too early.
It happens with ACLs defined in SPOE configurations. Pattern expressions can
be shared between filters. To fix the issue, filters for disabled proxies
are no longer released during the startup stage but only when HAProxy is
stopped.
This commit depends on the previous one ("MINOR: spoe: Don't stop disabled
proxies"). Both must be backported to all stable versions.
SPOE register a signal handler to be able to stop SPOE applets ASAP during
soft-stop. Disabled proxies must be ignored at this staged because they are
not fully configured.
For now, it is useless but this change is mandatory to fix a bug.
clang 15 reports unused variables in src/mjson.c:
src/mjson.c:196:21: fatal error: expected ';' at end of declaration
int __maybe_unused n = 0;
and
src/mjson.c:727:17: fatal error: variable 'n' set but not used [-Wunused-but-set-variable]
int sign = 1, n = 0;
An issue was created on the project, but it was not fixed for now:
https://github.com/cesanta/mjson/issues/51
So for now, to fix the build issue, these variables are declared as unused.
Of course, if there is any update on this library, be careful to review this
patch first to be sure it is always required.
This patch should fix the issue #1868. It be backported as far as 2.4.
fixes#2031
quoting Willy Tarreau:
"Originally the listeners were intended to work without a bind_conf
(e.g. for FTP processing) hence these tests, but over time the
bind_conf has become omnipresent"
At the end of process_stream(), if there was any change on request/response
analyzers, we now trigger a resync. It is performed if any analyzer is added
but also removed. It should help to catch internal changes on a stream and
eventually avoid it to be frozen.
There is no reason to backport this patch. But it may be good to keep an eye
on it, just in case.
In process_stream(), after request and response analyzers evaluation,
unhandled errors are processed, if any. In this case, depending on the case,
remaining request or response analyzers may be removed, unlesse the last one
about end of filters. However, auto-close is not reenabled in same
time. Thus it is possible to not forward the shutdown for a side to the
other one while no analyzer is there to do so or at least to make evolved
the situation.
In theory, it is thus possible to freeze a stream if no wakeup happens. And
it seems possible because it explain a freeze we've oberseved.
This patch could be backported to every stable versions but only after a
period of observation and if it may match an unexplained bug. It should not
lead to any loop but at worst and eventually to truncated messages.
When commit ead43fe4f ("MEDIUM: compression: Make it so we can compress
requests as well.") added the test for the direction flags to select the
compression, it implicitly broke compression defined in defaults sections
because the flags from the default proxy were not recopied, hence the
compression was enabled but in no direction.
No backport is needed, that's 2.8 only.
->others member of tp_version_information structure pointed to a buffer in the
TLS stack used to parse the transport parameters. There is no garantee that this
buffer is available until the connection is released.
Do not dump the available versions selected by the client anymore, but displayed the
chosen one (selected by the client for this connection) and the negotiated one.
Must be backported to 2.7 and 2.6.
A recent series of commit have been introduced to rework error
generation on QUIC MUX side. Now, all MUX/APP functions uses
qcc_set_error() to set the flag QC_CF_ERRL on error. Then, this flag is
converted to QC_CF_ERRL_DONE with a CONNECTION_CLOSE emission by
qc_send().
This has the advantage of centralizing the CONNECTION_CLOSE generation
in one place and reduces the link between MUX and quic-conn layer.
However, we must now ensure that every qcc_set_error() call is followed
by a QUIC MUX tasklet to invoke qc_send(). This was not the case, thus
when there is no active transfer, no CONNECTION_CLOSE frame is emitted
and the connection remains opened.
To fix this, add a tasklet_wakeup() directly in qcc_set_error(). This is
a brute force solution as this may be unneeded when already in the MUX
tasklet context. However, it is the simplest solution as it is too
tedious for the moment to list all qcc_set_error() invocation outside of
the tasklet.
This must be backported up to 2.7.
A recent series of patch were introduced to streamline error generation
by QUIC MUX. However, a regression was introduced : every error
generated by the MUX was built as CONNECTION_CLOSE_APP frame, whereas it
should be only for H3/QPACK errors.
Fix this by adding an argument <app> in qcc_set_error. When false, a
standard CONNECTION_CLOSE is used as error.
This bug was detected by QUIC tracker with the following tests
"stop_sending" and "server_flow_control" which requires a
CONNECTION_CLOSE frame.
This must be backported up to 2.7.
This was lost with commit f4258bdf3 ("MINOR: stats: Use the applet API to
write data"). When the buffer is almost full, the stats applet gives up.
When this happens, the applet must require more room. Otherwise, data in the
channel buffer are sent to the client but the applet is not woken up in
return.
It is a 2.8-specific bug, no backport needed.
GCC complains about swapping 2 heads list, one local and one global.
gcc -Iinclude -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment -Werror -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS -DUSE_EPOLL -DUSE_NETFILTER -DUSE_POLL -DUSE_THREAD -DUSE_BACKTRACE -DUSE_TPROXY -DUSE_LINUX_TPROXY -DUSE_LINUX_SPLICE -DUSE_LIBCRYPT -DUSE_CRYPT_H -DUSE_GETADDRINFO -DUSE_OPENSSL -DUSE_SSL -DUSE_LUA -DUSE_ACCEPT4 -DUSE_ZLIB -DUSE_CPU_AFFINITY -DUSE_TFO -DUSE_NS -DUSE_DL -DUSE_RT -DUSE_MATH -DUSE_SYSTEMD -DUSE_PRCTL -DUSE_THREAD_DUMP -DUSE_QUIC -DUSE_SHM_OPEN -DUSE_PCRE -DUSE_PCRE_JIT -I/github/home/opt/include -I/usr/include -DCONFIG_HAPROXY_VERSION=\"2.8-dev8-7d23e8d1a6db\" -DCONFIG_HAPROXY_DATE=\"2023/04/24\" -c -o src/ssl_sample.o src/ssl_sample.c
In file included from include/haproxy/pool.h:29,
from include/haproxy/chunk.h:31,
from include/haproxy/dynbuf.h:33,
from include/haproxy/channel.h:27,
from include/haproxy/applet.h:29,
from src/ssl_sock.c:47:
src/ssl_sock.c: In function 'tlskeys_finalize_config':
include/haproxy/list.h:48:88: error: storing the address of local variable 'tkr' in 'tlskeys_reference.p' [-Werror=dangling-pointer=]
48 | #define LIST_INSERT(lh, el) ({ (el)->n = (lh)->n; (el)->n->p = (lh)->n = (el); (el)->p = (lh); (el); })
| ~~~~~~~~^~~~~~
src/ssl_sock.c:1086:9: note: in expansion of macro 'LIST_INSERT'
1086 | LIST_INSERT(&tkr, &tlskeys_reference);
| ^~~~~~~~~~~
compilation terminated due to -Wfatal-errors.
This appears with gcc 13.0.
The fix uses LIST_SPLICE() instead of inserting the head of the local
list in the global list.
Should fix issue #2136 .
Since a recent change on the SC API, a producer must specify the amount of
free space it needs to progress when it is blocked. But, it must take care
to never exceed the maximum size allowed in the buffer. Otherwise, the
stream is freezed because it cannot reach the condition to unblock the
producer.
In this context, there is a bug in the cache applet when it fails to dump a
message. It may request more space than allowed. It happens when the cached
object is too big.
It is a 2.8-specific bug. No backport needed.
As noticed by Miroslav, there was a typo in quic_tls_key_update() which lead
a cipher context for decryption to be initialized and used in place of a cipher
context for encryption. Surprisingly, this did not prevent the key update
from working. Perhaps this is due to the fact that the underlying cryptographic
algorithms used by QUIC are all symetric algorithms.
Also modify incorrect traces.
Must be backported in 2.6 and 2.7.
Most of the function in quic_frame.c and quic_frame.h manipulate <buf> buffer
position variables which have nothing to see with struct buffer variables.
Rename them to <pos>
Should be backported to 2.7.
The build without thread support was broken by commit b30ced3d8 ("BUG/MINOR:
debug: fix incorrect profiling status reporting in show threads") because
it accesses the isolated_thread variable that is not defined when threads
are disabled. In fact both the test on harmless and this one make no sense
without threads, so let's comment out the block and mark the related
variables as unused.
This may have to be backported to 2.7 if the commit above is.
The function that cpu-map uses to parse CPU sets, parse_cpu_set(), was
etended in 2.4 with commit a80823543 ("MINOR: cfgparse: support the
comma separator on parse_cpu_set") to support commas between ranges.
But since it was quite late in the development cycle, by then it was
decided not to add a last-minute surprise and not to magically support
commas in cpu-map, hence the "comma_allowed" argument.
Since then we know that it was not the best choice, because the comma
is silently ignored in the cpu-map syntax, causing all sorts of
surprises in field with threads running on a single node for example.
In addition it's quite common to copy-paste a taskset line and put it
directly into the haproxy configuration.
This commit relaxes this rule an finally allows cpu-map to support
commas between ranges. It simply consists in removing the comma_allowed
argument in the parse_cpu_set() function. The doc was updated to
reflect this.
Add a new output format "oneline" for "show quic" command. This prints
one connection per line with minimal information. The objective is to
have an equivalent of the netstat/ss tools with just enough information
to quickly find connection which are misbehaving.
A legend is printed on the first line to describe the field columns
starting with a dash character.
This should be backported up to 2.7.
Add an extra optional argument for "show quic" to specify desired output
format. Its objective is to control the verbosity per connections. For
the moment, only "full" is supported, which is the already implemented
output with maximum information.
This should be backported up to 2.7.
Adding a new lua class: Queue.
This class provides a generic FIFO storage mechanism that may be shared
between multiple lua contexts to easily pass data between them, as stock
Lua doesn't provide easy methods for passing data between multiple
coroutines.
New Queue object may be obtained using core.queue()
(it works like core.concat() for a concat Class)
Lua documentation was updated (including some usage examples)
Remaining in drain mode after removing one of server admins flags leads
to this message being generated:
"Server name/backend is leaving forced drain but remains in drain mode."
However this is not necessarily true: the server might just be leaving
MAINT with the IDRAIN flag set, so the report is incorrect in this case.
(FDRAIN was not set so it cannot be cleared)
To prevent confusion around this message and to comply with the code
comment above it: we remove the "leaving forced drain" precision to
make the report suitable for multiple transitions.
Since 3157222 ("MEDIUM: hlua/applet: Use the sedesc to report and detect
end of processing"), hlua_socket_handler() might spin loop if the hlua
socket is destroyed and some data was left unconsumed in the applet.
Prior to the above commit, the stream was explicitly KILLED
(when ctx->die == 1) so the app couldn't spinloop on unconsumed data.
But since the refactor this is no longer the case.
To prevent unconsumed data from waking the applet indefinitely, we consume
pending data when either one of EOS|ERROR|SHR|SHW flags are set, as it is
done everywhere else this check is performed in the code. Hence it was
probably overlooked in the first place during the refacto.
This bug is 2.8 specific only, so no backport needed.
Proxy mailers, which are configured using "email-alert" directives
in proxy sections from the configuration, are now being exposed
directly in lua thanks to the proxy:get_mailers() method which
returns a class containing the various mailers settings if email
alerts are configured for the given proxy (else nil is returned).
Both the class and the proxy method were marked as LEGACY since this
feature relies on mailers configuration, and the goal here is to provide
the last missing bits of information to lua scripts in order to make
them capable of sending email alerts instead of relying on the soon-to-
be deprecated mailers implementation based on checks (see src/mailers.c)
Lua documentation was updated accordingly.
Exposing a new hlua function, available from body or init contexts, that
forcefully disables the sending of email alerts even if the mailers are
defined in haproxy configuration.
This will help for sending email directly from lua.
(prevent legacy email sending from intefering with lua)
Exposing SERVER_CHECK event through the lua API.
New lua class named ServerEventCheck was added to provide additional
data for SERVER_CHECK event.
Lua documentation was updated accordingly.
Adding a new event type: SERVER_CHECK.
This event is published when a server's check state ought to be reported.
(check status change or check result)
SERVER_CHECK event is provided as a server event with additional data
carrying relevant check's context such as check's result and health.
Adding a new SERVER event in the event_hdl API.
SERVER_ADMIN is implemented as an advanced server event.
It is published each time the administrative state changes.
(when s->cur_admin changes)
SERVER_ADMIN data is an event_hdl_cb_data_server_admin struct that
provides additional info related to the admin state change, but can
be casted as a regular event_hdl_cb_data_server struct if additional
info is not needed.
Reuse cb_data from STATE event to publish UP and DOWN events.
This saves some CPU time since the event is only constructed
once to publish STATE, STATE+UP or STATE+DOWN depending on the
state change.
Adding a new SERVER event in the event_hdl API.
SERVER_STATE is implemented as an advanced server event.
It is published each time the server's effective state changes.
(when s->cur_state changes)
SERVER_STATE data is an event_hdl_cb_data_server_state struct that
provides additional info related to the server state change, but can
be casted as a regular event_hdl_cb_data_server struct if additional
info is not needed.
add a macro helper to help publish server events to global and
per-server subscription list at once since all server events
support both subscription modes.
Server.get_pend_conn: number of pending connections to the server
Server.get_cur_sess: number of current sessions handled by the server
Lua documentation was updated accordingly.
After a sending attempt, we check the opposite SC to see if it is waiting
for a minimum free space to receive more data. If the condition is
respected, it is unblocked. 0 is special case where the SC is
unconditionally unblocked.
During fast-forward, if the SC is waiting for a minimum free space to
receive more data and some data was sent, it is only unblock is the
condition is respected. 0 is special case where the SC is unconditionally
unblocked.
If the opposite SC is waiting for a minimum free space to receive more data,
it is only unblock is the condition is respected. 0 is a special cases where
the opposite SC is always unblocked.
At the end of process_stream(), in sc_update_rx(), the SC is now unblocked
if it was waiting for room and the free space in the input buffer is large
enough. This patch should fix an issue with the compression filter that can
leave the channel's buffer empty while the endpoint is waiting for room to
progress. Indeed, in this case, because the buffer is empty, there is no
send attempt and no other way to unblock the SE.
This commit depends on following commits:
* MEDIUM: tree-wide: Change sc API to specify required free space to progress
* MINOR: stconn: Add a field to specify the room needed by the SC to progress
* MINOR: peers: Use the applet API to send message
* MINOR: stats: Use the applet API to write data
* MINOR: cli: Use applet API to write output message
It should fix a regression introduced with the commit 341a5783b
("BUG/MEDIUM: stconn: stop to enable/disable reads from streams via
si_update_rx").
It must be backported iff the commit above is also backported. It was not
backported yet and it is thus probably a good idea to not do so to avoid to
backport too many change..
sc_need_room() now takes the required free space to receive more data as
parameter. All calls to this function are updated accordingly. For now, this
value is set but not used. When we are waiting for a buffer, 0 is used. So
we expect to be unblocked ASAP. However this must be reviewed because
SC_FL_NEED_BUF is probably enough in this case and this flag is already set
if the input buffer allocation fails.
When the SC is blocked because it is waiting for room in the input buffer,
it will be responsible to specify the minimum free space required to
progress. In this commit, we only introduce the field in the stconn
structure that will be used to store this value. It is a signed value with
the following meaning:
* -1: The SC is waiting for room but not based on the buffer state. It
will be typically used during splicing when the pipe is full. In
this case, only a successful send can unblock the SC.
* >= 0; The minimum free space in the input buffer to unblock the SC. 0 is
a special value to specify the SC must be unblocked ASAP, by the
stream, at the end of process_stream() or when output data are
consumed on the opposite side.
The peers applet now use the applet API to send message instead of the
channel API. This way, it does not need to take care to request more room if
it fails to put data into the channel's buffer.
stats_putchk() is updated to use the applet API instead of the channel API
to write data. To do so, the appctx is passed as parameter instead of the
channel. This way, the applet does not need to take care to request more
room it it fails to put data into the channel's buffer.
Instead of using the channel API to to write output message from the CLI
applet, we use the applet API. This way, the applet does not need to take
care to request more room it it fails to put its message into the channel's
buffer.
This commit introduces the keyword "client-sigalgs" for the bind line,
which does the same as "sigalgs" but for the client authentication.
"ssl-default-bind-client-sigalgs" allows to set the default parameter
for all the bind lines.
This patch should fix issue #2081.
This patch introduces the "sigalgs" keyword for the bind line, which
allows to configure the list of server signature algorithms negociated
during the handshake. Also available as "ssl-default-bind-sigalgs" in
the default section.
This patch was originally written by Bruno Henc.
Previously it would re-dump all threads to the same trash if the output
buffer was full, which it never was since the trash is of the same size.
Now it dumps one thread, copies it to the buffer and yields until it can
continue. Showing 256 threads works as expected.
Currently large setups cannot dump all their threads because they're
first dumped to the trash buffer, then copied to stderr. Here we can
now change this, instead we dump one thread at a time into the trash
and immediately send it to stderr. We also keep a copy into a local
trash chunk that's assigned to thread_dump_buffer so that a core file
still contains a copy of a large number of threads, which is generally
sufficient for the vast majority of situations.
It was verified that dumping 256 threads now produces ~55kB of output
and all of them are properly dumped.
The thread dump mechanism that is used by "show threads" and by the
panic dump is overly complicated due to an initial misdesign. It
firsts wakes all threads, then serializes their dumps, then releases
them, while taking extreme care not to face colliding dumps. In fact
this is not what we need and it reached a limit where big machines
cannot dump all their threads anymore due to buffer size limitations.
What is needed instead is to be able to dump *one* thread, and to let
the requester iterate on all threads.
That's what this patch does. It adds the thread_dump_buffer to the
struct thread_ctx so that the requester offers the buffer to the
thread that is about to be dumped. This buffer also serves as a lock.
A thread at rest has a NULL, a valid pointer indicates the thread is
using it, and 0x1 (NULL+1) is used by the dumped thread to tell the
requester it's done. This makes sure that a given thread is dumped
once at a time. In addition to this, the calling thread decides
whether it accesses the thread by itself or via the debug signal
handler, in order to get a backtrace. This is much saner because the
calling thread is free to do whatever it wants with the buffer after
each thread is dumped, and there is no dependency between threads,
once they've dumped, they're free to continue (and possibly to dump
for another requester if needed). Finally, when the THREAD_DUMP
feature is disabled and the debug signal is not used, the requester
accesses the thread by itself like before.
For now we still have the buffer size limitation but it will be
addressed in future patches.
When a client H2 stream is waiting for a tunnel establishment, it must state
it expects data from server. It is the second fix that should fix
regressions of the commit 2722c04b ("MEDIUM: mux-h2: Don't expect data from
server as long as request is unfinished")
It is a 2.8-specific bug. No backport needed.
In 2.3, commit 471425f51 ("BUG/MINOR: debug: Don't dump the lua stack
if it is not initialized") introduced the possibility to emit an empty
line when there's no Lua info to dump. The problem is that doing this
on the CLI in "show threads" marks the end of the output, and it may
affect some external tools. We need to make sure that LFs are only
emitted if there's something on the line and that all lines properly
start with the prefix.
This may be backported as far as 2.0 since the commit above was
backported there.
With the change for QUIC MUX local error API, the new flag QC_CF_ERRL is
now checked on qc_detach(). If set, qcs instance is freed even though
transfer is not finished. This should help to quickly release qcs and
eventually all MUX instance resources.
To further accelerate this, a specific check has been added in
qc_shutw(). It is skipped if local error flag is set to prevent noisy
reset stream invocation. In the same way, QUIC MUX is not rescheduled on
qc_recv_buf() operation if local error flag set.
This should be backported up to 2.7.
If an error a detected at the MUX layer, all remaining stream endpoints
should be closed asap with error set. This is now done by checking for
QC_CF_ERRL flag on qc_wake_some_streams() and qc_send_buf(). To complete
this, qc_wake_some_streams() is called by qc_process() if needed.
This should help to quickly release streams as soon as a new error is
detected locally by the MUX or APP layer. This allows to in turn free
the MUX instance itself. Previously, error would not have been
automatically reported until the transport layer closure would occur
on CONNECTION_CLOSE emission.
This should be backported up to 2.7.
When a fatal error is detected by the QUIC MUX or H3 layer, the
connection should be closed with a CONNECTION_CLOSE with an error code
as the reason.
Previously, a direct call was used to the quic_conn layer to try to
close the connection. This API was adjusted to be more flexible. Now,
when an error is detected, the function qcc_set_error() is called. This
set the flag QC_CF_ERRL with the error code stored by the MUX. The
connection will be closed soon so most of the operations are not
conducted anymore. Connection is then finally closed during qc_send()
via quic_conn layer if QC_CF_ERRL is set. This will set the flag
QC_CF_ERRL_DONE which indicates that the MUX instance can be freed.
This model is cleaner and brings the following improvments :
- interaction with quic_conn layer for closure is centralized on a
single function
- CO_FL_ERROR is not set anymore. This was incorrect as this should be
reserved to errors reported by the transport layer to be similar with
other haproxy components. As a consequence, qcc_is_dead() has been
adjusted to check for QC_CF_ERRL_DONE to release the MUX instance.
This should be backported up to 2.7.
When HTX content is transferred from qcs instance to upper stream
endpoint, a wakeup is conducted for MUX tasklet. However, this is only
necessary if demux was interrupted due to a full QCS HTX buffer.
This should be backported up to 2.7.
Add a dedicated trace event QMUX_EV_QCC_ERR. This is used for locally
detected error when a CONNECTION_CLOSE should be emitted.
This should be backported up to 2.7.
When MUX performs a graceful shutdown, quic_conn error code is set to a
"no error" code which depends on the application layer used. However,
this may overwrite a previous error code if quic_conn layer has detected
an error on its side.
In practice, this behavior has not been seen on production. In fact, it
may have undesirable effect only if this error code modification happens
between the quic_conn error detection and the emission of the
CONNECTION_CLOSE, so it should be pretty rare. However, there is still a
tiny possibility it may happen.
To prevent this, first check that quic_conn error code is not set before
setting it. Ideally, transport layer API should be adjusted to be able
to set this without fiddling with the quic_conn directly.
This should be backported up to 2.6.
The commit 2722c04b ("MEDIUM: mux-h2: Don't expect data from server as long
as request is unfinished") introduced a regression in the H2 multiplexer.
The end of the request is not systematically handled to state a H2 stream on
client side now expexts data from the server.
Indeed, while the client is uploading its request, the H2 stream warns it
does not expect data from the server. This way, no server timeout is applied
at this stage. When end of the request is detected, the H2 stream must state
it now expects the server response. This enables the server timeout.
However, it was only performed at one place while the end of the request can
be handled at different places. First, during a zero-copy in
h2_rcv_buf(). Then, when the SC is created with the full request. Because of
this bug, it is possible to totally disable the server timeout for H2
streams.
In h2_rcv_buf(), we now rely on h2s flags to detect the end of the request,
but only when the rxbuf was emptied.
It is a 2.8-specific bug. No backport needed.
Sometimes it's convenient to test the effect of tasks running under
isolation, e.g. to validate the contents of the crash dumps. Let's
add an optional "isolated" keyword to "debug dev loop" for this.
Thread dumps include a field "prof" for each thread that reports whether
task profiling is currently active or not. It turns out that in 2.7-dev1,
commit 680ed5f28 ("MINOR: task: move profiling bit to per-thread")
mistakenly replaced it with a check for the current thread's bit in the
thread dumps, which basically is the only place where another thread is
being watched. The same mistake was done a few lines later by confusing
threads_want_rdv_mask with the profiling mask. This mask disappeared
in 2.7-dev2 with commit 598cf3f22 ("MAJOR: threads: change thread_isolate
to support inter-group synchronization"), though instead we know the ID
of the isolated thread. This commit fixes this and now reports "isolated"
instead of "wantrdv".
This can be backported to 2.7.
16kB buffers are not enough to dump 4096 threads with up to 10 bytes value
on each line. By storing the column number in the applet's context, we can
now restart from the last attempted column. This requires to dump all values
as they are produced, but it doesn't cost that much: a 4096-thread output
from a fesh process produces 300kB of output in ~8ms, or ~400us per call
(19*16kB), most of which are spent in vfprintf(). Given that we don't print
more than needed, it doesn't really change anything.
The main caveat is that when interrupted on such large lines, there's a
great possibility that the total or average on the first column doesn't
match anymore the sum or average of all dumped values. In order to avoid
this whenever possible (typically less than ~1500 threads), we first try
to dump entire lines and only proceed one column at a time when we have
to retry a failed dump. This is already the same for other stats that are
dumped in an interruptible way anyway and there's little that can be done
about it at this point (and not much immediately perceived benefit in
doing this with extreme accuracy for >1500 threads).
When using many threads, it's difficult to see the end of "show activity"
due to the numerous columns which fill the buffer. For example a dump of
a 256-thread, freshly booted process yields around 15kB.
Here by arranging the dump in a loop around a switch/case block where
each case checks the code line number against the current dump position,
we have a restartable counter for free with a granularity of the line of
code, without having to maintain a matching between states and specific
lines. It just requires to reset the trash buffer for each line and to
try to dump it after each line.
Now dumping 256 threads after a few seconds of traffic happily emits 20kB.
Now each line of "show activity" will iterate over n+2 fields, one for
the line header, one for the total, and one per thread. This will soon
allow us to save the current state in a restartable way.
Commit 986798718 ("DEBUG: cli: add "debug dev task" to show/wake/expire/kill
tasks and tasklets") broke the build on windows due to this:
src/debug.c:940:95: error: array subscript has type char [-Werror=char-subscripts]
940 | caller && may_access(caller) && may_access(caller->func) && isalnum(*caller->func) ? caller->func : "0",
| ^~~~~~~~~~~~~
It's classical on platforms which implement ctype.h as macros instead of
functions, let's cast it as uchar. No backport is needed.
The x509_v_err_str converter now outputs the numerical value as a string
when the corresponding constant name was not found.
Must be backported as far as 2.7.
When analyzing certain types of bugs in field, sometimes it would be
nice to be able to wake up a task or tasklet to see how events progress
(e.g. to detect a missing wakeup condition), or expire or kill such a
task. This restricted command shows hte current state of a task or tasklet
and allows to manipulate it like this. However it must be used with extreme
care because while it does verify that the pointers are mapped, it cannot
know if they point to a real task, and performing such actions on something
not a task will easily lead to a crash. In addition, performing a "kill"
on a task has great chances of provoking a deferred crash due to a double
free and/or another kill that is not idempotent. Use with extreme care!
The "show sess" command displays the stream's age in synthetic form,
and also makes it appear in the long version (show sess all). But that
last one uses the wrong origin, it uses accept_date.tv_sec instead of
accept_ts (formerly known as tv_accept). This was introduced in 1.4.2
with the long format, with commit 66dc20a17 ("[MINOR] stats socket: add
show sess <id> to dump details about a session"), while the code that
split the two variables was introduced in 1.3.16 with commit b7f694f20
("[MEDIUM] implement a monotonic internal clock"). This problem was
revealed by recent change ad5a5f677 ("MEDIUM: tree-wide: replace timeval
with nanoseconds in tv_accept and tv_request") that made this value report
random garbage, and generally emphasized by the fact that in 2.8 the two
clocks have sufficiently large an offset for such mistakes to be noticeable
early.
Arguably a difference between date and accept_date could also make sense,
to indicate if the stream had been there for more than 49 days, but this
would introduce instabilities for most sockets (including negative times)
for extremely rare cases while the goal is essentially to see how much
longer than a configured timeout a stream has been there. And that's what
other locations (including the short form) provide.
This patch could be backported but most users will never notice. In case
of backport, tv_accept.tv_sec should be used instead of accept_date.tv_sec.
WolfSSL is enabling by default the CRL checks even if a CRL file wasn't
provided. This patch resets the default X509_STORE flags so this is
not checked by default.
Before this patch, global sending rate was measured on the QUIC lower
layer just after sendto(). This meant that all QUIC frames were
accounted for, including non STREAM frames and also retransmission.
To have a better reflection of the application data transferred, move
the incrementation into the MUX layer. This allows to account only for
STREAM frames payload on their first emission.
This should be backported up to 2.6.
Now that "now" is no more a timeval, there's no point keeping a copy
of it as a timeval, let's also switch start_time to nanoseconds, it
simplifies operations.
This puts an end to the occasional confusion between the "now" date
that is internal, monotonic and not synchronized with the system's
date, and "date" which is the system's date and not necessarily
monotonic. Variable "now" was removed and replaced with a 64-bit
integer "now_ns" which is a counter of nanoseconds. It wraps every
585 years, so if all goes well (i.e. if humanity does not need
haproxy anymore in 500 years), it will just never wrap. This implies
that now_ns is never nul and that the zero value can reliably be used
as "not set yet" for a timestamp if needed. This will also simplify
date checks where it becomes possible again to do "date1<date2".
All occurrences of "tv_to_ns(&now)" were simply replaced by "now_ns".
Due to the intricacies between now, global_now and now_offset, all 3
had to be turned to nanoseconds at once. It's not a problem since all
of them were solely used in 3 functions in clock.c, but they make the
patch look bigger than it really is.
The clock_update_local_date() and clock_update_global_date() functions
are now much simpler as there's no need anymore to perform conversions
nor to round the timeval up or down.
The wrapping continues to happen by presetting the internal offset in
the short future so that the 32-bit now_ms continues to wrap 20 seconds
after boot.
The start_time used to calculate uptime can still be turned to
nanoseconds now. One interrogation concerns global_now_ms which is used
only for the freq counters. It's unclear whether there's more value in
using two variables that need to be synchronized sequentially like today
or to just use global_now_ns divided by 1 million. Both approaches will
work equally well on modern systems, the difference might come from
smaller ones. Better not change anyhting for now.
One benefit of the new approach is that we now have an internal date
with a resolution of the nanosecond and the precision of the microsecond,
which can be useful to extend some measurements given that timestamps
also have this resolution.
Instead we're using ns_to_sec(tv_to_ns(&now)) which allows the tv_sec
part to disappear. At this point, "now" is only used as a timeval in
clock.c where it is updated.
Let's get rid of timeval in storage of internal timestamps so that they
are no longer mistaken for wall clock time. These were exclusively used
subtracted from each other or to/from "now" after being converted to ns,
so this patch removes the tv_to_ns() conversion to use them natively. Two
occurrences of tv_isge() were turned to a regular wrapping subtract.
Instead of operating on {sec, usec} now we convert both operands to
ns then subtract them and convert to ms. This is a first step towards
dropping timeval from these timestamps.
Interestingly, tv_ms_elapsed() and tv_ms_remain() are no longer used at
all and could be removed.
The "show info" help for "Start_time_sec" says "Start time in seconds"
so it's definitely the start date in human format, not the internal one
that is solely used to compute uptime. Since commit 28360dc ("MEDIUM:
clock: force internal time to wrap early after boot"), both are split
apart since the start time takes into account the offset needed to cause
the early wraparound, so we must only use start_date here.
No backport is needed.
The commit a664aa6a6 ("BUG/MINOR: tcpcheck: Be able to expect an empty
response") instroduced a regression for expect rules relying on a custom
function. Indeed, there is no check on the buffer to be sure it is not empty
before calling the custom function. But some of these functions expect to
have data and don't perform any test on the buffer emptiness.
So instead of fixing all custom functions, we just don't eval them if the
buffer is empty.
This patch must be backported but only if the commit above was backported
first.
It was a cut/paste typo during stream-interface to conn-stream
refactoring. sc_have_room() was used instead of sc_need_room().
This patch must be backported as far as 2.6.
It is possible to start too many applets on sporadic burst of events after
an inactivity period. It is due to the way we estimate if a new applet must
be created or not. It is based on a frequency counter. We compare the events
processing rate against the number of events currently processed (in
progress or waiting to be processed). But we should also take care of the
number of idle applets.
We already track the number of idle applets, but it is global and not
per-thread. Thus we now also track the number of idle applets per-thread. It
is not a big deal because this fills a hole in the spoe_agent structure.
Thanks to this counter, we can refrain applets creation if there is enough
idle applets to handle currently processed events.
This patch should be backported to every stable versions.
That's hopefully the last one affected by this. It was a bit trickier
because there's the promise in the doc that the date is monotonous, so
we continue to use now-start_time as the uptime value and add it to
start_date to get the current date. It was also emphasized by commit
28360dc ("MEDIUM: clock: force internal time to wrap early after boot"),
causing core.now() to return a date of Mar 20 on Apr 27. No backport is
needed.
Yet another case where "now" was used instead of "date" for a publicly
visible date that was already incorrect and became worse after commit
28360dc ("MEDIUM: clock: force internal time to wrap early after boot").
No backport is needed.
Since commit 28360dc ("MEDIUM: clock: force internal time to wrap early
after boot") we have a much clearer distinction between 'now' (the internal,
drifting clock) and 'date' (the wall clock time). The calltrace code was
using "now" instead of "date" since the value is displayed to humans.
No backport is needed.
This reverts commit aadcfc9ea6.
The parts affecting the DeviceAtlas addon were wrong actually, the
"now" variable was a local time_t in a file that's not compiled with
the haproxy binary (dadwsch). Only the fix to the calltrace is correct,
so better revert and fix the only one in a separate commit. No backport
is needed.
Another case where "now" was used instead of "date" for a publicly visible
date that was already incorrect and became worse after commit 28360dc
("MEDIUM: clock: force internal time to wrap early after boot"). No
backport is needed.
The debug messages were still emitted with a date taken from "now" instead
of "date", which was not correct a long time ago but which became worse in
2.8 since commit 28360dc ("MEDIUM: clock: force internal time to wrap early
after boot"). Let's fix it. No backport is needed.
Since commit 28360dc ("MEDIUM: clock: force internal time to wrap early
after boot") we have a much clearer distinction between 'now' (the internal,
drifting clock) and 'date' (the wall clock time). There were still a few
places where 'now' was being used for human consumption.
No backport is needed.
Each quic_conn are attached in a global thread-local quic_conns list
used for "show quic" command. During thread rebinding, a connection is
detached from its local list instance and moved to its new thread list.
However this operation is not thread-safe and may cause a race
condition.
To fix this, only remove the connection from its list inside
qc_set_tid_affinity(). The connection is inserted only after in
qc_finalize_affinity_rebind() on the new thread instance thus prevented
a race condition. One impact of this is that a connection will be
invisible during rebinding for "show quic".
A connection must not transition to closing state in between this two
steps or else cleanup via quic_handle_stopping() may not miss it. To
ensure this, this patch relies on the previous commit :
commit d6646dddcc
MINOR: quic: finalize affinity change as soon as possible
This should be backported up to 2.7.
During accept, a quic-conn is rebind to a new thread. This process is
done in two times :
* first on the original thread via qc_set_tid_affinity()
* then on the newly assigned thread via qc_finalize_affinity_rebind()
Most quic_conn operations (I/O tasklet, task and quic_conn FD socket
read) are reactivated ony after the second step. However, there is a
possibility that datagrams are handled before it via quic_dgram_parse()
when using listener sockets. This does not seem to cause any issue but
this may cause unexpected behavior in the future.
To simplify this, qc_finalize_affinity_rebind() will be called both by
qc_xprt_start() and quic_dgram_parse(). Only one invocation will be
performed thanks to the new flag QUIC_FL_CONN_AFFINITY_CHANGED.
This should be backported up to 2.7.
Sometimes it may be necessary to send an empty STREAM frame to signal
clean stream closure with FIN bit set. Prior to this change, a Tx buffer
was allocated unconditionnally even if no data is transferred.
Most of the times, allocation was not performed due to an older buffer
reused. But if data were already acknowledge, a new buffer is allocated.
No memory leak occurs as the buffer is properly released when the empty
frame acknowledge is received. But this allocation is unnecessary and it
consumes a connexion Tx buffer for nothing.
Improve this by skipping buffer allocation if no data to transfer.
qcs_build_stream_frm() is now able to deal with a NULL out argument.
This should be backported up to 2.6.
Previous patch fixes an issue occurring with empty STREAM frames without
payload. The crash was hidden in part because buf/data fields of
qf_stream were set even if no payload is referenced. This was not the
true cause of the crash but to ease future debugging, a STREAM frame
built with no payload now has its buf and data fields set to NULL.
This should be backported up to 2.6.
Sometimes it may be necessary to send empty STREAM frames with only the
FIN bit set. For these frames, memcpy is thus unnecessary as their
payload is empty. However, we did not prevent its invocation inside
quic_build_stream_frame().
Normally, memcpy invocation with length==0 is safe. However, there is an
extra condition in our function to handle data wrapping. For an empty
STREAM frame in the context of MUX emission, this is safe as the frame
points to a valid buffer which causes the wrapping condition to be false
and resulting in a memcpy with 0 length.
However, in the context of retransmission, this may lead to a crash.
Consider the following scenario : two STREAM frames A and B are
produced, one with payload and one empty with FIN set, pointing to the
same stream_desc buffer. If A is acknowledged by the peer, its buffer is
released as no more data is left in it. If B needs to be resent, the
wrapping condition will be messed up to a reuse of a freed buffer. Most
of the times, <wrap> will be a negative number, which results in a
memcpy invocation causing a buffer overflow.
To fix this, simply add an extra condition to skip memcpy and wrapping
check if STREAM frame length is null inside quic_build_stream_frame().
This crash is pretty rare as it relies on a lot of conditions difficult
to reproduce. It seems to be the cause for the latest crashes reported
under github issue #2120. In all the inspected dumps, the segfault
occurred during retransmission with an empty STREAM frame being used as
input. Thanks again to Tristan from Mangadex for his help and
investigation on it.
This should be backported up to 2.6.
Since the following mentioned patch, a send-list mechanism was
implemented to improve streams priorization on sending.
commit 20f2a425ff
MAJOR: mux-quic: rework stream sending priorization
This is done to prevent the same streams to always be used as first ones
on emission. However there is still a flaw on the algorithm. Once put in
the send-list, a streams is not removed until it has sent all of its
content. When a stream transfers a large object, it will remain in the
send-list during all the transfer and will soon monopolize the first
place. the stream does never leave its position until the transfer is
finished and will monopolize the first place. Other streams behind won't
have the opportunity to advance on their own transfers due to a Tx
buffer exhaustion.
This situation is especially problematic if a small timeout client is
used. As some streams won't advance on their transfer for a long period
of time, they will be aborted due to a stream layer timeout client
causing a RESET_STREAM emission.
To fix this, during sending each stream with at least some bytes
transferred from its tx.buf to qc_stream_desc out buffer is put at the
end of the send-list. This ensures that on the next iteration streams
that cannot transfer anything will be used in priority.
This patch improves significantly h2load benchmarks for large objects
with several streams opened in parallel on a single connection. Without
it, errors may be reported by h2load for aborted streams. For example,
this improved the following scenario on a 10mbit/s link with a 10s
timeout client :
$ ./build/bin/h2load --npn-list h3 -t 1 -c 1 -m 30 -n 30 https://198.18.10.11:20443/?s=500k
This fix may help with the github issue #2004 where chrome browser stop
to use QUIC after receiving RESET_STREAM frames.
This should be backported up to 2.7.
Some HTX responses may not always contain a EOM block. For example this
is the case if content-length header is missing from the HTTP server
response. Stream termination is thus signaled to QUIC mux via shutw
callback. However, this is interpreted inconditionnally as an early
close by the mux with a RESET_STREAM emission. Most of the times, QUIC
clients report this as an error.
To fix this, check if htx.extra is set to HTX_UNKOWN_PAYLOAD_LENGTH for
a qcs instance. If true, shutw will never be used to emit a
RESET_STREAM. Instead, the stream will be closed properly with a FIN
STREAM frame. If all data were already transfered, an empty STREAM frame
is sent.
This fix may help with the github issue #2004 where chrome browser stop
to use QUIC after receiving RESET_STREAM frames.
This issue was reported by Vladimir Zakharychev. Thanks to him for his
help and testing. It was also reproduced locally using httpterm with the
query string "/?s=1k&b=0&C=1".
This should be backported up to 2.7.
Make quic_stateless_reset_token_cpy(), quic_derive_cid() and quic_get_cid_tid()
be more readable: there is no struct buffer variable manipulated by these
functions.
Should be backported to 2.7.
There is no <buf> variable passed to this function.
Also rename <buf_end> to <end> to mimic others functions.
Rename <beg> to <first_byte> and <end> to <last_byte>.
Should be backported to 2.7.
Make quic_build_packet_long_header(), quic_build_packet_short_header() and
quic_apply_header_protection() be more readable: there is no struct buffer
variables used by these functions.
Should be backported to 2.7.
Remove the pointer to the connection passed as parameters to qc_purge_tx_buf()
and other similar function which came with qc_purge_tx_buf() implementation.
They were there do track the connection during tests.
Must be backported to 2.7.
Rename all frame variables with the suffix _frm. This helps to
differentiate frame instances from other internal objects.
This should be backported up to 2.7.
Each frame type used in quic_frame union has been renamed with the
following prefix "qf_". This helps to differentiate frame instances from
other internal objects.
This should be backported up to 2.7.
From the idle_timer_task(), the I/O handler must be woken up to send ack. But
there is no reason to do that in draining state or killing state. In draining
state this is even forbidden.
Must be backported to 2.7.
The timer task responsible of triggering probing retransmission did not inspect
the state of the connection before doing its job. But there is no need to
probe the peer when the connection is in draining or killing state. About the
draining state, this is even forbidden.
Must be backported to 2.7 and 2.6.
qc_dgrams_retransmit() prepares two list of frames to be retransmitted into
two datagrams. If the first datagram could not be sent, the TX buffer will
be purged with the prepared packet and its frames, but this was not the case for
the second list of frames.
Must be backported in 2.7.
This bug arrived with this commit which was not sufficient:
BUG/MEDIUM: quic: Missing TX buffer draining from qc_send_ppkts()
Indeed, there were also remaining allocated TX packets to be released and
their TX frames.
Implement qc_purge_tx_buf() to do so which depends on qc_free_tx_coalesced_pkts()
and qc_free_frm_list().
Must be backported to 2.7.
Sharding by-group is exactly identical to by-process for a single
group, and will use the same number of file descriptors for more than
one group, while significantly lowering the kernel's locking overhead.
Now that all special listeners (cli, peers) are properly handled, and
that support for SO_REUSEPORT is detected at runtime per protocol, there
should be no more reason for now switching to by-group by default.
That's what this patch does. It does only this and nothing else so that
it's easy to revert, should any issue be raised.
Testing on an AMD EPYC 74F3 featuring 24 cores and 48 threads distributed
into 8 core complexes of 3 cores each, shows that configuring 8 groups
(one per CCX) is sufficient to simply double the forwarded connection
rate from 112k to 214k/s, reducing kernel locking from 71 to 55%.
This new setting accepts "by-process", "by-group" and "by-thread" and
will dictate how listeners will be sharded by default when nothing is
specified. While the default remains "by-process", "by-group" should be
much more efficient with many threads, while not changing anything for
single-group setups.
Now that we're able to run listeners on any set of groups, we don't need
to maintain a special case about the stats socket anymore. It used to be
forced to group 1 only so as to avoid startup failures in case several
groups were configured, but if it's done now, it will automatically bind
the needed FDs to have one per group so this is no more an issue.
When testing if a protocol supports SO_REUSEPORT, we're now able to
verify if the OS does really support it. While it may be supported at
build time, it may possibly have been blocked in a container for
example so we'd rather know what it's like.
The new function _sock_supports_reuseport() will be used to check if a
protocol type supports SO_REUSEPORT or not. This will be useful to verify
that shards can really work.
The new function protocol_supports_flag() checks the protocol flags
to verify if some features are supported, but will support being
extended to refine the tests. Let's use it to check for REUSEPORT.
Now if multiple shards are explicitly requested, and the listener's
protocol doesn't support SO_REUSEPORT, sharding is disabled, which will
result in the socket being automatically duped if needed. A warning is
emitted when this happens. If "shards by-group" or "shards by-thread"
are used, these will automatically be turned down to 1 since we want
this to be possible easily using -dR on the command line without having
to djust the config. For "by-thread", a diag warning will be emitted to
help troubleshoot possible performance issues.
Some protocol support SO_REUSEPORT and others not. Some have such a
limitation in the kernel, and others in haproxy itself (e.g. sock_unix
cannot support multiple bindings since each one will unbind the previous
one). Also it's really protocol-dependent and not just family-dependent
because on Linux for some time it was supported for TCP and not UDP.
Let's move the definition to the protocols instead. Now it's preset in
tcp/udp/quic when SO_REUSEPORT is defined, and is otherwise left unset.
The enabled() config condition test validates IPv4 (generally sufficient),
and -dR / noreuseport all protocols at once.
We'll use these flags to know if some protocols are supported, and if
so, with what options/extensions. Reuseport will move there for example.
Two functions were added to globally set/clear a flag.
The listeners in peers sections were still not handing the thread
groups fine. Shards were silently ignored and if a listener was bound
to more than one group, it would simply fail. Now we can call the
dedicated function to resolve all this and possibly create the missing
extra listeners.
bind_complete_thread_setup() was adjusted to use the proxy_type_str()
instead of writing "proxy" at the only place where this word was still
hard-coded so that we continue to speak about peers sections when
relevant.
What used to be only two lines to apply a mask in a loop in
check_config_validity() grew into a 130-line block that performs deeply
listener-specific operations that do not have their place there anymore.
In addition it's worth noting that the peers code still doesn't support
shards nor being bound to more than one group, which is a second reason
for moving that code to its own function. Nothing was changed except
recreating the missing variables from the bind_conf itself (the fe only).
In 2.6-dev1, NUMA topology detection was enabled on FreeBSD with commit
f5d48f8b3 ("MEDIUM: cfgparse: numa detect topology on FreeBSD."). But
it suffers from a minor bug which is that it forgets to check for the
number of domains and always emits a confusing warning indicating that
multiple sockets were found while it's not the case.
This can be backported to 2.6.
The lib compatibility checks introduced in 2.8-dev6 with commit c3b297d5a
("MEDIUM: tools: further relax dlopen() checks too consider grouped
symbols") were partially incorrect in that they check at the same time
libcrypto and libssl. But if loading a library that only depends on
libcrypto, the ssl-only symbols will be missing and this might present
an inconsistency. This is what is observed on FreeBSD 13.1 when
libcrypto is being loaded, where it sees two symbols having disappeared.
The fix consists in splitting the checks for libcrypto and libssl.
No backport is needed, unless the patch above finally gets backported.
On FreeBSD 13.1 I noticed that thread balancing using shards was not
always working. Sometimes several threads would work, but most of the
time a single one was taking all the traffic. This is related to how
SO_REUSEPORT works on FreeBSD since version 12, as it seems there is
no guarantee that multiple sockets will receive the traffic. However
there is SO_REUSEPORT_LB that is designed exactly for this, so we'd
rather use it when available.
This patch may possibly be backported, but nobody complained and it's
not sure that many users rely on shards. So better wait for some feedback
before backporting this.
In 2.7-dev2, "stats bind-process" was removed by commit 94f763b5e
("MEDIUM: config: remove deprecated "bind-process" directives from
frontends") and an error message indicates that it's no more supported.
However it says "stats" is not supported instead of "stats bind-process",
making it a bit confusing.
This should be backported to 2.7.
By comparing the local thread's load with the least loaded thread's
load, we can further improve the fairness and at the same time also
improve locality since it allows a small ratio of connections not to
be migrated. This is visible on CPU usage with long connections on
very large thread counts (224) and high bandwidth (200G). The cost
of checking the local thread's load remains fairly low so there's no
reason not to do this. We continue to update the index if we select
the local thread, because it means that the two other threads were
both more loaded so we'd rather find better ones.
One limitation of the current thread index mechanism is that if the
values are assigned multiple times to the same thread and the index
loops, it can match again the old value, which will not prevent a
competing thread from finishing its CAS and assigning traffic to a
thread that's not the optimal one. The probability is low but the
solution is simple enough and consists in implementing an update
counter in the high bits of the index to force a mismatch in this
case (assuming we don't try to cover for extremely unlikely cases
where the update counter loops while the index remains equal). So
let's do that. In order to improve the situation a little bit, we
now set the index to a ulong so that in 32 bits we have 8 bits of
counter and in 64 bits we have 40 bits.
During heavy accept competition, the CAS will occasionally fail and
we'll have to go through all the calculation again. While the first
two loops look heavy, they're almost never taken so they're quite
cheap. However the rest of the operation is heavy because we have to
consult connection counts and queue indexes for other threads, so
better double-check if the index is still valid before continuing.
Tests show that it's more efficient do retry half-way like this.
Instead of seeing each listener use its own thr_idx, let's use the same
for all those from a shard. It should provide more accurate and smoother
thread allocation.
Till now threads were assigned in listener_accept() to other threads of
the same group only, using a single group mask. Now that we have all the
relevant info (array of listeners of the same shard), we can spread the
thr_idx to cover all assigned groups. The thread indexes now contain the
group number in their upper bits, and the indexes run over te whole list
of threads, all groups included.
One particular subtlety here is that switching to a thread from another
group also means switching the group, hence the listener. As such, when
changing the group we need to update the connection's owner to point to
the listener of the same shard that is bound to the target group.
There has always been a race when checking the length of an accept queue
to determine which one is more loaded that another, because the head and
tail are read at two different moments. This is not required, we can merge
them as two 16 bit numbers inside a single 32-bit index that is always
accessed atomically. This way we read both values at once and always have
a consistent measurement.
Now it's possible for a bind line to span multiple thread groups. When
this happens, the first one will become the reference and will be entirely
set up, and the subsequent ones will be duplicated from this reference,
so that they can be registered in distinct groups. The reference is
always setup and started first so it is always available when the other
ones are started.
The doc was updated to reflect this new possibility with its limitations
and impacts, and the differences with the "shards" option.
It's not strictly necessary, but it's still better to avoid setting
up the same socket multiple times when it's being duplicated to a few
FDs. We don't change that for inherited ones however since they may
really need to be set up, so we only skip duplicated ones.
The different protocol's ->bind() function will now check the receiver's
RX_F_MUST_DUP flag to decide whether to bind a fresh new listener from
scratch or reuse an existing one and just duplicate it. It turns out
that the existing code already supports reusing FDs since that was done
as part of the FD passing and inheriting mechanism. Here it's not much
different, we pass the FD of the reference receiver, it gets duplicated
and becomes the new receiver's FD.
These FDs are also marked RX_F_INHERITED so that they are not exported
and avoid being touched directly (only the reference should be touched).
In order to create multiple receivers for one multi-group shard, we'll
need some more info about the shard. Here we store:
- the number of groups (= number of receivers)
- the number of threads (will be used for accept LB)
- pointer to the reference rx (to get the FD and to find all threads)
- pointers to the other members (to iterate over all threads)
For now since there's only one group per shard it remains simple. The
listener deletion code already takes care of removing the current
member from its shards list and moving others' reference to the last
one if it was their reference (so as to avoid o(n^2) updates during
ordered deletes).
Since the vast majority of setups will not use multi-group shards, we
try to save memory usage by only allocating the shard_info when it is
needed, so the principle here is that a receiver shard_info==NULL is
alone and doesn't share its socket with another group.
Various approaches were considered and tests show that the management
of the listeners during boot makes it easier to just attach to or
detach from a shard_info and automatically allocate it if it does not
exist, which is what is being done here.
For now the attach code is not called, but detach is already called
on delete.
This new algorithm for rebalancing incoming connections to multiple
threads is simpler and instead of considering the threads load, it will
only cycle through all of them, offering a fair share of the traffic to
each thread. It may be well suited for short-lived connections but is
also convenient for very large thread counts where it's not always certain
that the least loaded thread will always be found.
There's a li_per_thread array in each listener for use with QUIC
listeners. Since thread groups were introduced, this array can be
allocated too large because global.nbthread is allocated for each
listener, while only no more than MIN(nbthread,MAX_THREADS_PER_GROUP)
may be used by a single listener. This was because the global thread
ID is used as the index instead of the local ID (since a listener may
only be used by a single group). Let's just switch to local ID and
reduce the allocated size.
When migrating a quic_conn to another thread, we may need to also
switch the listener if the thread belongs to another group. When
this happens, the freshly created connection will already have the
target listener, so let's just pick it from the connection and use
it in qc_set_tid_affinity(). Note that it will be the caller's
responsibility to guarantee this.
Adding the possibility to publish an event using a struct wrapper
around existing SERVER events to provide additional contextual info.
Using the specific struct wrapper is not required: it is supported
to cast event data as a regular server event data struct so
that we don't break the existing API.
However, casting event data with a more explicit data type allows
to fetch event-only relevant hints.
Considering that srv_update_status() is now synchronous again since
3ff577e1 ("MAJOR: server: make server state changes synchronous again"),
and that we can easily identify if the update is from an operational
or administrative context thanks to "MINOR: server: pass adm and op cause
to srv_update_status()".
And given that administrative and operational updates cannot be cumulated
(since srv_update_status() is called synchronously and independently for
admin updates and state/operational updates, and the function directly
consumes the changes).
We split srv_update_status() in 2 distinct parts:
Either <type> is 0, meaning the update is an operational update which
is handled by directly looking at cur_state and next_state to apply the
proper transition.
Also, the check to prevent operational state from being applied
if MAINT admin flag is set is no longer needed given that the calling
functions already ensure this (ie: srv_set_{running,stopping,stopped)
Or <type> is 1, meaning the update is an administrative update, where
cur_admin and next_admin are evaluated to apply the proper transition and
deduct the resulting server state (next_state is updated implicitly).
Once this is done, both operations share a common code path in
srv_update_status() to update proxy and servers stats if required.
Thanks to this change, the function's behavior is much more predictable,
it is not an all-in-one function anymore. Either we apply an operational
change, else it is an administrative change. That's it, we cannot mix
the 2 since both code paths are now properly separated.
Operational and administrative state change causes are not propagated
through srv_update_status(), instead they are directly consumed within
the function to provide additional info during the call when required.
Thus, there is no valid reason for keeping adm and op causes within
server struct. We are wasting space and keeping uneeded complexity.
We now exlicitly pass change type (operational or administrative) and
associated cause to srv_update_status() so that no extra storage is
needed since those values are only relevant from srv_update_status().
Fixing function comments for the server state changing function since they
still refer to asynchonous propagation of server state which is no longer
in play.
Moreover, there were some mixups between running/stopping.
This one is greatly inspired by "MINOR: server: change adm_st_chg_cause storage type".
While looking at current srv_op_st_chg_cause usage, it was clear that
the struct needed some cleanup since some leftovers from asynchronous server
state change updates were left behind and resulted in some useless code
duplication, and making the whole thing harder to maintain.
Two observations were made:
- by tracking down srv_set_{running, stopped, stopping} usage,
we can see that the <reason> argument is always a fixed statically
allocated string.
- check-related state change context (duration, status, code...) is
not used anymore since srv_append_status() directly extracts the
values from the server->check. This is pure legacy from when
the state changes were applied asynchronously.
To prevent code duplication, useless string copies and make the reason/cause
more exportable, we store it as an enum now, and we provide
srv_op_st_chg_cause() function to fetch the related description string.
HEALTH and AGENT causes (check related) are now explicitly identified to
make consumers like srv_append_op_chg_cause() able to fetch checks info
from the server itself if they need to.
srv_append_status() has become a swiss-knife function over time.
It is used from server code and also from checks code, with various
inputs and distincts code paths, making it very hard to guess the
actual behavior of the function (resulting string output).
To simplify the logic behind it, we're dividing it in multiple contextual
functions that take simple inputs and do explicit things, making them
more predictable and easier to maintain.
Even though it doesn't look like it at first glance, this is more like
a cleanup than an actual code improvement:
Given that srv->adm_st_chg_cause has been used to exclusively store
static strings ever since it was implemented, we make the choice to
store it as an enum instead of a fixed-size string within server
struct.
This will allow to save some space in server struct, and will make
it more easily exportable (ie: event handlers) because of the
reduced memory footprint during handling and the ability to later get
the corresponding human-readable message when it's explicitly needed.
Now that we have a generic srv_lb_propagate(s) function, let's
use it each time we explicitly wan't to set the status down as
well.
Indeed, it is tricky to try to handle "down" case explicitly,
instead we use srv_lb_propagate() which will call the proper
function that will handle the new server state.
This will allow some code cleanup and will prevent any logic
error.
This commit depends on:
- "MINOR: server: propagate server state change to lb through single function"
Use a dedicated helper function to propagate server state change to
lb algorithms, since it is performed at multiple places within
srv_update_status() function.
Based on "BUG/MINOR: server: don't miss server stats update on server
state transitions", we're also taking advantage of the new centralized
logic to update down_trans server counter directly from there instead
of multiple places.
When restoring from a state file: the server "Status" reports weird values on
the html stats page:
"5s UP" becomes -> "? UP" after the restore
This is due to a bug in srv_state_srv_update(): when restoring the states
from a state file, we rely on date.tv_sec to compute the process-relative
server last_change timestamp.
This is wrong because everywhere else we use now.tv_sec when dealing
with last_change, for instance in srv_update_status().
date (which is Wall clock time) deviates from now (monotonic time) in the
long run.
They should not be mixed, and given that last_change is an internal time value,
we should rely on now.tv_sec instead.
last_change export through "show servers state" cli is safe since we export
a delta and not the raw time value in dump_servers_state():
srv_time_since_last_change = now.tv_sec - srv->last_change
--
While this bug affects all stable versions, it was revealed in 2.8 thanks
to 28360dc ("MEDIUM: clock: force internal time to wrap early after boot")
This is due to the fact that "now" immediately deviates from "date",
whereas in the past they had the same value when starting.
Thus prior to 2.8 the bug is trickier since it could take some time for
date and now to deviate sufficiently for the issue to arise, and instead
of reporting absurd values that are easy to spot it could just result in
last_change becoming inconsistent over time.
As such, the fix should be backported to all stable versions.
[for 2.2 the patch needs to be applied manually since
srv_state_srv_update() was named srv_update_state() and can be found in
server.c instead of server_state.c]
s->last_change and s->down_time updates were manually updated for each
effective server state change within srv_update_status().
This is rather error-prone, and as a result there were still some state
transitions that were not handled properly since at least 1.8.
ie:
- when transitionning from DRAIN to READY: downtime was updated
(which is wrong since a server in DRAIN state should not be
considered as DOWN)
- when transitionning from MAINT to READY: downtime was not updated
(this can be easily seen in the html stats page)
To fix these all at once, and prevent similar bugs from being introduced,
we centralize the server last_change and down_time stats logic at the end
of srv_update_status():
If the server state changed during the call, then it means that
last_change must be updated, with a special case when changing from
STOPPED state which means the server was previously DOWN and thus
downtime should be updated.
This patch depends on:
- "MINOR: server: explicitly commit state change in srv_update_status()"
This could be backported to every stable versions.
backend "down" stats logic has been duplicated multiple times in
srv_update_status(), resulting in the logic now being error-prone.
For example, the following bugfix was needed to compensate for a
copy-paste introduced bug: d332f139
("BUG/MINOR: server: update last_change on maint->ready transitions too")
While the above patch works great, we actually forgot to update the
proxy downtime like it is done for other down->up transitions...
This is simply illustrating that the current design is error-prone,
it is very easy to miss something in this area.
To properly update the proxy downtime stats on the maint->ready transition,
to cleanup srv_update_status() and to prevent similar bugs from being
introduced in the future, proxy/backend stats update are now automatically
performed at the end of the server state change if needed.
Thus we can remove existing updates that were performed at various places
within the function, this simplifies things a bit.
This patch depends on:
- "MINOR: server: explicitly commit state change in srv_update_status()"
This could be backported to all stable versions.
Backport notes:
2.2:
Replace
struct task *srv_cleanup_toremove_conns(struct task *task, void *context, unsigned int state)
by
struct task *srv_cleanup_toremove_connections(struct task *task, void *context, unsigned short state)
As shown in 8f29829 ("BUG/MEDIUM: checks: a down server going to
maint remains definitely stucked on down state."), state changes
that don't result in explicit lb state change, require us to perform
an explicit server state commit to make sure the next state is
applied before returning from the function.
This is the case for server state changes that don't trigger lb logic
and only perform some logging.
This is quite error prone, we could easily forget a state change
combination that could result in next_state, next_admin or next_eweight
not being applied. (cur_state, cur_admin and cur_eweight would be left
with unexpected values)
To fix this, we explicitly call srv_lb_commit_status() at the end
of srv_update_status() to enforce the new values, even if they were
already applied. (when a state changes requires lb state update
an implicit commit is already performed)
Applying the state change multiple times is safe (since the next value
always points to the current value).
Backport notes:
2.2:
Replace
struct task *srv_cleanup_toremove_conns(struct task *task, void *context, unsigned int state)
by
struct task *srv_cleanup_toremove_connections(struct task *task, void *context, unsigned short state)
Report message for tracking servers completely leaving drain is wrong:
The check for "leaving drain .. via" never evaluates because the
condition !(s->next_admin & SRV_ADMF_FDRAIN) is always true in the
current block which is guarded by !(s->next_admin & SRV_ADMF_DRAIN).
For tracking servers that leave inherited drain mode, this results in the
following message being emitted:
"Server x/b is UP (leaving forced drain)"
Instead of:
"Server x/b is UP (leaving drain) via x/a"
To this fix: we check if FDRAIN is currently set, else it means that the
drain status is inherited from the tracked server (IDRAIN)
This regression was introduced with 64cc49cf ("MAJOR: servers: propagate server
status changes asynchronously."), thus it may be backported to every stable
versions.
'when' optional argument is provided to lua event handlers.
It is an integer representing the number of seconds elapsed since Epoch
and may be used in conjunction with lua `os.date()` function to provide
a custom format string.
For advanced async handlers only
(Registered using EVENT_HDL_ASYNC_TASK() macro):
event->when is provided as a struct timeval and fetched from 'date'
haproxy global variable.
Thanks to 'when', related event consumers will be able to timestamp
events, even if they don't work in real-time or near real-time.
Indeed, unlike sync or normal async handlers, advanced async handlers
could purposely delay the consumption of pending events, which means
that the date wouldn't be accurate if computed directly from within
the handler.
Add the ability to provide a cleanup function for event data passed
via the publishing function.
One use case could be the need to provide valid pointers in the safe
section of the data struct.
Cleanup function will be automatically called with data (or copy of data)
as argument when all handlers consumed the event, which provides an easy
way to release some memory or decrement refcounts to ressources that were
provided through the data struct.
data in itself may not be freed by the cleanup function, it is handled
by the API.
This would allow passing large (allocated) data blocks through the data
struct while keeping data struct size under the EVENT_HDL_ASYNC_EVENT_DATA
size limit.
To do so, when publishing an event, where we would currently do:
struct event_hdl_cb_data_new_family event_data;
/* safe data, available from both sync and async contexts
* may not use pointers to short-living resources
*/
event_data.safe.my_custom_data = x;
/* unsafe data, only available from sync contexts */
event_data.unsafe.my_unsafe_data = y;
/* once data is prepared, we can publish the event */
event_hdl_publish(NULL,
EVENT_HDL_SUB_NEW_FAMILY_SUBTYPE_1,
EVENT_HDL_CB_DATA(&event_data));
We could do:
struct event_hdl_cb_data_new_family event_data;
/* safe data, available from both sync and async contexts
* may not use pointers to short-living resources,
* unless EVENT_HDL_CB_DATA_DM is used to ensure pointer
* consistency (ie: refcount)
*/
event_data.safe.my_custom_static_data = x;
event_data.safe.my_custom_dynamic_data = malloc(1);
/* unsafe data, only available from sync contexts */
event_data.unsafe.my_unsafe_data = y;
/* once data is prepared, we can publish the event */
event_hdl_publish(NULL,
EVENT_HDL_SUB_NEW_FAMILY_SUBTYPE_1,
EVENT_HDL_CB_DATA_DM(&event_data, data_new_family_cleanup));
With data_new_family_cleanup func which would look like this:
void data_new_family_cleanup(const void *data)
{
const struct event_hdl_cb_data_new_family *event_data = ptr;
/* some data members require specific cleanup once the event
* is consumed
*/
free(event_data.safe.my_custom_dynamic_data);
/* don't ever free data! it is not ours */
}
Not sure if this feature will become relevant in the future, so I prefer not
to mention it in the doc for now.
But given that the implementation is trivial and does not put a burden
on the existing API, it's a good thing to have it there, just in case.
This commit does nothing that ought to be mentioned, except that
it adds missing comments and slighty moves some function calls
out of "sensitive" code in preparation of some server code refactors.
Changing hlua_event_hdl_cb_data_push_args() return type to void since it
does not return anything useful.
Also changing its name to hlua_event_hdl_cb_push_args() since it does more
than just pushing cb data argument (it also handles event type and mgmt).
Errors catched by the function are reported as lua errors.
Since "MINOR: server/event_hdl: add proxy_uuid to event_hdl_cb_data_server"
we may now use proxy_uuid variable to perform proxy lookups when
handling a server event.
It is more reliable since proxy_uuid isn't subject to any size limitation
Expose proxy_uuid variable in event_hdl_cb_data_server struct to
overcome proxy_name fixed length limitation.
proxy_uuid may be used by the handler to perform proxy lookups.
This should be preferred over lookups relying proxy_name.
(proxy_name is suitable for printing / logging purposes but not for
ID lookups since it has a maximum fixed length)
srv_update_status() function comment says that the function "is designed
to be called asynchronously".
While this used to be true back then with 64cc49cf
("MAJOR: servers: propagate server status changes asynchronously.")
This is not true anymore since 3ff577e ("MAJOR: server: make server state changes
synchronous again")
Fixing the comment in order to better reflect current behavior.
Since 9f903af5 ("MEDIUM: log: slightly refine the output format of
alerts/warnings/etc"), messages generated by ha_{alert,warning,notice}
don't embed date/time information anymore.
Updating some old function comments that kept saying otherwise.
A BUG_ON crash can occur on qc_rcv_buf() if a Rx packet allocation
failed.
To fix this, datagram are marked as consumed even if a fatal error
occured during parsing. For the moment, only a Rx packet allocation
failure could provoke this. At this stage, it's unknown if the datagram
were partially parsed or not at all so it's better to discard it
completely.
This bug was detected using -dMfail argument.
This should be backported up to 2.7.
Properly initialize el_th_ctx member first on qc_new_conn(). This
prevents a segfault if release should be called later due to memory
allocation failure in the function on qc_detach_th_ctx_list().
This should be backported up to 2.7.
Do not emit a CONNECTION_CLOSE on h3s allocation failure. Indeed, this
causes a crash as the calling function qcs_new() will also try to emit a
CONNECTION_CLOSE which triggers a BUG_ON() on qcc_emit_cc().
This was reproduced using -dMfail.
This should be backported up to 2.7.
Previously, if a STREAM frame cannot be allocated for emission, a crash
would occurs due to an ABORT_NOW() statement in _qc_send_qcs().
Replace this by proper error code handling. Each stream were sending
fails are removed temporarily from qcc::send_list to a list local to
_qc_send_qcs(). Once emission has been conducted for all streams,
reinsert failed stream to qcc::send_list. This avoids to reloop on
failed streams on the second while loop at the end of _qc_send_qcs().
This crash was reproduced using -dMfail.
This should be backported up to 2.6.
On MUX initialization, the application layer is setup via
qcc_install_app_ops(). If this function fails MUX is deallocated and an
error is returned.
This code path causes a crash before connection has been registered
prior into the mux_stopping_data::list for stopping idle frontend conns.
To fix this, insert the connection later in qc_init() once no error can
occured.
The crash was seen on the process closing with SUGUSR1 with a segfault
on mux_stopping_process(). This was reproduced using -dMfail.
This regression was introduced by the following patch :
commit b4d119f0c7
BUG/MEDIUM: mux-quic: fix crash on H3 SETTINGS emission
This should be backported up to 2.7.
Again a now_ms variable value used without the ticks API. It is used
to store the generation time of the Retry token to be received back
from the client.
Must be backported to 2.6 and 2.7.
As server, an Initial does not contain a token but only the token length field
with zero as value. The remaining room was not checked before writting this field.
Must be backported to 2.6 and 2.7.
Since this commit:
BUG/MINOR: quic: Possible wrapped values used as ACK tree purging limit.
There are more chances that ack ranges may be removed from their trees when
building a packet. It is preferable to impose a limit to these trees. This
will be the subject of the a next commit to come.
For now on, it is sufficient to stop deleting ack range from their trees.
Remove quic_ack_frm_reduce_sz() and quic_rm_last_ack_ranges() which were
there to do that.
Make qc_frm_len() support ACK frames and calls it to ensure an ACK frame
may be added to a packet before building it.
Must be backported to 2.6 and 2.7.
Overriding global coroutine.create() function in order to link the
newly created subroutine with the parent hlua ctx.
(hlua_gethlua() function from a subroutine will return hlua ctx from the
hlua ctx on which the coroutine.create() was performed, instead of NULL)
Doing so allows hlua_hook() function to support being called from
subroutines created using coroutine.create() within user lua scripts.
That is: the related subroutine will be immune to the forced-yield,
but it will still be checked against hlua timeouts. If the subroutine
fails to yield or finish before the timeout, the related lua handler will
be aborted (instead of going rogue unnoticed like it would be the case prior
to this commit)
When forcing a yield attempt from hlua_hook(), we should perform it on
the known hlua state, not on a potential substate created using
coroutine.create() from an existing hlua state from lua script.
Indeed, only true hlua couroutines will properly handle the yield and
perform the required timeout checks when returning in hlua_ctx_resume().
So far, this was not a concern because hlua_gethlua() would return NULL
if hlua_hook() is not directly being called from a hlua coroutine anyway.
But with this we're trying to make hlua_hook() ready for being called
from a subcoroutine which inherits from a parent hlua ctx.
In this case, no yield attempt will be performed, we will simply check
for hlua timeouts.
Not doing so would result in the timeout checks not being performed since
hlua_ctx_resume() is completely bypassed when yielding from the subroutine,
resulting in a user-defined coroutine potentially going rogue unnoticed.
Not all hlua "time" variables use the same time logic.
hlua->wake_time relies on ticks since its meant to be used in conjunction
with task scheduling. Thus, it should be stored as a signed int and
manipulated using the tick api.
Adding a few comments about that to prevent mixups with hlua internal
timer api which doesn't rely on the ticks api.
The "burst" execution timeout applies to any Lua handler.
If the handler fails to finish or yield before timeout is reached,
handler will be aborted to prevent thread contention, to prevent
traffic from not being served for too long, and ultimately to prevent
the process from crashing because of the watchdog kicking in.
Default value is 1000ms.
Combined with forced-yield default value of 10000 lua instructions, it
should be high enough to prevent any existing script breakage, while
still being able to catch slow lua converters or sample fetches
doing thread contention and risking the process stability.
Setting value to 0 completely bypasses this check. (not recommended but
could be required to restore original behavior if this feature breaks
existing setups somehow...)
No backport needed, although it could be used to prevent watchdog crashes
due to poorly coded (slow/cpu consuming) lua sample fetches/converters.
For non yieldable lua handlers (converters, fetches or yield
incompatible lua functions), current timeout detection relies on now_ms
thread local variable.
But within non-yieldable contexts, now_ms won't be updated if not by us
(because we're momentarily stuck in lua context so we won't
re-enter the polling loop, which is responsible for clock updates).
To circumvent this, clock_update_date(0, 1) was manually performed right
before now_ms is being read for the timeout checks.
But this fails to work consistently, because if no other concurrent
threads periodically run clock_update_global_date(), which do happen if
we're the only active thread (nbthread=1 or low traffic), our
clock_update_date() call won't reliably update our local now_ms variable
Moreover, clock_update_date() is not the right tool for this anyway, as
it was initially meant to be used from the polling context.
Using it could have negative impact on other threads relying on now_ms
to be stable. (because clock_update_date() performs global clock update
from time to time)
-> Introducing hlua multipurpose timer, which is internally based on
now_cpu_time_fast() that provides per-thread consistent clock readings.
Thanks to this new hlua timer API, hlua timeout logic is less error-prone
and more robust.
This allows the timeout detection to work as expected for both yieldable
and non-yieldable lua handlers.
This patch depends on commit "MINOR: clock: add now_cpu_time_fast() function"
While this could theorically be backported to all stable versions,
it is advisable to avoid backports unless we're confident enough
since it could cause slight behavior changes (timing related) in
existing setups.
Same as now_cpu_time(), but for fast queries (less accurate)
Relies on now_cpu_time() and now_mono_time_fast() is used
as a cache expiration hint to prevent now_cpu_time() from being
called too often since it is known to be quite expensive.
Depends on commit "MINOR: clock: add now_mono_time_fast() function"
Same as now_mono_time(), but for fast queries (less accurate)
Relies on coarse clock source (also known as fast clock source on
some systems).
Fallback to now_mono_time() if coarse source is not supported on the system.
Commit 5003ac7fe ("MEDIUM: config: set useful ALPN defaults for HTTPS
and QUIC") revealed a build dependency bug: if QUIC is not enabled,
cfgparse doesn't have any dependency on the SSL stack, so the various
ifdefs that try to check special conditions such as rejecting an H2
config with too small a bufsize, are silently ignored. This was
detected because the default ALPN string was not set and caused the
alpn regtest to fail without QUIC support. Adding openssl-compat to
the list of includes seems to be sufficient to have what we need.
It's unclear when this dependency was broken, it seems that even 2.2
didn't have an explicit dependency on anything SSL-related, though it
could have been inherited through other files (as happens with QUIC
here). It would be safe to backport it to all stable branches. The
impact is very low anyway.
The following commit introduced a regression :
commit 1a5cc19cec
MINOR: quic: adjust Rx packet type parsing
Since this commit, qv variable was left to NULL as version is stored
directly in quic_rx_packet instance. In most cases, this only causes
traces to skip version printing. However, qv is dereferenced when
sending a Retry which causes a segfault.
To fix this, simply remove qv variable and use pkt->version instead,
both for traces and send_retry() invocation.
This bug was detected thanks to QUIC interop runner. It can easily be
reproduced by using quic-force-retry on the bind line.
This must be backported up to 2.7.
This commit makes sure that if three is no "alpn", "npn" nor "no-alpn"
setting on a "bind" line which corresponds to an HTTPS or QUIC frontend,
we automatically turn on "h2,http/1.1" as an ALPN default for an HTTP
listener, and "h3" for a QUIC listener. This simplifies the configuration
for end users since they won't have to explicitly configure the ALPN
string to enable H2, considering that at the time of writing, HTTP/1.1
represents less than 7% of the traffic on large infrastructures. The
doc and regtests were updated. For more info, refer to the following
thread:
https://www.mail-archive.com/haproxy@formilux.org/msg43410.html
While it does not have any effect, it's better not to try to setup an
ALPN callback nor to try to lookup algorithms when the configured ALPN
string is empty as a result of "no-alpn" being used.
It's possible to replace a previously set ALPN but not to disable ALPN
if it was previously set. The new "no-alpn" setting allows to disable
a previously set ALPN setting by preparing an empty one that will be
replaced and freed when the config is validated.
On sending path, a pending error can be promoted to a terminal error at the
endpoint level (SE_FL_ERR_PENDING to SE_FL_ERROR). When this happens, we
must propagate the error on the SC to be able to handle it at the stream
level and eventually forward it to the other side.
Because of this bug, it is possible to freeze sessions, for instance on the
CLI.
It is a 2.8-specific issue. No backport needed.
When compiled in debug mode, HAProxy prints a debug message at the end of
the cli I/O handle. It is pretty annoying and useless because, we can active
applet traces. Thus, just remove it.
When compiled in debug mode, HAProxy prints a debug message at the beginning
of assign_server(). It is pretty annoying and useless because, in debug
mode, we can active stream traces. Thus, just remove it.
The commit 9704797fa ("BUG/MEDIUM: http-ana: Properly switch the request in
tunnel mode on upgrade") fixes the switch in TUNNEL mode, but only
partially. Because both channels are switch in TUNNEL mode in same time on
one side, the channel's analyzers on the opposite side are not updated
accordingly. This prevents the tunnel timeout to be applied.
So instead of updating both sides in same time, we only force the analysis
on the other side by setting CF_WAKE_ONCE flag when a channel is switched in
TUNNEL mode. In addition, we must take care to forward all data if there is
no DATAa TCP filters registered.
This patch is related to the issue #2125. It is 2.8-specific. No backport
needed.
Remove the receiver RX_F_LOCAL_ACCEPT flag. This was used by QUIC
protocol before thread rebinding was supported by the quic_conn layer.
This should be backported up to 2.7 after the previous patch has also
been taken.
Before this patch, QUIC protocol used a custom add_listener callback.
This was because a quic_conn instance was allocated before accept. Its
thread affinity was fixed and could not be changed after. The thread was
derived itself from the CID selected by the client which prevent an even
repartition of QUIC connections on multiple threads.
A series of patches was introduced with a lot of changes. The most
important ones :
* removal of affinity between an encoded CID and a thread
* possibility to rebind a quic_conn on a new thread
Thanks to this, it's possible to suppress the custom add_listener
callback. Accept is conducted for QUIC protocol as with the others. A
less loaded thread is selected on listener_accept() and the connection
stack is bind on it. This operation implies that quic_conn instance is
moved to the new thread using the set_affinity QUIC protocol callback.
To reactivate quic_conn instance after thread rebind,
qc_finalize_affinity_rebind() is called after accept on the new thread
by qc_xprt_start() through accept_queue_process() / session_accept_fd().
This should be backported up to 2.7 after a period of observation.
When a quic_conn instance is rebinded on a new thread its tasks and
tasklet are destroyed and new ones created. Its socket is also migrated
to a new thread which stop reception on it.
To properly reactivate a quic_conn after rebind, wake up its tasks and
tasklet if they were active before thread rebind. Also reactivate
reading on the socket FD. These operations are implemented on a new
function qc_finalize_affinity_rebind().
This should be backported up to 2.7 after a period of observation.
qc_set_timer() function is used to rearm the timer for loss detection
and probing. Previously, timer was always rearm when congestion window
was free due to a wrong interpretation of the RFC which mandates the
client to rearm the timer before handshake completion to avoid a
deadlock related to anti-amplification.
Fix this by removing this code from quic_pto_pktns(). This allows
qc_set_timer() to be reentrant and only activate the timer if needed.
The impact of this bug seems limited. It can probably caused the timer
task to be processed too frequently which could caused too frequent
probing.
This change will allow to reuse easily qc_set_timer() after quic_conn
thread migration. As such, the new timer task will be scheduled only if
needed.
This should be backported up to 2.6.
Implement a new function qc_set_tid_affinity(). This function is
responsible to rebind a quic_conn instance to a new thread.
This operation consists mostly of releasing existing tasks and tasklet
and allocating new instances on the new thread. If the quic_conn uses
its owned socket, it is also migrated to the new thread. The migration
is finally completed with updated the CID TID to the new thread. After
this step, the connection is thus accessible to the new thread and
cannot be access anymore on the old one without risking race condition.
To ensure rebinding is either done completely or not at all, tasks and
tasklet are pre-allocated before all operations. If this fails, an error
is returned and rebiding is not done.
To destroy the older tasklet, its context is set to NULL before wake up.
In I/O callbacks, a new function qc_process() is used to check context
and free the tasklet if NULL.
The thread rebinding can cause a race condition if the older thread
quic_dghdlrs::dgrams list contains datagram for the connection after
rebinding is done. To prevent this, quic_rx_pkt_retrieve_conn() always
check if the packet CID is still associated to the current thread or
not. In the latter case, no connection is returned and the new thread is
returned to allow to redispatch the datagram to the new thread in a
thread-safe way.
This should be backported up to 2.7 after a period of observation.
When QUIC handshake is completed on our side, some frames are prepared
to be sent :
* HANDSHAKE_DONE
* several NEW_CONNECTION_ID with CIDs allocated
This step was previously executed in quic_conn_io_cb() directly after
CRYPTO frames parsing. This patch delays it to be completed after
accept. Special care have been taken to ensure it is still functional
with 0-RTT activated.
For the moment, this patch should have no impact. However, when
quic_conn thread migration on accept will be implemented, it will be
easier to remap only one CID to the new thread. New CIDs will be
allocated after migration on the new thread.
This should be backported up to 2.7 after a period of observation.
Define a new protocol callback set_affinity. This function is used
during listener_accept() to notify about a rebind on a new thread just
before pushing the connection on the selected thread queue. If the
callback fails, accept is done locally.
This change will be useful for protocols with state allocated before
accept is done. For the moment, only QUIC protocol is concerned. This
will allow to rebind the quic_conn to a new thread depending on its
load.
This should be backported up to 2.7 after a period of observation.
Each quic_conn is inserted in an accept queue to allocate the upper
layers. This is done through a listener tasklet in
quic_sock_accept_conn().
This patch interrupts the accept process for a quic_conn in
closing/draining state. Indeed, this connection will soon be closed so
it's unnecessary to allocate a complete stack for it.
This patch will become necessary when thread migration is implemented.
Indeed, it won't be allowed to proceed to thread migration for a closing
quic_conn.
This should be backported up to 2.7 after a period of observation.
TID encoding in CID was removed by a recent change. It is now possible
to access to the <tid> member stored in quic_connection_id instance.
For unknown CID, a quick solution was to redispatch to the thread
corresponding to the first CID byte. This ensures that an identical CID
will always be handled by the same thread to avoid creating multiple
same connection. However, this forces an uneven load repartition which
can be critical for QUIC handshake operation.
To improve this, remove the above constraint. An unknown CID is now
handled by its receiving thread. However, this means that if multiple
packets are received with the same unknown CID, several threads will try
to allocate the same connection.
To prevent this race condition, CID insertion in global tree is now
conducted first before creating the connection. This is a thread-safe
operation which can only be executed by a single thread. The thread
which have inserted the CID will then proceed to quic_conn allocation.
Other threads won't be able to insert the same CID : this will stop the
treatment of the current packet which is redispatch to the now owning
thread.
This should be backported up to 2.7 after a period of observation.
CIDs were moved from a per-thread list to a global list instance. The
TID-encoded is thus non needed anymore.
This should be backported up to 2.7 after a period of observation.
Previously, quic_connection_id were stored in a per-thread tree list.
Datagram were first dispatched to the correct thread using the encoded
TID before a tree lookup was done.
Remove these trees and replace it with a global trees list of 256
entries. A CID is using the list index corresponding to its first byte.
On datagram dispatch, CID is lookup on its tree and TID is retrieved
using new member quic_connection_id.tid. As such, a read-write lock
protects each list instances. With 256 entries, it is expected that
contention should be reduced.
A new structure quic_cid_tree served as a tree container associated with
its read-write lock. An API is implemented to ensure lock safety for
insert/lookup/delete operation.
This patch is a step forward to be able to break the affinity between a
CID and a TID encoded thread. This is required to be able to migrate a
quic_conn after accept to select thread based on their load.
This should be backported up to 2.7 after a period of observation.
Remove <tid> member in quic_conn. This is moved to quic_connection_id
instance.
For the moment, this change has no impact. Indeed, qc.tid reference
could easily be replaced by tid as all of this work was already done on
the connection thread. However, it is planified to support quic_conn
thread migration in the future, so removal of qc.tid will simplify this.
This should be backported up to 2.7.
ODCID are never stored in the CID tree. Instead, we store our generated
CID which is directly derived from the CID using a hash function. This
operation is done via quic_derive_cid().
Previously, generated CID was returned as a 64-bits integer. However,
this is cumbersome to convert as an array of bytes which is the most
common CID representation. Adjust this by modifying return type to a
quic_cid struct.
This should be backported up to 2.7.
qc_parse_hd_form() is the function used to parse the first byte of a
packet and return its type and version. Its API has been simplified with
the following changes :
* extra out paremeters are removed (long_header and version). All infos
are now stored directly in quic_rx_packet instance
* a new dummy version is declared in quic_versions array with a 0 number
code. This can be used to match Version negotiation packets.
* a new default packet type is defined QUIC_PACKET_TYPE_UNKNOWN to be
used as an initial value.
Also, the function has been exported to an include file. This will be
useful to be able to reuse on quic-sock to parse the first packet of a
datagram.
This should be backported up to 2.7.
Two different structs exists for QUIC connection ID :
* quic_connection_id which represents a full CID with its sequence
number
* quic_cid which is just a buffer with a length. It is contained in the
above structure.
To better differentiate them, rename all quic_connection_id variable
instances to "conn_id" by contrast to "cid" which is used for quic_cid.
This should be backported up to 2.7.
Remove quic_conn instance as first parameter of
quic_stateless_reset_token_init() and quic_stateless_reset_token_cpy()
functions. It was only used for trace purpose.
The main advantage is that it will be possible to allocate a QUIC CID
without a quic_conn instance using new_quic_cid() which is requires to
first check if a CID is existing before allocating a connection.
This should be backported up to 2.7.
QUIC_LOCK label is never used. Indeed, lock usage is minimal on QUIC as
every connection is pinned to its owned thread.
This should be backported up to 2.7.
Adjust BUG_ON() statement to allow tasklet_wakeup_after() for tasklets
with tid pinned to -1 (the current thread). This is similar to
tasklet_wakeup().
This should be backported up to 2.6.
For a long time the maximum number of concurrent streams was set once for
both sides (front and back) while the impacts are different. This commit
allows it to be configured separately for each side. The older settings
remains the fallback choice when other ones are not set.
For a long time the initial window size (per-stream size) was set once
for both directions, frontend and backend, resulting in a tradeoff between
upload speed and download fairness. This commit allows it to be configured
separately for each side. The older settings remains the fallback choice
when other ones are not set.
In the same way than for a stream-connector attached to a mux, an EOS is now
propagated from an applet to its stream-connector. To do so, sc_applet_eos()
function is added.
Now there is a SC flag to state the endpoint has reported an end-of-stream,
it is possible to distinguish an EOS from an abort at the stream-connector
level.
sc_conn_read0() function is renamed to sc_conn_eos() and it propagates an
EOS by setting SC_FL_EOS instead of SC_FL_ABRT_DONE. It only concernes
stream-connectors attached to a mux.
SC_FL_EOS flag is added to report the end-of-stream at the SC level. It will
be used to distinguish end of stream reported by the endoint, via the
SE_FL_EOS flag, and the abort triggered by the stream, via the
SC_FL_ABRT_DONE flag.
In this patch, the flag is defined and is systematically tested everywhere
SC_FL_ABRT_DONE is tested. It should be safe because it is never set.
In the syslog applet, when there is no output data, nothing is performed and
the applet leaves by requesting more data. But it is an issue because a
client abort is only handled if it reported with the last bytes of the
message. If the abort occurs after the message was handled, it is ignored.
The session remains opened and inactive until the client timeout is being
triggered. It no such timeout is configured, given that the default maxconn
is 10, all slots can be quickly busy and make the applet unresponsive.
To fix the issue, the best is to always try to read a message when the I/O
handle is called. This way, the abort can be handled. And if there is no
data, we leave as usual.
This patch should fix the issue #2112. It must be backported as far as 2.4.
Since the commit f2b02cfd9 ("MAJOR: http-ana: Review error handling during
HTTP payload forwarding"), during the payload forwarding, we are analyzing a
side, we stop to test the opposite side. It means when the HTTP request
forwarding analyzer is called, we no longer check the response side and vice
versa.
Unfortunately, since then, the HTTP tunneling is broken after a protocol
upgrade. On the response is switch in TUNNEL mode. The request remains in
DONE state. As a consequence, data received from the server are forwarded to
the client but not data received from the client.
To fix the bug, when both sides are in DONE state, both are switched in same
time in TUNNEL mode if it was requested. It is performed in the same way in
http_end_request() and http_end_response().
This patch should fix the issue #2125. It is 2.8-specific. No backport
needed.
As revealed by GH #2120 opened by @Tristan971, there are cases where ACKs
have to be sent without packet to acknowledge because the ACK timer has
been triggered and the connection needs to probe the peer at the same time.
Indeed
Thank you to @Tristan971 for having reported this issue.
Must be backported to 2.6 and 2.7.
It is the last commit on this subject. we stop to use SE_FL_ERROR flag from
the SC, except at the I/O level. Otherwise, we rely on SC_FL_ERROR
flag. Now, there should be a real separation between SE flags and SC flags.
In the same way the previous commit, we stop to use SE_FL_ERROR flag from
analyzers and their sub-functions. We now fully rely on SC_FL_ERROR to do so.
SE_FL_ERROR flag is no longer set when an error is detected durign the
connection establishment. SC_FL_ERROR flag is set instead. So it is safe to
remove test on SE_FL_ERROR to detect connection establishment error.
We can now fully rely on SC_FL_ERROR flag from the stream. The first step is
to stop to set the SE_FL_ERROR flag. Only endpoints are responsible to set
this flag. It was a design limitation. It is now fixed.
From the stream, when SE_FL_ERROR flag is tested, we now also test the
SC_FL_ERROR flag. Idea is to stop to rely on the SE descriptor to detect
errors.
The flag SC_FL_ERROR is added to ack errors on the endpoint. When
SE_FL_ERROR flag is detected on the SE descriptor, the corresponding is set
on the SC. Idea is to avoid, as far as possible, to manipulated the SE
descriptor in upper layers and know when an error in the endpoint is handled
by the SC.
For now, this flag is only set and cleared but never tested.
There is no reason to remove this flag. When the SC endpoint is reset, it is
replaced by a new one. The old one is released. It was useful when the new
endpoint inherited some flags from the old one. But it is no longer
performed. Thus there is no reason still unset this flag.
When an applet is woken up, before calling its io_handler, we pretend it has
no more data to deliver. So, after the io_handler execution, it is a bug if
an applet states it has more data to deliver while the end of input is
reached.
So a BUG_ON() is added to be sure it never happens.
It is not the SC responsibility to report errors on the SE descriptor. It is
the endpoint responsibility. It must switch SE_FL_ERR_PENDING into
SE_FL_ERROR if the end of stream was detected. It can even be considered as
a bug if it is not done by he endpoint.
So now, on sending path, a BUG_ON() is added to abort if SE_FL_EOS and
SE_FL_ERR_PENDING flags are set but not SE_FL_ERROR. It is trully important
to handle this case in the endpoint to be able to properly shut the endpoint
down.
SE_FL_ERR_PENDING is used to report an error on the write side. But it is
not a terminal error. Some incoming data may still be available. In the cli
analyzers, it is important to not close the stream when this flag is
set. Otherwise the response to a command can be truncated. It is probably
hard to observe. But it remains a bug.
While this patch could be backported to 2.7, there is no real reason to do
so, except if someone reports a bug about truncated responses.
Here again, it is just a flag renaming. In SC flags, there is no longer
shutdown for reads but aborts. For now this flag is set when a read0 is
detected. It is of couse not accurate. This will be changed later.
After the flag renaming, it is now the turn for the channel function to be
renamed and moved in the SC scope. channel_shutw_now() is replaced by
sc_schedule_shutdown(). The request channel is replaced by the front SC and
the response is replace by the back SC.
Because shutowns for reads are now considered as aborts, the shudowns for
writes can now be considered as shutdowns. Here it is just a flag
renaming. SC_FL_SHUTW_NOW is renamed SC_FL_SHUT_WANTED.
After the flag renaming, it is now the turn for the channel function to be
renamed and moved in the SC scope. channel_shutr_now() is replaced by
sc_schedule_abort(). The request channel is replaced by the front SC and the
response is replace by the back SC.
Most of calls to channel_abort() are associated to a call to
channel_auto_close(). Others are in areas where the auto close is the
default. So, it is now systematically enabled when an abort is performed on
a channel, as part of channel_abort() function.
First, it is useless to abort the both channel explicitly. For HTTP streams,
http_reply_and_close() is called. This function already take care to abort
processing. For TCP streams, we can rely on stream_retnclose().
To set termination flags, we can also rely on http_set_term_flags() for HTTP
streams and sess_set_term_flags() for TCP streams. Thus no reason to handle
them by hand.
At the end, the error handling after filters evaluation is now quite simple.
We erroneously though that an attempt to receive data was not possible if the SC
was waiting for more room in the channel buffer. A BUG_ON() was added to detect
bugs. And in fact, it is possible.
The regression was added in commit 341a5783b ("BUG/MEDIUM: stconn: stop to
enable/disable reads from streams via si_update_rx").
This patch should fix the issue #2115. It must be backported if the commit
above is backported.
A regression was introduced when stream's timeouts were refactored. Write
timeouts are not testing is the right order. When timeous of the front SC
are handled, we must then test the read timeout on the request channel and
the write timeout on the response channel. But write timeout is tested on
the request channel instead. On the back SC, the same mix-up is performed.
We must be careful to handle timeouts before checking channel flags. To
avoid any confusions, all timeuts are handled first, on front and back SCs.
Then flags of the both channels are tested.
It is a 2.8-specific issue. No backport needed.
There is a bug at begining of process_stream(). The SE_FL_ERROR flag is
tested against backend stream-connector's flags instead of its SE
descriptor's flags. It is an old typo, introduced when the stream-interfaces
were replaced by the conn-streams.
This patch must be backported as far as 2.6.
This bug arrived with this commit:
MEDIUM: quic: Ack delay implementation
After having probed the Handshake packet number space, one must not select the
Application encryption level to continue trying building packets as this is done
when the connection is not probing. Indeed, if the ACK timer has been triggered
in the meantime, the packet builder will try to build a packet at the Application
encryption level to acknowledge the received packet. But there is very often
no 01RTT packet to acknowledge when the connection is probing before the
handshake is completed. This triggers a BUG_ON() in qc_do_build_pkt() which
checks that the tree of ACK ranges to be used is not empty.
Thank you to @Tristan971 for having reported this issue in GH #2109.
Must be backported to 2.6 and 2.7.
qel->pktns->tx.pto_probe is set to 0 after having prepared a probing
datagram. There is no reason to check this parameter. Furthermore
it is always 0 when the connection does not probe the peer.
Must be backported to 2.6 and 2.7.
As reported by @Tristan971 in GH #2116, the congestion control window could be zero
due to an inversion in the code about the reduction factor to be applied.
On a new loss event, it must be applied to the slow start threshold and the window
should never be below ->min_cwnd (2*max_udp_payload_sz).
Same issue in both newReno and cubic algorithm. Furthermore in newReno, only the
threshold was decremented.
Must be backported to 2.6 and 2.7.
Add two missing checks not to substract too big values from another too little
one. In this case the resulted wrapped huge values could be passed to the function
which has to remove the last range of a tree of ACK ranges as encoded limit size
not to go below, cancelling the ACK ranges deletion. The consequence could be that
no ACK were sent.
Must be backported to 2.6 and 2.7.
qc_may_build_pkt() has been modified several times regardless of the conditions
the functions it is supposed to allow to send packets (qc_build_pkt()/qc_do_build_pkt())
really use to finally send packets just after having received others, leading
to contraditions and possible very long loops sending empty packets (PADDING only packets)
because qc_may_build_pkt() could allow qc_build_pkt()/qc_do_build_pkt to build packet,
and the latter did nothing except sending PADDING frames, because from its point
of view they had nothing to send.
For now on, this is the job of qc_may_build_pkt() to decide to if there is
packets to send just after having received others AND to provide this information
to the qc_build_pkt()/qc_do_build_pkt()
Note that the unique case where the acknowledgements are completely ignored is
when the endpoint must probe. But at least this is when sending at most two datagrams!
This commit also fixes the issue reported by Willy about a very low throughput
performance when the client serialized its requests.
Must be backported to 2.7 and 2.6.
This should help in diagnosing issues.
Some adjustments have to be done to avoid deferencing a quic_conn objects from
TRACE_*() calls.
Must be backported to 2.7 and 2.6.
Do not ignore very short RTTs (less than 1ms) before computing the smoothed
RTT initializing it to an "infinite" value (UINT_MAX).
Must be backported to 2.7 and 2.6.
Add the number of packet losts and the maximum congestion control window computed
by the algorithms to "show quic".
Same thing for the traces of existent congestion control algorithms.
Must be backported to 2.7 and 2.6.
In fd_update_events(), we loop until there's no bit in the running_mask
that is not in the thread_mask. Problem is, the thread sets its
running_mask bit before that loop, and so if 2 threads do the same, and
a 3rd one just closes the FD and sets the thread_mask to 0, then
running_mask will always be non-zero, and we will loop forever. This is
trivial to reproduce when using a DNS resolver that will just answer
"port unreachable", but could theoretically happen with other types of
file descriptors too.
To fix that, just don't bother looping if we're no longer in the
thread_mask, if that happens we know we won't have to take care of the
FD, anyway.
This should be backported to 2.7, 2.6 and 2.5.
Setting "shards by-group" will create one shard per thread group. This
can often be a reasonable tradeoff between a single one that can be
suboptimal on CPUs with many cores, and too many that will eat a lot
of file descriptors. It was shown to provide good results on a 224
thread machine, with a distribution that was even smoother than the
system's since here it can take into account the number of connections
per thread in the group. Depending on how popular it becomes, it could
even become the default setting in a future version.
Instead of artificially setting the shards count to MAX_THREAD when
"by-thread" is used, let's reserve special values for symbolic names
so that we can add more in the future. For now we use value -1 for
"by-thread", which requires to turn the type to signed int but it was
already used as such everywhere anyway.
fd_migrate_on() can be used to migrate an existing FD to any thread, even
one belonging to a different group from the current one and from the
caller's. All that is needed is to make sure the FD is still valid when
the operation is performed (which is the case when such operations happen).
This is potentially slightly expensive since it locks the tgid during the
delicate operation, but it is normally performed only from an owning
thread to offer the FD to another one (e.g. reassign a better thread upon
accept()).
We're only checking for 0, 1, or >1 groups enabled there, and we'll soon
need to be more precise and know quickly which groups are non-empty.
Let's just replace the count with a mask of enabled groups. This will
allow to quickly spot the presence of any such group in a set.
Alert when the len argument of a stick table type contains incorrect
characters.
Replace atol by strtol.
Could be backported in every maintained versions.
It was missing from the output but is sometimes convenient to observe
and understand how incoming connections are distributed. The CPU usage
is reported as the instant measurement of 100-idle_pct for each thread,
and the average value is shown for the aggregated value.
This could be backported as it's helpful in certain troublehsooting
sessions.
As the ACK frames are not added to the packet list of ack-eliciting frames,
it could not be traced. But there is a flag to identify such packet.
Let's use it to add this information to the traces of TX packets.
Must be backported to 2.6 and 2.7.
Dump at proto level the packet information when its header protection was removed.
Remove no more use qpkt_trace variable.
Must be backported to 2.7 and 2.6.
It is possible that the handshake was not confirmed and there was no more
packet in flight to probe with. It this case the server must wait for
the client to be unblocked without probing any packet number space contrary
to what was revealed by interop tests as follows:
[01|quic|2|uic_loss.c:65] TX loss pktns : qc@0x7fac301cd390 pktns=I pp=0
[01|quic|2|uic_loss.c:67] TX loss pktns : qc@0x7fac301cd390 pktns=H pp=0 tole=-102ms
[01|quic|2|uic_loss.c:67] TX loss pktns : qc@0x7fac301cd390 pktns=01RTT pp=0 if=1054 tole=-1987ms
[01|quic|5|uic_loss.c:73] quic_loss_pktns(): leaving : qc@0x7fac301cd390
[01|quic|5|uic_loss.c:91] quic_pto_pktns(): entering : qc@0x7fac301cd390
[01|quic|3|ic_loss.c:121] TX PTO handshake not already completed : qc@0x7fac301cd390
[01|quic|2|ic_loss.c:141] TX PTO : qc@0x7fac301cd390 pktns=I pp=0 dur=83ms
[01|quic|5|ic_loss.c:142] quic_pto_pktns(): leaving : qc@0x7fac301cd390
[01|quic|3|c_conn.c:5179] needs to probe Initial packet number space : qc@0x7fac301cd390
This bug was not visible before this commit:
BUG/MINOR: quic: wake up MUX on probing only for 01RTT
This means that before it, one could do bad things (probing the 01RTT packet number
space before the handshake was confirmed).
Must be backported to 2.7 and 2.6.
When end-of-stream is reported by a H2 stream, we must take care to also
report an error is end-of-input was not reported. Indeed, it is now
mandatory to set SE_FL_EOI or SE_FL_ERROR flags when SE_FL_EOS is set.
It is a 2.8-specific issue. No backport needed.
When TCP connection is first upgrade to H1 then to H2, the stream-connector,
created by the PT mux, must be destroyed because the H2 mux cannot inherit
from it. When it is performed, the SE_FL_EOS flag is set but SE_FL_EOI must
also be set. It is now required to never set SE_FL_EOS without SE_FL_EOI or
SE_FL_ERROR.
It is a 2.8-specific issue. No backport needed.
Timeouts for dynamic resolutions are not handled at the stream level but by
the resolvers themself. It means there is no connect, client and server
timeouts defined on the internal proxy used by a resolver.
While it is not an issue for DNS resolution over UDP, it can be a problem
for resolution over TCP. New sessions are automatically created when
required, and killed on excess. But only established connections are
considered. Connecting ones are never killed. Because there is no conncet
timeout, we rely on the kernel to report a connection error. And this may be
quite long.
Because resolutions are periodically triggered, this may lead to an excess
of unusable sessions in connecting state. This also prevents HAProxy to
quickly exit on soft-stop. It is annoying, especially because there is no
reason to not set a connect timeout.
So to mitigate the issue, we now use the "resolve" timeout as connect
timeout for the internal proxy attached to a resolver.
This patch should be backported as far as 2.4.
Thanks to previous commit ("BUG/MEDIUM: dns: Kill idle DNS sessions during
stopping stage"), DNS idle sessions are killed on stopping staged. But the
task responsible to kill these sessions is running every 5 seconds. It
means, when HAProxy is stopped, we can observe a delay before the process
exits.
To reduce this delay, when the resolvers task is executed, all DNS idle
tasks are woken up.
This patch must be backported as far as 2.6.
There is no server timeout for DNS sessions over TCP. It means idle session
cannot be killed by itself. There is a task running peridically, every 5s,
to kill the excess of idle sessions. But the last one is never
killed. During the stopping stage, it is an issue since the dynamic
resolutions are no longer performed (2ec6f14c "BUG/MEDIUM: resolvers:
Properly stop server resolutions on soft-stop").
Before the above commit, during stopping stage, the DNS sessions were killed
when a resolution was triggered. Now, nothing kills these sessions. This
prevents the process to finish on soft-stop.
To fix this bug, the task killing excess of idle sessions now kill all idle
sessions on stopping stage.
This patch must be backported as far as 2.6.
When the log applet is executed while a shut is pending, the remaining
output data must always be consumed. Otherwise, this can prevent the stream
to exit, leading to a spinning loop on the applet.
It is 2.8-specific. No backport needed.
When the stats applet is executed while a shut is pending, the remaining
output data must always be consumed. Otherwise, this can prevent the stream
to exit, leading to a spinning loop on the applet.
It is 2.8-specific. No backport needed.
When the http-client applet is executed while a shut is pending, the
remaining output data must always be consumed. Otherwise, this can prevent
the stream to exit, leading to a spinning loop on the applet.
It is 2.8-specific. No backport needed.
When the cli applet is executed while a shut is pending, the remaining
output data must always be consumed. Otherwise, this can prevent the stream
to exit, leading to a spinning loop on the applet.
This patch should fix the issue #2107. It is 2.8-specific. No backport
needed.
An applet must never set SE_FL_EOS flag without SE_FL_EOI or SE_FL_ERROR
flags. Here, SE_FL_EOI flag was missing for "quit" or "_getsocks"
commands. Indeed, these commands are terminal.
This bug triggers a BUG_ON() recently added.
This patch is related to the issue #2107. It is 2.8-specific. No backport
needed.
When calls to strcpy() were replaced with calls to strlcpy2(), one of them
was replaced wrong, and the source and size were inverted. Correct that.
This should fix issue #2110.
These ones are genenerally harmless on modern compilers because the
compiler checks them. While gcc optimizes them away without even
referencing strcpy(), clang prefers to call strcpy(). Nevertheless they
prevent from enabling stricter checks so better remove them altogether.
They were all replaced by strlcpy2() and the size of the destination
which is always known there.
strcpy() is quite nasty but tolerable to copy constants, but here
it copies a variable path into a node in a code path that's not
trivial to follow given that it takes the node as the result of
a tree lookup. Let's get rid of it and mention where the entry
is retrieved.
As every time strncat() is used, it's wrong, and this one is no exception.
Users often think that the length applies to the destination except it
applies to the source and makes it hard to use correctly. The bug did not
have an impact because the length was preallocated from the sum of all
the individual lengths as measured by strlen() so there was no chance one
of them would change in between. But it could change in the future. Let's
fix it to use memcpy() instead for strings, or byte copies for delimiters.
No backport is needed, though it can be done if it helps to apply other
fixes.
Add code so that compression can be used for requests as well.
New compression keywords are introduced :
"direction" that specifies what we want to compress. Valid values are
"request", "response", or "both".
"type-req" and "type-res" define content-type to be compressed for
requests and responses, respectively. "type" is kept as an alias for
"type-res" for backward compatibilty.
"algo-req" specifies the compression algorithm to be used for requests.
Only one algorithm can be provided.
"algo-res" provides the list of algorithm that can be used to compress
responses. "algo" is kept as an alias for "algo-res" for backward
compatibility.
Make provision for being able to store both compression algorithms and
content-types to compress for both requests and responses. For now only
the responses one are used.
Make provision for storing the compression algorithm and the compression
context twice, one for requests, and the other for responses. Only the
response ones are used for now.
When a trace message for an applet is dumped, if the SC exists, the stream
always exists too. There is no way to attached an applet to a health-check.
So, we can use the unsafe version __sc_strm() to get the stream.
This patch is related to #2106. Not sure it will be enough for
Coverity. However, there is no bug here.
On startup/reload, startup_logs_init() will try to export startup logs shm
filedescriptor through the internal HAPROXY_STARTUPLOGS_FD env variable.
While memprintf() is used to prepare the string to be exported via
setenv(), str_fd argument (first argument passed to memprintf()) could
be non NULL as a result of HAPROXY_STARTUPLOGS_FD env variable being
already set.
Indeed: str_fd is already used earlier in the function to store the result
of getenv("HAPROXY_STARTUPLOGS_FD").
The issue here is that memprintf() is designed to free the 'out' argument
if out != NULL, and here we don't expect str_fd to be freed since it was
provided by getenv() and would result in memory violation.
To prevent any invalid free, we must ensure that str_fd is set to NULL
prior to calling memprintf().
This must be backported in 2.7 with eba6a54cd4 ("MINOR: logs: startup-logs
can use a shm for logging the reload")
People who use HAProxy as a process 1 in containers sometimes starts
other things from the program section. This is still not recommend as
the master process has minimal features regarding process management.
Environment variables are still inherited, even internal ones.
Since 2.7, it could provoke a crash when inheriting the
HAPROXY_STARTUPLOGS_FD variable.
Note: for future releases it should be better to clean the env and sets
a list of variable to be exported. We need to determine which variables
are used by users before.
Must be backported in 2.7.
Previously, ODCID were concatenated with the client address. This was
done to prevent a collision between two endpoints which used the same
ODCID.
Thanks to the two previous patches, first connection generated CID is
now directly derived from the client ODCID using a hash function which
uses the client source address from the same purpose. Thus, it is now
unneeded to concatenate client address to <odcid> quic-conn member.
This change allows to simplify the quic_cid structure management and
reduce its size which is important as it is embedded several times in
various structures such as quic_conn and quic_rx_packet.
This should be backported up to 2.7.
First connection CID generation has been altered. It is now directly
derived from client ODCID since previous commit :
commit 162baaff7a
MINOR: quic: derive first DCID from client ODCID
This patch removes the ODCID tree which is now unneeded. On connection
lookup via CID, if a DCID is not found the hash derivation is performed
for an INITIAL/0-RTT packet only. In case a client has used multiple
times an ODCID, this will allow to retrieve our generated DCID in the
CID tree without storing the ODCID node.
The impact of this two combined patch is that it may improve slightly
haproxy memory footprint by removing a tree node from quic_conn
structure. The cpu calculation induced by hash derivation should only be
performed only a few times per connection as the client will start to
use our generated CID as soon as it received it.
This should be backported up to 2.7.
Change the generation of the first CID of a connection. It is directly
derived from the client ODCID using a 64-bits hash function. Client
address is added to avoid collision between clients which could use the
same ODCID.
For the moment, this change as no functional impact. However, it will be
directly used for the next commit to be able to remove the ODCID tree.
This should be backported up to 2.7.
This is due to this commit:
MINOR: quic: Add trace to debug idle timer task issues
where has been added without having been tested at developer level.
<qc> was dereferenced after having been released by qc_conn_release().
Set qc to NULL value after having been released to forbid its dereferencing.
Add a check for qc->idle_timer_task in the traces added by the mentionned
commit above to prevent its dereferencing if NULL.
Take the opportunity of this patch to modify trace events from
QUIC_EV_CONN_SSLALERT to QUIC_EV_CONN_IDLE_TIMER.
Must be backported to 2.6 and 2.7.
The HTTP message must remains in BODY state during the analysis, to be able
to report accurate termination state in logs. It is also important to know
the HTTP analysis is still in progress. Thus, when we are waiting for the
message payload, the message is no longer switch to DATA state. This was
used to not process "Expect: " header at each evaluation. But thanks to the
previous patch, it is no long necessary.
This patch also fixes a bug in the lua filter api. Some functions must be
called during the message analysis and not during the payload forwarding. It
is not valid to try to manipulate headers during the forward stage because
headers are already forwarded. We rely on the message state to detect
errors. So the api was unusable if a "wait-for-body" action was used.
This patch shoud fix the issue #2093. It relies on the commit:
* MINOR: http-ana: Add a HTTP_MSGF flag to state the Expect header was checked
Both must be backported as far as 2.5.
HTTP_MSGF_EXPECT_CHECKED is now set on the request message to know the
"Expect: " header was already handled, if any. The flag is set from the
moment we try to handle the header to send a "100-continue" response,
whether it was found or not.
This way, when we are waiting for the request payload, thanks to this flag,
we only try to handle "Expect: " header only once. Before it was performed
by changing the message state from BODY to DATA. But this has some side
effects and it is no accurate. So, it is better to rely on a flag to do so.
Now that event_hdl api is properly implemented in hlua, we may add the
per-server event subscription in addition to the global event
subscription.
Per-server subscription allows to be notified for events related to
single server. It is useful to track a server UP/DOWN and DEL events.
It works exactly like core.event_sub() except that the subscription
will be performed within the server dedicated subscription list instead
of the global one.
The callback function will only be called for server events affecting
the server from which the subscription was performed.
Regarding the implementation, it is pretty trivial at this point, we add
more doc than code this time.
Usage examples have been added to the (lua) documentation.
Now that the event handler API is pretty mature, we can expose it in
the lua API.
Introducing the core.event_sub(<event_types>, <cb>) lua function that
takes an array of event types <event_types> as well as a callback
function <cb> as argument.
The function returns a subscription <sub> on success.
Subscription <sub> allows you to manage the subscription from anywhere
in the script.
To this day only the sub->unsub method is implemented.
The following event types are currently supported:
- "SERVER_ADD": when a server is added
- "SERVER_DEL": when a server is removed from haproxy
- "SERVER_DOWN": server states goes from up to down
- "SERVER_UP": server states goes from down to up
As for the <cb> function: it will be called when one of the registered
event types occur. The function will be called with 3 arguments:
cb(<event>,<data>,<sub>)
<event>: event type (string) that triggered the function.
(could be any of the types used in <event_types> when registering
the subscription)
<data>: data associated with the event (specific to each event family).
For "SERVER_" family events, server details such as server name/id/proxy
will be provided.
If the server still exists (not yet deleted), a reference to the live
server is provided to spare you from an additionnal lookup if you need
to have direct access to the server from lua.
<sub> refers to the subscription. In case you need to manage it from
within an event handler.
(It refers to the same subscription that the one returned from
core.event_sub())
Subscriptions are per-thread: the thread that will be handling the
event is the one who performed the subscription using
core.event_sub() function.
Each thread treats events sequentially, it means that if you have,
let's say SERVER_UP, then SERVER_DOWN in a short timelapse, then your
cb function will first be called with SERVER_UP, and once you're done
handling the event, your function will be called again with SERVER_DOWN.
This is to ensure event consitency when it comes to logging / triggering
logic from lua.
Your lua cb function may yield if needed, but you're pleased to process
the event as fast as possible to prevent the event queue from growing up
To prevent abuses, if the event queue for the current subscription goes
over 100 unconsumed events, the subscription will pause itself
automatically for as long as it takes for your handler to catch up.
This would lead to events being missed, so a warning will be emitted in
the logs to inform you about that. This is not something you want to let
happen too often, it may indicate that you subscribed to an event that
is occurring too frequently or/and that your callback function is too
slow to keep up the pace and you should review it.
If you want to do some parallel processing because your callback
functions are slow: you might want to create subtasks from lua using
core.register_task() from within your callback function to perform the
heavy job in a dedicated task and allow remaining events to be processed
more quickly.
Please check the lua documentation for more information.
Adding alternative findserver() functions to be able to perform an
unique match based on name or puid and by leveraging revision id (rid)
to make sure the function won't match with a new server reusing the
same name or puid of the "potentially deleted" server we were initially
looking for.
For example, if you were in the position of finding a server based on
a given name provided to you by a different context:
Since dynamic servers were implemented, between the time the name was
picked and the time you will perform the findserver() call some dynamic
server deletion/additions could've been performed in the mean time.
In such cases, findserver() could return a new server that re-uses the
name of a previously deleted server. Depending on your needs, it could
be perfectly fine, but there are some cases where you want to lookup
the original server that was provided to you (if it still exists).
While working on event handling from lua, the need for a pause/resume
function to temporarily disable a subscription was raised.
We solve this by introducing the EHDL_SUB_F_PAUSED flag for
subscriptions.
The flag is set via _pause() and cleared via _resume(), and it is
checked prior to notifying the subscription in publish function.
Pause and Resume functions are also available for via lookups for
identified subscriptions.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
Use event_hdl_async_equeue_size() in advanced async task handler to
get the near real-time event queue size.
By near real-time, you should understand that the queue size is not
updated during element insertion/removal, but shortly before insertion
and shortly after removal, so the size should reflect the approximate
queue size at a given time but should definitely not be used as a
unique source of truth.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
advanced async mode (EVENT_HDL_ASYNC_TASK) provided full support for
custom tasklets registration.
Due to the similarities between tasks and tasklets, it may be useful
to use the advanced mode with an existing task (not a tasklet).
While the API did not explicitly disallow this usage, things would
get bad if we try to wakeup a task using tasklet_wakeup() for notifying
the task about new events.
To make the API support both custom tasks and tasklets, we use the
TASK_IS_TASKLET() macro to call the proper waking function depending
on the task's type:
- For tasklets: we use tasklet_wakeup()
- For tasks: we use task_wakeup()
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
In _event_hdl_publish(), when publishing an event to async handler(s),
async_data is allocated only once and then relies on a refcount
logic to reuse the same data block for multiple async event handlers.
(this allows to save significant amount of memory)
Because the refcount is first set to 0, there is a small race where
the consumers could consume async data (async data refcount reaching 0)
before publishing is actually over.
The consequence is that async data may be freed by one of the consumers
while we still rely on it within _event_hdl_publish().
This was discovered by chance when stress-testing the API with multiple
async handlers registered to the same event: some of the handlers were
notified about a new event for which the event data was already freed,
resulting in invalid reads and/or segfaults.
To fix this, we first set the refcount to 1, assuming that the
publish function relies on async_data until the publish is over.
At the end of the publish, the reference to the async data is dropped.
This way, async_data is either freed by _event_hdl_publish() itself
or by one of the consumers, depending on who is the last one relying
on it.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
soft-stop was not explicitly handled in event_hdl API.
Because of this, event_hdl was causing some leaks on deinit paths.
Moreover, a task responsible for handling events could require some
additional cleanups (ie: advanced async task), and as the task was not
protected against abort when soft-stopping, such cleanup could not be
performed unless the task itself implements the required protections,
which is not optimal.
Consider this new approach:
'jobs' global variable is incremented whenever an async subscription is
created to prevent the related task from being aborted before the task
acknowledges the final END event.
Once the END event is acknowledged and freed by the task, the 'jobs'
variable is decremented, and the deinit process may continue (including
the abortion of remaining tasks not guarded by the 'jobs' variable).
To do this, a new global mt_list is required: known_event_hdl_sub_list
This list tracks the known (initialized) subscription lists within the
process.
sub_lists are automatically added to the "known" list when calling
event_hdl_sub_list_init(), and are removed from the list with
event_hdl_sub_list_destroy().
This allows us to implement a global thread-safe event_hdl deinit()
function that is automatically called on soft-stop thanks to signal(0).
When event_hdl deinit() is initiated, we simply iterate against the known
subscription lists to destroy them.
event_hdl_subscribe_ptr() was slightly modified to make sure that a sub_list
may not accept new subscriptions once it is destroyed (removed from the
known list)
This can occur between the time the soft-stop is initiated (signal(0)) and
haproxy actually enters in the deinit() function (once tasks are either
finished or aborted and other threads already joined).
It is safe to destroy() the subscription list multiple times as long
as the pointer is still valid (ie: first on soft-stop when handling
the '0' signal, then from regular deinit() path): the function does
nothing if the subscription list is already removed.
We partially reverted "BUG/MINOR: event_hdl: make event_hdl_subscribe thread-safe"
since we can use parent mt_list locking instead of a dedicated lock to make
the check gainst duplicate subscription ID.
(insert_lock is not useful anymore)
The check in itself is not changed, only the locking method.
sizeof(event_hdl_sub_list) slightly increases: from 24 bits to 32bits due
to the additional mt_list struct within it.
With that said, having thread-safe list to store known subscription lists
is a good thing: it could help to implement additional management
logic for subcription lists and could be useful to add some stats or
debugging tools in the future.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
event_hdl_sub_list_init() and event_hdl_sub_list_destroy() don't expect
to be called with a NULL argument (to use global subscription list
implicitly), simply because the global subscription list init and
destroy is internally managed.
Adding BUG_ON() to detect such invalid usages, and updating some comments
to prevent confusion around these functions.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
List insertion in event_hdl_subscribe() was not thread-safe when dealing
with unique identifiers. Indeed, in this case the list insertion is
conditional (we check for a duplicate, then we insert). And while we're
using mt lists for this, the whole operation is not atomic: there is a
race between the check and the insertion.
This could lead to the same ID being registered multiple times with
concurrent calls to event_hdl_subscribe() on the same ID.
To fix this, we add 'insert_lock' dedicated lock in the subscription
list struct. The lock's cost is nearly 0 since it is only used when
registering identified subscriptions and the lock window is very short:
we only guard the duplicate check and the list insertion to make the
conditional insertion "atomic" within a given subscription list.
This is the only place where we need the lock: as soon as the item is
properly inserted we're out of trouble because all other operations on
the list are already thread-safe thanks to mt lists.
A new lock hint is introduced: LOCK_EHDL which is dedicated to event_hdl
The patch may seem quite large since we had to rework the logic around
the subscribe function and switch from simple mt_list to a dedicated
struct wrapping both the mt_list and the insert_lock for the
event_hdl_sub_list type.
(sizeof(event_hdl_sub_list) is now 24 instead of 16)
However, all the changes are internal: we don't break the API.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
core.register_task(function) may now take up to 4 additional arguments
that will be passed as-is to the task function.
This could be convenient to spawn sub-tasks from existing functions
supporting core.register_task() without the need to use global
variables to pass some context to the newly created task function.
The new prototype is:
core.register_task(function[, arg1[, arg2[, ...[, arg4]]]])
Implementation remains backward-compatible with existing scripts.
Server revision ID was recently added to haproxy with 61e3894
("MINOR: server: add srv->rid (revision id) value")
Let's add it to the hlua server class.
Main lua lock is used at various places in the code.
Most of the time it is used from unprotected lua environments,
in which case the locking is mandatory.
But there are some cases where the lock is attempted from protected
lua environments, meaning that lock is already owned by the current
thread. Thus new locking attempt should be skipped to prevent any
deadlocks from occuring.
To address this, "already_safe" lock hint was implemented in
hlua_ctx_init() function with commit bf90ce1
("BUG/MEDIUM: lua: dead lock when Lua tasks are trigerred")
But this approach is not very safe, for 2 reasons:
First reason is that there are still some code paths that could lead
to deadlocks.
For instance, in register_task(), hlua_ctx_init() is called with
already_safe set to 1 to prevent deadlock from occuring.
But in case of task init failure, hlua_ctx_destroy() will be called
from the same environment (protected environment), and hlua_ctx_destroy()
does not offer the already_safe lock hint.. resulting in a deadlock.
Second reason is that already_safe hint is used to completely skip
SET_LJMP macros (which manipulates the lock internally), resulting
in some logics in the function being unprotected from lua aborts in
case of unexpected errors when manipulating the lua stack (the lock
does not protect against longjmps)
Instead of leaving the locking responsibility to the caller, which is
quite error prone since we must find out ourselves if we are or not in
a protected environment (and is not robust against code re-use),
we move the deadlock protection logic directly in hlua_lock() function.
Thanks to a thread-local lock hint, we can easily guess if the current
thread already owns the main lua lock, in which case the locking attempt
is skipped. The thread-local lock hint is implemented as a counter so
that the lock is properly dropped when the counter reaches 0.
(to match actual lock() and unlock() calls)
This commit depends on "MINOR: hlua: simplify lua locking"
It may be backported to every stable versions.
[prior to 2.5 lua filter API did not exist, filter-related parts
should be skipped]
The check on lua state==0 to know whether locking is required or not can
be performed in a locking wrapper to simplify things a bit and prevent
implementation errors.
Locking from hlua context should now be performed via hlua_lock(L) and
unlocking via hlua_unlock(L)
Using hlua_pushref() everywhere temporary lua objects are involved.
(ie: hlua_checkfunction(), hlua_checktable...)
Those references are expected to be cleared using hlua_unref() when
they are no longer used.
Using hlua_ref() everywhere temporary lua objects are involved.
Those references are expected to be cleared using hlua_unref()
when they are no longer used.
Several error paths were leaking function or table references.
(Obtained through hlua_checkfunction() and hlua_checktable() functions)
Now we properly release the references thanks to hlua_unref() in
such cases.
This commit depends on "MINOR: hlua: add simple hlua reference handling API"
This could be backported in every stable versions although it is not
mandatory as such leaks only occur on rare error/warn paths.
[prior to 2.5 lua filter API did not exist, the hlua_register_filter()
part should be skipped]
hlua init function references were not released during
hlua_post_init_state().
Hopefully, this function is only used during startup so the resulting
leak is not a big deal.
Since each init lua function runs precisely once, it is safe to release
the ref as soon as the function is restored on the stack.
This could be backported to every stable versions.
Please note that this commit depends on "MINOR: hlua: add simple hlua reference handling API"
In core.register_task(): we take a reference to the function passed as
argument in order to push it in the new coroutine substack.
However, once pushed in the substack: the reference is not useful
anymore and should be cleared.
Currently, this is not the case in hlua_register_task().
Explicitly dropping the reference once the function is pushed to the
coroutine's stack to prevent any reference leak (which could contribute
to resource shortage)
This may be backported to every stable versions.
Please note that this commit depends on "MINOR: hlua: add simple hlua reference handling API"
hlua_checktable() and hlua_checkfunction() both return the raw
value of luaL_ref() function call.
As luaL_ref() returns a signed int, both functions should return a signed
int as well to prevent any misuse of the returned reference value.
We're doing this in an attempt to simplify temporary lua objects
references handling.
Adding the new hlua_unref() function to release lua object references
created using luaL_ref(, LUA_REGISTRYINDEX)
(ie: hlua_checkfunction() and hlua_checktable())
Failure to release unused object reference prevents the reference index
from being re-used and prevents the referred ressource from being garbage
collected.
Adding hlua_pushref(L, ref) to replace
lua_rawgeti(L, LUA_REGISTRYINDEX, ref)
Adding hlua_ref(L) to replace luaL_ref(L, LUA_REGISTRYINDEX)
The comment for the hlua_ctx_destroy() function states that the "lua"
struct is not freed.
This is not true anymore since 2c8b54e7 ("MEDIUM: lua: remove Lua struct
from session, and allocate it with memory pools")
Updating the function comment to properly report the actual behavior.
This could be backported in every stable versions with 2c8b54e7
("MEDIUM: lua: remove Lua struct from session, and allocate it with memory pools")
Since ("MINOR: hlua_fcn: alternative to old proxy and server attributes"):
- s->name(), s->puid() are superseded by s->get_name() and s->get_puid()
- px->name(), px->uuid() are superseded by px->get_name() and
px->get_uuid()
And considering this is now the proper way to retrieve proxy name/uuid
and server name/puid from lua:
We're now removing such legacy attributes, but for retro-compatibility
purposes we will be emulating them and warning the user for some time
before completely dropping their support.
To do this, we first remove old legacy code.
Then we move server and proxy methods out of the metatable to allow
direct elements access without systematically involving the "__index"
metamethod.
This allows us to involve the "__index" metamethod only when the requested
key is missing from the table.
Then we define relevant hlua_proxy_index and hlua_server_index functions
that will be used as the "__index" metamethod to respectively handle
"name, uuid" (proxy) or "name, puid" (server) keys, in which case we
warn the user about the need to use the new getter function instead the
legacy attribute (to prepare for the potential upcoming removal), and we
call the getter function to return the value as if the getter function
was directly called from the script.
Note: Using the legacy variables instead of the getter functions results
in a slight overhead due to the "__index" metamethod indirection, thus
it is recommended to switch to the getter functions right away.
With this commit we're also adding a deprecation notice about legacy
attributes.
This patch proposes to enumerate servers using internal HAProxy list.
Also, remove the flag SRV_F_NON_PURGEABLE which makes the server non
purgeable each time Lua uses the server.
Removing reg-tests/cli_delete_server_lua.vtc since this test is no
longer relevant (we don't set the SRV_F_NON_PURGEABLE flag anymore)
and we already have a more generic test:
reg-tests/server/cli_delete_server.vtc
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
This patch adds new lua methods:
- "Proxy.get_uuid()"
- "Proxy.get_name()"
- "Server.get_puid()"
- "Server.get_name()"
These methods will be equivalent to their old analog Proxy.{uuid,name}
and Server.{puid,name} attributes, but this will be the new preferred
way to fetch such infos as it duplicates memory only when necessary and
thus reduce the overall lua Server/Proxy objects memory footprint.
Legacy attributes (now superseded by the explicit getters) are expected
to be removed some day.
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
When HAproxy is loaded with a lot of frontends/backends (tested with 300k),
it is slow to start and it uses a lot of memory just for indexing backends
in the lua tables.
This patch uses the internal frontend/backend index of HAProxy in place of
lua table.
HAProxy startup is now quicker as each frontend/backend object is created
on demand and not at init.
This has to come with some cost: the execution of Lua will be a little bit
slower.
Two lua init function seems to return something useful, but it
is not the case. The function "hlua_concat_init" seems to return
a failure status, but the function never fails. The function
"hlua_fcn_reg_core_fcn" seems to return a number of elements in
the stack, but it is not the case.
register_{init, converters, fetches, action, service, cli, filter} are
meant to run exclusively from body context according to the
documentation (unlike register_task which is designed to work from both
init and runtime contexts)
A quick code inspection confirms that only register_task implements
the required precautions to make it safe out of init context.
Trying to use those register_* functions from a runtime lua task will
lead to a program crash since they all assume that they are running from
the main lua context and with no concurrent runs:
core.register_task(function()
core.register_init(function()
end)
end)
When loaded from the config, the above example would segfault.
To prevent this undefined behavior, we now report an explicit error if
the user tries to use such functions outside of init/body context.
This should be backported in every stable versions.
[prior to 2.5 lua filter API did not exist, the hlua_register_filter()
part should be skipped]
In hlua_process_task: when HLUA_E_ETMOUT was returned by
hlua_ctx_resume(), meaning that the lua task reached
tune.lua.task-timeout (default: none),
we logged "Lua task: unknown error." before stopping the task.
Now we properly handle HLUA_E_ETMOUT to report a meaningful error
message.
In function hlua_hook, a yieldk is performed when function is yieldable.
But the following code in that function seems to assume that the yield
never returns, which is not the case!
Moreover, Lua documentation says that in this situation the yieldk call
must immediately be followed by a return.
This patch adds a return statement after the yieldk call.
It also adds some comments and removes a needless lua_sethook call.
It could be backported to all stable versions, but it is not mandatory,
because even if it is undefined behavior this bug doesn't seem to
negatively affect lua 5.3/5.4 stacks.
srv_drop() function is reponsible for freeing the server when the
refcount reaches 0.
There is one exception: when global.mode has the MODE_STOPPING flag set,
srv_drop() will ignore the refcount and free the server on first
invocation.
This logic has been implemented with 13f2e2ce ("BUG/MINOR: server: do
not use refcount in free_server in stopping mode") and back then doing
so was not a problem since dynamic server API was just implemented and
srv_take() and srv_drop() were not widely used.
Now that dynamic server API is starting to get more popular we cannot
afford to keep the current logic: some modules or lua scripts may hold
references to existing server and also do their cleanup in deinit phases
In this kind of situation, it would be easy to trigger double-frees
since every call to srv_drop() on a specific server will try to free it.
To fix this, we take a different approach and try to fix the issue at
the source: we now properly drop server references involved with
checks/agent_checks in deinit_srv_check() and deinit_srv_agent_check().
While this could theorically be backported up to 2.6, it is not very
relevant for now since srv_drop() usage in older versions is very
limited and we're only starting to face the issue in mid 2.8
developments. (ie: lua core updates)
In srv_drop(), we only call the ssl->destroy_srv() method on
specific conditions.
But this has two downsides:
First, destroy_srv() is reponsible for freeing data that may have been
allocated in prepare_srv(), but not exclusively: it also frees
ssl-related parameters allocated when parsing a server entry, such as
ca-file for instance.
So this is quite error-prone, we could easily miss a condition where
some data needs to be deallocated using destroy_srv() even if
prepare_srv() was not used (since prepare_srv() is also conditional),
thus resulting in memory leaks.
Moreover, depending on srv->proxy to guard the check is probably not
a good idea here, since srv_drop() could be called in late de-init paths
in which related proxy could be freed already. srv_drop() should only
take care of freeing local server data without external logic.
Thankfully, destroy_srv() function performs the necessary checks to
ensure that a systematic call to the function won't result in invalid
reads or double frees.
No backport needed.
Proxies belonging to the cfg_log_forward proxy list are not cleaned up
in haproxy deinit() function.
We add the missing cleanup directly in the main deinit() function since
no other specific function may be used for this.
This could be backported up to 2.4
When a ring section is configured, a new sink is created and forward_px
proxy may be allocated and assigned to the sink.
Such sink-related proxies are added to the sink_proxies_list and thus
don't belong to the main proxy list which is cleaned up in
haproxy deinit() function.
We don't have to manually clean up sink_proxies_list in the main deinit()
func:
sink API already provides the sink_deinit() function so we just add the
missing free_proxy(sink->forward_px) there.
This could be backported up to 2.4.
[in 2.4, commit b0281a49 ("MINOR: proxy: check if p is NULL in free_proxy()")
must be backported first]
In stats_dump_proxy_to_buffer() function, special care was taken when
dealing with servers dump.
Indeed, stats_dump_proxy_to_buffer() can be interrupted and resumed if
buffer space is not big enough to complete dump.
Thus, a reference is taken on the server being dumped in the hope that
the server will still be valid when the function resumes.
(to prevent the server from being freed in the meantime)
While this is now true thanks to:
- "BUG/MINOR: server/del: fix legacy srv->next pointer consistency"
We still have an issue: when resuming, saved server reference is not
dropped.
This prevents the server from being freed when we no longer use it.
Moreover, as the saved server might now be deleted
(SRV_F_DELETED flag set), the current deleted server may still be dumped
in the stats and while this is not a bug, this could be misleading for
the user.
Let's add a px_st variable to detect if the stats_dump_proxy_to_buffer()
is being resumed at the STAT_PX_ST_SV stage: perform some housekeeping
to skip deleted servers and properly drop the reference on the saved
server.
This commit depends on:
- "MINOR: server: add SRV_F_DELETED flag"
- "BUG/MINOR: server/del: fix legacy srv->next pointer consistency"
This should be backported up to 2.6
We recently discovered a bug which affects dynamic server deletion:
When a server is deleted, it is removed from the "visible" server list.
But as we've seen in previous commit
("MINOR: server: add SRV_F_DELETED flag"), it can still be accessed by
someone who keeps a reference on it (waiting for the final srv_drop()).
Throughout this transient state, server ptr is still valid (may be
dereferenced) and the flag SRV_F_DELETED is set.
However, as the server is not part of server list anymore, we have
an issue: srv->next pointer won't be updated anymore as the only place
where we perform such update is in cli_parse_delete_server() by
iterating over the "visible" server list.
Because of this, we cannot guarantee that a server with the
SRV_F_DELETED flag has a valid 'next' ptr: 'next' could be pointing
to a fully removed (already freed) server.
This problem can be easily demonstrated with server dumping in
the stats:
server list dumping is performed in stats_dump_proxy_to_buffer()
The function can be interrupted and resumed later by design.
ie: output buffer is full: partial dump and finish the dump after
the flush
This is implemented by calling srv_take() on the server being dumped,
and only releasing it when we're done with it using srv_drop().
(drop can be delayed after function resume if buffer is full)
While the function design seems OK, it works with the assumption that
srv->next will still be valid after the function resumes, which is
not true. (especially if multiple servers are being removed in between
the 2 dumping attempts)
In practice, this did not cause any crash yet (at least this was not
reported so far), because server dumping is so fast that it is very
unlikely that multiple server deletions make their way between 2
dumping attempts in most setups. But still, this is a problem that we
need to address because some upcoming work might depend on this
assumption as well and for the moment it is not safe at all.
========================================================================
Here is a quick reproducer:
With this patch, we're creating a large deletion window of 3s as soon
as we reach a server named "t2" while iterating over the list.
This will give us plenty of time to perform multiple deletions before
the function is resumed.
| diff --git a/src/stats.c b/src/stats.c
| index 84a4f9b6e..15e49b4cd 100644
| --- a/src/stats.c
| +++ b/src/stats.c
| @@ -3189,11 +3189,24 @@ int stats_dump_proxy_to_buffer(struct stconn *sc, struct htx *htx,
| * Temporarily increment its refcount to prevent its
| * anticipated cleaning. Call free_server to release it.
| */
| + struct server *orig = ctx->obj2;
| for (; ctx->obj2 != NULL;
| ctx->obj2 = srv_drop(sv)) {
|
| sv = ctx->obj2;
| + printf("sv = %s\n", sv->id);
| srv_take(sv);
| + if (!strcmp("t2", sv->id) && orig == px->srv) {
| + printf("deletion window: 3s\n");
| + thread_idle_now();
| + thread_harmless_now();
| + sleep(3);
| + thread_harmless_end();
| +
| + thread_idle_end();
| +
| + goto full; /* simulate full buffer */
| + }
|
| if (htx) {
| if (htx_almost_full(htx))
| @@ -4353,6 +4366,7 @@ static void http_stats_io_handler(struct appctx *appctx)
| struct channel *res = sc_ic(sc);
| struct htx *req_htx, *res_htx;
|
| + printf("http dump\n");
| /* only proxy stats are available via http */
| ctx->domain = STATS_DOMAIN_PROXY;
|
Ok, we're ready, now we start haproxy with the following conf:
global
stats socket /tmp/ha.sock mode 660 level admin expose-fd listeners thread 1-1
nbthread 2
frontend stats
mode http
bind *:8081 thread 2-2
stats enable
stats uri /
backend farm
server t1 127.0.0.1:1899 disabled
server t2 127.0.0.1:18999 disabled
server t3 127.0.0.1:18998 disabled
server t4 127.0.0.1:18997 disabled
And finally, we execute the following script:
curl localhost:8081/stats&
sleep .2
echo "del server farm/t2" | nc -U /tmp/ha.sock
echo "del server farm/t3" | nc -U /tmp/ha.sock
This should be enough to reveal the issue, I easily manage to
consistently crash haproxy with the following reproducer:
http dump
sv = t1
http dump
sv = t1
sv = t2
deletion window = 3s
[NOTICE] (2940566) : Server deleted.
[NOTICE] (2940566) : Server deleted.
http dump
sv = t2
sv = �����U
[1] 2940566 segmentation fault (core dumped) ./haproxy -f ttt.conf
========================================================================
To fix this, we add prev_deleted mt_list in server struct.
For a given "visible" server, this list will contain the pending
"deleted" servers references that point to it using their 'next' ptr.
This way, whenever this "visible" server is going to be deleted via
cli_parse_delete_server() it will check for servers in its
'prev_deleted' list and update their 'next' pointer so that they no
longer point to it, and then it will push them in its
'next->prev_deleted' list to transfer the update responsibility to the
next 'visible' server (if next != NULL).
Then, following the same logic, the server about to be removed in
cli_parse_delete_server() will push itself as well into its
'next->prev_deleted' list (if next != NULL) so that it may still use its
'next' ptr for the time it is in transient removal state.
In srv_drop(), right before the server is finally freed, we make sure
to remove it from the 'next->prev_deleted' list so that 'next' won't
try to perform the pointers update for this server anymore.
This has to be done atomically to prevent 'next' srv from accessing a
purged server.
As a result:
for a valid server, either deleted or not, 'next' ptr will always
point to a non deleted (ie: visible) server.
With the proposed fix, and several removal combinations (including
unordered cli_parse_delete_server() and srv_drop() calls), I cannot
reproduce the crash anymore.
Example tricky removal sequence that is now properly handled:
sv list: t1,t2,t3,t4,t5,t6
ops:
take(t2)
del(t4)
del(t3)
del(t5)
drop(t3)
drop(t4)
drop(t5)
drop(t2)
Set the SRV_F_DELETED flag when server is removed from the cli.
When removing a server from the cli (in cli_parse_delete_server()),
we update the "visible" server list so that the removed server is no
longer part of the list.
However, despite the server being removed from "visible" server list,
one could still access the server data from a valid ptr (ie: srv_take())
Deleted flag helps detecting when a server is in transient removal
state: that is, removed from the list, thus not visible but not yet
purged from memory.
SE_FL_EOS flag must never be set on the SE descriptor without SE_FL_EOI or
SE_FL_ERROR. When a mux or an applet report an end of stream, it must be
able to state if it is the end of input too or if it is an error.
Because all this part was recently refactored, especially the applet part,
it is a bit sensitive. Thus a BUG_ON_HOT() is used and not a BUG_ON().
The purpose of this patch is only a one-to-one replacement, as far as
possible.
CF_SHUTR(_NOW) and CF_SHUTW(_NOW) flags are now carried by the
stream-connecter. CF_ prefix is replaced by SC_FL_ one. Of course, it is not
so simple because at many places, we were testing if a channel was shut for
reads and writes in same time. To do the same, shut for reads must be tested
on one side on the SC and shut for writes on the other side on the opposite
SC. A special care was taken with process_stream(). flags of SCs must be
saved to be able to detect changes, just like for the channels.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream. Here, the applet is a bit
refactored to handle SE descriptor EOS, EOI and ERROR flags
The state of the opposite SC is already tested to wait the connection is
established before sending messages. So, there is no reason to test it again
before looping on the ring buffer.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream. We must just be sure to consume
request data when we are waiting the applet to be released.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream.
Here, the refactoring only reports errors by setting SE_FL_ERROR flag.
There are 3 kinds of applet in lua: The co-sockets, the TCP services and the
HTTP services. The three are refactored to use the SE descriptor instead of
the channel to report error and end-of-stream.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream. We must just be sure to consume
request data when we are waiting the applet to be released.
This patch is bit different than others because messages handling is
dispatched in several functions. But idea if the same.
It is now the dns turn to be refactored to use the SE descriptor instead of
the channel to report error and end-of-stream. We must just be sure to
consume request data when we are waiting the applet to be released.
The state of the opposite SC is already tested to wait the connection is
established before sending requests. So, there is no reason to test it again
before looping on the ring buffer.
It is the same kind of change than for the cache applet. Idea is to use the SE
desc instead of the channel or the SC to report end-of-input, end-of-stream and
errors.
Truncated commands are now reported on error. Other changes are the same than
for the cache applet. We now set SE_FL_EOS flag instead of calling cf_shutr()
and calls to cf_shutw are removed.
We now try, as far as possible, to rely on the SE descriptor to detect end
of processing. Idea is to no longer rely on the channel or the SC to do so.
First, we now set SE_FL_EOS instead of calling and cf_shutr() to report the
end of the stream. It happens when the response is fully sent (SE_FL_EOI is
already set in this case) or when an error is reported. In this last case,
SE_FL_ERROR is also set.
Thanks to this change, it is now possible to detect the applet must only
consume the request waiting for the upper layer releases it. So, if
SE_FL_EOS or SE_FL_ERROR are set, it means the reponse was fully
handled. And if SE_FL_SHR or SE_FL_SHW are set, it means the applet was
released by upper layer and is waiting to be freed.
Just like for end of input, the end of stream reported by the endpoint
(SE_FL_EOS flag) is now handled in sc_applet_process(). The idea is to have
applets acting as muxes by reporting events through the SE descriptor, as
far as possible.
Thanks to the previous patch, it is now possible for applets to not set the
CF_EOI flag on the channels. On this point, the applets get closer to the
muxes.
The end of input reported by the endpoint (SE_FL_EOI flag), is now handled
in sc_applet_process(). This function is always called after an applet was
called. So, the applets can now only report EOI on the SE descriptor and
have no reason to update the channel too.
EOS is now acknowledge at the end of sc_conn_recv(), even if an error was
encountered. There is no reason to not do so, especially because, if it not
performed here, it will be ack in sc_conn_process().
Note, it is still performed in sc_conn_process() because this function is
also the .wake callback function and can be directly called from the lower
layer.
On truncated message, a parsing error is still reported. But an error on the
SE descriptor is also reported. This will avoid any bugs in future. We are
know sure the SC is able to detect the error, independently on the HTTP
analyzers.
In h1_rcv_pipe(), only the end of stream was reported when a read0 was
detected. However, it is also important to report the end of input or an
error, depending on the message state. This patch does not fix any real
issue for now, but some others, specific to the 2.8, rely on it.
No backport needed.
In the PT multiplexer, the end of stream is also the end of input. Thus
we must report EOI to the stream-endpoint descriptor when the EOS is
reported. For now, it is a bit useless but it will be important to
disginguish an shutdown to an error to an abort.
To be sure to not report an EOI on an error, the errors are now handled
first.
In sc_conn_recv(), if the EOS is reported by the endpoint, it will always be
acknowledged by the SC and a read0 will be performed on the input
channel. Thus there is no reason to still test at the begining of the
function because there is already a test on CF_SHUTR.
When a response is consumed, result for co_getblk() is never checked. It
seems ok because amount of output data is always checked first. But There is
an issue when we try to get the first 2 bytes to read the message length. If
there is only one byte followed by a shutdown, the applet ignore the
shutdown and loop till the timeout to get more data.
So to avoid any issue and improve shutdown detection, the co_getblk() return
value is always tested. In addition, if there is not enough data, the applet
explicitly ask for more data by calling applet_need_more_data().
This patch relies on the previous one:
* BUG/MEDIUM: channel: Improve reports for shut in co_getblk()
Both should be backported as far as 2.4. On 2.5 and 2.4,
applet_need_more_data() must be replaced by si_rx_endp_more().
When co_getblk() is called with a length and an offset to 0, shutdown is
never reported. It may be an issue when the function is called to retrieve
all available output data, while there is no output data at all. And it
seems pretty annoying to handle this case in the caller.
Thus, now, in co_getblk(), -1 is returned when the channel is empty and a
shutdown was received.
There is no real reason to backport this patch alone. However, another fix
will rely on it.
There is a bug in a way the channels flags are checked to set clientfin or
serverfin timeout. Indeed, to set the clientfin timeout, the request channel
must be shut for reads (CF_SHUTR) or the response channel must be shut for
writes (CF_SHUTW). As the opposite, the serverfin timeout must be set when
the request channel is shut for writes (CF_SHUTW) or the response channel is
shut for reads (CF_SHUTR).
It is a 2.8-dev specific issue. No backport needed.
In the commut b08c5259e ("MINOR: stconn: Always report READ/WRITE event on
shutr/shutw"), a return statement was erroneously removed from
sc_app_shutr(). As a consequence, CF_SHUTR flags was never set. Fortunately,
it is the default .shutr callback function. Thus when a connection or an
applet is attached to the SC, another callback is used to performe a
shutdown for reads.
It is a 28-dev specific issue. No backport needed.
It is not possible to successfully match an empty response. However using
regex, it should be possible to reject response with any content. For
instance:
tcp-check expect !rstring ".+"
It may seem a be strange to do that, but it is possible, it is a valid
config. So it must work. Thanks to this patch, it is now really supported.
This patch may be backported as far as 2.2. But only if someone ask for it.
As timestamps based on now_ms values are used to compute the probing timeout,
they may wrap. So, use ticks API to compared them.
Must be backported to 2.7 and 2.6.
This this commit, this is ->idle_expire of quic_conn struct which must
be taken into an account to display the idel timer task expiration value:
MEDIUM: quic: Ack delay implementation
Furthermore, this value was always zero until now_ms has wrapped (20 s after
the start time) due to this commit:
MEDIUM: clock: force internal time to wrap early after boot
Do not rely on the value of now_ms compared to ->idle_expire to display the
difference but use ticks_remain() to compute it.
Must be backported to 2.7 where "show quic" has already been backported.
This bug arrived with this commit:
MEDIUM: quic: Ack delay implementation
It is possible that the idle timer task was already in the run queue when its
->expire field was updated calling qc_idle_timer_do_rearm(). To prevent this
task from running in this condition, one must check its ->expire field value
with this condition to run the task if its timer has really expired:
!tick_is_expired(t->expire, now_ms)
Furthermore, as this task may be directly woken up with a call to task_wakeup()
all, for instance by qc_kill_conn() to kill the connection, one must check this
task has really been woken up when it was in the wait queue and not by a direct
call to task_wakeup() thanks to this test:
(state & TASK_WOKEN_ANY) == TASK_WOKEN_TIMER
Again, when this condition is not fulfilled, the task must be run.
Must be backported where the commit mentionned above was backported.
As found in issue #2089, it's easy to mistakenly paste a colon in a
header name, or other chars (e.g. spaces) when quotes are in use, and
this causes all sort of trouble in field because such chars are rejected
by the peer.
Better try to detect these upfront. That's what we're doing here during
the parsing of the add-header/set-header/early-hint actions, where a
warning is emitted if a non-token character is found in a header name.
A special case is made for the colon at the beginning so that it remains
possible to place any future pseudo-headers that may appear. E.g:
[WARNING] (14388) : config : parsing [badchar.cfg:23] : header name 'X-Content-Type-Options:' contains forbidden character ':'.
This should be backported to 2.7, and ideally it should be turned to an
error in future versions.
As now_ms may be zero, these BUG_ON() could be triggered when its value has wrapped.
These call to BUG_ON() may be removed because the values they was supposed to
check are safely used by the ticks API.
Must be backported to 2.6 and 2.7.
This very old bug is there since the first implementation of newreno congestion
algorithm implementation. This was a very bad idea to put a state variable
into quic_cc_algo struct which only defines the congestion control algorithm used
by a QUIC listener, typically its type and its callbacks.
This bug could lead to crashes since BUG_ON() calls have been added to each algorithm
implementation. This was revealed by interop test, but not very often as there was
not very often several connections run at the time during these tests.
Hopefully this was also reported by Tristan in GH #2095.
Move the congestion algorithm state to the correct structures which are private
to a connection (see cubic and nr structs).
Must be backported to 2.7 and 2.6.
When entering a recovery period, the algo state is set by quic_enter_recovery().
And that's it!. These two lines should have been removed with this commit:
BUG/MINOR: quic: Wrong use of now_ms timestamps (cubic algo)
Take the opportunity of this patch to add a missing TRACE_LEAVE() call in
quic_cc_cubic_ca_cb().
Must be backported to 2.7 and 2.6.
This algorithm does nothing except initializing the congestion control window
to a fixed value. Very smart!
Modify the QUIC congestion control configuration parser to support this new
algorithm. The congestion control algorithm must be set as follows:
quic-cc-algo nocc-<cc window size(KB))
For instance if "nocc-15" is provided as quic-cc-algo keyword value, this
will set a fixed window of 15KB.
Depending on what we're debugging, some FDs can represent pollution in
the "show fd" output. Here we add a set of filters allowing to pick (or
exclude) any combination of listener, frontend conn, backend conn, pipes,
etc. "show fd l" will only list listening connections for example.
In ->srtt quic_loss struct this is 8*srtt which is stored so that not to have to multiply/devide
it to compute the RTT variance (at least). This is where there was a bug in quic_loss_srtt_update():
each time ->srtt must be used, it must be devided by 8 or right shifted by 3.
This bug had a very bad impact for network with non negligeable packet loss.
Must be backported to 2.6 and 2.7.
Reuse the idle timeout task to delay the acknowledgments. The time of the
idle timer expiration is for now on stored in ->idle_expire. The one to
trigger the acknowledgements is stored in ->ack_expire.
Add QUIC_FL_CONN_ACK_TIMER_FIRED new connection flag to mark a connection
as having its acknowledgement timer been triggered.
Modify qc_may_build_pkt() to prevent the sending of "ack only" packets and
allows the connection to send packet when the ack timer has fired.
It is possible that acks are sent before the ack timer has triggered. In
this case it is cancelled only if ACK frames are really sent.
The idle timer expiration must be set again when the ack timer has been
triggered or when it is cancelled.
Must be backported to 2.7.
Dump variables displayed by TRACE_ENTER() or TRACE_LEAVE() by calls to TRACE_PROTO().
No more variables are displayed by the two former macros. For now on, these information
are accessible from proto level.
Add new calls to TRACE_PROTO() at important locations in relation whith QUIC transport
protocol.
When relevant, try to prefix such traces with TX or RX keyword to identify the
concerned subpart (transmission or reception) of the protocol.
Must be backported to 2.7.
This callback was left as not implemented. It should at least display
the algorithm state, the control congestion window the slow start threshold
and the time of the current recovery period. Should be helpful to debug.
Must be backported to 2.7.
This bug was revealed by handshakeloss interop tests (often with quiceh) where one
could see haproxy an Initial packet without TLS ClientHello message (only a padded
PING frame). In this case, as the ->max_idle_timeout was not initialized, the
connection was closed about three seconds later, and haproxy opened a new one with
a new source connection ID upon receipt of the next client Initial packet. As the
interop runner count the number of source connection ID used by the server to check
there were exactly 50 such IDs used by the server, it considered the test as failed.
So, the ->max_idle_timeout of the connection must be at least initialized
to the local "max_idle_timeout" transport parameter value to avoid such
a situation (closing connections too soon) until it is "negotiated" with the
client when receiving its TLS ClientHello message.
Must be backported to 2.7 and 2.6.
This patch is similar to the one for cubic algorithm:
"BUG/MINOR: quic: Wrong use of timestamps with now_ms variable (cubic algo)"
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore, to make the newreno congestion control algorithm more readable and easy
to maintain, add quic_cc_cubic_rp_cb() new callback for the "in recovery period"
state (QUIC_CC_ST_RP).
Must be backported to 2.7 and 2.6.
Add ->srtt, ->rtt_var, ->rtt_min and ->pto_count values from ->path->loss
struct to "show quic". Same thing for ->cwnd from ->path struct.
Also take the opportunity of this patch to dump the packet number
space information directly from ->pktns[] array in place of ->els[]
array. Indeed, ->els[QUIC_TLS_ENC_LEVEL_EARLY_DATA] and ->els[QUIC_TLS_ENC_LEVEL_APP]
have the same packet number space.
Must be backported to 2.7 where "show quic" implementation has alredy been
backported.
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore to make the cubic congestion control algorithm more readable and easy
to maintain, adding a new state ("in recovery period" QUIC_CC_ST_RP new enum) helps
in reaching this goal. Implement quic_cc_cubic_rp_cb() which is the callback for
this new state.
Must be backported to 2.7 and 2.6.
If a bundle is used in a crt-list, the ssl-min-ver and ssl-max-ver
options were not taken into account in entries other than the first one
because the corresponding fields in the ssl_bind_conf structure were not
copied in crtlist_dup_ssl_conf.
This should fix GitHub issue #2069.
This patch should be backported up to 2.4.
In some extremely unlikely case (or even impossible for now), we might
exit cli_parse_update_ocsp_response without raising an error but with a
filled 'err' buffer. It was not properly free'd.
It does not need to be backported.
This patch removes dead code from the cli_parse_update_ocsp_response
function. The 'end' label in only used in case of error so the check of
the 'errcode' variable and the errcode variable itself become useless.
This patch does not need to be backported.
It fixes GitHub issue #2077.
During soft-stop, manage_proxy() (p->task) will try to purge
trashable (expired and not referenced) sticktable entries,
effectively releasing the process memory to leave some space
for new processes.
This is done by calling stktable_trash_oldest(), immediately
followed by a pool_gc() to give the memory back to the OS.
As already mentioned in dfe7925 ("BUG/MEDIUM: stick-table:
limit the time spent purging old entries"), calling
stktable_trash_oldest() with a huge batch can result in the function
spending too much time searching and purging entries, and ultimately
triggering the watchdog.
Lately, an internal issue was reported in which we could see
that the watchdog is being triggered in stktable_trash_oldest()
on soft-stop (thus initiated by manage_proxy())
According to the report, the crash seems to only occur since 5938021
("BUG/MEDIUM: stick-table: do not leave entries in end of window during purge")
This could be the result of stktable_trash_oldest() now working
as expected, and thus spending a large amount of time purging
entries when called with a large enough <to_batch>.
Instead of adding new checks in stktable_trash_oldest(), here we
chose to address the issue directly in manage_proxy().
Since the stktable_trash_oldest() function is called with
<to_batch> == <p->table->current>, it's pretty obvious that it could
cause some issues during soft-stop if a large table, assuming it is
full prior to the soft-stop, suddenly sees most of its entries
becoming trashable because of the soft-stop.
Moreover, we should note that the call to stktable_trash_oldest() is
immediately followed by a call to pool_gc():
We know for sure that pool_gc(), as it involves malloc_trim() on
glibc, is rather expensive, and the more memory to reclaim,
the longer the call.
We need to ensure that both stktable_trash_oldest() + consequent
pool_gc() call both theoretically fit in a single task execution window
to avoid contention, and thus prevent the watchdog from being triggered.
To do this, we now allocate a "budget" for each purging attempt.
budget is maxed out to 32K, it means that each sticktable cleanup
attempt will trash at most 32K entries.
32K value is quite arbitrary here, and might need to be adjusted or
even deducted from other parameters if this fails to properly address
the issue without introducing new side-effects.
The goal is to find a good balance between the max duration of each
cleanup batch and the frequency of (expensive) pool_gc() calls.
If most of the budget is actually spent trashing entries, then the task
will immediately be rescheduled to continue the purge.
This way, the purge is effectively batched over multiple task runs.
This may be slowly backported to all stable versions.
[Please note that this commit depends on 6e1fe25 ("MINOR: proxy/pool:
prevent unnecessary calls to pool_gc()")]
This commit adds a new optional argument to smp_fetch_url_param
and smp_fetch_url_param_val that makes the parameter key comparison
case-insensitive.
Now users can retrieve URL parameters regardless of their case,
allowing to match parameters in case insensitive application.
Doc was updated.
This commit adds a new argument to smp_fetch_url_param
that makes the parameter key comparison case-insensitive.
Several levels of callers were modified to pass this info.
In prevision of adding a third parameter to the url_param
sample-fetch function we need to make the second parameter optional.
User can now pass a empty 2nd argument to keep the default delimiter.
Since eb77824 ("MEDIUM: proxy: remove the deprecated "grace" keyword"),
stop_time is never set, so the related code in manage_proxy() is not
relevant anymore.
Removing code that refers to p->stop_time, since it was probably
overlooked.
Under certain soft-stopping conditions (ie: sticktable attached to proxy
and in-progress connections to the proxy that prevent haproxy from
exiting), manage_proxy() (p->task) will wake up every second to perform
a cleanup attempt on the proxy sticktable (to purge unused entries).
However, as reported by TimWolla in GH #2091, it was found that a
systematic call to pool_gc() could cause some CPU waste, mainly
because malloc_trim() (which is rather expensive) is being called
for each pool_gc() invocation.
As a result, such soft-stopping process could be spending a significant
amount of time in the malloc_trim->madvise() syscall for nothing.
Example "strace -c -f -p `pidof haproxy`" output (taken from
Tim's report):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
46.77 1.840549 3941 467 1 epoll_wait
43.82 1.724708 13 128509 sched_yield
8.82 0.346968 11 29696 madvise
0.58 0.023011 24 951 clock_gettime
0.01 0.000257 10 25 7 recvfrom
0.00 0.000033 11 3 sendto
0.00 0.000021 21 1 rt_sigreturn
0.00 0.000021 21 1 timer_settime
------ ----------- ----------- --------- --------- ----------------
100.00 3.935568 24 159653 8 total
To prevent this, we now only call pool_gc() when some memory is
really expected to be reclaimed as a direct result of the previous
stick table cleanup.
This is pretty straightforward since stktable_trash_oldest() returns
the number of trashed sticky sessions.
This may be backported to every stable versions.
This bug arrived with this commit:
MINOR: quic: Send PING frames when probing Initial packet number space
This may happen when haproxy needs to probe the peer with very short packets
(only one PING frame). In this case, the packet must be padded. There was clearly
a case which was removed by the mentionned commit above. That said, there was
an extra byte which was added to the PADDING frame before the mentionned commit
above. This is no more the case with this patch.
Thank you to @tatsuhiro-t (ngtcp2 manager) for having reported this issue which
was revealed by the keyupdate test (on client side).
Must be backported to 2.7 and 2.6.
When a backend H2 connection is waiting the connection is fully established,
nothing is sent. However, it remains useful to detect connection error at
this stage. It is especially important to release H2 connection on connect
error. Be able to set H2_CF_ERR_PENDiNG or H2_CF_ERROR flags when the
underlying connection is not fully established will exclude the H2C to be
inserted in a idle list in h2_detach().
Without this fix, an H2C in PREFACE state and relying on a connection in
error can be inserted in the safe list. Of course, it will be purged if not
reused. But in the mean time, it can be reused. When this happens, the
connection remains in error and nothing happens. At the end a connection
error is returned to the client. On low traffic, we can imagine a scenario
where this dead connection is the only idle connection. If it is always
reused before being purged, no connection to the server is possible.
In addition, h2c_is_dead() is updated to declare as dead any H2 connection
with a pending error if its state is PREFACE or SETTINGS1 (thus if no
SETTINGS frame was received yet).
This patch should fix the issue #2092. It must be backported as far as 2.6.
In commit c2c043ed4 ("BUG/MEDIUM: stats: Consume the request except when
parsing the POST payload"), a change about applet was pushed too early. The
applet must still call cf_shutr() when the response is fully sent. It is
planned to rely on SE_FL_EOS flag, just like connections. But it is not
possible for now.
However, at first glance, this bug has no visible effect.
It is 2.8-specific. No backport needed.
Previously performing a config check of `.github/h2spec.config` would report a
20 byte leak as reported in GitHub Issue #2082.
The leak was introduced in a6c0a59e9a, which is
dev only. No backport needed.
This patch follows this commit which was not sufficient:
BUG/MINOR: quic: Missing STREAM frame data pointer updates
Indeed, after updating the ->offset field, the bit which informs the
frame builder of its presence must be systematically set.
This bug was revealed by the following BUG_ON() from
quic_build_stream_frame() :
bug condition "!!(frm->type & 0x04) != !!stream->offset.key" matched at src/quic_frame.c:515
This should fix the last crash occured on github issue #2074.
Must be backported to 2.6 and 2.7.
Since 465a6c8 ("BUG/MEDIUM: applet: only set appctx->sedesc on
successful allocation"), sedesc is attached to the appctx after the
task is successfully allocated.
If the task fails to allocate: current sedesc cleanup is performed
on appctx->sedesc which still points to NULL so sedesc won't be
freed.
This is fine when sedesc is provided as argument (!=NULL), but leads
to memory leaks if sedesc is allocated locally.
It was shown in GH #2086 that if sedesc != NULL when passed as
argument, it shouldn't be freed on error paths. This is what 465a6c8
was trying to address.
In an attempt to fix both issues at once, we do as Christopher
suggested: that is moving sedesc allocation attempt at the
end of the function, so that we don't have to free it in case
of error, thus removing the ambiguity.
(We won't risk freeing a sedesc that does not belong to us)
If we fail to allocate sedesc, then the task that was previously
created locally is simply destroyed.
This needs to be backported to 2.6 with 465a6c8 ("BUG/MEDIUM: applet:
only set appctx->sedesc on successful allocation")
[Copy pasting the original backport note from Willy:
In 2.6 the function is slightly
different and called appctx_new(), though the issue is exactly the
same.]
This old bug was revealed because of the commit 407210a34 ("BUG/MEDIUM:
stconn: Don't rearm the read expiration date if EOI was reached"). But it is
still possible to hit it if there is no server timeout. At first glance, the
2.8 is not affected. But the fix remains valid.
When a shutdown for writes if performed the H1 connection must be notified
to be released. If it is subscribed for any I/O events, it is not an
issue. But, if there is no subscription, no I/O event is reported to the H1
connection and it remains alive. If the message was already fully received,
nothing more happens.
On my side, I was able to trigger the bug by freezing the session. Some
users reported a spinning loop on process_stream(). Not sure how to trigger
the loop. To freeze the session, the client timeout must be reached while
the server response was already fully received. On old version (< 2.6), it
only happens if there is no server timeout.
To fix the issue, we must wake up the H1 connection on shutdown for writes
if there is no I/O subscription.
This patch must be backported as far as 2.0. It should fix the issue #2090
and #2088.
The stats applet is designed to consume the request at the end, when it
finishes to send the response. And during the response forwarding, because
the request is not consumed, the applet states it will not consume
data. This avoid to wake the applet up in loop. When it finishes to send the
response, the request is consumed.
For POST requests, there is no issue because the response is small
enough. It is sent in one time and must be processed by HTTP analyzers. Thus
the forwarding is not performed by the applet itself. The applet is always
able to consume the request, regardless the payload length.
But for other requests, it may be an issue. If the response is too big to be
sent in one time and if the requests is not fully received when the response
headers are sent, the applet may be blocked infinitely, not consuming the
request. Indeed, in the case the applet will be switched in infinite forward
mode, the request will not be consumed immediately. At the end, the request
buffer is flushed. But if some data must still be received, the applet is not
woken up because it is still in a "not-consuming" mode.
So, to fix the issue, we must take care to re-enable data consuming when the
end of the response is reached.
This patch must be backported as far as 2.6.
In the syslog applet, when a message was not fully received, we must request
for more data by calling appctx_need_more_data() and not by setting
CF_READ_DONTWAIT flag on the request channel. Indeed, this flag is only used
to only try a read at once.
This patch could be backported as far as 2.4. On 2.5 and 2.4,
applet_need_more_data() must be replaced by si_cant_get().
Replace all BUG_ON() on frame allocation failure by a CONNECTION_CLOSE
sending with INTERNAL_ERROR code. This can happen for the following
cases :
* sending of MAX_STREAM_DATA
* sending of MAX_DATA
* sending of MAX_STREAMS_BIDI
In other cases (STREAM, STOP_SENDING, RESET_STREAM), an allocation
failure will only result in the current operation to be interrupted and
retried later. However, it may be desirable in the future to replace
this with a simpler CONNECTION_CLOSE emission to recover better under a
memory pressure issue.
This should be backported up to 2.7.
Emit a CONNECTION_CLOSE with INTERNAL_ERROR code each time qcs
allocation fails. This can happen in two cases :
* when creating a local stream through application layer
* when instantiating a remote stream through qcc_get_qcs()
In both cases, error paths are already in place to interrupt the current
operation and a CONNECTION_CLOSE will be emitted soon after.
This should be backported up to 2.7.
Add BUG_ON() statements to ensure qcc_emit_cc()/qcc_emit_cc_app() is not
called more than one time for each connection. This should improve code
resilience of MUX-QUIC and H3 and it will ensure that a scheduled
CONNECTION_CLOSE is not overwritten by another one with a different
error code.
This commit relies on the previous one to ensure all QUIC operations are
not conducted as soon as a CONNECTION_CLOSE has been prepared :
commit d7fbf458f8a4c5b09cbf0da0208fbad70caaca33
MINOR: mux-quic: interrupt most operations if CONNECTION_CLOSE scheduled
This should be backported up to 2.7.
Ensure that external MUX operations are interrupted if a
CONNECTION_CLOSE is scheduled. This was already the cases for some
functions. This is extended to the qcc_recv*() family for
MAX_STREAM_DATA, RESET_STREAM and STOP_SENDING.
Also, qcc_release_remote_stream() is skipped in qcs_destroy() if a
CONNECTION_CLOSE is already scheduled.
All of this will ensure we only proceed to minimal treatment as soon as
a CONNECTION_CLOSE is prepared. Indeed, all sending and receiving is
stopped as soon as a CONNECTION_CLOSE is emitted so only internal
cleanup code should be necessary at this stage.
This should prevent a registered CONNECTION_CLOSE error status to be
overwritten by an error in a follow-up treatment.
This should be backported up to 2.7.
HTTP/3 graceful shutdown operation is used to emit a GOAWAY followed by
a CONNECTION_CLOSE with H3_NO_ERROR status. It is used for every
connection on release which means that if a CONNECTION_CLOSE was already
registered for a previous error, its status code is overwritten.
To fix this, skip shutdown operation if a CONNECTION_CLOSE is already
registered at the MUX level. This ensures that the correct error status
is reported to the peer.
This should be backported up to 2.6. Note that qc_shutdown() does not
exists on 2.6 so modification will have to be made directly in
qc_release() as followed :
diff --git a/src/mux_quic.c b/src/mux_quic.c
index 49df0dc418..3463222956 100644
--- a/src/mux_quic.c
+++ b/src/mux_quic.c
@@ -1766,19 +1766,21 @@ static void qc_release(struct qcc *qcc)
TRACE_ENTER(QMUX_EV_QCC_END, conn);
- if (qcc->app_ops && qcc->app_ops->shutdown) {
- /* Application protocol with dedicated connection closing
- * procedure.
- */
- qcc->app_ops->shutdown(qcc->ctx);
+ if (!(qcc->flags & QC_CF_CC_EMIT)) {
+ if (qcc->app_ops && qcc->app_ops->shutdown) {
+ /* Application protocol with dedicated connection closing
+ * procedure.
+ */
+ qcc->app_ops->shutdown(qcc->ctx);
- /* useful if application protocol should emit some closing
- * frames. For example HTTP/3 GOAWAY frame.
- */
- qc_send(qcc);
- }
- else {
- qcc_emit_cc_app(qcc, QC_ERR_NO_ERROR, 0);
+ /* useful if application protocol should emit some closing
+ * frames. For example HTTP/3 GOAWAY frame.
+ */
+ qc_send(qcc);
+ }
+ else {
+ qcc_emit_cc_app(qcc, QC_ERR_NO_ERROR, 0);
+ }
}
if (qcc->task) {
A H3 unidirectional stream is always opened with its stream type first
encoded as a QUIC variable integer. If the STREAM frame contains some
data but not enough to decode this varint, haproxy would crash due to an
ABORT_NOW() statement.
To fix this, ensure we support an incomplete stream type. In this case,
h3_init_uni_stream() returns 0 and the buffer content is not cleared.
Stream decoding will resume when new data are received for this stream
which should be enough to decode the stream type varint.
This bug has never occured on production because standard H3 stream types
are small enough to be encoded on a single byte.
This should be backported up to 2.6.
Instead of reporting the inaccurate "malloc_trim() support" on -vv, let's
report the case where the memory allocator was actively replaced from the
one used at build time, as this is the corner case we want to be cautious
about. We also put a tainted bit when this happens so that it's possible
to detect it at run time (e.g. the user might have inherited it from an
environment variable during a reload operation).
The now unused is_trim_enabled() function was finally dropped.
The runtime detection of the default memory allocator was broken by
Commit d8a97d8f6 ("BUG/MINOR: illegal use of the malloc_trim() function
if jemalloc is used") due to a misunderstanding of its role. The purpose
is not to detect whether we're on non-jemalloc but whether or not the
allocator was changed from the one we booted with, in which case we must
be extra cautious and absolutely refrain from calling malloc_trim() and
its friends.
This was done only to drop the message saying that malloc_trim() is
supported, which will be totally removed in another commit, and could
possibly be removed even in older versions if this patch would get
backported since in the end it provides limited value.
There's a recurring issue regarding shared library loading from Lua. If
the imported library is linked with a different version of openssl but
doesn't use it, the check will trigger and emit a warning. In practise
it's not necessarily a problem as long as the API is the same, because
all symbols are replaced and the library will use the included ssl lib.
It's only a problem if the library comes with a different API because
the dynamic linker will only replace known symbols with ours, and not
all. Thus the loaded lib may call (via a static inline or a macro) a
few different symbols that will allocate or preinitialize structures,
and which will then pass them to the common symbols coming from a
different and incompatible lib, exactly what happens to users of Lua's
luaossl when building haproxy with quictls and without rebuilding
luaossl.
In order to better address this situation, we now define groups of
symbols that must always appear/disappear in a consistent way. It's OK
if they're all absent from either haproxy or the lib, it means that one
of them doesn't use them so there's no problem. But if any of them is
defined on any side, all of them must be in the exact same state on the
two sides. The symbols are represented using a bit in a mask, and the
mask of the group of other symbols they're related to. This allows to
check 64 symbols, this should be OK for a while. The first ones that
are tested for now are symbols whose combination differs between
openssl versions 1.0, 1.1, and 3.0 as well as quictls. Thus a difference
there will indicate upcoming trouble, but no error will mean that we're
running on a seemingly compatible API and that all symbols should be
replaced at once.
The same mechanism could possibly be used for pcre/pcre2, zlib and the
few other optional libs that may occasionally cause runtime issues when
used by dependencies, provided that discriminatory symbols are found to
distinguish them. But in practice such issues are pretty rare, mainly
because loading standard libs via Lua on a custom build of haproxy is
not pretty common.
In the event that further symbol compatibility issues would be reported
in the future, backporting this patch as well as the following series
might be an acceptable solution given that the scope of changes is very
narrow (the malloc stuff is needed so that the malloc/free tests can be
dropped):
BUG/MINOR: illegal use of the malloc_trim() function if jemalloc is used
MINOR: pools: make sure 'no-memory-trimming' is always used
MINOR: pools: intercept malloc_trim() instead of trying to plug holes
MEDIUM: pools: move the compat code from trim_all_pools() to malloc_trim()
MINOR: pools: export trim_all_pools()
MINOR: pattern: use trim_all_pools() instead of a conditional malloc_trim()
MINOR: tools: relax dlopen() on malloc/free checks
Now that we can provide a safe malloc_trim() we don't need to detect
anymore that some dependencies use a different set of malloc/free
functions than ours because they will use the same as those we're
seeing, and we control their use of malloc_trim(). The comment about
the incompatibility with DEBUG_MEM_STATS is not true anymore either
since the feature relies on macros so we're now OK.
This will stop catching libraries linked against glibc's allocator
when haproxy is natively built with jemalloc. This was especially
annoying since dlopen() on a lib depending on jemalloc() tends to
fail on TLS issues.
We already have some generic code in trim_all_pools() to implement the
equivalent of malloc_trim() on jemalloc and macos. Instead of keeping the
logic there, let's just move it to our own malloc_trim() implementation
so that we can unify the mechanism and the logic. Now any low-level code
calling malloc_trim() will either be disabled by haproxy's config if the
user decides to, or will be mapped to the equivalent mechanism if malloc()
was intercepted by a preloaded jemalloc.
Trim_all_pools() preserves the benefit of serializing threads (which we
must not impose to other libs which could come with their own threads).
It means that our own code should mostly use trim_all_pools() instead of
calling malloc_trim() directly.
As reported by Miroslav in commit d8a97d8f6 ("BUG/MINOR: illegal use of
the malloc_trim() function if jemalloc is used") there are still occasional
cases where it's discovered that malloc_trim() is being used without its
suitability being checked first. This is a problem when using another
incompatible allocator. But there's a class of use cases we'll never be
able to cover, it's dynamic libraries loaded from Lua. In order to address
this more reliably, we now define our own malloc_trim() that calls the
previous one after checking that the feature is supported and that the
allocator is the expected one. This way child libraries that would call
it will also be safe.
The function is intentionally left defined all the time so that it will
be possible to clean up some code that uses it by removing ifdefs.
The global option 'no-memory-trimming' was added in 2.6 with commit
c4e56dc58 ("MINOR: pools: add a new global option "no-memory-trimming"")
but there were some cases left where it was not considered. Let's make
is_trim_enabled() also consider it.
Complete traces with information from qcc and qcs instances about
flow-control level. This should help to debug further issue on sending.
This must be backported up to 2.7.
Change the trace from developer to data level whenever the flow control
limitation is updated following a MAX_DATA or MAX_STREAM_DATA reception.
This should be backported up to 2.7.
Add traces for _qc_send_qcs() function. Most notably, traces have been
added each time a qc_stream_desc buffer allocation fails and when stream
or connection flow-level is reached. This should improve debugging for
emission issues.
This must be backported up to 2.7.
Connection flow-control level calculation is a bit complicated. To
ensure it is never exceeded, each time a transfer occurs from a
qcs.tx.buf to its qc_stream_desc buffer it is accounted in
qcc.tx.offsets at the connection level. This value is not decremented
even if the corresponding STREAM frame is rejected by the quic-conn
layer as its emission will be retried later.
In normal cases this works as expected. However there is an issue if a
qcs instance is removed with prepared data left. In this case, its data
is still accounted in qcc.tx.offsets despite being removed which may
block other streams. This happens every time a qcs is reset with
remaining data which will be discarded in favor of a RESET_STREAM frame.
To fix this, if a stream has prepared data in qcc_reset_stream(), it is
decremented from qcc.tx.offsets. A BUG_ON() has been added to ensure
qcs_destroy() is never called for a stream with prepared data left.
This bug can cause two issues :
* transfer freeze as data unsent from closed streams still count on the
connection flow-control limit and will block other streams. Note that
this issue was not reproduced so it's unsure if this really happens
without the following issue first.
* a crash on a BUG_ON() statement in qc_send() loop over
qc_send_frames(). Streams may remained in the send list with nothing
to send due to connection flow-control limit. However, limit is never
reached through qcc_streams_sent_done() so QC_CF_BLK_MFCTL flag is not
set which will allow the loop to continue.
The last case was reproduced after several minutes of testing using the
following command :
$ ngtcp2-client --exit-on-all-streams-close -t 0.1 -r 0.1 \
--max-data=100K -n32 \
127.0.0.1 20443 "https://127.0.0.1:20443/?s=1g" 2>/dev/null
This should fix github issues #2049 and #2074.
In the event that HAProxy is linked with the jemalloc library, it is still
shown that malloc_trim() is enabled when executing "haproxy -vv":
..
Support for malloc_trim() is enabled.
..
It's not so much a problem as it is that malloc_trim() is called in the
pat_ref_purge_range() function without any checking.
This was solved by setting the using_default_allocator variable to the
correct value in the detect_allocator() function and before calling
malloc_trim() it is checked whether the function should be called.
Starting haproxy with -dL helps enumerate the list of libraries in use.
But sometimes in order to go further we'd like to see their address
ranges. This is already supported on the CLI's "show libs" but not on
the command line where it can sometimes help troubleshoot startup issues.
Let's dump them when in verbose mode. This way it doesn't change the
existing behavior for those trying to enumerate libs to produce an archive.
When threads are disabled, the compiler complains that we might be
accessing tg->abs[] out of bounds since the array is of size 1. It
cannot know that the condition to do this is never met, and given
that it's not in a fast path, we can make it more obvious.
qc_notify_send() is used to wake up the MUX layer for sending. This
function first ensures that all sending condition are met to avoid to
wake up the MUX for unnecessarily.
One of this condition is to check if there is room in the congestion
window. However, when probe packets must be sent due to a PTO
expiration, RFC 9002 explicitely mentions that the congestion window
must be ignored which was not the case prior to this patch.
This commit fixes this by first setting <pto_probe> of 01RTT packet
space before invoking qc_notify_send(). This ensures that congestion
window won't be checked anymore to wake up the MUX layer until probing
packets are sent.
This commit replaces the following one which was not sufficient :
commit e25fce03eb
BUG/MINOR: quic: Dysfunctional 01RTT packet number space probing
This should be backported up to 2.7.
On PTO probe timeout expiration, a probe packet must be emitted.
quic_pto_pktns() is used to determine for which packet space the timer
has expired. However, if MUX is already subscribed for sending, it is
woken up without checking first if this happened for the 01RTT packet
space.
It is unsure that this is really a bug as in most cases, MUX is
established only after Initial and Handshake packet spaces are removed.
However, the situation is not se clear when 0-RTT is used. For this
reason, adjust the code to explicitely check for the 01RTT packet space
before waking up the MUX layer.
This should be backported up to 2.6. Note that qc_notify_send() does not
exists in 2.6 so it should be replaced by the explicit block checking
(qc->subs && qc->subs->events & SUB_RETRY_SEND).
If appctx_new_on() fails to allocate a task, it will not remove the
freshly allocated sedesc from the appctx despite freeing it, causing
a UAF. Let's only assign appctx->sedesc upon success.
This needs to be backported to 2.6. In 2.6 the function is slightly
different and called appctx_new(), though the issue is exactly the
same.
In h1c_frt_stream_new() and h1c_bck_stream_new(), if we fail to completely
initialize the freshly allocated h1s, typically because sc_attach_mux()
fails, we must use h1s_destroy() to de-initialize it. Otherwise it stays
attached to the h1c when released, causing use-after-free upon the next
wakeup. This can be triggered upon memory shortage.
This needs to be backported to 2.6.
Using -dMfail alone does nothing unless tune.fail-alloc is set, which
renders it pretty useless as-is, and is not intuitive. Let's change
this so that the filure rate is preset to 1% when the option is set on
the command line. This allows to inject failures without having to edit
the configuration.
If we fail to allocate a new stream in sc_new_from_endp(), and the call
to sc_new() allocated the sedesc itself (which normally doesn't happen),
then it doesn't get released on the failure path. Let's explicitly
handle this case so that it's not overlooked and avoids some head
scratching sessions.
This may be backported to 2.6.
There's an occasional crash that can be triggered in sc_detach_endp()
when calling conn->mux->detach() upon memory allocation error. The
problem in fact comes from sc_attach_mux(), which doesn't reset the
sc type flags upon tasklet allocation failure, leading to an attempt
at detaching an incompletely initialized stconn. Let's just attach
the sc after the tasklet allocation succeeds, not before.
This must be backported to 2.6.
On the allocation error path in h2_init() we may check if
h2c->wait_event.tasklet needs to be released but it has not yet been
zeroed. Let's do this before jumping to the freeing location.
This needs to be backported to all maintained versions.
In h2s_close() we may dereference h2s->sd to get the sc, but this
function may be called on allocation error paths, so we must check
for this specific condition. Let's also update the comment to make
it explicitly permitted.
This needs to be backported to 2.6.
In stream_free() if we fail to allocate s->scb() we go to the path where
we try to free it, and it doesn't like being called with a null at all.
It's easily reproducible with -dMfail,no-cache and "tune.fail-alloc 10"
in the global section.
This must be backported to 2.6.
This bug arrived with this commit:
"MINOR: quic: implement qc_notify_send()".
The ->tx.pto_probe variable was no more set when qc_processt_timer() the timer
task for the connection responsible of detecting packet loss and probing upon
PTO expiration leading to interrupted stream transfers. This was revealed by
blackhole interop failed tests where one could see that qc_process_timer()
was wakeup without traces as follows in the log file:
"needs to probe 01RTT packet number space"
Must be backported to 2.7 and to 2.6 if the commit mentionned above
is backported to 2.6 in the meantime.
The ACK frame range of packets were handled from the largest to the smallest
packet number, leading to big number of ebtree insertions when the packet are
handled in the inverse way they are sent. This was detected a long time ago
but left in the code to stress our implementation. It is time to be more
efficient and process the packet so that to avoid useless ebtree insertions.
Modify qc_ackrng_pkts() responsible of handling the acknowledged packets from an
ACK frame range of acknowledged packets.
Must be backported to 2.7.
Despite having replaced the SSL BIOs to use our own raw_sock layer, we
still didn't exploit the CO_SFL_MSG_MORE flag which is pretty useful to
avoid sending incomplete packets. It's particularly important for SSL
since the extra overhead almost guarantees that each send() will be
followed by an incomplete (and often odd-sided) segment.
We already have an xprt_st set of flags to pass info to the various
layers, so let's just add a new one, SSL_SOCK_SEND_MORE, that is set
or cleared during ssl_sock_from_buf() to transfer the knowledge of
CO_SFL_MSG_MORE. This way we can recover this information and pass
it to raw_sock.
This alone is sufficient to increase by ~5-10% the H2 bandwidth over
SSL when multiple streams are used in parallel.
Traces show that sendto() rarely has MSG_MORE on H2 despite sending
multiple buffers. The reason is that the loop iterating over the buffer
ring doesn't have this info and doesn't pass it down.
But now we know how many buffers are left to be sent, so we know whether
or not the current buffer is the last one. As such we can set this flag
for all buffers but the last one.
Before muxes were used, we used to refrain from reading past the
buffer's reserve. But with muxes which have their own buffer, this
rule was a bit forgotten, resulting in an extraneous read to be
performed just because the rx buffer cannot be entirely transferred
to the stream layer:
sendto(12, "GET /?s=16k HTTP/1.1\r\nhost: 127."..., 84, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 84
recvfrom(12, "HTTP/1.1 200\r\nContent-length: 16"..., 16320, 0, NULL, NULL) = 16320
recvfrom(12, ".123456789.12345", 16, 0, NULL, NULL) = 16
recvfrom(12, "6789.123456789.12345678\n.1234567"..., 15244, 0, NULL, NULL) = 182
recvfrom(12, 0x1e5d5d6, 15062, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
Here the server sends 16kB of payload after a headers block, the mux reads
16320 into the ibuf, and the stream layer consumes 15360 from the first
h1_rcv_buf(), which leaves 960 into the buffer and releases a few indexes.
The buffer cannot be realigned due to these remaining data, and a subsequent
read is made on 16 bytes, then again on 182 bytes.
By avoiding to read too much on the first call, we can avoid needlessly
filling this buffer:
recvfrom(12, "HTTP/1.1 200\r\nContent-length: 16"..., 15360, 0, NULL, NULL) = 15360
recvfrom(12, "456789.123456789.123456789.12345"..., 16220, 0, NULL, NULL) = 1158
recvfrom(12, 0x1d52a3a, 15062, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
This is much more efficient and uses less RAM since the first buffer that
was emptied can now be released.
Note that a further improvement (tested) consists in reading even less
(typically 1kB) so that most of the data are transferred in zero-copy, and
are not read until process_stream() is scheduled. This patch doesn't do that
for now so that it can be backported without any obscure impact.
CertiK Skyfall Team reported that passing an index greater than
QPACK_SHT_SIZE in a qpack instruction referencing a literal field
name with name reference or and indexed field line will cause a
read out of bounds that may crash the process, and confirmed that
this fix addresses the issue.
This needs to be backported as far as 2.5.
sc-add-gpc() was implemented in 5a72d03 ("MINOR: stick-table:
implement the sc-add-gpc() action")
This new action was exposed everywhere sc-inc-gpc() is available,
except for http-after-response.
But there doesn't seem to be a technical constraint that prevents us from
exposing it in http-after-response.
It was probably overlooked, let's add it.
No backport needed, unless 5a72d03 ("MINOR: stick-table: implement the
sc-add-gpc() action") is being backported.
This patch follows this one which was not sufficient:
"BUG/MINOR: quic: Missing STREAM frame length updates"
Indeed, it is not sufficient to update the ->len and ->offset member
of a STREAM frame to move it forward. The data pointer must also be updated.
This is not done by the STREAM frame builder.
Must be backported to 2.6 and 2.7.
Emeric noticed that h2 bit-rate performance was always slightly lower
than h1 when the CPU is saturated. Strace showed that we were always
data in 2kB chunks, corresponding to the max_record size. What's
happening is that when this mechanism of dynamic record size was
introduced, the STREAMER flag at the stream level was relied upon.
Since all this was moved to the muxes, the flag has to be passed as
an argument to the snd_buf() function, but the mux h2 did not use it
despite a comment mentioning it, probably because before the multi-buf
it was not easy to figure the status of the buffer.
The solution here consists in checking if the mbuf is congested or not,
by checking if it has more than one buffer allocated. If so we set the
CO_SFL_STREAMER flag, otherwise we don't. This way moderate size
exchanges continue to be made over small chunks, but downloads will
be able to use the large ones.
While it could be backported to all supported versions, it would be
better to limit it to the last LTS, so let's do it for 2.7 and 2.6 only.
This patch requires previous commit "MINOR: buffer: add br_single() to
check if a buffer ring has more than one buf".
During performance tests, Emeric faced a case where the wakeups of
sc_conn_io_cb() caused by h2_resume_each_sending_h2s() was multiplied
by 5-50 and a lot of CPU was being spent doing this for apparently no
reason.
The culprit is h2_send() not behaving well with congested buffers and
small SSL records. What happens when the output is congested is that
all buffers are full, and data are emitted in 2kB chunks, which are
sufficient to wake all streams up again to ask them to send data again,
something that will obviously only work for one of them at best, and
waste a lot of CPU in wakeups and memcpy() due to the small buffers.
When this happens, the performance can be divided by 2-2.5 on large
objects.
Here the chosen solution against this is to keep in mind that as long
as there are still at least two buffers in the ring after calling
xprt->snd_buf(), it means that the output is congested and there's
no point trying again, because these data will just be placed into
such buffers and will wait there. Instead we only mark the buffer
decongested once we're back to a single allocated buffer in the ring.
By doing so we preserve the ability to deal with large concurrent
bursts while not causing a thundering herd by waking all streams for
almost nothing.
This needs to be backported to 2.7 and 2.6. Other versions could
benefit from it as well but it's not strictly necessary, and we can
reconsider this option if some excess calls to sc_conn_io_cb() are
faced.
Note that this fix depends on this recent commit:
MINOR: buffer: add br_single() to check if a buffer ring has more than one buf
When detaching a stream, if it's the last one and the mbuf is blocked,
we leave without freeing the stream yet. We also refresh the h2c task's
timeout, except that it's possible that there's no such task in case
there is no client timeout, causing a crash. The fix just consists in
doing this when the task exists.
This bug has always been there and is extremely hard to meet even
without a client timeout. This fix has to be backported to all
branches, but it's unlikely anyone has ever met it anyay.
The commit 5e1b0e7bf ("BUG/MEDIUM: connection: Clear flags when a conn is
removed from an idle list") introduced a regression. CO_FL_SAFE_LIST and
CO_FL_IDLE_LIST flags are used when the connection is released to properly
decrement used/idle connection counters. if a connection is idle, these
flags must be preserved till the connection is really released. It may be
removed from the list but not immediately released. If these flags are lost
when it is finally released, the current number of used connections is
erroneously decremented. If means this counter may become negative and the
counters tracking the number of idle connecitons is not decremented,
suggesting a leak.
So, the above commit is reverted and instead we improve a bit the way to
detect an idle connection. The function conn_get_idle_flag() must now be
used to know if a connection is in an idle list. It returns the connection
flag corresponding to the idle list if the connection is idle
(CO_FL_SAFE_LIST or CO_FL_IDLE_LIST) or 0 otherwise. But if the connection
is scheduled to be removed, 0 is also returned, regardless the connection
flags.
This new function is used when the connection is temporarily removed from
the list to be used, mainly in muxes.
This patch should fix#2078 and #2057. It must be backported as far as 2.2.
Some STREAM frame lengths were not updated before being duplicated, built
of requeued contrary to their ack offsets. This leads haproxy to crash when
receiving acknowledgements for such frames with this thread #1 backtrace:
Thread 1 (Thread 0x7211b6ffd640 (LWP 986141)):
#0 ha_crash_now () at include/haproxy/bug.h:52
No locals.
#1 b_del (b=<optimized out>, del=<optimized out>) at include/haproxy/buf.h:436
No locals.
#2 qc_stream_desc_ack (stream=stream@entry=0x7211b6fd9bc8, offset=offset@entry=53176, len=len@entry=1122) at src/quic_stream.c:111
Thank you to @Tristan971 for having provided such traces which reveal this issue:
[04|quic|5|c_conn.c:1865] qc_requeue_nacked_pkt_tx_frms(): entering : qc@0x72119c22cfe0
[04|quic|5|_frame.c:1179] qc_frm_unref(): entering : qc@0x72119c22cfe0
[04|quic|5|_frame.c:1186] qc_frm_unref(): remove frame reference : qc@0x72119c22cfe0 frm@0x72118863d260 STREAM_F uni=0 fin=1 id=460 off=52957 len=1122 3244
[04|quic|5|_frame.c:1194] qc_frm_unref(): leaving : qc@0x72119c22cfe0
[04|quic|5|c_conn.c:1902] qc_requeue_nacked_pkt_tx_frms(): updated partially acked frame : qc@0x72119c22cfe0 frm@0x72119c472290 STREAM_F uni=0 fin=1 id=460 off=53176 len=1122
Note that haproxy has much more chance to crash if this frame is the last one
(fin bit set). But another condition must be fullfilled to update the ack offset.
A previous STREAM frame from the same stream with the same offset but with less
data must be acknowledged by the peer. This is the condition to update the ack offset.
For others frames without fin bit in the same conditions, I guess the stream may be
truncated because too much data are removed from the stream when they are
acknowledged.
Must be backported to 2.6 and 2.7.
There is a bug in the smp_fetch_dport() function which affects the 'f' case,
also known as 'fc_dst_port' sample fetch.
conn_get_src() is used to retrieve the address prior to calling conn_dst().
But this is wrong: conn_get_dst() should be used instead.
Because of that, conn_dst() may return unexpected results since the dst
address is not guaranteed to be set depending on the conn state at the time
the sample fetch is used.
This was reported by Corin Langosch on the ML:
during his tests he noticed that using fc_dst_port in a log-format string
resulted in the correct value being printed in the logs but when he used it
in an ACL, the ACL did not evaluate properly.
This can be easily reproduced with the following test conf:
|frontend test-http
| bind 127.0.0.1:8080
| mode http
|
| acl test fc_dst_port eq 8080
| http-request return status 200 if test
| http-request return status 500 if !test
A request on 127.0.0.1:8080 should normally return 200 OK, but here it
will return a 500.
The same bug was also found in smp_fetch_dst_is_local() (fc_dst_is_local
sample fetch) by reading the code: the fix was applied twice.
This needs to be backported up to 2.5
[both sample fetches were introduced in 2.5 with 888cd70 ("MINOR:
tcp-sample: Add samples to get original info about client connection")]
When a connection error is encountered during a receive, the error is not
immediatly reported to the SE descriptor. We take care to process all
pending input data first. However, when the error is finally reported, a
fatal error is reported only if a read0 was also received. Otherwise, only a
pending error is reported.
With a raw socket, it is not an issue because shutdowns for read and write
are systematically reported too when a connection error is detected. So in
this case, the fatal error is always reported to the SE descriptor. But with
a SSL socket, in case of pure SSL error, only the connection error is
reported. This prevent the fatal error to go up. And because the connection
is in error, no more receive or send are preformed on the socket. The mux is
blocked till a timeout is triggered at the stream level, looping infinitly
to progress.
To fix the bug, during the demux stage, when there is no longer pending
data, the read error is reported to the SE descriptor, regardless the
shutdown for reads was received or not. No more data are expected, so it is
safe.
This patch should fix the issue #2046. It must be backported to 2.7.
Like for connection error code, when FD are dumps, the ssl error code is now
displayed. This may help to diagnose why a connection error occurred.
This patch may be backported to help debugging.
When FD are dumps, the connection error code is now displayed. This may help
to diagnose why a connection error occurred.
This patch may be backported to help debugging.
When HAproxy is stopping, the DNS resolutions must be stopped, except those
triggered from a "do-resolve" action. To do so, the resolutions themselves
cannot be destroyed, the current design is too complex. However, it is
possible to mute the resolvers tasks. The same is already performed with the
health-checks. On soft-stop, the tasks are still running periodically but
nothing if performed.
For the resolvers, when the process is stopping, before running a
resolution, we check all the requesters attached to this resolution. If t
least a request is a stream or if there is a requester attached to a running
proxy, a new resolution is triggered. Otherwise, we ignored the
resolution. It will be evaluated again on the next wakeup. This way,
"do-resolv" action are still working during soft-stop but other resoluation
are stopped.
Of course, it may be see as a feature and not a bug because it was never
performed. But it is in fact not expected at all to still performing
resolutions when HAProxy is stopping. In addution, a proxy option will be
added to change this behavior.
This patch partially fixes the issue #1874. It could be backported to 2.7
and maybe to 2.6. But no further.
On soft-stop, we must properlu stop backends and not only proxies with at
least a listener. This is mandatory in order to stop the health checks. A
previous fix was provided to do so (ba29687bc1 "BUG/MEDIUM: proxy: properly
stop backends"). However, only stop_proxy() function was fixed. When HAproxy
is stopped, this function is no longer used. So the same kind of fix must be
done on do_soft_stop_now().
This patch partially fixes the issue #1874. It must be backported as far as
2.4.
The ocsp-related CLI commands tend to work with OCSP_CERTIDs as well as
certificate paths so the path should also be added to the output of the
"show ssl ocsp-response" command when no certid or path is provided.
In order to increase usability, the "show ssl ocsp-response" also takes
a frontend certificate path as parameter. In such a case, it behaves the
same way as "show ssl cert foo.pem.ocsp".
If the last update before a deinit happens was successful, the pointer
to the httpclient in the ocsp update context was not reset while the
httpclient instance was already destroyed.
Instead of having a dedicated httpclient instance and its own code
decorrelated from the actual auto update one, the "update ssl
ocsp-response" will now use the update task in order to perform updates.
Since the cli command allows to update responses that were never
included in the auto update tree, a new flag was added to the
certificate_ocsp structure so that the said entry can be inserted into
the tree "by hand" and it won't be reinserted back into the tree after
the update process is performed. The 'update_once' flag "stole" a bit
from the 'fail_count' counter since it is the one less likely to reach
UINT_MAX among the ocsp counters of the certificate_ocsp structure.
This new logic required that every certificate_ocsp entry contained all
the ocsp-related information at all time since entries that are not
supposed to be configured automatically can still be updated through the
cli. The logic of the ssl_sock_load_ocsp was changed accordingly.
The dedicated proxy used for OCSP auto update is renamed OCSP-UPDATE
which should be more explicit than the previous HC_OCSP name. The
reference to the underlying httpclient is simply kept in the
documentation.
The certid is removed from the log line since it is not really
comprehensible and is replaced by the path to the corresponding frontend
certificate.
It is more a less a revert of the commit b65af26e1 ("MEDIUM: mux-pt: Don't
always set a final error on SE on the sending path"). The PT multiplexer is
so simple that an error on the sending path is terminal. Unlike other muxes,
there is no connection level here. However, instead of reporting an final
error by setting SE_FL_ERROR, we set SE_FL_EOS flag instead if a read0 was
received on the underlying connection. Concretely, it is always true with
the current design of the raw socket layer. But it is cleaner this way.
Without this patch, it is possible to block a TCP socket if a connection
error is triggered when data are sent (for instance a broken pipe) while the
upper stream does not expect to receive more data.
Note the patch above introduced a regression because errors handling at the
connection level is quite simple. All errors are final. But we must keep in
mind it may change. And if so, this will require to move back on a 2-step
errors handling in the mux-pt.
This patch must be backported to 2.7.
This one is printed as the iocb in the "show fd" output, and arguably
this wasn't very convenient as-is:
293 : st=0x000123(cl heopI W:sRa R:sRA) ref=0 gid=1 tmask=0x8 umask=0x0 prmsk=0x8 pwmsk=0x0 owner=0x7f488487afe0 iocb=0x50a2c0(main+0x60f90)
Let's unstatify it and export it so that the symbol can now be resolved
from the various points that need it.
This bug was revealed by h2load tests run as follows:
h2load -t 4 --npn-list h3 -c 64 -m 16 -n 16384 -v https://127.0.0.1:4443/
This open (-c) 64 QUIC connections (-n) 16384 h3 requets from (-t) 4 threads, i.e.
256 requests by connection. Such tests could not always pass and often ended with
such results displays by h2load:
finished in 53.74s, 38.11 req/s, 493.78KB/s
requests: 16384 total, 2944 started, 2048 done, 2048 succeeded, 14336
failed, 14336 errored, 0 timeout
status codes: 2048 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 25.92MB (27174537) total, 102.00KB (104448) headers (space
savings 1.92%), 25.80MB (27053569) data
UDP datagram: 3883 sent, 24330 received
min max mean sd ± sd
time for request: 48.75ms 502.86ms 134.12ms 75.80ms 92.68%
time for connect: 20.94ms 331.24ms 189.59ms 84.81ms 59.38%
time to 1st byte: 394.36ms 417.01ms 406.72ms 9.14ms 75.00%
req/s : 0.00 115.45 14.30 38.13 87.50%
The number of successful requests was always a multiple of 256.
Activating the traces also shew that some connections were blocked after having
successfully completed their handshakes due to the fact that the mux. The mux
is started upon the acceptation of the connection.
Under heavy load, some connections were never accepted. From the moment where
more than 4 (MAXACCEPT) connections were enqueued before a listener could be
woken up to accept at most 4 connections, the remaining connections were not
accepted ore lately at the second listener tasklet wakeup.
Add a call to tasklet_wakeup() to the accept list tasklet of the listeners to
wake up it if there are remaining connections to accept after having called
listener_accept(). In this case the listener must not be removed of this
accept list, if not at the next call it will not accept anything more.
Must be backported to 2.7 and 2.6.
In environments where SYSTEM_MAXCONN is defined when compiling, the
master will use this value instead of the original minimal value which
was set to 100. When this happens, the master process could allocate
RAM excessively since it does not need to have an high maxconn. (For
example if SYSTEM_MAXCONN was set to 100000 or more)
This patch fixes the issue by using the new define MASTER_MAXCONN which
define a default maxconn of 100 for the master process.
Must be backported as far as 2.5.
The goal is to send signals to random threads at random instants so that
they spin for a random delay in a relax() loop, trying to give back the
CPU to another competing hardware thread, in hope that from time to time
this can trigger in critical areas and increase the chances to provoke a
latent concurrency bug. For now none were observed.
For example, this command starts 64 such tasks waking after random delays
of 0-1ms and delivering signals to trigger such loops on 3 random threads:
for i in {1..64}; do
socat - /tmp/sock1 <<< "expert-mode on;debug dev delay-inj 2 3"
done
This command is only enabled when DEBUG_DEV is set at build time.
As mentioned in commit 237e6a0d6 ("BUG/MAJOR: fd/thread: fix race between
updates and closing FD"), a race was found during stress tests involving
heavy backend connection reuse with many competing closes.
Here the problem is complex. The analysis in commit f69fea64e ("MAJOR:
fd: get rid of the DWCAS when setting the running_mask") that removed
the DWCAS in 2.5 overlooked a few races.
First, a takeover from thread1 could happen just after fd_update_events()
in thread2 validates it holds the tmask bit in the CAS loop. Since thread1
releases running_mask after the operation, thread2 will succeed the CAS
and both will believe the FD is theirs. This does explain the occasional
crashes seen with h1_io_cb() being called on a bad context, or
sock_conn_iocb() seeing conn->subs vanish after checking it. This issue
can be addressed using a DWCAS in both fd_takeover() and fd_update_events()
as it was before the patch above but this is not portable to all archs and
is not easy to adapt for those lacking it, due to some operations still
happening only on individual masks after the thread groups were added.
Second, the checks after fd_clr_running() for the current thread being
the last one is not sufficient: at the exact moment the operation
completes, another thread may also set and drop the running bit and see
itself as alone, and both can call _fd_close_orphan() in parallel. In
order to prevent this from happening, we cannot rely on the absence of
others, we need an explicit flag indicating that the FD must be closed.
One approach that was attempted consisted in playing with the thread_mask
but that was not reliable since it could still match between the late
deletion and the early insertion that follows. Instead, a new FD flag
was added, FD_MUST_CLOSE, that exactly indicates that the call to
_fd_delete_orphan() must be done. It is set by fd_delete(), and
atomically cleared by the first one which checks it, and which is the
only one to call _fd_delete_orphan().
With both points addressed, there's no more visible race left:
- takeover() only happens under the connection list's lock and cannot
compete with fd_delete() since fd_delete() must first remove the
connection from the list before deleting the FD. That's also why it
doesn't need to call _fd_delete_orphan() when dropping its running
bit.
- takeover() sets its running bit then atomically replaces the thread
mask, so that until that's done, it doesn't validate the condition
to end the synchonization loop in fd_update_events(). Once it's OK,
the previous thread's bit is lost, and this is checked for in
fd_update_events()
- fd_update_events() can compete with fd_delete() at various places
which are explained above. Since fd_delete() clears the thread mask
as after setting its running bit and after setting the FD_MUST_CLOSE
bit, the synchronization loop guarantees that the thread mask is seen
before going further, and that once it's seen, the FD_MUST_CLOSE flag
is already present.
- fd_delete() may start while fd_update_events() has already started,
but fd_delete() must hold a bit in thread_mask before starting, and
that is checked by the first test in fd_update_events() before setting
the running_mask.
- the poller's _update_fd() will not compete against _fd_delete_orphan()
nor fd_insert() thanks to the fd_grab_tgid() that's always done before
updating the polled_mask, and guarantees that we never pretend that a
polled_mask has a bit before the FD is added.
The issue is very hard to reproduce and is extremely time-sensitive.
Some tests were required with a 1-ms timeout with request rates
closely matching 1 kHz per server, though certain tests sometimes
benefitted from saturation. It was found that adding the following
slowdown at a few key places helped a lot and managed to trigger the
bug in 0.5 to 5 seconds instead of tens of minutes on a 20-thread
setup:
{ volatile int i = 10000; while (i--); }
Particularly, placing it at key places where only one of running_mask
or thread_mask is set and not the other one yet (e.g. after the
synchronization loop in fd_update_events or after dropping the
running bit) did yield great results.
Many thanks to Olivier Houchard for this expert help analysing these
races and reviewing candidate fixes.
The patch must be backported to 2.5. Note that 2.6 does not have tgid
in FDs, and that it requires a change of output on fd_clr_running() as
we need the previous bit. This is provided by carefully backporting
commit d6e1987612 ("MINOR: fd: make fd_clr_running() return the previous
value instead"). Tests have shown that the lack of tgid is a showstopper
for 2.6 and that unless a better workaround is found, it could still be
preferable to backport the minimum pieces required for fd_grab_tgid()
to 2.6 so that it stays stable long.
In case too many thread groups are needed for the threads, we emit
an error indicating the problem. Unfortunately the threads and groups
counts were reversed.
This can be backported to 2.6.
The NUMA detection code tries not to interfer with any taskset the user
could have specified in init scripts. For this it compares the number of
CPUs available with the number the process is bound to. However, the CPU
count is retrieved after being applied an upper bound of MAX_THREADS, so
if the machine has more than 64 CPUs, the comparison always fails and
makes haproxy think the user has already enforced a binding, and it does
not pin it anymore to a single NUMA node.
This can be verified by issuing:
$ socat /path/to/sock - <<< "show info" | grep thread
On a dual 48-CPU machine it reports 64, implying that threads are allowed
to run on the second socket:
Nbthread: 64
With this fix, the function properly reports 96, and the output shows 48,
indicating that a single NUMA node was used:
Nbthread: 48
Of course nothing is changed when "no numa-cpu-mapping" is specified:
Nbthread: 64
This can be backported to 2.4.
This issue was revealed by "Multiple streams" QUIC tracker test which very often
fails (locally) with a file of about 1Mbytes (x4 streams). The log of QUIC tracker
revealed that from its point of view, the 4 files were never all received entirely:
"results" : {
"stream_0_rec_closed" : true,
"stream_0_rec_offset" : 1024250,
"stream_0_snd_closed" : true,
"stream_0_snd_offset" : 15,
"stream_12_rec_closed" : false,
"stream_12_rec_offset" : 72689,
"stream_12_snd_closed" : true,
"stream_12_snd_offset" : 15,
"stream_4_rec_closed" : true,
"stream_4_rec_offset" : 1024250,
"stream_4_snd_closed" : true,
"stream_4_snd_offset" : 15,
"stream_8_rec_closed" : true,
"stream_8_rec_offset" : 1024250,
"stream_8_snd_closed" : true,
"stream_8_snd_offset" : 15
},
But this in contradiction with others QUIC tracker logs which confirms that haproxy
has really (re)sent the stream at the suspected offset(stream_12_rec_offset):
1152085,
"transport",
"packet_received",
{
"frames" : [
{
"frame_type" : "stream",
"length" : "155",
"offset" : "72689",
"stream_id" : "12"
}
],
"header" : {
"dcid" : "a14479169ebb9dba",
"dcil" : "8",
"packet_number" : "466",
"packet_size" : 190
},
"packet_type" : "1RTT"
}
When detected as losts, the packets are enlisted, then their frames are
requeued in their packet number space by qc_requeue_nacked_pkt_tx_frms().
This was done using a local list which was spliced to the packet number
frame list. This had as bad effect to retransmit the frames in the inverse
order they have been sent. This is something the QUIC tracker go client
does not like at all!
Removing the frame splicing fixes this issue and allows haproxy to pass the
"Multiple streams" test.
Must be backported to 2.7.
Normally the task_wakeup() in sock_conn_io_cb() is expected to
happen on the same thread the FD is attached to. But due to the
way the code was arranged in the past (with synchronous callbacks)
we continue to update connections after the wakeup, which always
makes the reader have to think deeply whether it's possible or not
to call another thread there. Let's just move the tasklet_wakeup()
at the end to make sure there's no problem with that.
This bug arrived with this commit:
b5a8020e9 MINOR: quic: RETIRE_CONNECTION_ID frame handling (RX)
and was revealed by h3 interop tests with clients like s2n-quic and quic-go
as noticed by Amaury.
Indeed, one must check that the CID matching the sequence number provided by a received
RETIRE_CONNECTION_ID frame does not match the DCID of the packet.
Remove useless ->curr_cid_seq_num member from quic_conn struct.
The sequence number lookup must be done in qc_handle_retire_connection_id_frm()
to check the validity of the RETIRE_CONNECTION_ID frame, it returns the CID to be
retired into <cid_to_retire> variable passed as parameter to this function if
the frame is valid and if the CID was not already retired
Must be backported to 2.7.
Since the following commit :
commit fb375574f9
MINOR: quic: mark quic-conn as jobs on socket allocation
quic-conn instances are marked as jobs. This prevent haproxy process to
stop while there is transfer in progress. To not delay process
termination, idle connections are woken up through their MUX instances
to be able to release them immediately.
However, there is no mechanism to wake up quic connections left on
closing or draining state. This means that haproxy process termination
is delayed until every closing quic connections timer has expired.
To improve this, a new function quic_handle_stopping() is called when
haproxy process is stopping. It simply wakes up the idle timer task of
all connections in the global closing list. These connections will thus
be released immediately to not interrupt haproxy process stopping.
This should be backported up to 2.7.