Commit 5ab9954faa9c815425fa39171ad33e75f4f7d56f introduced a new flag in
ssl_sock_ctx, to know that an ALPN was negociated, however, the way to
get the ssl_sock_ctx was wrong for QUIC. If we're using QUIC, get it
from the quic_conn.
This should fix crashes when attempting to use QUIC.
This function is a tasklet handler used to send peers updates, and it can
happen quite a bit in "show tasks" and "show profiling tasks", so let's
export it so that we don't face a cryptic symbol name:
$ socat - /tmp/haproxy-n10.stat <<< "show tasks"
Running tasks: 43 (8 threads)
function places % lat_tot lat_avg calls_tot calls_avg calls%
process_table_expire 16 37.2 1.072m 4.021s 115831 7239 15.4
task_process_applet 15 34.8 1.072m 4.287s 486299 32419 65.0
stktable_add_pend_updates 8 18.6 - - 89725 11215 12.0
sc_conn_io_cb 3 6.9 - - 5007 1669 0.6
process_peer_sync 1 2.3 4.293s 4.293s 50765 50765 6.7
This should be backported to 3.2 as it participates to debugging the
table+peers processing overhead.
The stick-table expiration of ref-counted entries was insufficiently
addresse by commit 324f0a60ab ("BUG/MINOR: stick-tables: never leave
used entries without expiration"), because now entries are just requeued
where they were, so they're visited over and over for long sessions,
causing process_table_expire() to loop, eating CPU and causing lock
contention.
Here we take care of refreshing their timeer when they are met, so
that we don't meet them more than once per stick-table lifetime. It
should address at least a part of the recent degradation that Felipe
noticed in GH #3084.
Since the fix above was marked for backporting to 3.2, this one should
be backported there as well.
resolve_sym_name() knows a number of symbols, but when one exactly matches
(e.g. a task's handler), it systematically displays the offset behind it
("+0"). Let's only show the offset when non-zero. This can be backported
as this is helpful for debugging.
The "show tasks" command can be useful to inspect run queues for active
tasks, but currently it's difficult to distinguish an occasional running
task from a heavily active one. Let's collect the number of calls for
each of them, report them average on the number of instances of each task
as well as a percentage of the total used. This way it even becomes
possible to get a hint about how CPU usage is distributed.
In 2.4, "show tasks" was introduced by commit 7eff06e162 ("MINOR:
activity: add a new "show tasks" command to list currently active tasks")
to expose some info about running tasks. The latency is not correct
because it's a u32 subtracted from a u64. It ought to have been casted
to u32 for the operation, which is what this patch does.
This can be backported to 2.4.
Since commit 5ab9954faa ("MINOR: ssl: Add a flag to let it known we have
an ALPN negociated"), when building with QUIC we get this warning:
src/ssl_sock.c: In function 'ssl_sock_advertise_alpn_protos':
src/ssl_sock.c:2189:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
Let's just move the instructions after the optional declaration. No
backport is needed.
Now that which ALPN gets negociated for a given server, use that to
decide if we can create the mux right away in connect_server(), and use
it in conn_install_mux_be().
That way, we may create the mux soon enough for early data to be sent,
before the handshake has been completed.
This commit depends on several previous commits, and it has not been
deemed important enough to backport.
The conditions to use early data on output are super tricky and
detected later, so that it's difficult to figure how this works. This
patch splits the condition in two parts, the one that can be performed
early that is based on config/client/etc. It is used to clear a variable
that allows early data to be used in case any condition is not satisfied.
It was purposely split into multiple independent and reviewable tests.
The second part remains where it was at the end, and is used to temporarily
clear the handshake flags to let the data layer use early data. This one
being tricky, a large comment explaining the principle was added.
The logic was not changed at all, only the code was made more readable.
Since 3.0 we have HAVE_SSL_0RTT precisely to avoid checking horribly
complicated and unmaintainable conditions to detect support for 0RTT.
Let's just drop the complex condition and use the macro instead.
Instead of trying to switch from delayed start to instant start based
on a single condition, let's do the opposite and preset the condition
to instant start and detect what could cause it to be delayed, thus
falling back to the slow mode. The condition remains exactly the
inverted one and better matches the comment about ALPN being the only
cause of such a delay.
The init_mux variable is currently used in a way that's not super easy
to grasp. It's set a bit too late and requires to know a lot of info at
once. Let's first rename it to "may_start_mux_now" to clarify its role,
as the purpose is not to *force* the mux to be initialized now but to
permit it to do it.
Add a new field in struct server, path parameters. It will contain
connection informations for the server that are not expected to change.
For now, just store the ALPN negociated with the server. Each time an
handhskae is done, we'll update it, even though it is not supposed to
change. This will be useful when trying to send early data, that way
we'll know which mux to use.
Each time the server goes down or is disabled, those informations are
erased, as we can't be sure those parameters will be the same once the
server will be back up.
How that we have a flag to let us know the ALPN has been set, we no
longer have to call ssl_sock_get_alpn() to know if the alpn has been
negociated already.
Remove the call to conn_create_mux() from ssl_sock_handshake(), and just
reuse the one already present in ssl_sock_io_cb() if we have received
early data, and if the flag is set.
Add a new flag to the ssl_sock_ctx, to be set as soon as the ALPN has
been negociated.
This happens before the handshake has been completed, and that
information will let us know that, when we receive early data, if the
ALPN has been negociated, then we can immediately create a mux, as the
ALPN will tell us which mux to use.
If we received early data, and an ALPN has been negociated, then
immediately try to create a mux if we did not have one already.
Generally, at this point we would not have one, as the mux is decided by
the ALPN, however at this point, even if the handshake is not done yet,
we have enough to determine the ALPN, so we can immediately create the
mux.
Doing so makes up able to treat the request immediately, without waiting
for the handshake to be done.
This should be backported up to 2.8.
In h1_recv_allowed(), do not forbid the reception if we are yet to
complete the connection, if we have received early data on it. That way,
we can deal with them right away, instead of waiting for the handshake
to be done.
This should be backported up to 2.8.
Recent fix 2421c3769a ("BUG/MEDIUM: peers: don't fail twice to grab the
update lock") improved the situation a lot for peers under locking
contention but still not enough for situations with many peers and
many entries to expire fast. It's indeed still possible to trigger
warnings at end of injection sessions for 16 peers at 100k req/s each
doing 10 random track-sc when process_table_expire() runs and holds the
update lock if compiled with a high value of STKTABLE_MAX_UPDATES_AT_ONCE
(1000). Better just not insist in this case and postpone the update.
At this point, under load only ebmb_lookup() consumes CPU, other functions
are in the few percent, indicating reasonable contention, and peers remain
updated.
This should be backported to 3.2 after a bit of testing.
This one doesn't need to wait forever, if it cannot work it can postpone
it. When building with a high value of STKTABLE_MAX_UPDATES_AT_ONCE (1000),
it's still possible to trigger warnings in this function on the write lock
that is contended by peers and expiration. Changing it for a trylock resolves
the issue.
This should be backported to 3.2 after a bit of testing.
process_table_expire() can take quite a lot of time running over all
shards. During this time it will hinder track-sc rules and peers, which
will experience an increased latency to do their work, especially peers
where each message will cause a lock, whose cumulated time can exceed
the watchdog's patience.
Here, we proceed just like in stktable_trash_oldest(), which is that
we're using a trylock to detect contention. The first time it happens,
if we hadn't purged anything, we switch to a regular lock to perform
the operation, and next time it happens we abort. This guarantees that
some entries will be expired and that contention will be reduced with
when detected.
With this change, various tests didn't manage to produce any warning,
including at the end of the load generation session.
This should be backported to 3.2 after a bit more testing.
stktable_trash_oldest() does insist a lot on purging what was requested,
only limited by STKTABLE_MAX_UPDATES_AT_ONCE. This is called in two
conditions, one to allocate a new stksess, and the other one to purge
entries of a stopping process. The cost of iterating over all shards
is huge, and a shard lock is taken each time before looking up entries.
Moreover, multiple threads can end up doing the same and looking hard for
many entries to purge when only one is needed. Furthermore, all threads
start from the same shard, hence synchronize their locks. All of this
costs a lot to other operations such as access from peers.
This commit simplifies the approach by ignoring the budget, starting
from a random shard number, and using a trylock so as to be able to
give up early in case of contention. The approach chosen here consists
in trying hard to flush at least one entry, but once at least one is
evicted or at least one trylock failed, then a failure on the trylock
will result in finishing.
The function now returns a success as long as one entry was freed.
With this, tests no longer show watchdog warnings during tests, though
a few still remain when stopping the tests (which are not related to
this function but to the contention from process_table_expire()).
With this change, under high contention some entries' purge might be
postponed and the table may occasionally contain slightly more entries
than their size (though this already happens since stksess_new() first
increments ->current before decrementing it).
Measures were made on a 64-core system with 8 peers
of 16 threads each, at CPU saturation (350k req/s each doing 10
track-sc) for 10M req, with 3 different approaches:
- this one resulted in 1500 failures to find an entry (0.015%
size overhead), with the lowest contention and the fairest
peers distibution.
- leaving only after a success resulted in 229 failures (0.0029%
size overhead) but doubled the time spent in the function (on
the write lock precisely).
- leaving only when both a success and a failed lock were met
resulted in 31 failures (0.00031% overhead) but the contention
was high enough again so that peers were not all up to date.
Considering that a saturated machine might exceed its entries by
0.015% is pretty minimal, the mechanism is kept.
This should be backported to 3.2 after a bit more testing as it
resolves some watchdog warnings and panics. It requires precedent
commit "MINOR: stick-table: permit stksess_new() to temporarily
allocate more entries" to over-allocate instead of failing in case
of contention.
stksess_new() calls stktable_trash_oldest() to release some entries.
If it fails however, it will fail to allocate an entry. This is a problem
because it doesn't permit stktable_trash_oldest() to be used in best effort
mode, which forces it to impose high contention. There's no problem with
allocating slightly more in practice. In the worst case if all entries are
in use, it's not shocking to temporarily exceed the number of entries by a
few units.
Let's relax this problematic rule. This patch might need to be backported
to 3.2 after a bit more testing in order to support locking relaxation.
The following functions take locks and are often involved in warnings
but are currently not resolved, so let's export them so that they are
properly decoded:
peer_prepare_updatemsg(), peer_send_teachmsgs(),
peer_treat_updatemsg(), peer_send_msgs(), peer_io_handler()
This should be backported to 3.2.
When task profiling is enabled, the current thread knows when the
currently running task was woken up and called, so we can calculate
how long ago it was woken up and called. This is convenient to figure
whether or not a warning or panic is caused by this task or by a
previous one, so let's report this info in thread outputs when known.
It would be useful to backport this to 3.2.
When multiple similar warnings are emitted, it can be difficult to know
whether only one task is looping slowly or if many are sharing the CPU.
Let's report the number of context switches and polling loop turns in
thread dumps so that warnings are easier to understand.
This should be backported to 3.2.
Normally the connect loop cannot loop, but some recent traces can easily
convince one of the opposite. Let's add a counter, including in panic
dumps, in order to avoid the repeated long head scratching sessions
starting with "and what if...". In addition, if it's found to loop, this
time it will be certain and will indicate what to zoom in. This should
be backported to 3.2.
Warning and panic messages currently do not report the PID. This is
annoying when trying to reproduce problems because warnings do not
allow know which process to attach to in order to debug, and panics
do not permit to know which core dump corresponds to which dump.
Let's add them in both messages. This should probably be backported
at least to 3.2.
QUIC is now supported on the backend side. The previous commit ensures
that simple checks can be activated on QUIC servers without any issue.
The current patch ensures that check server settings remain compatible
with a QUIC server. Thus, configuration is now invalid if check
specifies an explicit MUX proto other than QUIC, disables SSL or try to
use PROXY protocol.
Previously, checks were only performed on TCP. However, QUIC is now
supported on backend. Prior to this patch, check activation for QUIC
servers would result in a crash.
To ensure compatibility between QUIC servers and checks, adjust
protocol_lookup() performed during check connect step. Instead of using
a hardcoded PROTO_TYPE_STREAM, the value is now derived from server
settings.
This does not need to be backported.
If no specific check settings are defined on a server line, it is
expected that these checks will be performed with the same parameters as
normal connections on the same server.
ALPN must be carefully taken into account for checks. Most notably, MUX
initialization is delayed so that it is performed only after SSL
handshake.
Prior to this patch, MUX init delay was only performed if ALPN was
defined via check settings. Thus, with the following settings, checks
would be performed on HTTP/1.1 without consulting ALPN negotiation
result from the server :
server s1 127.0.0.1:443 ssl crt <...> alpn h2 check
This bug may result in checks reporting failure, for example in case of
a server answering HTTP/2 to ALPN negotiation to the configuration
above. Besides, there is incoherency between normal and check
connections, which is not what the documentation specifies.
This patch fixes this code. Now server parameters are also taken into
account. This ensures that checks and normal connections by default
use the same connection method.
This must be backported up to 2.4.
To ensure ALPN is properly applied on checks, MUX initialization is
delayed so that it is created on SSL handshake completion. However, this
does not check if SSL is really active for the connection.
This patch adjusts the condition so that MUX init is not delayed if SSL
is not active for the check connection. A similar process is already
conducted for normal connections via connect_server().
This must be backported up to 2.4. Despite not being a bug, it must be
backported for the following patch which fixes check ALPN inheritance
from server settings.
HTTP/0.9 is available on top of QUIC. This protocol is reserved for
internal use, mostly interop purpose.
This patch adjusts HTTP/0.9 layer with the following changes :
* version is not emitted anymore on the status line. This is performed
as some servers does not parse it correctly.
* status line is set explicitely on HTX status-line. This ensures the
correct HTTP status code is reported to the upper stream layer.
This does not need to be backported.
This patch relies on the previous one ("BUG/MEDIUM: mux-h2: Report RST/error to
app-layer stream during 0-copy fwding").
When the end of the connection is detected, so when the H2_CF_END_REACHED
flag is set after the shutdown was received and all incoming data were
processed, if a stream is blocked by the flow control (the stream one or the
connection one), an error must be reported to the app-layer stream.
Otherwise, outgoing data won't be sent and the opposite side will handle
this as a lack of room. So the stream will be blocked until the write
timeout is triggerd. By reporting the error early, the stream can be
immediately closed.
This patch should be backported to 3.2. For older versions, it is probably a
good idea to wait for bug report.
In h2_nego_ff(), it is important to report reset and error to app-layer
stream and to send the RST-STREAM frame accordingly. It is not clear if it
is an issue or not. But it is clearly a difference with the classical
forwarding via h2_snd_buf. And it is mandatory for the next fix.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this area.
This only happens when a connection error is detected or when the H2
connection is in ERR/ERR2 state. The demux buffer is explicitly reset. In
that case, it is important to remove the flag reporting this buffer as full.
It is probably worth to backport this patch to 3.2. But it is not mandatory
on older versions because it does not fix any known issue.
When the mbuf ring buffer is full, the flag H2_CF_DEM_MROOM is set on the H2
connection to block any demux. It is important to properly handle ACK
frames. However, we must take care to restart reading when some data were
removed from the mbuf. Otherwise, we may block the demux for no reason. It
is especially an issue if the demux buffer is full. In that case, the H2
connection is blocked, waiting for the timeout.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this area.
The H2 connection is switched to ERR when a GOAWAY must be sent and in ERR2
when it is sent. In these states, no more data can be emitted by the
mux. But there is no reason to not try to process incoming data or to not
try to receive data. It is espcially important to be able to get the
shutdown from the TCP connection when a SSL connection was previously
detected. Otherwise, it is possible to block a H2 connection until its
timeout expiration to be able to close it.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this
area.
When an send error is detected on the underlying connection, a pending error
is reported to the H2 connection by setting H2_CF_ERR_PENDING flag. When
this happen the tail of the mux ring buffer is reset. However some blocking
flags remain set and have no chance to be removed later because of the
pending error. Especially the flag H2_CF_DEM_MROOM which block data
demultiplexing. Thus, it is possible to block a H2 connection with unparsed
incoming data.
Worse, if a read event is received, it could lead to a wakeup loop between
the H2 connection and the underlying SSL connection. The H2 connection is
unable to convert the pending error to a fatal error because the
demultiplexing is blocked. In the mean time, it tries to receive more data
because of the not-consumed read event. On the underlying connection side,
the error detected earlier blocks the read, but the H2 connection is woken
up to handle the error.
To fix the issue, blocking flags must be removed when a send error is caught,
H2_CF_MUX_MFULL and H2_CF_DEM_MROOM flags. But, it is not necessary to only
release the tail of the mbuf ring. When a send error is detected, all outgoing
data can be flushed. So, now, in h2_send(), h2_release_mbuf() function is called
on pending error. The mbuf ring is fully released and H2_CF_MUX_MFULL and
H2_CF_DEM_MROOM flags are removed.
Many thanks to Krzysztof Kozłowski for its help to spot this issue.
This patch could be backported at least as far as 2.8. But it is a bit
sensitive. So, it is probably a good idea to backport it to 3.2 for now and
wait for bug report on older versions.
Previously, GSO emission was explicitely disabled on backend side. This
is not true since the following patch, thus GSO can be used, for example
when transfering large POST requests to a HTTP/3 backend.
commit e064e5d46171d32097a84b8f84ccc510a5c211db
MINOR: quic: duplicate GSO unsupp status from listener to conn
However, GSO on the backend side may cause crash when handling EIO. In
this case, GSO must be completely disabled. Previously, this was
performed by flagging listener instance. In backend side, this would
cause a crash as listener is NULL.
This patch fixes it by supporting GSO disable flag for servers. Thus, in
qc_send_ppkts(), EIO can be converted either to a listener or server
flag depending on the quic_conn proxy side. On backend side, server
instance is retrieved via <qc.conn.target>. This is enough to guarantee
that server is not deleted.
This does not need to be backported.
Historically, when the purge of pools was forced by sending a SIGQUIT to
haproxy, information about the pools were first dumped. It is now totally
pointless because these info can be retrieved via the CLI. It is even less
relevant now because the purge is forced typically when there are memroy
issues and to dump pools information, data must be allocated.
dump_pools_info() function was simplified because it is now called only from
an applet. No reason to still try to dump info on stderr.
The "show pools" CLI command was not designed to dump information exceeding
the size of a buffer. But there is now much more pools than few years ago
and when detailed information are dumped, we exceeds the buffer limit and
the output is truncated.
To fix the issue, the command must be refactored to be able to stream the
result. To do so, the array containing pools info is now part of the command
context and it is dynamically allocated. A dedicated function was created to
fill all info. In addition, the index of the next pool to dump is saved in
the command context too to properly handle resumption cases. Finally global
information about pools are also stored in the command context for
convenience.
This patch should fix the issue #3067. It must be backported to 3.2. On
older release, the buffer limit is never reached.
First, the barrier to delay the client execution was moved before the client
definition. Otherwise, the connection is established too early and with
short timeouts it could be closed before the requests are sent.
The main purpose of the barrier was to workaround slow health-checks. This
is also the reason why the script was flagged as slow. But it can be
significantly speed-up by setting a slow "inter" value. It is now set to
100ms and the script is no longer slow.
The below patch fixes padding emission for small packets, which is
required to ensure that header protection removal can be performed by
the recipient.
commit d7dea408c64c327cab6aebf4ccad93405b675565
BUG/MINOR: quic: too short PADDING frame for too short packets
In addition to the proper fix, constant QUIC_HP_SAMPLE_LEN was removed
and replaced by QUIC_TLS_TAG_LEN. However, it still makes sense to have
a dedicated constant which represent the size of the sample used for
header protection. Thus, this patch restores it.
Special instructions for backport : above patch mentions that no
backport is needed. However, this is incorrect, as bug is introduced by
another patch scheduled for backport up to 2.6. Thus, it is first
mandatory to schedule d7dea408c64c327cab6aebf4ccad93405b675565 after it.
Then, this patch can also be used for the sake of code clarity.
Define a new "quic_tx" unit-test which is used to test QUIC TX module.
For the moment, a single test is performed on qc_do_build_pkt(). It
checks that PADDING is correctly added for HP sampling in case of a
small packet.
As reported in GH #3104, there remained a place where (1 << shift was
used to set or remove bits from uint64_t users bitfield. It is incorrect
and could lead to bugs for values > 32 bits.
Instead, let's use 1ULL to ensure the operation remains 64bits consistent.
No backport needed.
Willy reported that the following config would segfault right after the
"removing incomplete section 'peer' is emitted:
peers peers
bind :2300
server n10 127.0.0.1:2310
listen dummy
bind localhost:9999
This is caused by the fact that stop_proxy(), which tries to read shared
counters, is called during early init while shared counters are not yet
initialized. To fix the crash, let's check if we're still during starting
phase, in which case we assume the counters are not initialized and we
assume 0 value instead.
No backport needed unless 16eb0fab31 ("MAJOR: counters: dispatch counters
over thread groups") is.
Mimic the same behavior as the one for SSL/TCP connetion to implement the
SSL session reuse.
Extract the code which try to reuse the SSL session for SSL/TCP connections
to implement ssl_sock_srv_try_reuse_sess().
Call this function from QUIC ->init() xprt callback (qc_conn_init()) as this
done for SSL/TCP connections.
When kTLS is compiled in, make sure msg_controllen is initialized to 0.
If we're not actually kTLS, then it won't be set, but we'll check that
it is non-zero later to check if we ancillary data.
This does not need to be backported.
This should fix CID 1620865, as reported in github issue #3106.
As found in GH issue #3103, CPU_ISSET() on musl 1.25 doesn't match the man
page which says it's returning an int. The reason is pretty simple, it's
a macro that operates on the bits directly and returns the result of the
bit field applied to the mask as an unsigned long. Bits above 31 will
simply be dropped if returned as an int, which causes CPUs 32..63 to
appear as absent from cpu_sets.
The fix is trivial, it consists in just comparing the result against zero
(i.e. turning it to a boolean), but before it's merged and deployed we'll
have to face such deployments, so better implement the same workaround
in the code here since we have access to the raw long value.
This workaround should be backported to 3.0.