RFC9113 clarified a point regarding the payload from DATA frames sent to
closed streams. It must always be counted against the connection's flow
control. In practice it should really have no practical effect, but if
repeated upload attempts are aborted, this might cause the client's
window to progressively shrink since not being ACKed.
It's probably not necessary to backport this, unless another patch
depends on it.
The fetch will return true if the stream was redispatched: this is a
past action, thus we rename the fetch to better reflect its true
meaning and prevent confusions.
Documentation was updated.
While at it, the fetch was moved from internal states section to Layer 4
section, which is where it belongs.
No backport needed unless 92b2edb (" MINOR: stream: add "txn.redispatch"
fetch") gets backported.
If a certificate that has an OCSP uri is unused and gets added to a
crt-list with the ocsp auto update option "on", it would not have been
inserted into the auto update tree because this insertion was only
working on the first call of the ssl_sock_load_ocsp function.
If the configuration used a crt-list like the following:
cert1.pem *
cert2.pem [ocsp-update on] *
Then calling "del ssl crt-list" on the second line and then reverting
the delete by calling "add ssl crt-list" with the same line, then the
cert2.pem would not appear in the ocsp update list (can be checked
thanks to "show ssl ocsp-updates" command).
This patch ensures that in such a case we still perform the insertion in
the update tree.
This patch can be backported up to branch 2.8.
The ckch_store's free'ing function might end up calling
'ssl_sock_free_ocsp' if the corresponding certificate had ocsp data.
This ocsp cleanup function expects for the 'refcount_instance' member of
the certificate_ocsp structure to be 0, meaning that no live
ckch instance kept a reference on this certificate_ocsp structure.
But since in ckch_store_free we were destroying the ckch_data before
destroying the linked instances, the BUG_ON would fail during a standard
deinit. Reversing the cleanup order fixes the problem.
Must be backported to 2.8.
With the current way OCSP responses are stored, a single OCSP response
is stored (in a certificate_ocsp structure) when it is loaded during a
certificate parsing, and each ckch_inst that references it increments
its refcount. The reference to the certificate_ocsp is actually kept in
the SSL_CTX linked to each ckch_inst, in an ex_data entry that gets
freed when he context is freed.
One of the downside of this implementation is that is every ckch_inst
referencing a certificate_ocsp gets detroyed, then the OCSP response is
removed from the system. So if we were to remove all crt-list lines
containing a given certificate (that has an OCSP response), the response
would be destroyed even if the certificate remains in the system (as an
unused certificate). In such a case, we would want the OCSP response not
to be "usable", since it is not used by any ckch_inst, but still remain
in the OCSP response tree so that if the certificate gets reused (via an
"add ssl crt-list" command for instance), its OCSP response is still
known as well. But we would also like such an entry not to be updated
automatically anymore once no instance uses it. An easy way to do it
could have been to keep a reference to the certificate_ocsp structure in
the ckch_store as well, on top of all the ones in the ckch_instances,
and to remove the ocsp response from the update tree once the refcount
falls to 1, but it would not work because of the way the ocsp response
tree keys are calculated. They are decorrelated from the ckch_store and
are the actual OCSP_CERTIDs, which is a combination of the issuer's name
hash and key hash, and the certificate's serial number. So two copies of
the same certificate but with different names would still point to the
same ocsp response tree entry.
The solution that answers to all the needs expressed aboved is actually
to have two reference counters in the certificate_ocsp structure, one
for the actual ckch instances and one for the ckch stores. If the
instance refcount becomes 0 then we remove the entry from the auto
update tree, and if the store reference becomes 0 we can then remove the
OCSP response from the tree. This would allow to chain some "del ssl
crt-list" and "add ssl crt-list" CLI commands without losing any
functionality.
Must be backported to 2.8.
When deleting a crt-list line through a "del ssl crt-list" call on the
CLI, we ended up free'ing the corresponding ckch instances without fully
clearing their contents. It left some dangling references on other
objects because the attache SSL_CTX was not deleted, as well as all the
ex_data referenced by it (OCSP responses for instance).
This patch can be backported up to branch 2.4.
The only useful information taken out of the ckch_store in order to copy
an OCSP certid into a buffer (later used as a key for entries in the
OCSP response tree) is the ocsp_certid field of the ckch_data structure.
We then don't need to pass a pointer to the full ckch_store to
ckch_store_build_certid or even any information related to the store
itself.
The ckch_store_build_certid is then converted into a helper function
that simply takes an OCSP_CERTID and converts it into a char buffer.
When calling ckchs_dup (during a "set ssl cert" CLI command), if the
modified store had OCSP auto update enabled then the new certificate
would not keep the previous update mode and would not appear in the auto
update list.
This patch can be backported to 2.8.
At the beginning of the 3.0-dev cycle, the zero-copy forwarding support was
added only for the cache applet with an option to disable it. This was a
hack, waiting for a better integration with applets. It is now possible to
implement the zero-copy forwarding for any applets. So the specific option
for the cache applet was renamed to be used for all applets. And this option
is now also checked for the stats applet.
Concretely, 'tune.cache.zero-copy-forwarding' was renamed to
'tune.applet.zero-copy-forwarding'.
This field was introduced when the first implementation of the zero-copy
forwarding was added. It is now useless. However, we must still save the
body-size of the object in the cache.
Default .rcv_buf and .snd_buf functions that applets can use are now
specialized to manipulate raw buffers or HTX buffers.
Thus a TCP applet should use appctx_raw_rcv_buf() and appctx_raw_snd_buf()
while HTTP applet should use appctx_htx_rcv_buf() and appctx_htx_snd_buf().
Note that the appctx is now directly passed to these functions instead of
the SC.
Just like for the cache applet, it is now possible to send response to the
opposite side using the zero-copy forwarding. Internal functions were
slightly updated but there is nothing special to say. Except the requested
size during the nego stage is not exact.
Till now, for chunked messages, the H1 mux used the size requested during
the zero-copy forwarding negotiation as the chunk size. And till now, this
was accurate because the requested size was indeed the chunk size on the
producer side.
But this will be a problem to implement the zero-copy forwarding on some
applets because the content size is not known during the nego but only when
it is produced. Thanks to previous patches, it is now possible to know the
requested size is not exact and we are able to reserve a larger space to
write the chunk size later, in h1_done_ff(), with some padding.
Now, during the zero-copy forwarding negotiation, when the requested size is
exact, we are now able to check if it is bigger than the expected one or
not. If it is indeed bigger than expeceted, the zero-copy forwarding is
disabled, the error will be triggered later on the normal sending path.
It is now possible to use a flag during zero-copy forwarding negotiation to
specify the requested size is exact, it means the producer really expect to
receive at least this amount of data.
It can be used by consumer to prepare some processing at this stage, based
on the requested size. For instance, in the H1 mux, it is used to write the
next chunk size.
It is now possible to impose the length to represent the chunk size in the
function used to prepended the chunk size in a buffer (so before the chunk
itself). It is thus possible to reserve a specific space for an unknown
chunk size and padding it with leading '0' to use all the space and avoid
holes.
During zero-copy forwarding negotiation, a pseudo flag was already used to
notify the consummer if the producer is able to use kernel splicing or not. But
this was not extensible. So, now we use a true bitfield to be able to pass flags
during the negotiation. NEGO_FF_FL_* flags may be used now.
Of course, for now, there is only one flags, the kernel splicing support on
producer side (NEGO_FF_FL_MAY_SPLICE).
The cache applet will be refactored to use its own buffer. Thus, for now,
the zero-copy forwarding support is removed and it will be reintrocuded
later.
The HTTP stat applets and all internal functions was adapted to use its own
buffers instead of the channels ones. The CLI part was not refactored yet,
thus there are still some access to channels in the file. But for the HTTP
part, we no longer use the channels at all.
To do so, the HTTP stats applet now uses default .rcv_buf and .snd_buf
callback function. In addition, it sets appctx flags instead of SE ones.
We no longer test the opposite stream-connector to detect aborted partial
post. Applets must not try to access to info ouside their scope. This make
the code more sensitive to changes and it is a common source of bug.
Tests on the sedesc flags at the begining of the I/O handler should be
enough.
Thanks to this patch, it is possible to an applet to directly send data to
the opposite endpoint. To do so, it must implement <fastfwd> appctx callback
function and set SE_FL_MAY_FASTFWD flag.
Everything will be handled by appctx_fastfwd() function. The applet is only
responsible to transfer data. If it sets <to_forward> value, it is used to
limit the amount of data to forward.
This patch introduces the support for the callback function responsible to
produce data via the zero-copy forwarding mechanism. There is no
implementation for now. But <to_forward> field was added in the appctx
structure to let an applet inform how much data it want to forward. It is
not mandatory but it will be used during the zero-copy forwarding
negociation.
We have indroduced flags to deal with end of input, end of stream and errors
at the applet level. With this patch we make the link with the endpoint
descriptor.
In appctx_rcv_buf(), applet flags are converted to SE flags.
There is no shutdown for reads and send with applets. Both are performed
when the appctx is released. So instead of 2 flags, like for
muxes/connections, only one flag is used. But the idea is the same:
acknowledge the event at the applet level.
The appctx state was never really used as a state. It is only used to know
when an applet should be freed on the next wakeup. This can be converted to
a flag and the state can be removed. This is what this patch does.
Till now, we've extended the appctx state to add some flags. However, the
field name is misleading. So a bitfield was added to handle real flags. And
helper functions to manipulate this bitfield were added.
A dedicated function to run applets was introduced, in addition to the old
one, to deal with applets that use their own buffers. The main differnce
here is that this handler does not use channels at all. It performs a
synchronous send before calling the applet and performs a synchronous
receive just after.
No applets are plugged on this handler for now.
There is no tasklet to handle I/O subscriptions for applets, but functions
to deal with receives and sends from the SC layer were added. it meanse a
function to retrieve data from an applet with this synchronous version and a
function to push data to an applet wit this synchronous version.
It is pretty similar to the functions used for muxes but there are some
differences. So for now, we keep them separated.
Zero-copy forwarding is not supported for now. In addition, there is no
subscription mechanism.
In this patch, we add default functions to copy data from a channel to the
<inbuf> buffer of an applet (appctx_rcv_buf) and another on to copy data
from <outbuf> buffer of an applet to a channel (appctx_snd_buf).
These functions are not used for now, but they will be used by applets to
define their <rcv_buf> and <snd_buf> callback functions. Of course, it will
be possible for a specific applet to implement its own functions but these
ones should be good enough for most of applets. HTX and RAW buffers are
supported.
It is the first patch of a series aimed to align applets on connections.
Here, dedicated buffers are added for applets. For now, buffers are
initialized and helpers function to deal with allocation are added. In
addition, flags to report allocation failures or full buffers are also
introduced. <inbuf> will be used to push data to the applet from the stream
and <outbuf> will be used to push data from the applet to the stream.
sc_attach_applet() was changed to be able to fail and callers were updated
accordingly. For now it cannot fail but if this changes, callers will be
prepared to handle errors.
In sc_attach_applet, an untyped pointer (void *) was used to attach a SC on
an applet. There is no reason to not use the right type here. So now a
pointer on an appctx is explicitly used.
Avoid loss of precision when computing K cubic value.
Same issue when computing the congestion window value from cubic increase function
formula with possible integer varaiable wrap around.
Depends on this commit:
MINOR: quic: Code clarifications for QUIC CUBIC (RFC 9438)
Must be backported as far as 2.6.
The first version of our QUIC CUBIC implementation is confusing because relying on
TCP CUBIC linux kernel implementation and with references to RFC 8312 which is
obsoleted by RFC 9438 (August 2023) after our implementation. RFC 8312 is a little
bit hard to understand. RFC 9438 arrived with much more clarifications.
So, RFC 9438 is about "CUBIC for Fast Long-Distance Networks". Our implementation
for QUIC is not very well documented. As it was difficult to reread this
code, this patch adds only some comments at complicated locations and rename
some macros, variables without logic modifications at all.
So, the aim of this patch is to add first some comments and variables/macros renaming
to avoid embedding too much code modifications in the same big patch.
Some code modifications will come to adapt this CUBIC implementation to this new
RFC 9438.
Rename some macros:
CUBIC_BETA -> CUBIC_BETA_SCALED
CUBIC_C -> CUBIC_C_SCALED
CUBIC_BETA_SCALE_SHIFT -> CUBIC_SCALE_FACTOR_SHIFT (this is the scaling factor
which is used only for CUBIC_BETA_SCALED)
CUBIC_DIFF_TIME_LIMIT -> CUBIC_TIME_LIMIT
CUBIC_ONE_SCALED was added (scaled value of 1).
These cubic struct members were renamed:
->tcp_wnd -> ->W_est
->origin_point -> ->W_target
->epoch_start -> ->t_epoch
->remaining_tcp_inc -> remaining_W_est_inc
Local variables to quic_cubic_update() were renamed:
t -> elapsed_time
diff ->t
delta -> W_cubic_t
Add a grahpic curve about the CUBIC Increase function.
Add big copied & pasted RFC 9438 extracts in relation with the 3 different increase
function regions.
Same thing for the fast convergence.
Fix a typo about the reference to QUIC RFC 9002.
Must be backported as far as 2.6 to ease any further modifications to come.
If we were to enable 'ocsp-update' on a certificate that does not have
an OCSP URI, we would exit ssl_sock_load_ocsp with a negative error code
which would raise a misleading error message ("<cert> has an OCSP URI
and OCSP auto-update is set to 'on' ..."). This patch simply fixes the
error message but an error is still raised.
This issue was raised in GitHub #2432.
It can be backported up to branch 2.8.
Fetch will return true if the stream underwent a redispatch according to
"option redispatch" setting upon retries.
Documentation was added, and the "%rc" logformat alternative now mentions
the new fetch to properly emulate the logformat behavior.
This build issued was introduced by this previous commit which is a bugfix:
BUG/MINOR: quic: Wrong ack ranges handling when reaching the limit.
A BUG_ON() referenced <fist> variable in place of <first>.
Must be backported as far as 2.6 as the previous commit.
Acknowledgements ranges are used to build ACK frames. To avoid allocating too
much such objects, a limit was set to 32(QUIC_MAX_ACK_RANGES) by this commit:
MINOR: quic: Do not allocate too much ack ranges
But there is an inversion when removing the oldest range from its tree.
eb64_first() must be used in place of eb64_last(). Note that this patch
only does this modification in addition to rename <last> variable to <first>.
This bug leads such a h2load command to block when a request ends up not
being acknowledged by haproxy even if correctly served:
/opt/nghttp2/build/bin/h2load --alpn-list h3 -t 1 -c 1 -m 1 -n 100 \
https://127.0.0.1/?s=5m
There is a remaining question to be answered. In such a case, haproxy refuses to
reopen the stream, this is a good thing but should not haproxy ackownledge the
request (because correctly parsed again).
Note that to be easily reproduced, this setting had to be applied to the client
network interface:
tc qdisc add dev eth1 root netem delay 100ms 1s loss random
Must be backported as far as 2.6.
As noticed in this thread, some bogus configurations are not always easy
to spot: https://www.mail-archive.com/haproxy@formilux.org/msg44558.html
Here it was about config keywords being used in ACL patterns where strings
were expected, hence they're always valid.
Since we have the diag mode (-dD) we can perform some extra checks when
it's used, and emit them to suggest the user there might be an issue.
Here we detect a few common words (logic such as "and"/"or"/"||" etc),
C++/JS comments mistakenly used to try to isolate final args, and words
that have the exact name of a sample fetch or an ACL keyword. These checks
are only done in diag mode of course.
Final diags were added in 2.4 by commit 5a6926dcf ("MINOR: diag: create
cfgdiag module"), but it's called too late in the startup process,
because when "-c" is passed, the call is not made, while it's its primary
use case. Let's just move the call earlier.
Note that currently the check in this function is limited to verifying
unicity of server cookies in a backend, so it can be backported as far
as 2.4, but there is little value in insisting if it doesn't backport
easily.
Diag warnings were added in 2.4 by commit 7b01a8dbd ("MINOR: global:
define diagnostic mode of execution") but probably due to the split
function that checks for the mode, they did not reuse the emission of
the version string before the first warning, as was brought in 2.2 by
commit bebd21206 ("MINOR: init: report in "haproxy -c" whether there
were warnings or not"). The effet is that diag warnings are emitted
before the version string if there is no other warning nor error. Let's
just proceed like for the two other ones.
This can be backported to 2.4, though this is of very low importance.
Just like for stick-tables, this patch adds a promex module to dump
resolvers metrics. It adds the "resolver" scope and for now, it dumps
folloowing metrics:
* haproxy_resolver_sent
* haproxy_resolver_send_error
* haproxy_resolver_valid
* haproxy_resolver_update
* haproxy_resolver_cname
* haproxy_resolver_cname_error
* haproxy_resolver_any_err
* haproxy_resolver_nx
* haproxy_resolver_timeout
* haproxy_resolver_refused
* haproxy_resolver_other
* haproxy_resolver_invalid
* haproxy_resolver_too_big
* haproxy_resolver_outdated
It is now possible to selectively retrieve extra counters from stats
modules. H1, H2, QUIC and H3 fill_stats() callback functions are updated to
return a specific counter.
The list of modules registered on the stats to expose extra counters is now
public. It is required to export these counters into the Prometheus
exporter.
set-bc-{mark,tos} actions are pretty similar to set-fc-{mark,tos} to set
mark/tos on packets sent from haproxy to server: set-bc-{mark,tos} actions
act on the whole backend/srv connection: from connect() to connection
teardown, thus they may only be used before the connection to the server
is instantiated, meaning that they are only relevant for request-oriented
rules such as tcp-request or http-request rules. For now their use is
limited to content request rules, because tos and mark informations are
stored directly within the stream, thus it is required that the stream
already exists.
stream flags are used in combination with dedicated stream struct members
variables to pass 'tos' and 'mark' informations so that they are correctly
considered during stream connection assignment logic (prior to connecting
to actually connecting to the server)
'tos' and 'mark' fd sockopts are taken into account in conn hash
parameters for connection reuse mechanism.
The documentation was updated accordingly.
In this patch we add the possibility to use sample expression as argument
for set-fc-{mark,tos} actions. To make it backward compatible with
previous behavior, during parsing we first try to parse the value as
as integer (decimal or hex notation), and then fallback to expr parsing
in case of failure.
The documentation was updated accordingly.
This is a complementary patch to "MINOR: tcp-act: Rename "set-{mark,tos}"
to "set-fc-{mark,tos}"", but for the Lua API.
set_mark and set_tos were kept as aliases for set_fc_mark and set_fc_tos
but they were marked as deprecated.
Using this opportunity to reorder set_mark and set_tos by alphabetical
order.
"set-mark" and "set-tos" only alter packets from haproxy to client
(frontend connection). Since we may add support for equivalent keywords
on server side, we rename them with an explicit name to prevent
confusions.
Thus, we rename:
- "set-mark" to "set-fc-mark"
- "set-tos" to "set-fc-tos"
"set-mark" and "set-tos" were kept as aliases (to "set-fc-mark" and
"set-fc-tos" respectively) for now to prevent config breakage, but they
have been marked as deprecated so they can be removed in future version.
Some CPU time is needlessly wasted in conn_calculate_hash(), because all
params are first copied into a temporary buffer before computing the
hash on the whole buffer. Instead, let's leverage the XXH progressive
hash update functions to avoid expensive memcpys.
A major reorganization of QUIC MUX sending has been implemented. Now
data transfer occur over a single QCS buffer. This has improve
performance but at the cost of restrictions on snd_buf. Indeed, buffer
instances are now shared from stream callback snd_buf up to quic-conn
layer.
As such, snd_buf cannot manipulate freely already present data buffer.
In particular, realign has been completely removed by the previous
patches.
This commit reintroduces a partial realign support. This is only done if
the buffer contains only unsent data, via a new MUX function
qcc_realign_stream_txbuf() which is called during snd_buf.
This commit is a direct follow-up on the major rearchitecture of send
buffering. This patch implements the proper handling of connection pool
buffer temporary exhaustion.
The first step is to be able to differentiate a fatal allocation error
from a temporary pool exhaustion. This is done via a new output argument
on qcc_get_stream_txbuf(). For a fatal error, application protocol layer
will schedule the immediate connection closing. For a pool exhaustion,
QCC is flagged with QC_CF_CONN_FULL and stream sending process is
interrupted. QCS instance is also registered in a new list
<qcc.buf_wait_list>.
A new connection buffer can become available when all ACKs are received
for an older buffer. This process is taken in charge by quic-conn layer.
It uses qcc_notify_buf() function to clear QC_CF_CONN_FULL and to wake
up every streams registered on buf_wait_list to resume sending process.
This commit is a direct follow-up on the major rearchitecture of send
buffering. It allows application protocol to react if current QCS
sending buffer space is too small. In this case, the buffer can be
released to the quic-conn layer. This allows to allocate a new QCS
buffer and retry HTX parsing, unless connection buffer pool is already
depleted.
A new function qcc_release_stream_txbuf() serves as API for app protocol
to release the QCS sending buffer. This operation fails if there is
unsent data in it. In this case, MUX has to keep it to finalize transfer
of unsent data to quic-conn layer. QCS is thus flagged with
QC_SF_BLK_MROOM to interrupt snd_buf operation.
When all data are sent to the quic-conn layer, QC_SF_BLK_MROOM is
cleared via qcc_streams_sent_done() and stream layer is woken up to
restart snd_buf.
Note that a new function qcc_stream_can_send() has been defined. It
allows app proto to check if sending is currently blocked for the
current QCS. For now, it checks QC_SF_BLK_MROOM flag. However, it will
be extended to other conditions with the following patches.
The previous commit was a major rework for QUIC MUX sending process.
Following this, this patch cleans up a few elements that remains but can
be removed as they are duplicated.
Of notable changes, offset fields from QCS and QCC are removed. They are
both equivalent to flow control soft offsets.
A new function qcs_prep_bytes() is implemented. Its purpose is to return
the count of prepared data bytes not yet sent. It also replaces
qcs_need_sending().
Previously, QUIC MUX sending was implemented with data transfered along
two different buffer instances per stream.
The first QCS buffer was used for HTX blocks conversion into H3 (or
other application protocol) during snd_buf stream callback. QCS instance
is then registered for sending via qcc_io_cb().
For each sending QCS, data memcpy is performed from the first to a
secondary buffer. A STREAM frame is produced for each QCS based on the
content of their secondary buffer.
This model is useful for QUIC MUX which has a major difference with
other muxes : data must be preserved longer, even after sent to the
lower layer. Data references is shared with quic-conn layer which
implements retransmission and data deletion on ACK reception.
This double buffering stages was the first model implemented and remains
active until today. One of its major drawbacks is that it requires
memcpy invocation for every data transferred between the two buffers.
Another important drawback is that the first buffer was is allocated by
each QCS individually without restriction. On the other hand, secondary
buffers are accounted for the connection. A bottleneck can appear if
secondary buffer pool is exhausted, causing unnecessary haproxy
buffering.
The purpose of this commit is to completely break this model. The first
buffer instance is removed. Now, application protocols will directly
allocate buffer from qc_stream_desc layer. This removes completely the
memcpy invocation.
This commit has a lot of code modifications. The most obvious one is the
removal of <qcs.tx.buf> field. Now, qcc_get_stream_txbuf() returns a
buffer instance from qc_stream_desc layer. qcs_xfer_data() which was
responsible for the memcpy between the two buffers is also completely
removed. Offset fields of QCS and QCC are now incremented directly by
qcc_send_stream(). These values are used as boundary with flow control
real offset to delimit the STREAM frames built.
As this change has a big impact on the code, this commit is only the
first part to fully support single buffer emission. For the moment, some
limitations are reintroduced and will be fixed in the next patches :
* on snd_buf if QCS sent buffer in used has room but not enough for the
application protocol to store its content
* on snd_buf if QCS sent buffer is NULL and allocation cannot succeeds
due to connection pool exhaustion
One final important aspect is that extra care is necessary now in
snd_buf callback. The same buffer instance is referenced by both the
stream and quic-conn layer. As such, some operation such as realign
cannot be done anymore freely.
qcs_build_stream_frm() is responsible to generate a STREAM frame
pointing to the content of QCS TX buffer.
This patch moves send flow control overflow check from qcs_xfer_data()
to qcs_build_stream_frm(), i.e. from transfer between internal
QCS buffer and qc_stream_desc, to STREAM frame generation.
Flow control is both check at stream and connection level. For
connection flow control, as several frames are built before emission, an
accumulator is used as extra arguments to functions to account the total
length of already built frames.
This patch should not provide any functional changes. Its main purpose
is to prepare for the removal of QCS internal buffer.
Both QCS and QCC have their owned sent offset field. These fields store
the newest offset sent to the quic-conn layer. It is similar to QCS/QCC
flow control real offset. This patch removes them and replaces them by
the latter for code clarification.
MINOR: mux-quic: remove unneeded qcc.tx.sent_offsets field
This commit as a similar purpose as previous, except that it removes QCC
<sent_offsets> field, now equivalent to connection flow control real
offset.
This commit is a direct follow-up on the previous one. This time, it
deals with connection level flow control. Process is similar to stream
level : soft offset is incremented during snd_buf and real offset during
STREAM frame emission.
On MAX_DATA reception, both stream layer and QMUX is woken up if
necessary. One extra feature for conn level is the introduction of a new
QCC list to reference QCS instances. It will store instances for which
snd_buf callback has been interrupted on QCC soft offset reached. Every
stream instances is woken up on MAX_DATA reception if soft_offset is
unblocked.
This patch is the first of two to reimplement flow control emission
limits check. The objective is to account flow control earlier during
snd_buf stream callback. This should smooth transfers and prevent over
buffering on haproxy side if flow control limit is reached.
The current patch deals with stream level flow control. It reuses the
newly defined flow control type. Soft offset is incremented after HTX to
data conversion. If limit is reached, snd_buf is interrupted and stream
layer will subscribe on QCS.
On qcc_io_cb(), generation of STREAM frames is restricted as previously
to ensure to never surpass peer limits. Finally, flow control real
offset is incremented on lower layer send notification. Thus, it will
serve as a base offset for built STREAM frames. If limit is reached,
STREAM frames generation is suspended.
Each time QCS data flow control limit is reached, soft and real offsets
are reconsidered.
Finally, special care is used when flow control limit is incremented via
MAX_STREAM_DATA reception. If soft value is unblocked, stream layer
snd_buf is woken up. If real value is unblocked, qcc_io_cb() is
rescheduled.
Create a new module dedicated to flow control handling. It will be used
to implement earlier flow control update on snd_buf stream callback.
For the moment, only Tx part is implemented (i.e. limit set by the peer
that haproxy must respect for sending). A type quic_fctl is defined to
count emitted data bytes. Two offsets are used : a real one and a soft
one. The difference is that soft offset can be incremented beyond limit
unless it is already in excess.
Soft offset will be used for HTX to H3 parsing. As size of generated H3
is unknown before parsing, it allows to surpass the limit one time. Real
offset will be used during STREAM frame generation : this time the limit
must not be exceeded to prevent protocol violation.
Add a new argument to qcc_send_stream() to specify the count of sent
bytes.
For the moment this argument is unused. This commit is in fact a step to
implement earlier flow control update during stream layer snd_buf.
Ben Kallus kindly reported that we still hadn't blocked the NUL
character from header values as clarified in RFC9110 and that, even
though there's no known issure related to this, it may one day be
used to construct an attack involving another component.
Actually, both Christopher and I sincerely believed we had done it
prior to releasing 2.9, shame on us for missing that one and thanks
to Ben for the reminder!
The change was applied, it was confirmed to properly reject this NUL
byte from both header and trailer values, and it's still possible to
force it to continue to be supported using the usual pair of unsafe
"option accept-invalid-http-{request|response}" for those who would
like to keep it for whatever reason that wouldn't make sense.
This was tagged medium so that distros also remember to apply it as
a preventive measure.
It should progressively be backported to all versions down to 2.0.
Trailers are parsed using a temporary h1m struct, likely due to using
distinct h1 parser states. However, the err_pos field that's used to
decide whether or not to enfore option accept-invalid-http-request (or
response) was not initialized in this struct, resulting in using a
random value that may randomly accept or reject a few bad chars. The
impact is very limited in trailers (e.g. no message size is transmitted
there) but we must make sure that the option is respected, at least for
users facing the need for this option there.
The issue was introduced in 2.0 by commit 2d7c5395ed ("MEDIUM: htx:
Add the parsing of trailers of chunked messages"), and the code moved
from mux_h1.c to h1_htx.c in 2.1 with commit 4f0f88a9d0 ("MEDIUM:
mux-h1/h1-htx: move HTX convertion of H1 messages in dedicated file")
so the patch needs to be backported to all stable versions, and the
file adjusted for 2.0.
Always compile the test of the early_data variable in
"ssl_quic_initial_ctx", this way we can emit a warning about its support
or not.
The test was moved in a more simple preprocessor check which only checks
the new HAVE_SSL_0RTT_QUIC constant.
Could be backported to 2.9 with the 2 previous commits.
However AWS-LC must be excluded of HAVE_SSL_0RTT_QUIC in this version.
It is similar to the previous fix but for the chunk size parsing. But this
one is more annoying because a poorly coded application in front of haproxy
may ignore the last digit before the LF thinking it should be a CR. In this
case it may be out of sync with HAProxy and that could be exploited to
perform some sort or request smuggling attack.
While it seems unlikely, it is safer to forbid LF with CR at the end of a
chunk size.
This patch must be backported to 2.9 and probably to all stable versions
because there is no reason to still support LF without CR in this case.
When the message is chunked, all chunks must ends with a CRLF. However, on
old versions, to support bad client or server implementations, the LF only
was also accepted. Nowadays, it seems useless and can even be considered as
an issue. Just forbid LF only at the end of chunks, it seems reasonnable.
This patch must be backported to 2.9 and probably to all stable versions
because there is no reason to still support LF without CR in this case.
In several places in the source, there was the same block of code that was
used to deinitialize the log buffer. There were even two functions that
did this, but they were called only from the code that is in the same
source file (free_tcpcheck_fmt() in src/tcpcheck.c and free_logformat_list()
in src/proxy.c - they were both static functions).
The function free_logformat_list() was moved from the file src/proxy.c to
src/log.c, and a check of the list before freeing the memory was added to
that function.
A recent fix was introduced to ensure unsent data are deleted when a
QUIC MUX stream releases its qc_stream_desc instance. This is necessary
to ensure all used buffers will be liberated once all ACKs are received.
This is implemented by the following patch :
commit ad6b13d317 (quic-dev/qns)
BUG/MEDIUM: quic: remove unsent data from qc_stream_desc buf
Before this patch, buffer removal was done only on ACK reception. ACK
handling is only done in order from the oldest one. A BUG_ON() statement
is present to ensure this assertion remains valid.
This is however not true anymore since the above patch. Indeed, after
unsent data removal, the current buffer may be empty if it did not
contain yet any sent data. In this case, it is not the oldest buffer,
thus the BUG_ON() statement will be triggered.
To fix this, simply remove this BUG_ON() statement. It should not have
any impact as it is safe to remove buffers in any order.
Note that several conditions must be met to trigger this BUG_ON crash :
* a QUIC MUX stream is destroyed before transmitting all of its data
* several buffers must have been previously allocated for this stream so
it happens only for transfers bigger than bufsize
* latency should be high enough to delay ACK reception
This must be backported wherever the above patch is (currently targetted
to 2.6).
HTTP status codes outside of 100..599 are considered invalid in HTTP
RFC9110. However, it is explicitely stated that range 600..999 is often
used for internal communication so in practice haproxy must be lenient
with it.
Before this patch, QPACK encoder rejected these values. This resulted in
a connection error. Fix this by extending the range of allowed values
from 100 to 999.
This is linked to github issue #2422. Once again, thanks to @yokim-git
for his help here.
This must be backported up to 2.6.
A crash occurs in h3_resp_headers_send() if an invalid response code is
received from the backend side. Fix this by properly flagging the
connection on error. This will cause a CONNECTION_CLOSE.
This should fix github issue #2422.
Big thanks to ygkim (@yokim-git) for his help and reactivity. Initially,
GDB reported an invalid code source location due to heavy functions
inlining inside h3_snd_buf(). The issue was found after using -Og flag.
This must be backported up to 2.6.
Replace h3_debug_printf() by real trace for functions used by stream
layer snd_buf callback. A new event type H3_EV_STRM_SEND is created for
the occasion.
This should be backported up to 2.6 to help investigate H3 issues on
stable releases. Note that h3_nego_ff/h3_done_ff definition are not
available from 2.8.
It has been found that under some rare error circumstances,
SSL_do_handshake() could return with SSL_ERROR_WANT_READ without
even trying to call the read function, causing permanent wakeups
that prevent the process from sleeping.
It was established that this only happens if the retry flags are
not systematically cleared in both directions upon any I/O attempt,
but, given the lack of documentation on this topic, it is hard to
say if this rather strange behavior is expected or not, otherwise
why wouldn't the library always clear the flags by itself before
proceeding?
In addition, this only seems to affect OpenSSL 1.1.0 and above,
and does not affect wolfSSL nor aws-lc.
A bisection on haproxy showed that this issue was first triggered by
commit a8955d57ed ("MEDIUM: ssl: provide our own BIO."), which means
that OpenSSL's socket BIO does not have this problem. And this one
does always clear the flags before proceeding. So let's just proceed
the same way. It was verified that it properly fixes the problem,
does not affect other implementations, and doesn't cause any freeze
nor spurious wakeups either.
Many thanks to Valentn Gutirrez for providing a network capture
showing the incident as well as a reproducer. This is GH issue #2403.
This patch needs to be backported to all versions that include the
commit above, i.e. as far as 2.0.
QCS instances use qc_stream_desc for data buffering on emission. On
stream reset, its Tx channel is closed earlier than expected. This may
leave unsent data into qc_stream_desc.
Before this patch, these unsent data would remain after QCS freeing.
This prevents the buffer to be released as no ACK reception will remove
them. The buffer is only freed when the whole connection is closed. As
qc_stream_desc buffer is limited per connection, this reduces the buffer
pool for other streams of the same connection. In the worst case if
several streams are resetted, this may completely freeze the transfer of
the remaining connection streams.
This bug was reproduced by reducing the connection buffer pool to a
single buffer instance by using the following global statement :
tune.quic.frontend.conn-tx-buffers.limit 1.
Then a QUIC client is used which opens a stream for a large enough
object to ensure data are buffered. The client them emits a STOP_SENDING
before reading all data, which forces the corresponding QCS instance to
be resetted. The client then opens a new request but the transfer is
freezed due to this bug.
To fix this, adjust qc_stream_desc API. Add a new argument <final_size>
on qc_stream_desc_release() function. Its value is compared to the
currently buffered offset in latest qc_stream_desc buffer. If
<final_size> is inferior, it means unsent data are present in the
buffer. As such, qc_stream_desc_release() removes them to ensure the
buffer will finally be freed when all ACKs are received. It is also
possible that no data remains immediately, indicating that ACK were
already received. As such, buffer instance is immediately removed by
qc_stream_buf_free().
This must be backported up to 2.6. As this code section is known to
regression, a period of observation could be reserved before
distributing it on LTS releases.
On ACK reception, data are removed from buffer via qc_stream_desc_ack().
The buffer can be freed if no more data are left. A new slot is also
accounted in buffer connection pool. Extract this operation in a
dedicated private function qc_stream_buf_free().
This change should have no functional change. However it will be useful
for the next patch which needs to remove a buffer from another function.
This patch is necessary for the following bugfix. As such, it must be
backported with it up to 2.6.
Very minor modification to replace a statement with an hardcoded value by a macro.
Should be backported as far as 2.6 to ease any further modification to come.
There is a typo in the statement to initialize this variable when selecting
newreno as cc algo:
const char *newreno = "newrno";
This would have happened if #defines had be used in place of several const char *
variables harcoded values.
Take the opportunity of this patch to use #defines for all the available cc algorithms.
Must be backported to 2.9.
When a cache is "cold" and multiple clients simultaneously try to access
the same resource we must forward all the requests to the server. Next,
every "duplicated" response will be processed in http_action_store_cache
and we will try to cache every one of them regardless of whether this
response was already cached. In order to avoid having multiple entries
for a same primary key, the logic is then to first delete any
preexisting entry from the cache tree before storing the current one.
The actual previous response content will not be deleted yet though
because if the corresponding row is detached from the "avail" list it
might still be used by a cache applet if it actually performed a lookup
in the cache tree before the new response could be received.
This all means that we can end up using a valid row that references a
cache_entry that was already removed from the cache tree. This does not
pose any problem in regular caches (no 'vary' mechanism enabled) because
the applet only works on the data and not the 'cache_entry' information,
but in the "vary" context, when calling 'http_cache_applet_release' we
might call 'delete_entry' on the given entry which in turn tries to
iterate over all the secondary entries to find the right one in which
the secondary entry counter can be updated. We would then call
eb32_next_dup on an entry that was not in the tree anymore which ended
up crashing.
This crash was introduced by "48f81ec09 : MAJOR: cache: Delay cache
entry delete in reserve_hot function" which added the call to
"release_entry" in "http_cache_applet_release" that ended up crashing.
This issue was raised in GitHub #2417.
This patch must be backported to branch 2.9.
This is cleanup patch to address cosmetic issues introduced in f034139bc0
("MINOR: lua: Allow reading "proc." scoped vars from LUA core.")
Also taking this opportunity to prefix the function with __LJMP to
indicate that it may longjump.
No backport needed.
As raised by Coverity in GH #2223, f034139bc0 ("MINOR: lua: Allow reading
"proc." scoped vars from LUA core.") causes uninitialized reads due to
smp being passed to vars_get_by_name() without being initialized first.
Indeed, vars_get_by_name() tries to read smp->sess and smp->strm pointers.
As we're only interested in the PROC var scope, it is safe to call
vars_get_by_name() with sess and strm pointers set to NULL, thus we
simply memset smp prior to calling vars_get_by_name() to fix the issue.
This should be backported in 2.9 with f034139bc0.
This previous commit was not sufficient to completely fix the building issue
in relation with the TLS stack 0-RTT support. LibreSSL was the last TLS
stack to refuse to compile because of undefined a QUIC specific function
for 0-RTT: SSL_set_quic_early_data_enabled().
To get rid of such compilation issues, define HA_OPENSSL_HAVE_0RTT_SUPPORT
only when building against TLS stack with 0-RTT support.
No need to backport.
This commit:
"MINOR: quic: Enable early data at SSL session level (aws-lc)
introduced a build error when using wolfssl as TLS stack
because it references unknown function wolfSSL_set_quic_early_data_enabled()
which is not defined in qc_set_quic_early_data_context() that must not be used
in this case. The compilation of this fonction was enabled for wolfssl when
it should not have by the mentionned commit.
No backport is needed.
Commit f783dd959b ("MINOR: quic: Enable early data at SSL session level
(aws-lc)") introduced a build error when using the openssl compat layer
because it references unknown function SSL_set_quic_early_data_context()
in qc_set_quic_early_data_context() that is not used in this case.
No backport is needed.
The jwt_verify converter was added in 2.5 with commit 130e142ee2
("MEDIUM: jwt: Add jwt_verify converter to verify JWT integrity"). It
takes a string on input and returns an integer. It turns out that by
presetting the return value to zero before processing contents, while
the sample data is a union, it overwrites the beginning of the buffer
struct passed on input. On a 64-bit arch it's not an issue because it's
where the allocated size is stored and it's not used in the operation,
which explains why the regtest works. But on 32-bit, both the size and
the pointer are overwritten, causing a NULL pointer to be passed to
jwt_tokenize() which is not designed to support this, hence crashes.
Let's just use a temporary variable to hold the result and move the
output sample initialization to the end of the function.
This should be backported as far as 2.5.
The initial purpose of CSV stats through CLI was to make it easely
parsable by scripts. But in some specific cases some error or warning
messages strings containing LF were dumped into cells of this CSV.
This made some parsing failure on several tools. In addition, if a
warning or message contains to successive LF, they will be dumped
directly but double LFs tag the end of the response on CLI and the
client may consider a truncated response.
This patch extends the 'csv_enc_append' and 'csv_enc' functions used
to format quoted string content according to RFC with an additionnal
parameter to convert multi-lines strings to one line: CRs are skipped,
and LFs are replaced with spaces. In addition and optionally, it is
also possible to remove resulting trailing spaces.
The call of this function to fill strings into stat's CSV output is
updated to force this conversion.
This patch should be backported on all supported branches (issue was
already present in v2.0)
This patch impacts only the haproxy builds against aws-lc TLS stack (USE_OPENSSL_AWSLC).
As mentionned by the boringssl documentation, SSL_do_handshake() completes as soon
as ClientHello is processed and server flight sent (from the TLS stack to the
server endpoint I guess). Into QUIC, the completion has as side effect to discard
the Handshake packet number space. If this handshake completion is not deffered,
the Handshake level CRYPTO data will not be sent to the peer (because of the
assotiated packet number space discarding). According to the documentation,
SSL_in_early_data() may be used to do that. If it returns 1, this means that
the handshake is still in progress but has enough progressed to send half-RTT
data.
This patch is required to make the haproxy builds against aws-lc TLS stack support 0-RTT.
This patch impacts only haproxy when built against aws-lc TLS stack (OPENSSL_IS_AWSLC).
During the SSL_CTX switching from ssl_sock_switchctx_cbk() callback,
ssl_sock_switchctx_set() is called. This latter calls SSL_set_SSL_CTX()
whose aims is to change the SSL_CTX attached o an SSL object (TLS session).
But the aws-lc (or boringssl) implementation of this function copy
the "early data enabled" setting value (boolean) coming with the SSL_CTX object
into the SSL object. So, if not set in the SSL_CTX object this setting disabled
the one which has been set by configuration into the SSL object
(see qc_set_quic_early_data_enabled(), it calls SSL_set_early_data_enabled()
with an SSL object as parameter).
Fix this enabling the "early data enabled" setting into the SSL_CTX before
setting this latter into the SSL object.
This patch is required to make QUIC 0-RTT work with haproxy built against
aws-lc.
Note that, this patch should also help in early data support for TCP connections.
This patch impacts only the haproxy build against aws-lc TLS stack (USE_OPENSSL_AWSLC).
Implement qc_set_quic_early_data_enabled() new function to enable
early data at session level. To make QUIC O-RTT work, a context string
must be set calling SSL_set_quic_early_data_context(). This is a
subset of the encoded transport parameters which is used for this.
Note that some application level settings should be also added (TODO).
This patch is required to make 0-RTT work for haproxy builds against aws-lc.
Encode the version_information parameter only if the chosen version is provided
to quic_transport_params_encode() whose aim is to encode into a buffer all the
transport parameters passed as parameter (struct quic_params *p) in addition
to the version_information parameter.
This enables the support of transport parameters encoding without
the version_information transport parameter. This is useful for build against TLS stacks
as boringssl, aws-lc where a subset of the listener transport parameters
without version_information must be set as context string for acception
early data (see https://commondatastorage.googleapis.com/chromium-boringssl-docs/ssl.h.html#SSL_set_quic_early_data_context).
This patch is required to make haproxy builds against aws-lc TLS stack
(USE_OPENSSL_AWSLC) support 0-RTT. Does not impact the others builds.
Commit 9b2717e7b ("MINOR: stktable: use {show,set,clear} table with ptr")
stores a pointer in a long long (64bit), which fails the cas to void* on
32-bit platforms:
src/stick_table.c: In function 'table_process_entry_per_ptr':
src/stick_table.c:5136:37: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
5136 | ts = stktable_lookup_ptr(t, (void *)ptr);
On all our supported platforms, longs and pointers are of the same size,
so let's just turn this to a ulong instead.
Now with fc_glitches and bc_glitches we can retrieve the number of
detected glitches on a front or back connection. On the backend it
can indicate a bug in a server that may induce frequent reconnections
hence CPU usage in TLS reconnections, and on the frontend it may
indicate an abusive client that may be trying to attack the stack
or to fingerprint it. Small non-zero values are definitely expected
and can be caused by network glitches for example, as well as rare
bugs in the other component (or maybe even in haproxy). These should
never be considered as alarming as long as they remain low (i.e.
much less than one per request). A reg-test is provided.
There are a lot of H2 events which are not invalid from a protocol
perspective but which are yet anomalies, especially when repeated. They
can come from bogus or really poorly implemlented clients, as well as
purposely built attacks, as we've seen in the past with various waves
of attempts at abusing H2 stacks.
In order to better deal with such situations, it would be nice to be
able to sort out what is correct and what is not. There's already the
HTTP error counter that may even be updated on a tracked connection,
but HTTP errors are something clearly defined while there's an entire
scope of gray area around it that should not fall into it.
This patch introduces the notion of "glitches", which normally correspond
to unexpected and temporary malfunction. And this is exactly what we'd
like to monitor. For example a peer is not misbehaving if a request it
sends fails to decode because due to HPACK compression it's larger than
a buffer, and for this reason such an event is reported as a stream error
and not a connection error. But this causes trouble nonetheless and should
be accounted for, especially to detect if it's repeated. Similarly, a
truncated preamble or settings frame may very well be caused by a network
hiccup but how do we know that in the logs? For such events, a glitch
counter is incremented on the connection.
For now a total of 41 locations were instrumented with this and the
counter is reported in the traces when not null, as well as in
"show sess" and "show fd". This was done using a new function,
"h2c_report_glitch()" so that it becomes easier to extend to more
advanced processing (applying thresholds, producing logs, escalating
to connection error, tracking etc).
A test with h2spec shows it reported in 8545 trace lines for 147 tests,
with some reaching value 3 in a same test (e.g. HPACK errors).
Some places were not instrumented, typically anything that can be
triggered on perfectly valid activity (received data after RST being
emitted, timeouts, etc). Some types of events were thought about,
such as INITIAL_WINDOW_SIZE after the first SETTINGS frame, too small
window update increments, etc. It just sounds too early to know if
those are currently being triggered by perfectly legit clients. Also
it's currently not incremented on timeouts so that we don't do that
repeatedly on short keep-alive timeouts, though it could make sense.
This may change in the future depending on how it's used. For now
this is not exposed outside of traces and debugging.
The test was performed but no trace emitted, which can complicate certain
diagnostics, so let's just add the trace for this rare case. It may safely
be backported though this is really not important.
Commit 7021a8c4d8 ("BUG/MINOR: mux-h2: also count streams for refused
ones") addressed stream counting issues on some error cases but not
completely correctly regarding the conn_err vs stream_err case. Indeed,
contrary to the initial analysis, h2c_dec_hdrs() can set H2_CS_ERROR
when facing some unrecoverable protocol errors, and it's not correct
to send it to strm_err which will only send the RST_STREAM frame and
the subsequent GOAWAY frame is in fact the result of the read timeout.
The difficulty behind this lies on the sequence of output validations
because h2c_dec_hdrs() returns two results at once:
- frame processing status (done/incomplete/failed)
- connection error status
The original ordering requires to write 2 exemplaries of the exact
same error handling code disposed differently, which the patch above
tried to factor to one. After careful inspection of h2c_dec_hdrs()
and its comments, it's clear that it always returns -1 on failure,
*including* connection errors. This means we can rearrange the test
to get rid of the missing data first, and immediately enter the
no-return zone where both the stream and connection errors can be
checked at the same place, making sure to consistently maintain
error counters. This is way better because we don't have to update
stream counters on the error path anymore. h2spec now passes the
test much faster.
This will need to be backported to the same branches as the commit
above, which was already backported to 2.9.
This bug impacts only the QUIC OpenSSL compatibility module (USE_QUIC_OPENSSL_COMPAT)
and it was introduced by this commit:
BUG/MINOR: quic: Wrong keylog callback setting.
quic_tls_compat_keylog_callback() callback was no more set when the SSL keylog was
enabled by tune.ssl.keylog setting. This is the callback which sets the TLS secrets
into haproxy.
Set it again when the SSL keylog is not enabled by configuration.
Thank you to @Greg57070 for having reported this issue in GH #2412.
Must be backported as far as 2.8.
There are a few places where we can reject an incoming stream based on
technical errors such as decoded headers that are too large for the
internal buffers, or memory allocation errors. In this case we send
an RST_STREAM to abort the request, but the total stream counter was
not incremented. That's not really a problem, until one starts to try
to enforce a total stream limit using tune.h2.fe.max-total-streams,
and which will not count such faulty streams. Typically a client that
learns too large cookies and tries to replay them in a way that
overflows the maximum buffer size would be rejected and depending on
how they're implemented, they might retry forever.
This patch removes the stream count increment from h2s_new() and moves
it instead to the calling functions, so that it translates the decision
to process a new stream instead of a successfully decoded stream. The
result is that such a bogus client will now be blocked after reaching
the total stream limit.
This can be validated this way:
global
tune.h2.fe.max-total-streams 128
expose-experimental-directives
trace h2 sink stdout
trace h2 level developer
trace h2 verbosity complete
trace h2 start now
frontend h
bind :8080
mode http
redirect location /
Sending this will fill frames with 15972 bytes of cookie headers that
expand to 16500 for storage+index once decoded, causing "message too large"
events:
(dev/h2/mkhdr.sh -t p;dev/h2/mkhdr.sh -t s;
for sid in {0..1000}; do
dev/h2/mkhdr.sh -t h -i $((sid*2+1)) -f es,eh \
-R "828684410f7777772e6578616d706c652e636f6d \
$(for i in {1..66}; do
echo -n 60 7F 73 433d $(for j in {1..24}; do
echo -n 2e313233343536373839; done);
done) ";
done) | nc 0 8080
Now it properly stops after sending 128 streams.
This may be backported wherever commit 983ac4397 ("MINOR: mux-h2:
support limiting the total number of H2 streams per connection") is
present, since without it, that commit is less effective.
The 'generate-certificates' option does not need its dedicated SSL_CTX
*, it only needs the default SSL_CTX.
Use the default SSL_CTX found in the sni_ctx to generate certificates.
It allows to remove all the specific default_ctx initialization, as
well as the default_ssl_conf and 'default_inst'.
This patch follows the previous one about default certificate selection
("MEDIUM: ssl: allow multiple fallback certificate to allow ECDSA/RSA
selection").
This patch generates '*" SNI filters for the first certificate of a
bind line, it will be used to match default certificates. Instead of
setting the default_ctx pointer in the bind line.
Since the filters are in the SNI tree, it allows to have multiple
default certificate and restore the ecdsa/rsa selection with a
multi-cert bundle.
This configuration:
# foobar.pem.ecdsa and foobar.pem.rsa
bind *:8443 ssl crt foobar.pem crt next.pem
will use "foobar.pem.ecdsa" and "foobar.pem.rsa" as default
certificates.
Note: there is still cleanup needed around default_ctx.
This was discussed in github issue #2392.
This patch changes the default certificate mechanism.
Since the beginning of SSL in HAProxy, the default certificate was the first
certificate of a bind line. This allowed to fallback on this certificate
when no servername extension was sent by the server, or when no SAN nor
CN was available in the certificate.
When using a multi-certificate bundle (ecdsa+rsa), it was possible to
have both certificates as the fallback one, leting openssl chose the
right one. This was possible because a multi-certificate bundle
was generating a unique SSL_CTX for both certificates.
When the haproxy and openssl architecture evolved, we decided to
use multiple SSL_CTX for a multi-cert bundle, in order to simplify the
code and allow updates over the CLI.
However only one default_ctx was allowed, so we lost the ability to
chose between ECDSA and RSA for the default certificate.
This patch allows to use a '*' filter for a certificate, which allow to
lookup between multiple '*' filter, and have one in RSA and another one
in ECDSA. It replaces the default_ctx mechanism in the ClientHello
callback and use the standard algorithm to look for a default cert and
chose between ECDSA and RSA.
/!\ This patch breaks the automatic setting of the default certificate, which
will be introduce in the next patch. So the first certificate of a bind
line won't be used as a defaullt anymore.
To use this feature, one could use crt-list with '*' filters:
$ cat foo.crtlist
foobar.pem.rsa *
foobar.pem.ecdsa *
In order to test the feature, it's easy to send a request without
the servername extension and use ECDSA or RSA compatible ciphers:
$ openssl s_client -connect localhost:8443 -tls1_2 -cipher ECDHE-RSA-AES256-GCM-SHA384
$ openssl s_client -connect localhost:8443 -tls1_2 -cipher ECDHE-ECDSA-AES256-GCM-SHA384
Data emitted by QUIC MUX is restrained by the peer flow control. This is
checked on stream and connection level inside qcc_io_send().
The connection level check was placed early in qcc_io_send() preambule.
However, this also prevents emission of other frames STOP_SENDING and
RESET_STREAM, until flow control limitation is increased by a received
MAX_DATA. Note that local flow control frame emission is done prior in
qcc_io_send() and so are not impacted.
In the worst case, if no MAX_DATA is received for some time, this could
delay significantly streams closure and resource free. However, this
should be rare as other peers should anticipate emission of MAX_DATA
before reaching flow control limit. In the end, this is also covered by
the MUX timeout so the impact should be minimal
To fix this, move the connection level check directly inside QCS sending
loop. Note that this could cause unnecessary looping when connection
flow control level is reached and no STOP_SENDING/RESET_STREAM are
needed.
This should be backported up to 2.6.
The new global keywords "http-err-codes" and "http-fail-codes" allow to
redefine which HTTP status codes indicate a client-induced error or a
server error, as tracked by stick-table counters. This is only done
globally, though everything was done so that it could easily be extended
to a per-proxy mechanism if there was a real need for this (but it would
eat quite more RAM then).
A simple reg-test was added (http-err-fail.vtc).
This drops the hard-coded 4xx and 5xx status codes for err_cnt and
fail_cnt, in favor of the new bit fields that will soon be configurable.
There should be no difference at all since the bit fields are initialized
to the exact same sets (400-499 for err, 500-599 minus 501 and 505 for
fail).
At the moment, http_err_cnt and http_fail_cnt are incremented on a
well-defined set of status codes, which are checked at various places.
Over time, there have been some complains about 404, 401 or 407
triggering errors, or 500 triggering failures in SOAP environments
for example. With a small bit field that fits in a cache line we
can match the presence of a status code from 100 to 599, so that
remains cheap.
This patch adds two such bit fields, one per code class, and the
accompanying functions to set/clear/test the codes. The arrays are
preset at boot time. For now they are not used and it's not possible
to adjust them.
The function does the inverse of http_known_methods[], better rely on
that array with its indices, that makes the code clearer. Note that
we purposely don't use a loop because the compiler is able to build
an evaluation tree of the size checks and content checks that's very
efficient for the most common methods. Moving a few unimportant
entries even simplified the output code a little bit (they're now
groupped by size without changing anything for the first ones).
This function uses a large switch/case, but the problem is that due to the
numerous holes in the range, the compiler implemented a large jump table.
With a bit of experimentations, some trivial perfect-hash code works, and
since the number of entries is 19, it was enlarged to match the nearest
next power of two to avoid a large modulo operation, and fills the holes
with the default return value (HTTP_ERR_500). Jumping to 32 also results
in a lot of valid keys and allows us to pick small values, resulting in
very fast and compact code. The new function, despite keeping a 32-bytes
table, saves slightly more than 800 bytes of code+data compared to the
previous code, and avoids table jumps that affect the CPU's branch history.
Note that another simple hash worked fine and produced exactly 19 codes
(hence no need to pad holes): ((status * 8675725) >> 13) % 19
But it's still about 24 bytes larger in code to save 13 bytes of data
that are aligned anyway, and it was a bit more expensive so that was
definitely not worth it.
The validity of the table was verified with this test code added just after
it:
__attribute__((constructor)) void http_hash_test(void)
{
int i;
for (i = 0; i <= 600; i++)
printf("code %d => %d\n", i, http_get_status_idx(i));
exit(0);
}
And starting haproxy |grep -vw 14 correctly shows all ordered values
(except 500 of course which is 14).
In case new codes would be added, just play again with dev/phash to
updated the table. As long as there are less than 32 effective entries
it will remain easy to update without having to modify phash.
Willy made me realize that tree-based matching may also suffer from
out-of-order mapfile loading, as opposed to what's being said in
b546bb6d ("BUG/MINOR: map: list-based matching potential ordering
regression") and the associated REGTEST.
Indeed, in case of duplicated keys, we want to be sure that only the key
that was first seen in the file will be returned (as long as it is not
removed). The above fix is still valid, and the list-based match regtest
will also prevent regressions for tree-based match since mapfile loading
logic is currently match-type agnostic.
But let's clarify that by making both the code comment and the regtest
more precise.
An unexpected side-effect was introduced by 5fea597 ("MEDIUM: map/acl:
Accelerate several functions using pat_ref_elt struct ->head list")
The above commit tried to use eb tree API to manipulate elements as much
as possible in the hope to accelerate some functions.
Prior to 5fea597, pattern_read_from_file() used to iterate over all
elements from the map file in the same order they were seen in the file
(using list_for_each_entry) to push them in the pattern expression.
Now, since eb api is used to iterate over elements, the ordering is lost
very early.
This is known to cause behavior changes with existing setups (same conf
and map file) when compared with previous versions for some list-based
matching methods as described in GH #2400. For instance, the map_dom()
converter may return a different matching key from the one that was
returned by older haproxy versions.
For IP or STR matching, matching is based on tree lookups for better
efficiency, so in this case the ordering is lost at the name of
performance. The order in which they are loaded doesn't matter because
tree ordering is based on the content, it is not positional.
But with some other types, matching is based on list lookups (e.g.: dom),
and the order in which elements are pushed into the list can affect the
matching element that will be returned (in case of multiple matches, since
only the first matching element in the list will be returned).
Despite the documentation not officially stating that the file ordering
should be preserved for list-based matching methods, it's probably best
to be conservative here and stick to historical behavior. Moreover, there
was no performance benefit from using the eb tree api to iterate over
elements in pattern_read_from_file() since all elements are visited
anyway.
This should be backported to 2.9.
Fix indentation in smp_fetch_ssl_fc_ec() since it is using exclusively
spaces.
This should have been in previous 9a21b4b43 patch but was missed by
accident.
Could be backported if a fix depends on it.
The function `smp_fetch_ssl_fc_ec` gets the curve name used during key
exchange. It currently uses the `SSL_get_negotiated_group`, available
since OpenSSLv3.0 to get the nid and derive the short name of the curve
from the nid. In OpenSSLv3.2, a new function, `SSL_get0_group_name` was
added that directly gives the curve name.
The function `smp_fetch_ssl_fc_ec` has been updated to use
`SSL_get0_group_name` if using OpenSSL>=3.2 and for versions >=3.0 and <
3.2 use the old SSL_get_negotiated_group to get the curve name. Another
change made is to normalize the return value, so that
`smp_fetch_ssl_fc_ec` returns curve name in uppercase.
(`SSL_get0_group_name` returns the curve name in lowercase and
`SSL_get_negotiated_group` + `OBJ_nid2sn` returns curve name in
uppercase).
Can be backported to 2.8.
Released version 3.0-dev1 with the following main changes :
- MINOR: channel: Use dedicated functions to deal with STREAMER flags
- MEDIUM: applet: Handle channel's STREAMER flags on applets size
- MINOR: applets: Use channel's field to compute amount of data received
- MEDIUM: cache: Save body size of cached objects and track it on delivery
- MEDIUM: cache: Add support for endp-to-endp fast-forwarding
- MINOR: cache: Add global option to enable/disable zero-copy forwarding
- MINOR: pattern: Use reference name as filename to read patterns from a file
- MEDIUM: pattern: Add support for virtual and optional files for patterns
- DOC: config: Add section about name format for maps and ACLs
- DOC: management/lua: Update commands about map and acl
- MINOR: promex: Add support for specialized front/back/li/srv metric names
- MINOR: promex: Export active/backup metrics per-server
- BUG/MINOR: ssl: Double free of OCSP Certificate ID
- MINOR: ssl/cli: Add ha_(warning|alert) msgs to CLI ckch callback
- BUG/MINOR: ssl: Wrong OCSP CID after modifying an SSL certficate
- BUG/MINOR: lua: Wrong OCSP CID after modifying an SSL certficate (LUA)
- DOC: configuration: typo req.ssl_hello_type
- MINOR: hq-interop: add fastfwd support
- CLEANUP: mux_quic: rename ffwd function with prefix qmux_strm_
- MINOR: mux-quic: add traces for 0-copy/fast-forward
- BUG/MINOR: mworker/cli: fix set severity-output support
- CLEANUP: mworker/cli: add comments about pcli_find_and_exec_kw()
- BUG/MEDIUM: quic: Possible buffer overflow when building TLS records
- BUILD: ssl: update types in wolfssl cert selection callback
- MINOR: ssl: activate the certificate selection callback for WolfSSL
- CI: github: switch to wolfssl git-c4b77ad for new PR
- BUG/MEDIUM: map/acl: pat_ref_{set,delete}_by_id regressions
- BUG/MINOR: ext-check: cannot use without preserve-env
- CLEANUP: mux-quic: remove unused prototype
- MINOR: mux-quic: clean up qcs Rx buffer allocation API
- MINOR: mux-quic: clean up qcs Tx buffer allocation API
- CLEANUP: mux-quic: clean up app ops callback definitions
- MINOR: mux-quic: factorize QC_SF_UNKNOWN_PL_LENGTH set
- MINOR: h3: complete traces for sending
- MINOR: h3: adjust zero-copy sending related code
- MINOR: hq-interop: use zero-copy to transfer single HTX data block
- BUG/MEDIUM: quic: QUIC CID removed from tree without locking
- BUG/MEDIUM: stconn: Block zero-copy forwarding if EOS/ERROR on consumer side
- BUG/MEDIUM: mux-h1: Cound data from input buf during zero-copy forwarding
- BUG/MEDIUM: mux-h1: Explicitly skip request's C-L header if not set originally
- CLEANUP: mux-h1: Fix a trace message about C-L header addition
- BUG/MEDIUM: mux-h2: Report too large HEADERS frame only when rxbuf is empty
- BUG/MEDIUM: mux-quic: report early error on stream
- DOC: config: add arguments to sample fetch methods in the table
- DOC: config: also add arguments to the converters in the table
- BUG/MINOR: resolvers: default resolvers fails when network not configured
- SCRIPTS: mk-patch-list: produce a list of patches
- DEV: patchbot: add the AI-based bot to pre-select candidate patches to backport
- BUG/MEDIUM: mux-h2: Switch pending error to error if demux buffer is empty
- BUG/MEDIUM: mux-h2: Only Report H2C error on read error if demux buffer is empty
- BUG/MEDIUM: mux-h2: Don't report error on SE if error is only pending on H2C
- BUG/MEDIUM: mux-h2: Don't report error on SE for closed H2 streams
- DOC: config: Update documentation about local haproxy response
- DEV: patchbot: use checked buttons as reference instead of internal table
- DEV: patchbot: allow to show/hide backported patches
- MINOR: h3: remove quic_conn only reference
- BUG/MINOR: server: Use the configured address family for the initial resolution
- MINOR: mux-quic: remove qcc_shutdown() from qcc_release()
- MINOR: mux-quic: use qcc_release in case of init failure
- MINOR: mux-quic: adjust error code in init failure
- MINOR: h3: add traces for connection init stage
- BUG/MINOR: h3: properly handle alloc failure on finalize
- MINOR: h3: use INTERNAL_ERROR code for init failure
- BUG/MAJOR: stconn: Disable zero-copy forwarding if consumer is shut or in error
- MINOR: stats: store the parent proxy in stats ctx (http)
- BUG/MEDIUM: stats: unhandled switching rules with TCP frontend
- MEDIUM: proxy: set PR_O_HTTP_UPG on implicit upgrades
- MINOR: proxy: monitor-uri works with tcp->http upgrades
- OPTIM: server: eb lookup for server_find_by_name()
- OPTIM: server: ebtree lookups for findserver_unique_* functions
- MINOR: server/event_hdl: add server_inetaddr struct to facilitate event data usage
- MINOR: server/event_hdl: update _srv_event_hdl_prepare_inetaddr prototype
- BUG/MINOR: server/event_hdl: propagate map port info through inetaddr event
- MINOR: server: ensure connection cleanup on server addr changes
- CLEANUP: server/event_hdl: remove purge_conn hint in INETADDR event
- MEDIUM: server: merge srv_update_addr() and srv_update_addr_port() logic
- CLEANUP: server: remove unused server_parse_addr_change_request() function
- CLEANUP: resolvers: remove duplicate func prototype
- MINOR: resolvers: add unique numeric id to nameservers
- MEDIUM: server: make server_set_inetaddr() updater serializable
- MINOR: server/event_hdl: expose updater info through INETADDR event
- MINOR: server: add dns hint in server_inetaddr_updater struct
- MEDIUM: server/dns: clear RMAINT when addr resolves again
- BUG/MINOR: server/dns: use server_set_inetaddr() to unset srv addr from DNS
- BUG/MEDIUM: server/dns: perform svc_port updates atomically from SRV records
- MEDIUM: peers: use server as stream target
- CLEANUP: peers: remove unused sock_init_arg struct member
- CLEANUP: peers: remove unused "proto" and "xprt" struct members
- MINOR: peers: rely on srv->addr and remove peer->addr
- DOC: config: add context hint for server keywords
- MINOR: stktable: add table_process_entry helper function
- MINOR: stktable: use {show,set,clear} table with ptr
- MINOR: map: add map_*_key converters to provide the matching key
- DOC: fix typo for fastfwd QUIC option
- BUG/MINOR: mux-quic: always report error to SC on RESET_STREAM emission
- MEDIUM: mux-quic: add BUG_ON if sending on locally closed QCS
- BUG/MINOR: mux-quic: disable fast-fwd if connection on error
- BUG/MINOR: quic: Wrong keylog callback setting.
- BUG/MINOR: quic: Missing call to TLS message callbacks
- MINOR: h3: check connection error during sending
- BUG/MINOR: h3: close connection on header list too big
- BUG/MINOR: h3: close connection on sending alloc errors
- BUG/MINOR: h3: disable fast-forward on buffer alloc failure
- Revert "MINOR: mux-quic: Disable zero-copy forwarding for send by default"
- MINOR: stktable: stktable_data_ptr() cannot fail in table_process_entry()
- CLEANUP: assorted typo fixes in the code and comments
- CI: use semantic version compare for determing "latest" OpenSSL
- CLEANUP: server: remove ambiguous check in srv_update_addr_port()
- CLEANUP: resolvers: remove unused RSLV_UPD_OBSOLETE_IP flag
- CLEANUP: resolvers: remove some more unused RSLV_UDP flags
- MEDIUM: server: simplify snr_set_srv_down() to prevent confusions
- MINOR: backend: export get_server_*() functions
- MINOR: tcpcheck: export proxy_parse_tcpcheck()
- MEDIUM: udp: allow to retrieve the frontend destination address
- MINOR: global: export a way to list build options
- MINOR: debug: add features and build options to "show dev"
- BUG/MINOR: server: fix server_find_by_name() usage during parsing
- REGTESTS: check attach-srv out of order declaration
- CLEANUP: quic: Remaining useless code into server part
- BUILD: quic: Missing quic_ssl.h header protection
- BUG/MEDIUM: h3: fix incorrect snd_buf return value
- MINOR: h3: do not consider missing buf room as error on trailers
- BUG/MEDIUM: stconn: Forward shutdown on write timeout only if it is forwardable
- BUG/MEDIUM: stconn: Set fsb date if zero-copy forwarding is blocked during nego
- BUG/MEDIUM: spoe: Never create new spoe applet if there is no server up
- MINOR: mux-h2: support limiting the total number of H2 streams per connection
- CLEANUP: mux-h2: remove the printfs from previous commit on h2 streams limit.
- DEV: h2: add the ability to emit literals in mkhdr
- DEV: h2: add the preface as well in supported output types
- DEV: h2: support passing raw data for a frame
- IMPORT: ebtree: implement and use flsnz_long() to count bits
- IMPORT: ebtree: switch the sizes and offsets to size_t and ssize_t
- IMPORT: ebtree: rework the fls macros to better deal with arch-specific ones
- IMPORT: ebtree: make string_equal_bits turn back to unsigned char
- IMPORT: ebtree: use unsigned ints for flznz()
- IMPORT: ebtree: make string_equal_bits() return an unsigned
This patch introduces a new setting: tune.h2.fe.max-total-streams. It
sets the HTTP/2 maximum number of total streams processed per incoming
connection. Once this limit is reached, HAProxy will send a graceful GOAWAY
frame informing the client that it will close the connection after all
pending streams have been closed. In practice, clients tend to close as fast
as possible when receiving this, and to establish a new connection for next
requests. Doing this is sometimes useful and desired in situations where
clients stay connected for a very long time and cause some imbalance inside a
farm. For example, in some highly dynamic environments, it is possible that
new load balancers are instantiated on the fly to adapt to a load increase,
and that once the load goes down they should be stopped without breaking
established connections. By setting a limit here, the connections will have
a limited lifetime and will be frequently renewed, with some possibly being
established to other nodes, so that existing resources are quickly released.
The default value is zero, which enforces no limit beyond those implied by
the protocol (2^30 ~= 1.07 billion). Values around 1000 were found to
already cause frequent enough connection renewal without causing any
perceptible latency to most clients. One notable exception here is h2load
which reports errors for all requests that were expected to be sent over
a given connection after it receives a GOAWAY. This is an already known
limitation: https://github.com/nghttp2/nghttp2/issues/981
The patch was made in two parts inside h2_frt_handle_headers():
- the first one, at the end of the function, which verifies if the
configured limit was reached and if it's needed to emit a GOAWAY ;
- the second, just before decoding the stream frame, which verifies if
a previously configured limit was ignored by the client, and closes
the connection if this happens. Indeed, one reason for a connection
to stay alive for too long definitely comes from a stupid bot that
periodically fetches the same resource, scans lots of URLs or tries
to brute-force something. These ones are more likely to just ignore
the last stream ID advertised in GOAWAY than a regular browser, or
a well-behaving client such as curl which respects it. So in order
to make sure we can close the connection we need to enforce the
advertised limit.
Note that a regular client will not face a problem with that because in
the worst case it will have max_concurrent_streams in flight and this
limit is taken into account when calculating the advertised last
acceptable stream ID.
Just a note: it may also be possible to move the first part above to
h2s_frt_stream_new() instead so that it's not processed for trailers,
though it doesn't seem to be more interesting, first because it has
two return points.
This is something that may be backported to 2.9 and 2.8 to offer more
control to those dealing with dynamic infrastructures, especially since
for now we cannot force a connection to be cleanly closed using rules
(e.g. github issues #946, #2146).
This test was already performed when a new message is queued into the
sending queue. However not when the last applet is released, in
spoe_release_appctx(). It is a quite old bug. It was introduced by commit
6f1296b5c7 ("BUG/MEDIUM: spoe: Create a SPOE applet if necessary when the
last one is released").
Because of this bug, new SPOE applets may be created and quickly released
because there is no server up, in loop and while there is at least one
message in the sending queue, consuming all the CPU. It is pretty visible if
the processing timeout is high.
To fix the bug, conditions to create or not a SPOE applet are now
centralized in spoe_create_appctx(). The test about the max connections per
second and about number of active servers are moved in this function.
This patch must be backported to all stable versions.
The commit b9c87f8082 ("BUG/MEDIUM: stconn/stream: Forward shutdown on write
timeout") introduced a regression. In sc_cond_forward_shut(), the write
timeout is considered too early to forward the shutdown. In fact, it is
always considered, even if the shutdown is not forwardable yet. It is of
course unexpected. It is especially an issue when a write timeout is
encountered on server side during the connection establishment. In this
case, if shutdown is forwarded too early on the client side, the connection
is closed before the 503 error sending.
So the write timeout must indeed be considered to forward the shutdown to
the underlying layer, but only if the shutdown is forwardable. Otherwise, we
should do nothing.
This patch should fix the issue #2404. It must be backported as far as 2.2.
Improve h3_resp_trailers_send() return value to be similar with
h3_resp_data_send(). In particular, if QCS Tx buffer has not enough
space for trailer encoding, 0 is returned instead of an error value,
with QC_SF_BLK_MROOM set.
This unify HTTP/3 headers/data/trailers encoding functions. Negative
error codes are limited to fatal error which should cause a connection
closure. Not enough output buffer space is only a transient condition
which is reflect by the QC_SF_BLK_MROOM flag.
h3_resp_data_send() is used to transcode HTX data into H3 data frames.
If QCS Tx buffer is not aligned when first invoked, two separate frames
may be built, first until buffer end, then with remaining space in
front.
If buffer space is not enough for at least the H3 frame header, -1 is
returned with the flag QC_SF_BLK_MROOM set to await for more room. An
issue arises if this occurs for the second frame : -1 is returned even
though HTX data were properly transcoded and removed on the first step.
This causes snd_buf callback to return an incorrect value to the stream
layer, which in the end will corrupt the channel output buffer.
To fix this, stop considering that not enough remaining space is an
error case. Instead, return 0 if this is encountered for the first frame
or the HTX removed block size for the second one. As QC_SF_BLK_MROOM is
set, this will correctly interrupt H3 encoding. Label err is thus only
properly limited to fatal error which should cause a connection closure.
A new BUG_ON() has been added which should prevent similar issues in the
future.
This issue was detected using the following client :
$ ngtcp2-client --no-quic-dump --no-http-dump --exit-on-all-streams-close \
127.0.0.1 20443 -n2 "http://127.0.0.1:20443/?s=50k"
This triggers the following CHECK_IF statement. Note that it may be
necessary to disable fast forwarding to enforce snd_buf usage.
Thread 1 "haproxy" received signal SIGILL, Illegal instruction.
0x00005555558bc48a in co_data (c=0x5555561ed428) at include/haproxy/channel.h:130
130 CHECK_IF_HOT(c->output > c_data(c));
[ ## gdb ## ] bt
#0 0x00005555558bc48a in co_data (c=0x5555561ed428) at include/haproxy/channel.h:130
#1 0x00005555558c1d69 in sc_conn_send (sc=0x5555561f92d0) at src/stconn.c:1637
#2 0x00005555558c2683 in sc_conn_io_cb (t=0x5555561f7f10, ctx=0x5555561f92d0, state=32832) at src/stconn.c:1824
#3 0x000055555590c48f in run_tasks_from_lists (budgets=0x7fffffffdaa0) at src/task.c:596
#4 0x000055555590cf88 in process_runnable_tasks () at src/task.c:876
#5 0x00005555558aae3b in run_poll_loop () at src/haproxy.c:3049
#6 0x00005555558ab57e in run_thread_poll_loop (data=0x555555d9fa00 <ha_thread_info>) at src/haproxy.c:3251
#7 0x00005555558ad053 in main (argc=6, argv=0x7fffffffddd8) at src/haproxy.c:3948
In case CHECK_IF are not activated, it may cause crash or incorrect
transfers.
This was introduced by the following commit
commit 2144d24186
BUG/MINOR: h3: close connection on sending alloc errors
This must be backported wherever the above patch is.
Remove some QUIC definitions of members from server structure as the haproxy QUIC
stack does not support at all the server part (QUIC client) as this time.
Remove the statements in relation with their initializations.
This patch should be backported as far as 2.6 to save memory.
Since below commit, server_find_by_name() now search using
'used_server_id' proxy backend tree :
4bcfe30414
OPTIM: server: eb lookup for server_find_by_name()
This introduces a regression if server_find_by_name() is used via
check_config_validity() during post-parsing. Indeed, used_server_id tree
is populated at the same stage so it's possible to not found an existing
server. This can cause incorrect rejection of previously valid
configuration file.
To fix this, servers are now inserted in used_server_id tree during
parsing via parse_server(). This guarantees that server instances can be
retrieved during post parsing.
A known feature which uses server_find_by_name() during post parsing is
attach-srv tcp-rule used for reverse HTTP. Prior to the current fix, a
config was wrongly rejected if the rule was declared before the server
line.
This should not be backported unless the mentionned commit is.
The "show dev" CLI command is still missing useful elements such as the
build options, SSL version etc. Let's just add the build features and
the build options there so that it's possible to collect all of this
from a running process without having to start the executable with -vv.
This is still dumped all at once from the parsing function since the
output is small. If it were to grow, this would possibly require to be
reworked to support a context.
It might be helpful to backport this to 2.9 since it can help narrow
down certain issues.
The new function hap_get_next_build_opt() will iterate over the list of
build options. This will be used for debugging, so that the build options
can be retrieved from the CLI.
A new flag RX_F_PASS_PKTINFO is now available, whose purpose is to mark
that the destination address is about to be retrieved on some listeners.
The address can be retrieved from the first received datagram, and
relies on the IP_PKTINFO, IP_RECVDSTADDR and IPV6_RECVPKTINFO support.
snr_set_srv_down() (was formely known as snr_update_srv_status()), is
still too ambiguous because it's not clear whether we will be putting
the server under maintenance or not. This is mainly due to the fact that
the function behaves differently if has_no_ip is set or not.
By reviewing the function callers, it has now become clear that
snr_resolution_cb() is always calling the function with a valid resolution
so we only want to put the server under maintenance if we don't have a
valid IP address. On the other hand snr_resolution_error_cb() always
calls the function on error, with either no resolution (for SRV requests)
or with failing resolution (all cases except RSLV_STATUS_VALID), so in
this case we decide whether to put the server under maintenance case by
case (ie: expired? timeout?)
As a result, let's simplify snr_set_srv_down() so that it is only called
when the caller really thinks that the server should be put under
maintenance, which means always for snr_resolution_error_cb(), and only
if the resolution didn't yield usable ip for snr_resolution_cb().
RSLV_UPD_CNAME and RSLV_UPD_NAME_ERROR flags have now become useless since
3cf7f987 ("MINOR: dns: proper domain name validation when receiving DNS
response") as they are never set, but we forgot to remove them.
A leftover check was left by recent patch series about server
addr:svc_port propagation: a check on (msg) being set was performed
in srv_update_addr_port(), but msg is always set, so the check is not
needed and confuses coverity (See GH #2399)
In table_process_entry(), stktable_data_ptr() result is dereferenced
without checking if it's NULL first, which may happen when bad inputs
are provided to the function.
However, data_type and ts arguments were already checked prior to calling
the function, so we know for sure that stktable_data_ptr() will never
return NULL in this case.
However some static code analyzers such as Coverity are being confused
because they think that the result might possibly be NULL.
(See GH #2398)
To make it explicit that we always provide good inputs and expect valid
result, let's switch to the __stktable_data_ptr() unsafe function.
This reverts commit 18f2ccd244.
Found issues related to QUIC fast-forward were resolved (see github
issue #2372). Reenable it by default. If any issue arises, it can be
disabled using the global statement :
tune.quit.zero-copy-fwd-send off
This can be backported to 2.9, but only after a sensible period of
observation.
If QCS Tx buffer cannot be allocated in nego_ff callback, disable
fast-forward for this connection and return immediately. If snd_buf is
later finally used but still no buffer can being allocated, the
connection will be closed on error.
This should fix coverity reported in github issue #2390.
This should be backported up to 2.9.
When encoding new HTTP/3 frames, QCS Tx buffer must be allocated if
currently NULL. Previously, allocation failure was not properly checked,
leaving the connection in an unspecified state, or worse risking a
crash.
Fix this by setting <h3c.err> to H3_INTERNAL_ERROR each time the
allocation fails. This will stop sending and close the connection. In
the future, it may be better to put the connection on pause waiting for
allocation to succeed but this is too complicated to implement for now
in a reliable way.
Along with the current change, return of all HTX parsing functions
(h3_resp_*_send) were set to a negative value in case of error. A new
BUG_ON() in h3_snd_buf() ensures that if such a value is returned,
either a connection error is register (via <h3c.err>) or buffer is
temporarily full (flag QC_SF_BLK_MROOM).
This should fix github issue #2389.
This should be backported up to 2.6. Note that qcc_get_stream_txbuf()
does not exist in 2.9 and below. mux_get_buf() is its equivalent. An
explicit check b_is_null(&qcs.tx.buf) should be used there.
When parsing a HTX response, if too many headers are present, stop
sending and close the connection with error code H3_INTERNAL_ERROR.
Previously, no error was reported despite the interruption of header
parsing. This cause an infinite loop. However, this is considered as
minor as it happens on the response path from backend side.
This should be backported up to 2.6.
It relies on previous commit
"MINOR: h3: check connection error during sending".
If an error occurs during HTX to H3 encoding, h3_snd_buf() should be
interrupted. This commit add this possibility by checking for <h3c.err>
member value. If non null, sending loop is stopped and an error is
reported using qcc_set_error().
This commit does not change any behavior for now, as <h3c.err> is never
set during sending. However, this will change in future commits, most
notably to reject too many headers or handle buffer allocation failure.
As such, this commit should be backported along the following fixes.
Note that in 2.6 qcc_set_error() does not exist and must be replaced by
qcc_emit_cc_app().
This bug impacts only the QUIC OpenSSL compatibility module (USE_QUIC_OPENSSL_COMPAT).
The TLS capture of information from client hello enabled by
tune.ssl.capture-buffer-size could not work with USE_QUIC_OPENSSL_COMPAT. This
is due to the fact the callback set for this feature was replaced by
quic_tls_compat_msg_callback(). In fact this called must be registered by
ssl_sock_register_msg_callback() as this done for the TLS client hello capture.
A call to this function appends the function passed as parameter to a list of
callbacks to be called when the TLS stack parse a TLS message.
quic_tls_compat_msg_callback() had to be modified to return if it is called
for a non-QUIC TLS session.
Must be backported to 2.8.
This bug impacts only the QUIC OpenSSL compatibility module (USE_QUIC_OPENSSL_COMPAT).
To make this module works, quic_tls_compat_keylog_callback() function must be
set as keylog callback, or at least be called by another keylog callback.
This is what SSL_CTX_keylog() was supposed to do. In addition to export the TLS
secrets via sample fetches this latter also calls quic_tls_compat_keylog_callback()
when compiled with USE_QUIC_OPENSSL_COMPAT defined.
Before this patch, SSL_CTX_keylog() was replaced by quic_tls_compat_keylog_callback()
and the TLS secret were no more exported by sample fetches.
Must be backported to 2.8.
Add a check on nego_ff to ensure connection is not on error. If this is
the case, fast-forward is disable to prevent unnecessary sending. If
snd_buf is latter called, stconn will be notified of the error to
interrupt the stream.
This check is necessary to ensure snd_buf and nego_ff are consistent.
Note that previously, if fast-forward was conducted even on connection
error, no sending would occur as qcc_io_send() also check these flags.
However, there is a risk that stconn is never notified of the error
status, thus it is considered as a bug.
Its impact is minimal for now as fast-forward is disable by default on
QUIC. By fixing it, it should be possible to reactive it soon.
This should be backported up to 2.9.
Previously, if snd_buf operation was conducted despite QCS already
locally closed, the input buffer was silently dropped. This situation
could happen if a RESET_STREAM was emitted butemission not reported to
the stream layer. Resetting silently the buffer ensure QUIC MUX remain
compliant with RFC 9000 which forbid emission after RESET_STREAM.
Since previous commit, it is now ensured that RESET_STREAM sending will
always be reported to stream-layer. Thus, there is no need anymore to
silently reset the buffer. A BUG_ON() statement is added to ensure this
assumption will remain valid.
The new code is deemed cleaner as it does not hide a missing error
notification on the stconn-layer. Previously, if an error was missing,
sending would continue unnecessarily with a false success status
reported for the stream.
Note that the BUG_ON() statement was also added into nego_ff callback.
This is necessary to ensure both sending path remains consistent.
This patch is labelled as MEDIUM as issues were already encountered in
snd_buf/nego_ff implementation and it's not easy to cover all occurences
during test. If the BUG_ON() is triggered without any apparent
stream-layer issue, this commit should be reverted.
On RESET_STREAM emission, the stream Tx channel is closed. This event
must be reported to stream-conn layer to interrupt future send
operations.
Previously, se_fl_set_error() was manually invocated before/after
qcc_reset_stream(). Change this by moving se_fl_set_error() invocation
into the latter. This ensures that notification won't be forget, most
notably in HTTP/3 layer.
In most cases, behavior should be identical as both functions were
called together unless not necessary. However, there is one exception
which could cause a RESET_STREAM emission without error notification :
this happens on H3 trailers parsing error. All other H3 errors happen
before the stream-layer creation and thus the error is notified on
stream creation. This regression has been caused by the following patch :
152beeec34
MINOR: mux-quic: report error on stream-endpoint earlier
Thus it should be backported up to 2.7.
Note that the case described above did not cause any crash or protocol
error. This is because currently MUX QUIC snd_buf operation silently
reset buffer on transmission if QCS is already closed locally. This will
however be removed in a future commit so the current patch is necessary
to prevent an invalid behavior.
All map_*_ converters now have an additional output type: key. Such
converters will return the matched entry's key (as found in the map file)
as a string instead of the value.
Consider this example map file:
|example.com value1
|haproxy value2
With the above map file:
str(test.example.com/url),map_dom_key(file.map) will return "example.com"
str(running haproxy),map_sub_key(file.map) will return "haproxy"
This should address GH #1446.
This patchs adds support for optional ptr (0xffff form) instead of key
argument to match against existing sticktable entries, ie: if the key is
empty or cannot be matched on the cli due to incompatible characters.
Lookup is performed using a linear search so it will be slower than key
search which relies on eb tree lookup.
Example:
set table mytable key mykey data.gpc0 1
show table mytable
> 0x7fbd00032bd8: key=mykey use=0 exp=86373242 shard=0 gpc0=1
clear table mytable ptr 0x7fbd00032bd8
This patchs depends on:
- "MINOR: stktable: add table_process_entry helper function"
It should solve GH #2118
Only keep key-related logic in table_process_entry_per_key() function,
and then use table_process_entry() function that takes an entry pointer
as argument to process the entry.
Similarly to the previous commit, we get rid of unused peer member.
peer->addr was only used to save a copy of the sever's addr at parsing
time. But instead of relying on an intermediate variable, we can actually
use server's address directly when initiating the peer session.
As with other streams created from server's settings (tcp/http, log, ring),
we should rely on srv->svc_port for the port part of the address. This
shouldn't change anything for peers since the address is fully resolved
at parsing time and runtime changes are not supported, but this should
help to make the code future-proof.
peer->proto and peer->xprt struct members are now pure legacy: they are
only set during parsing but never used afterwards.
This is due to commit 02efedac ("MINOR: peers: now remove the remote
connection setup code") which made some cleanup in the past, but the
unused proto and xprt members were probably left unused by mistake.
Since we don't have valid uses for them, we remove them.
Also, peer_xprt() helper function was removed since it was related to
peer->xprt struct member.
This was the last missing bit from cd994407a ("BUG/MAJOR: server/addr:
fix a race during server addr:svc_port updates")
Indeed, despite the fix, svc_port updates from resolvers were still
directly performed on the server's struct.
Now they make proper use of the server_set_inetaddr() function so the port
change (+ optional addr change with AR) will be propagated atomically.
This patch depends on:
- "MINOR: server: ensure connection cleanup on server addr changes"
- "CLEANUP: server/event_hdl: remove purge_conn hint in INETADDR event"
- "MEDIUM: server: merge srv_update_addr() and srv_update_addr_port() logic"
- "MEDIUM: server: make server_set_inetaddr() updater serializable"
- "MINOR: server/event_hdl: expose updater info through INETADDR event"
- "MINOR: server: add dns hint in server_inetaddr_updater struct"
- "MEDIUM: server/dns: clear RMAINT when addr resolves again"
While it could be backported in 2.9 with cd994407a ("BUG/MAJOR:
server/addr: fix a race during server addr:svc_port updates") to ensure
addr and svc_port updates performed by resolver's code comply with the
API taking care of pushing the update (and thus avoid any race), some
patch dependencies are quite sensitive so it's probably best to avoid
backporting for no good reason, or at least wait for it to be considered
stable to prevent any breakeages
As seen before, server's addr and svc_port should not be updated directly
during runtime, because even if the update is performed under the lock,
some competing threads might be reading ->addr and ->svc_port without
the lock because they simply cannot afford it.
To prevent races with such competing threads, server's addr and port
should only be updated using server_set_inetaddr() function or similar.
This patch depends on:
- "MINOR: server: ensure connection cleanup on server addr changes"
- "CLEANUP: server/event_hdl: remove purge_conn hint in INETADDR event"
- "MEDIUM: server: merge srv_update_addr() and srv_update_addr_port() logic"
- "MEDIUM: server: make server_set_inetaddr() updater serializable"
- "MINOR: server/event_hdl: expose updater info through INETADDR event"
- "MINOR: server: add dns hint in server_inetaddr_updater struct"
- "MEDIUM: server/dns: clear RMAINT when addr resolves again"
While it could be backported in 2.9 with cd994407a ("BUG/MAJOR:
server/addr: fix a race during server addr:svc_port updates") to ensure
addr and svc_port reset performed by resolver's code comply with the
API taking care of pushing the update (and thus avoid any race), some
patch dependencies are quite sensitive so it's probably best to avoid
backporting for no good reason, or at least wait for it to be considered
stable to prevent any breakeages.
snr_update_srv_status() and srvrq_update_srv_status() will both set or
clear the server RMAINT state depending of the result of the current dns
resolution.
This used to work pretty well in the past, but now that addr:svc_port
changes are changed atomically through a dedicated task, the change is
performed asynchronously, so this can cause some flapping issues if the
server is put out of maintenance while the server's address is still
unassigned.
To prevent errors, the resolver's code is now only allowed to put the
server under maintenance but not to remove it from maintenance:
the decision to remove a server from maintenance is performed by the task
responsible for updating the server's addr: if the addr resolves again
thanks to a valid DNS resolution and the server was previously under
RMAINT, then it cleared from RMAINT state.
srvrq_update_srv_status() was renamed srvrq_set_srv_down(), since it is
only called to put the server in maintenance as a result of a failing
SRV entry.
snr_update_srv_status() was renamed srv_set_srv_down() and slightly
modified so that it only takes care of putting the server under
maintenance when needed.
The cli command "set server x/y addr" does not need to remove the RMAINT
flag anymore.
server_set_inetaddr() updater argument is a simple char * string
containing infos about the caller responsible for the update.
In this patch, we try to make this argument serializable, that is, make
it so that we can easily export it without having to keep the original
pointer passed by the caller or having to work with strings of variable
lengths.
This was a prerequisite for exposing more updater information through
SERVER_INETADDR event (upcoming patch).
Static strings were simply mapped to a fixed ID that can be converted back
to a string when needed using server_inetaddr_updater_by_to_str(). One
special case one made for the SERVER_INETADDR_UPDATER_DNS_RESOLVER updater
since in this case the updater hint has to be generated from the
corresponding resolver id / nameserver id combination. This was achieved
by saving the nameserver id within the updater struct. Knowing that the
resolver id can be guessed from the server struct directly, it was not
exposed through the updater struct.
This patch depends on:
- "MINOR: resolvers: add unique numeric id to nameservers"
No functional change should be expected.
When we want to avoid keeping pointers on a nameserver struct, it's not
always convenient to refer as a nameserver using it's text-based unique
identifier since it's not limited in length thus it cannot be serialized
and deserialized safely.
To address this limitation, we add a new ->puid member in dns_nameserver
struct which is a parent-unique numeric value that can be used to refer
to the dns nameserver within its parent resolver context.
To achieve this, we reused the resolver->nb_nameserver member that wasn't
used. Each time we add a new nameserver to a resolver: we set ns->puid to
the current number of nameservers within the resolver and we increment
this number right away.
Public helper function find_nameserver_by_resolvers_and_id() was added to
help retrieve nameserver pointer from (resolver X nameserver puid)
combination.
server_parse_addr_change_request() was completely replaced by the newer
srv_update_addr_port() function. Considering the function doesn't offer
useful features that srv_update_addr_port() couldn't do, we simply
remove the function.
Both functions are performing the similar tasks, except that the _port()
version is doing a bit more work.
In this patch, we add the server_set_inetaddr() function that works like
the srv_update_addr_port() but it takes parsed inputs instead of raw
strings as arguments.
Then, server_set_inetaddr() is used as underlying helper function for
both srv_update_addr() and srv_update_addr_port() to make them easier
to maintain.
Also, helper functions were added:
- server_set_inetaddr_warn() -> same as server_set_inetaddr() but report
a warning on updates.
- server_get_inetaddr() -> fills a struct server_inetaddr from srv
Since the feedback message generation part was slightly reworked, some
minor changes in the way addr:svc_port updates are reported in the logs
or cli messages should be expected (no loss of information though).
Previously, in srv_update_addr_port(), we forced connection cleanup on
server changes.
This was done in 6318d33ce ("BUG/MEDIUM: connections: force connections
cleanup on server changes").
However, there is no reason we shouldn't have done the same in
srv_update_addr() function, because the end goal is the same: perform
runtime changes on server's address.
The purge_conn hint propagated through the INETADDR server event was
simply there to keep the original behavior (only purge the connection
for events originating from srv_update_addr_port()), but to ensure the
address change is handled the same way for both code paths, we simply
ignore this hint.
server addr:svc_port updates during runtime might set or clear the
SRV_F_MAPPORTS flag. Unfortunately, the flag update is still directly
performed by srv_update_addr_port() function while the addr:svc_port
update is being scheduled for atomic update. Given that existing readers
don't take server's lock to read addr:svc_port, they also check the
SRV_F_MAPPORTS flag right after without the lock.
So we could cause the readers to incorrectly interpret the svc_port from
the server struct because the mapport information is not published
atomically, resulting in inconsistencies between svc_port / mapport flag.
(MAPPORTS flag causes svc_port to be used differently by the reader)
To fix this, we publish the mapport information within the INETADDR server
event and we let the task responsible for updating server's addr and port
position or clear the flag depending on the mapport hint.
This patch depends on:
- MINOR: server/event_hdl: add server_inetaddr struct to facilitate event data usage
- MINOR: server/event_hdl: update _srv_event_hdl_prepare_inetaddr prototype
This should be backported in 2.9 with 683b2ae01 ("MINOR: server/event_hdl:
add SERVER_INETADDR event")
Slightly change _srv_event_hdl_prepare_inetaddr() function prototype to
reduce the input arguments by learning some settings directly from the
server. Also taking this opportunity to make the function static inline
since it's relatively simple and not meant to be used directly.
4e5e2664 ("MINOR: proxy: add findserver_unique_id() and findserver_unique_name()")
added findserver_unique_id() and findserver_unique_name() functions that
were inspired from the historical findserver() function, so unfortunately
they don't perform well when used on large backend farms because they scan
the whole server list linearly.
I was about to provide a patch to optimize such functions when I stumbled
on Baptiste's work:
19a106d24 ("MINOR: server: server_find functions: id, name, best_match")
It turns out Baptiste already implemented helper functions to supersed
the unoptimized findserver() function (at least at runtime when servers
have been assigned their final IDs and inserted in the lookup trees): they
offer more matching options and rely on eb lookups so they are much more
suitable for fast queries. I don't know how I missed that, but they are a
perfect base for the server rid matching functions.
So in this patch, we essentially revert 4e5e2664 to provide the optimized
equivalent functions named server_find_by_id_unique() and
server_find_by_name_unique(), then we force existing findserver_unique_*()
callers to switch to the new functions.
This patch depends on:
- "OPTIM: server: eb lookup for server_find_by_name()"
This could be backported up to 2.8.
server_find_by_name() function was added in 19a106d24 ("MINOR: server:
server_find functions: id, name, best_match").
At that time, only the used_server_id proxy tree was available, thus the
name lookup was performed as a linear search.
However, used_server_name proxy tree was added in 84d6046a ("MINOR: proxy:
Add a "server by name" tree to proxy."), so we may safely rely on it to
perform server name lookups now. This will hopefully make the function
quite faster, especially when performing lookups in huge backend farms.
Currently, we have a check in proxy_cfg_ensure_no_http() that generates a
warning if the monitor-uri is configured on a proxy that doesn't have
mode HTTP enabled.
However, when we give a look at monitor-uri implementation, it's not
100% correct. Indeed, despite the warning message, the directive will
still be evaluated when HTTP upgrade occurs from a TCP frontend.
Thus the error is misleading.
To make the error message comply with the actual behavior, the check was
moved alongside other checks that accept both native HTTP mode or HTTP
upgrades in cfgparse.c.
When a TCP frontend uses an HTTP backend, the stream is automatically
upgraded and it results in a similar behavior as if a switch-mode http
rule was evaluated since stream_set_http_mode() gets called in both
situations and minimal HTTP analyzers are set.
In the current implementation, some postparsing checks are generating
errors or warnings when the frontend is in TCP mode with some HTTP options
set and no upgrade is expected (no switch-rule http). But as you can guess,
unfortunately this leads in issues when such "HTTP" only options are used
in a frontend that has implicit switching rules (that is, when the
frontend uses an HTTP backend for example), because in this case the
PR_O_HTTP_UPG will not be set, so the postparsing checks will consider
that some options are not relevant and will raise some warnings.
Consider the following example:
backend back
mode http
server s1 git.haproxy.org:80
frontend front
mode tcp
bind localhost:8080
http-request set-var(txn.test) str(TRUE),debug(WORKING,stderr)
use_backend back
By starting an haproxy instance with the above example conf, we end up
having this warning:
[WARNING] (400280) : config : 'http-request' rules ignored for frontend 'front' as they require HTTP mode.
However, by making a request on the frontend, we notice that the request
rules are still executed, and that's because the stream is effectively
upgraded as a result of an implicit upgrade:
[debug] WORKING: type=str <TRUE>
So this confirms the previous description: since implicit and explicit
upgrades result in approximately the same behavior on the frontend side,
we should consider them both when doing postparsing checks.
This is what we try to address in the following commit: PR_O_HTTP_UPG
flag is now more generic in the sense that it refers to either implicit
(through default_backend or use_backend rules) or explicit (switch-mode
rules) upgrades. Indeed, everytime an HTTP or dynamic backend (where the
mode cannot be assumed during parsing) is encountered in default_backend
directive or use_backend rules, we explicitly position the upgrade flag
so that further checks that depend on the proxy being in HTTP context
don't report false warnings.
Consider the following configuration:
backend back
mode http
frontend front
mode tcp
bind localhost:8080
stats enable
stats uri /stats
tcp-request content switch-mode http if FALSE
use_backend back
Firing a request to /stats on the haproxy process started with the above
configuration will cause a segfault in http_handle_stats().
The cause for the crash is that in this case, the upgrade doesn't simply
switches to HTTP mode, but also changes the stream backend (changing from
the frontend itself to the targeted HTTP backend).
However, there is an inconsitency in the stats logic between the check
for the stats URI and the actual handling of the stats page.
Indeed, http_stats_check_uri() checks uri parameters from the proxy
undergoing the http analyzers, whereas http_handle_stats() uses s->be
instead.
During stream analysis, from the frontend perspective: s->be defaults to
the frontend. But if the frontend is in TCP mode and the stream is
upgraded to HTTP via backend switching rules, then s->be will be assigned
to the actual HTTP-capable backend in stream_set_backend().
What this means is that when the http analyzer first checks if the current
URI matches the one from the "stats uri" directive, it will check against
the "stats uri" directive from the frontend, but later since the stats
handlers reads the uri from s->be it wil actually use the value from the
backend and the previous safety checks are thus garbage, resulting in
unexpected behavior. (In our test case since the backend didn't define
"stats uri" it is set to NULL, and http_handle_stats() dereferences it)
To fix this, we should ensure that prechecks and actual stats processing
always rely on the same proxy source for stats config directives.
This is what is done in this patch, thanks to the previous commit, since
we can make sure that the stat applet will use ->http_px as its parent
proxy. So here we simply propagate the current proxy being analyzed
through all the stats processing functions.
This patch depends on:
- MINOR: stats: store the parent proxy in stats ctx (http)
It should be backported up to 2.4.
For 2.4: the fix is less trivial since stats ctx was directly stored
within the applet struct at that time, so this alternative patch must be
used instead (without "MINOR: stats: store the parent proxy in stats ctx
(http)" dependency):
diff --git a/include/haproxy/applet-t.h b/include/haproxy/applet-t.h
index 014e01ed9..1d9a63359 100644
--- a/include/haproxy/applet-t.h
+++ b/include/haproxy/applet-t.h
@@ -121,6 +121,7 @@ struct appctx {
* keep the grouped together and avoid adding new ones.
*/
struct {
+ struct proxy *http_px; /* parent proxy of the current applet (only relevant for HTTP applet) */
void *obj1; /* context pointer used in stats dump */
void *obj2; /* context pointer used in stats dump */
uint32_t domain; /* set the stats to used, for now only proxy stats are supported */
diff --git a/src/http_ana.c b/src/http_ana.c
index b557da89d..1025d7711 100644
--- a/src/http_ana.c
+++ b/src/http_ana.c
@@ -63,8 +63,8 @@ static enum rule_result http_req_restrict_header_names(struct stream *s, struct
static void http_manage_client_side_cookies(struct stream *s, struct channel *req);
static void http_manage_server_side_cookies(struct stream *s, struct channel *res);
-static int http_stats_check_uri(struct stream *s, struct http_txn *txn, struct proxy *backend);
-static int http_handle_stats(struct stream *s, struct channel *req);
+static int http_stats_check_uri(struct stream *s, struct http_txn *txn, struct proxy *px);
+static int http_handle_stats(struct stream *s, struct channel *req, struct proxy *px);
static int http_handle_expect_hdr(struct stream *s, struct htx *htx, struct http_msg *msg);
static int http_reply_100_continue(struct stream *s);
@@ -428,7 +428,7 @@ int http_process_req_common(struct stream *s, struct channel *req, int an_bit, s
}
/* parse the whole stats request and extract the relevant information */
- http_handle_stats(s, req);
+ http_handle_stats(s, req, px);
verdict = http_req_get_intercept_rule(px, &px->uri_auth->http_req_rules, s);
/* not all actions implemented: deny, allow, auth */
@@ -3959,16 +3959,16 @@ void http_check_response_for_cacheability(struct stream *s, struct channel *res)
/*
* In a GET, HEAD or POST request, check if the requested URI matches the stats uri
- * for the current backend.
+ * for the current proxy.
*
* It is assumed that the request is either a HEAD, GET, or POST and that the
* uri_auth field is valid.
*
* Returns 1 if stats should be provided, otherwise 0.
*/
-static int http_stats_check_uri(struct stream *s, struct http_txn *txn, struct proxy *backend)
+static int http_stats_check_uri(struct stream *s, struct http_txn *txn, struct proxy *px)
{
- struct uri_auth *uri_auth = backend->uri_auth;
+ struct uri_auth *uri_auth = px->uri_auth;
struct htx *htx;
struct htx_sl *sl;
struct ist uri;
@@ -4003,14 +4003,14 @@ static int http_stats_check_uri(struct stream *s, struct http_txn *txn, struct p
* s->target which is supposed to already point to the stats applet. The caller
* is expected to have already assigned an appctx to the stream.
*/
-static int http_handle_stats(struct stream *s, struct channel *req)
+static int http_handle_stats(struct stream *s, struct channel *req, struct proxy *px)
{
struct stats_admin_rule *stats_admin_rule;
struct stream_interface *si = &s->si[1];
struct session *sess = s->sess;
struct http_txn *txn = s->txn;
struct http_msg *msg = &txn->req;
- struct uri_auth *uri_auth = s->be->uri_auth;
+ struct uri_auth *uri_auth = px->uri_auth;
const char *h, *lookup, *end;
struct appctx *appctx;
struct htx *htx;
@@ -4020,6 +4020,7 @@ static int http_handle_stats(struct stream *s, struct channel *req)
memset(&appctx->ctx.stats, 0, sizeof(appctx->ctx.stats));
appctx->st1 = appctx->st2 = 0;
appctx->ctx.stats.st_code = STAT_STATUS_INIT;
+ appctx->ctx.stats.http_px = px;
appctx->ctx.stats.flags |= uri_auth->flags;
appctx->ctx.stats.flags |= STAT_FMT_HTML; /* assume HTML mode by default */
if ((msg->flags & HTTP_MSGF_VER_11) && (txn->meth != HTTP_METH_HEAD))
diff --git a/src/stats.c b/src/stats.c
index d1f3daa98..1f0b2bff7 100644
--- a/src/stats.c
+++ b/src/stats.c
@@ -2863,9 +2863,9 @@ static int stats_dump_be_stats(struct stream_interface *si, struct proxy *px)
return stats_dump_one_line(stats, stats_count, appctx);
}
-/* Dumps the HTML table header for proxy <px> to the trash for and uses the state from
- * stream interface <si> and per-uri parameters <uri>. The caller is responsible
- * for clearing the trash if needed.
+/* Dumps the HTML table header for proxy <px> to the trash and uses the state from
+ * stream interface <si>. The caller is responsible for clearing the trash if
+ * needed.
*/
static void stats_dump_html_px_hdr(struct stream_interface *si, struct proxy *px)
{
@@ -3015,17 +3015,19 @@ static void stats_dump_html_px_end(struct stream_interface *si, struct proxy *px
* input buffer. Returns 0 if it had to stop dumping data because of lack of
* buffer space, or non-zero if everything completed. This function is used
* both by the CLI and the HTTP entry points, and is able to dump the output
- * in HTML or CSV formats. If the later, <uri> must be NULL.
+ * in HTML or CSV formats.
*/
int stats_dump_proxy_to_buffer(struct stream_interface *si, struct htx *htx,
- struct proxy *px, struct uri_auth *uri)
+ struct proxy *px)
{
struct appctx *appctx = __objt_appctx(si->end);
- struct stream *s = si_strm(si);
struct channel *rep = si_ic(si);
struct server *sv, *svs; /* server and server-state, server-state=server or server->track */
struct listener *l;
+ struct uri_auth *uri = NULL;
+ if (appctx->ctx.stats.http_px)
+ uri = appctx->ctx.stats.http_px->uri_auth;
chunk_reset(&trash);
switch (appctx->ctx.stats.px_st) {
@@ -3045,7 +3047,7 @@ int stats_dump_proxy_to_buffer(struct stream_interface *si, struct htx *htx,
break;
/* match '.' which means 'self' proxy */
- if (strcmp(scope->px_id, ".") == 0 && px == s->be)
+ if (strcmp(scope->px_id, ".") == 0 && px == appctx->ctx.stats.http_px)
break;
scope = scope->next;
}
@@ -3227,10 +3229,16 @@ int stats_dump_proxy_to_buffer(struct stream_interface *si, struct htx *htx,
}
/* Dumps the HTTP stats head block to the trash for and uses the per-uri
- * parameters <uri>. The caller is responsible for clearing the trash if needed.
+ * parameters from the parent proxy. The caller is responsible for clearing
+ * the trash if needed.
*/
-static void stats_dump_html_head(struct appctx *appctx, struct uri_auth *uri)
+static void stats_dump_html_head(struct appctx *appctx)
{
+ struct uri_auth *uri;
+
+ BUG_ON(!appctx->ctx.stats.http_px);
+ uri = appctx->ctx.stats.http_px->uri_auth;
+
/* WARNING! This must fit in the first buffer !!! */
chunk_appendf(&trash,
"<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"\n"
@@ -3345,17 +3353,21 @@ static void stats_dump_html_head(struct appctx *appctx, struct uri_auth *uri)
}
/* Dumps the HTML stats information block to the trash for and uses the state from
- * stream interface <si> and per-uri parameters <uri>. The caller is responsible
- * for clearing the trash if needed.
+ * stream interface <si> and per-uri parameters from the parent proxy. The caller
+ * is responsible for clearing the trash if needed.
*/
-static void stats_dump_html_info(struct stream_interface *si, struct uri_auth *uri)
+static void stats_dump_html_info(struct stream_interface *si)
{
struct appctx *appctx = __objt_appctx(si->end);
unsigned int up = (now.tv_sec - start_date.tv_sec);
char scope_txt[STAT_SCOPE_TXT_MAXLEN + sizeof STAT_SCOPE_PATTERN];
const char *scope_ptr = stats_scope_ptr(appctx, si);
+ struct uri_auth *uri;
unsigned long long bps = (unsigned long long)read_freq_ctr(&global.out_32bps) * 32;
+ BUG_ON(!appctx->ctx.stats.http_px);
+ uri = appctx->ctx.stats.http_px->uri_auth;
+
/* Turn the bytes per second to bits per second and take care of the
* usual ethernet overhead in order to help figure how far we are from
* interface saturation since it's the only case which usually matters.
@@ -3629,8 +3641,7 @@ static void stats_dump_json_end()
* a pointer to the current server/listener.
*/
static int stats_dump_proxies(struct stream_interface *si,
- struct htx *htx,
- struct uri_auth *uri)
+ struct htx *htx)
{
struct appctx *appctx = __objt_appctx(si->end);
struct channel *rep = si_ic(si);
@@ -3650,7 +3661,7 @@ static int stats_dump_proxies(struct stream_interface *si,
px = appctx->ctx.stats.obj1;
/* skip the disabled proxies, global frontend and non-networked ones */
if (!px->disabled && px->uuid > 0 && (px->cap & (PR_CAP_FE | PR_CAP_BE))) {
- if (stats_dump_proxy_to_buffer(si, htx, px, uri) == 0)
+ if (stats_dump_proxy_to_buffer(si, htx, px) == 0)
return 0;
}
@@ -3666,14 +3677,12 @@ static int stats_dump_proxies(struct stream_interface *si,
}
/* This function dumps statistics onto the stream interface's read buffer in
- * either CSV or HTML format. <uri> contains some HTML-specific parameters that
- * are ignored for CSV format (hence <uri> may be NULL there). It returns 0 if
- * it had to stop writing data and an I/O is needed, 1 if the dump is finished
- * and the stream must be closed, or -1 in case of any error. This function is
- * used by both the CLI and the HTTP handlers.
+ * either CSV or HTML format. It returns 0 if it had to stop writing data and
+ * an I/O is needed, 1 if the dump is finished and the stream must be closed,
+ * or -1 in case of any error. This function is used by both the CLI and the
+ * HTTP handlers.
*/
-static int stats_dump_stat_to_buffer(struct stream_interface *si, struct htx *htx,
- struct uri_auth *uri)
+static int stats_dump_stat_to_buffer(struct stream_interface *si, struct htx *htx)
{
struct appctx *appctx = __objt_appctx(si->end);
struct channel *rep = si_ic(si);
@@ -3688,7 +3697,7 @@ static int stats_dump_stat_to_buffer(struct stream_interface *si, struct htx *ht
case STAT_ST_HEAD:
if (appctx->ctx.stats.flags & STAT_FMT_HTML)
- stats_dump_html_head(appctx, uri);
+ stats_dump_html_head(appctx);
else if (appctx->ctx.stats.flags & STAT_JSON_SCHM)
stats_dump_json_schema(&trash);
else if (appctx->ctx.stats.flags & STAT_FMT_JSON)
@@ -3708,7 +3717,7 @@ static int stats_dump_stat_to_buffer(struct stream_interface *si, struct htx *ht
case STAT_ST_INFO:
if (appctx->ctx.stats.flags & STAT_FMT_HTML) {
- stats_dump_html_info(si, uri);
+ stats_dump_html_info(si);
if (!stats_putchk(rep, htx, &trash))
goto full;
}
@@ -3733,7 +3742,7 @@ static int stats_dump_stat_to_buffer(struct stream_interface *si, struct htx *ht
case STATS_DOMAIN_PROXY:
default:
/* dump proxies */
- if (!stats_dump_proxies(si, htx, uri))
+ if (!stats_dump_proxies(si, htx))
return 0;
break;
}
@@ -4112,11 +4121,14 @@ static int stats_process_http_post(struct stream_interface *si)
static int stats_send_http_headers(struct stream_interface *si, struct htx *htx)
{
struct stream *s = si_strm(si);
- struct uri_auth *uri = s->be->uri_auth;
+ struct uri_auth *uri;
struct appctx *appctx = __objt_appctx(si->end);
struct htx_sl *sl;
unsigned int flags;
+ BUG_ON(!appctx->ctx.stats.http_px);
+ uri = appctx->ctx.stats.http_px->uri_auth;
+
flags = (HTX_SL_F_IS_RESP|HTX_SL_F_VER_11|HTX_SL_F_XFER_ENC|HTX_SL_F_XFER_LEN|HTX_SL_F_CHNK);
sl = htx_add_stline(htx, HTX_BLK_RES_SL, flags, ist("HTTP/1.1"), ist("200"), ist("OK"));
if (!sl)
@@ -4166,11 +4178,14 @@ static int stats_send_http_redirect(struct stream_interface *si, struct htx *htx
{
char scope_txt[STAT_SCOPE_TXT_MAXLEN + sizeof STAT_SCOPE_PATTERN];
struct stream *s = si_strm(si);
- struct uri_auth *uri = s->be->uri_auth;
+ struct uri_auth *uri;
struct appctx *appctx = __objt_appctx(si->end);
struct htx_sl *sl;
unsigned int flags;
+ BUG_ON(!appctx->ctx.stats.http_px);
+ uri = appctx->ctx.stats.http_px->uri_auth;
+
/* scope_txt = search pattern + search query, appctx->ctx.stats.scope_len is always <= STAT_SCOPE_TXT_MAXLEN */
scope_txt[0] = 0;
if (appctx->ctx.stats.scope_len) {
@@ -4263,7 +4278,7 @@ static void http_stats_io_handler(struct appctx *appctx)
}
if (appctx->st0 == STAT_HTTP_DUMP) {
- if (stats_dump_stat_to_buffer(si, res_htx, s->be->uri_auth))
+ if (stats_dump_stat_to_buffer(si, res_htx))
appctx->st0 = STAT_HTTP_DONE;
}
@@ -4888,6 +4903,7 @@ static int cli_parse_show_stat(char **args, char *payload, struct appctx *appctx
appctx->ctx.stats.scope_str = 0;
appctx->ctx.stats.scope_len = 0;
+ appctx->ctx.stats.http_px = NULL; // not under http context
appctx->ctx.stats.flags = STAT_SHNODE | STAT_SHDESC;
if ((strm_li(si_strm(appctx->owner))->bind_conf->level & ACCESS_LVL_MASK) >= ACCESS_LVL_OPER)
@@ -4954,7 +4970,7 @@ static int cli_io_handler_dump_info(struct appctx *appctx)
*/
static int cli_io_handler_dump_stat(struct appctx *appctx)
{
- return stats_dump_stat_to_buffer(appctx->owner, NULL, NULL);
+ return stats_dump_stat_to_buffer(appctx->owner, NULL);
}
static int cli_io_handler_dump_json_schema(struct appctx *appctx)
Some HTTP related stats functions need to know the parent proxy, mainly
to get a pointer on the related uri_auth set by the proxy or to check
scope settings.
The current design (probably historical as only the http context existed
by then) took the other approach: it propagates the uri pointer from the
http context deep down the calling stack up to the relevant functions.
For non-http contexts (cli), the pointer is set to NULL.
Doing so is not very pretty and not easy to maintain. Moreover, there were
still some places in the code were the uri pointer was learned directly
from the stream proxy because the argument was not available as argument
from those functions. This is error-prone, because if one day we decide to
change the source proxy in the parent function, we might still have some
functions down the stack that ignore the top most argument and still do
on their own, and we'll probably end up with inconsistencies.
So in this patch, we take a safer approach: the caller responsible for
creating the stats applet should set the http_px pointer so that any stats
function running under the applet that needs to know if it's running in
http context or needs to access parent proxy info may do so thanks to
the dedicated ctx->http_px pointer.
Consider that application layer is responsible to set proper error code
on init or finalize operation failure. In case of H3, use INTERNAL_ERROR
application error code. This allows to remove qcc_set_error() invocation
from qmux_init().
In case application layer would not specify any error code, fallback
INTERNAL_ERROR transport error code would be used thanks to the recent
change introduced for error management in qmux_init().
If H3 control stream Tx buffer cannot be allocated, return a proper
errur through h3_finalize(). This will cause the emission of a
CONNECTION_CLOSE with error H3_INTERNAL_ERROR and closure of the whole
connection.
This should be backported up to 2.6. Note that 2.9 has some difference
which will cause conflict. The main one is that qcc_get_stream_txbuf()
does not exist in this version. Instead the check in h3_control_send()
should be made after mux_get_buf(). Finally, it may be useful to first
pick previous commit (MINOR: h3: add traces for connection init stage)
to improve context similarity.
If QUIC MUX cannot be initialized for any reason, the connection is shut
down with a CONNECTION_CLOSE frame. Previously, no error code was
explicitely specified, resulting in "no error" code.
Change this by always set error code in case of QUIC MUX failure. Use
the already defined QUIC MUX error code or "internal error" if unset.
Call quic_set_connection_close() on error label to register it to the
quic_conn layer.
This should help to improve error reporting in case of MUX
initialization failure.
qmux_init() may fail at different stage. In this case, an error is
returned and QCC allocated element are freed. Previously, extra care was
taken using different label to only liberate already allocate elements.
This patch removes the multi label and uses qcc_release(). This will be
simpler to ensure a QCC is always properly freed. The only important
thing is to ensure that mandatory fields are properly initialized to
NULL or equivalent to be able to use qcc_release() safely.
Render qcc_release() more generic by removing qcc_shutdown(). This
prevents systematic graceful shutdown/CONNECTION_CLOSE emission if only
QCC resource deallocation is necessary.
For now, qcc_shutdown() is used before every qcc_release() invocation.
The only exception is on qmux_destroy stream layer callback.
This commit will be useful to reuse qcc_release() in other contexts to
simply deallocate a QCC instance.
A regression was introduced by the commit c886fb58eb ("MINOR: server/ip:
centralize server ip updates"). The configured address family is lost when the
server address is initialized during the startup, for the resolution based on
the libc or based on the server state-file. Thus, "ipv4@" and "ipv6@" prefixed
are ignored.
To fix the bug, we take care to use the configured address family before calling
str2ip2() in srv_apply_lastaddr() and srv_apply_via_libc() functions.
This patch should fix the issue #2393. It must be backported to 2.9.
H3 uses a direct reference to quic_conn to access the listener instance.
This can be replaced by using qcc->conn->target. This allows to remove
quic_conn-t.h header include from it.
An error on the H2 connection was always reported as an error to the
stream-endpoint descriptor, independently on the H2 stream state. But it is
a bug to do so for closed streams. And indeed, it leads to report "SD--"
termination state for some streams while the response was fully received and
forwarded to the client, at least for the backend side point of view.
Now, errors are no longer reported for H2 streams in closed state.
This patch is related to the three previous ones:
* "BUG/MEDIUM: mux-h2: Don't report error on SE for closed H2 streams"
* "BUG/MEDIUM: mux-h2: Don't report error on SE if error is only pending on H2C"
* "BUG/MEDIUM: mux-h2: Only Report H2C error on read error if demux buffer is empty"
The series should fix a bug reported in issue #2388
(#2388#issuecomment-1855735144). The series should be backported to 2.9 but
only after a period of observation. In theory, older versions are also
affected but this part is pretty sensitive. So don't backport it further
except if someone ask for it.
In h2s_wake_one_stream(), we must not report an error on the stream-endpoint
descriptor if the error is not definitive on the H2 connection. A pending
error on the H2 connection means there are potentially remaining data to be
demux. It is important to not truncate a message for a stream.
This patch is part of a series that should fix a bug reported in issue #2388
(#2388#issuecomment-1855735144). Backport instructions will be shipped in
the last commit of the series.
It is similar to the previous fix ("BUG/MEDIUM: mux-h2: Don't report H2C
error on read error if dmux buffer is not empty"), but on receive side. If
the demux buffer is not empty, an error on the TCP connection must not be
immediately reported as an error on the H2 connection. We must be sure to
have tried to demux all data first. Otherwise, messages for one or more
streams may be truncated while all data were already received and are
waiting to be demux.
This patch is part of a series that should fix a bug reported in issue #2388
(#2388#issuecomment-1855735144). Backport instructions will be shipped in
the last commit of the series.
When an error on the H2 connection is detected when sending data, only a
pending error is reported, waiting for an error or a shutdown on the read
side. However if a shutdown was already received, the pending error is
switched to a definitive error.
At this stage, we must also wait to have flushed the demux
buffer. Otherwise, if some data must still be demux, messages for one or
more streams may be truncated. There is already the flag H2_CF_END_REACHED
to know a shutdown was received and we no longer progress on demux side
(buffer empty or data truncated). On sending side, we should use this flag
instead to report a definitive error.
This patch is part of a series that should fix a bug reported in issue #2388
(#2388#issuecomment-1855735144). Backport instructions will be shipped in
the last commit of the series.
Bug #1740 was opened again, this time a user is complaining about the
"can't create socket for nameserver". This can happen if the resolv.conf
file contains a class of address which was not configured on the
machine, for example IPv6.
The fix does the same as b10b1196b ("MINOR: resolvers: shut the warning
when "default" resolvers is implicit"), and uses the
"resolvers->conf.implicit" variable to emit the error.
Though it is not needed to convert the explicit behavior with a
ERR_WARN, because this is supposed to be an unrecoverable error, unlike
the connect().
Should fix issue #1740.
Must be backported were b10b1196b was backported. (as far as 2.6)
On STOP_SENDING reception, an error is notified to the stream layer as
no more data can be responded. However, this is not done if the stream
instance is not allocated (already freed for example).
The issue occurs if STOP_SENDING is received and the stream instance is
instantiated after it. It happens if a STREAM frame is received after it
with H3 HEADERS, which is valid in QUIC protocol due to UDP packet
reordering. In this case, stream layer is never notified about the
underlying error. Instead, reponse buffers are silently purged by the
MUX in qmux_strm_snd_buf().
This is suboptimal as there is no point in exchanging data from the
server if it cannot be eventually transferred back to the client.
However, aside from this consideration, no other issue occured. However,
this is not the case with QUIC mux-to-mux implementation. Now, if
mux-to-mux is used, qmux_strm_snd_buf() is bypassed and response if
transferred via nego_ff/done_ff callbacks. However, these functions did
not checked if QCS is already locally closed. This causes a crash when
qcc_send_stream() is called via done_ff.
To fix this crash, there is several approach, one of them would be to
adjust nego_ff/done_ff QUIC callbacks. However, another method has been
chosen. Now stream layer is flagged on error just after its
instantiation if the stream is already locally closed. This ensures that
mux-to-mux won't try to emit data as se_nego_ff() check if the opposide
SD is not on error before continuing.
Note that an alternative solution could be to not instantiate at all
stream layer if QCS is already locally closed. This is the most optimal
solution as it reduce unnecessary allocations and task processing.
However, it's not easy to implement so the easier bug fix has been
chosen for the moment.
This patch is labelled as MEDIUM as it can change behavior of all QCS
instances, wheter mux-to-mux is used or not, and thus could reveal other
architecture issues.
This should fix latest crash occurence on github issue #2392.
It should be backported up to 2.6, until a necessary period of
observation.
During HEADERS frames decoding, if a frame is too large to fit in a buffer,
an internal error is reported and a RST_STREAM is emitted. On the other
hand, we wait to have an empty rxbuf to decode the frame because we cannot
retry a failed HPACK decompression.
When we are decoding headers, it is valid to return an error if dbuf buffer
is full because no data can be blocked in the rxbuf (which hosts the HTX
message).
However, during the trailers decoding, it is possible to have some data not
sent yet for the current stream in the rxbug and data for another stream
fully filling the dbuf buffer. In this case, we don't decode the trailers
but we must not return an error. We must wait to empty the rxbuf first.
Now, a HEADERS frame is considered as too large if the dbuf buffer is full
and if the rxbuf is empty (the HTX message to be accurate).
This patch should fix the issue #2382. It must be backported to all stable
versions.
Commit f89ba27caa ("BUG/MEDIUM: mux-h1; Ignore headers modifications about
payload representation") introduced a regression. The Content-Length is no
longer sent to the server for requests without payload but with a
'Content-Lnegth' header explicitly set to 0, like POST request with no
payload. It is of course unexpected. In some cases, depending on the server,
such requests are considered as invalid and a 411-Length-Required is returned.
The above commit is not directly responsible for the bug, it only reveals a too
lax condition to skip the 'Content-Length' header of bodyless requests. We must
only skip this header if none was originally found, during the parsing.
This patch should fix the issue #2386. It must be backported to 2.9.
During zero-copy forwarding, we first try to forward data found in the input
buffer before trying to receive more data. These data must be removed from
the amount of data to forward (the cound variable).
Otherwise, on an internal retry, in h1_fastfwd(), we can be lead to read
more data than expected. It is especially a problem on the end of a
chunk. An error is erroneously reported because more data than announced are
received.
This patch should fix the issue #2382. It must be backported to 2.9.
This bug arrived with this commit:
BUG/MINOR: quic: Wrong RETIRE_CONNECTION_ID sequence number chec
Every connection ID manipulations against the by thread trees used to store the
connection IDs must be done under the trees locks. These trees are accessed by
the low level connection identification code.
When receiving a RETIRE_CONNECTION_ID frame, the concerned connection ID must
be deleted from the its underlying by thread tree but not without locking!
Add a WR lock around ebmb_delete() call to do so.
Must be backported as far as 2.7.
Similarly to H3, hq-interop now uses zero-copy when dealing with a HTX
message with only a single data block. Exchange HTX and QCS buffer, and
use the HTX data block for HTTP payload. This is only possible if QCS
buffer is empty. Contrary to HTTP/3, no extra frame header is needed
before transferring HTTP payload.
hq-interop is only implemented for testing purpose so this change should
not be noticeable by users. However, it will be useful to be able to
test zero-copy transfer on QUIC interop testing.
Adjust HTTP/3 data emission. First, add HTX as argument to the function
as this is used for other frames emission function. Keep the buffer
argument as this is mandatory for zero-copy. Extend comments related to
this, in particular to explain purposes of both HTX and buffer
arguments.
No function change here. This should however be useful to port a code
equivalent to hq-interop protocol.
Add data level traces for each encoded H3 frame. Of notable interest,
traces will be useful to detect if standard emission, zero-copy or fast
forward is used. Also add the generic filter H3_EV_TX_FRAME to be able
to filter these messages.
When dealing with HTTP/1 responses without Content-Length nor chunked
encoding, flag QC_SF_UNKNOWN_PL_LENGTH is set on QCS. This prevent the
emission of a RESET_STREAM on shutw, instead resorting to a proper FIN
emission.
This code was duplicated both in H3 and hq-interop. Move it in common
qcs_http_snd_buf() to factorize it.
qcc_app_ops is a set of callbacks used to unify application protocol
running over QUIC. This commit introduces some changes to clarify its
API :
* write simple comment to reflect each callback purpose
* rename decode_qcs to rcv_buf as this name is more common and is
similar to already existing snd_buf
* finalize is moved up as it is used during connection init stage
All these changes are ported to HTTP/3 layer. Also function comments
have been extended to highlight HTTP/3 special characteristics.
This function is similar to the previous one, but this time for QCS
sending buffer.
Previously, each application layer redefine their own version of
mux_get_buf() which was used to allocate <qcs.tx.buf>. Unify it under a
single function renamed qcc_get_stream_txbuf().
Replaces qcs_get_buf() function which naming does not reflect its
purpose. Add a new function qcc_get_stream_rxbuf() which allocate if
needed <qcs.rx.app_buf> and returns the buffer pointer. This function is
reserved for application protocol layer. This buffer is then accessed by
stconn layer.
For other qcs_get_buf() invocation which was used in effect for a local
buffer, replace these by a plain b_alloc().
Since 1de44da ("MINOR: ext-check: add an option to preserve environment
variables"), it is now possible to provide an extra argument to
"external-check" directive. This allows to support the "preserve-env"
option which differs from the default behavior.
However a mistake was made, because the config parser doesn't allow the
default configuration anymore: using external-check without argument will
trigger an error:
'external-check' only supports 'preserve-env' as an argument, found ''.
This is due to as small mistake in the code that make the check
systematically report an error if the first argument is not equal to
"preserve-env". The check was modified so that the error is only reported
if the argument is provided, so that the default behavior is restored.
This should fix GH #2380 and should be backported on 2.9 and potentially
further (anywhere 1de44da is, because a note about an optional backport
up to the 2.6 was left in the original commit message)
Some regressions were introduced by 5fea59754b ("MEDIUM: map/acl:
Accelerate several functions using pat_ref_elt struct ->head list")
pat_ref_delete_by_id() fails to properly unlink and free the removed
reference because it bypasses the pat_ref_delete_by_ptr() made for
that purpose. This function is normally used everywhere the target
reference is set for removal, such as the pat_ref_delete() function
that matches pattern against a string. The call was probably skipped
by accident during the rewrite of the function.
With the above commit also comes another undesirable change:
both pat_ref_delete_by_id() and pat_ref_set_by_id() directly use the
<refelt> argument as a valid pointer (they do dereference it).
This is wrong, because <refelt> is unsafe and should be handled as an
ID, not a pointer (hence the function name). Indeed, the calling function
may directly pass user input from the CLI as <refelt> argument, so we must
first ensure that it points to a valid element before using it, else it is
probably invalid and we shouldn't touch it.
What this patch essentially does, is that it reverts pat_ref_set_by_id()
and pat_ref_delete_by_id() to pre 5fea59754b behavior. This seems like
it was the only optimization from the patch that doesn't apply.
Hopefully, after reviewing the changes with Fred, it seems that the 2
functions are only being involved in commands for manipulating maps or
acls on the cli, so the "missed" opportunity to improve their performance
shouldn't matter much. Nonetheless, if we wanted to speed up the reference
lookup by ID, we could consider adding an eb64 tree for that specific
purpose that contains all pattern references IDs (ie: pointers) so that
eb lookup functions may be used instead of linear list search.
The issue was raised by Marko Juraga as he failed to perform an an acl
removal by reference on the CLI on 2.9 which was known to work properly
on other versions.
It should be backported on 2.9.
Co-Authored-by: Frédéric Lécaille <flecaille@haproxy.com>
The PR which allows to chose a certificate depending on the ciphers and
the signature algorithms was merged in WolfSSL. Let's activate this
code.
This could be backported in 2.9 only when the next WolfSSL release is
available (5.6.5). It will also need a check on the version.
This bug impacts only the OpenSSL QUIC compatibility module (USE_QUIC_OPENSSL_COMPAT).
This may happen only when the TLS stack has to be provided with more than 1024+1+5+16
bytes of CRYPTO data. In this case several TLS records have to be built in one
call to SSL_provide_quic_data(). A 5-bytes header is created at the head
of these records. This header is used as AAD to cipher the record. But
the length of this AAD was counted two times. One time here in
quic_tls_compat_create_record() (initialization):
adlen = quic_tls_compat_create_header(qc, rec, ad, 0);
and a second time here in the same function after quic_tls_tls_seal() return:
ret = aad_len + outlen;
This addition is useless. Note that this bug could be reproduced when haproxy has
to authenticate the client.
Thank you to @vifino for having reported this issue in GH #2381.
Must be backported to 2.8.
"set severity-output" is one of these command that changes the appctx
state so the next commands are affected.
Unfortunately the master CLI works with pipelining and server close
mode, which means the connection between the master and the worker is
closed after each response, so for the next command this is a new appctx
state.
To fix the problem, 2 new flags are added ACCESS_MCLI_SEVERITY_STR and
ACCESS_MCLI_SEVERITY_NB which are used to prefix each command sent to
the worker with the right "set severity-output" command.
This patch fixes issue #2350.
It could be backported as far as 2.6.
All QUIC MUX functions which are callbacks for stream layer use the
prefix qmux_strm_*. This was not the case for fast forward related
callback which only used qmux_* prefix.
Fix this by reusing the standard prefix to respect QUIC MUX code
convention.
Implement callback for fast forwarding for hq-interop.
This change should not be considered as functionally important. Indeed,
HTTP/0.9 is reserved for QUIC interop testing and should not be used
outside of it. However, implementing fast forwarding in this context is
useful as this will allow to test MUX code sections for fast forward via
QUIC interop.
This bugfix is the same as the following one:
"BUG/MINOR: ssl_ckch: Wrong OCSP CID after modifying an SSL certficate"
where the OCSP CID had to be reset when updating a certificate.
Must be backported to 2.8.
This bug could be reproduced with the "set ssl cert" CLI command to update
a certificate. The OCSP CID is duplicated by ckchs_dup() which calls
ssl_sock_copy_cert_key_and_chain(). It should be computed again by
ssl_sock_load_ocsp(). This may be accomplished resetting the new ckch OCSP CID
returned by ckchs_dup().
This bug may be in relation with GH #2319.
Must be backported to 2.8.
This patch allows cli_io_handler_commit_cert() callback called upon
a "commit ssl cert ..." command to prefix the messages returned by the CLI
to the by the ones built by ha_warining(), ha_alert().
Should be interesting to backport this commit to 2.8.
This bug could be reproduced loading several certificated from "bind" line:
with "server_ocsp.pem" as argument to "crt" setting and updating
the CDSA certificate with the RSA as follows:
echo -e "set ssl cert reg-tests/ssl/ocsp_update/multicert/server_ocsp.pem.ecdsa \
<<\n$(cat reg-tests/ssl/ocsp_update/multicert/server_ocsp.pem.rsa)\n" | socat - /tmp/stats
followed by an "commit ssl cert reg-tests/ssl/ocsp_update/multicert/server_ocsp.pem.ecdsa"
command. This could be detected by libasan as follows:
=================================================================
==507223==ERROR: AddressSanitizer: attempting double-free on 0x60200007afb0 in thread T3:
#0 0x7fabc6fb5527 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x54527)
#1 0x7fabc6ae8f8c in ossl_asn1_string_embed_free (/opt/quictls/lib/libcrypto.so.81.3+0xd4f8c)
#2 0x7fabc6af54e9 in ossl_asn1_primitive_free (/opt/quictls/lib/libcrypto.so.81.3+0xe14e9)
#3 0x7fabc6af5960 in ossl_asn1_template_free (/opt/quictls/lib/libcrypto.so.81.3+0xe1960)
#4 0x7fabc6af569f in ossl_asn1_item_embed_free (/opt/quictls/lib/libcrypto.so.81.3+0xe169f)
#5 0x7fabc6af58a4 in ASN1_item_free (/opt/quictls/lib/libcrypto.so.81.3+0xe18a4)
#6 0x46a159 in ssl_sock_free_cert_key_and_chain_contents src/ssl_ckch.c:723
#7 0x46aa92 in ckch_store_free src/ssl_ckch.c:869
#8 0x4704ad in cli_release_commit_cert src/ssl_ckch.c:1981
#9 0x962e83 in cli_io_handler src/cli.c:1140
#10 0xc1edff in task_run_applet src/applet.c:454
#11 0xaf8be9 in run_tasks_from_lists src/task.c:634
#12 0xafa2ed in process_runnable_tasks src/task.c:876
#13 0xa23c72 in run_poll_loop src/haproxy.c:3024
#14 0xa24aa3 in run_thread_poll_loop src/haproxy.c:3226
#15 0x7fabc69e7ea6 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x7ea6)
#16 0x7fabc6907a2e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0xfba2e)
0x60200007afb0 is located 0 bytes inside of 3-byte region [0x60200007afb0,0x60200007afb3)
freed by thread T3 here:
#0 0x7fabc6fb5527 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x54527)
#1 0x7fabc6ae8f8c in ossl_asn1_string_embed_free (/opt/quictls/lib/libcrypto.so.81.3+0xd4f8c)
previously allocated by thread T2 here:
#0 0x7fabc6fb573f in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x5473f)
#1 0x7fabc6ae8d77 in ASN1_STRING_set (/opt/quictls/lib/libcrypto.so.81.3+0xd4d77)
Thread T3 created by T0 here:
#0 0x7fabc6f84bba in pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x23bba)
#1 0xc04f36 in setup_extra_threads src/thread.c:252
#2 0xa2761f in main src/haproxy.c:3917
#3 0x7fabc682fd09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
Thread T2 created by T0 here:
#0 0x7fabc6f84bba in pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x23bba)
#1 0xc04f36 in setup_extra_threads src/thread.c:252
#2 0xa2761f in main src/haproxy.c:3917
#3 0x7fabc682fd09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
SUMMARY: AddressSanitizer: double-free ??:0 __interceptor_free
==507223==ABORTING
Aborted
The OCSP CID stored in the impacted ckch data were freed but not reset to NULL,
leading to a subsequent double free.
Must be backported to 2.8.
Before this patch, it was not possible to use a list of patterns, map or a
list of acls, without an existing file. However, it could be handy to just
use an ID, with no file on the disk. It is pretty useful for everyone
managing dynamically these lists. It could also be handy to try to load a
list from a file if it exists without failing if not. This way, it could be
possible to make a cold start without any file (instead of empty file),
dynamically add and del patterns, dump the list to the file periodically to
reuse it on reload (via an external process).
In this patch, we uses some prefixes to be able to use virtual or optional
files.
The default case remains unchanged. regular files are used. A filename, with
no prefix, is used as reference, and it must exist on the disk. With the
prefix "file@", the same is performed. Internally this prefix is
skipped. Thus the same file, with ou without "file@" prefix, references the
same list of patterns.
To use a virtual map, "virt@" prefix must be used. No file is read, even if
the following name looks like a file. It is just an ID. The prefix is part
of ID and must always be used.
To use a optional file, ie a file that may or may not exist on a disk at
startup, "opt@" prefix must be used. If the file exists, its content is
loaded. But HAProxy doesn't complain if not. The prefix is not part of
ID. For a given file, optional files and regular files reference the same
list of patterns.
This patch should fix the issue #2202.
It is only a small API refactoring. The filename is no longer used when
pat_ref_read_from_file_smp() or pat_ref_read_from_file() functions are
called. The filename was already used to create the reference on the list of
patterns. Thus, we now directly use info from this reference.
tune.cache.zero-copy-forwarding parameter can now be used to enable or
disable the zero-copy fast-forwarding for the cache applet only. It is
enabled ('on') by default. It can be disabled by setting the parameter to
'off'.
It is now possible to directly forward data to the opposite side from the
cache applet. To do so, dedicated functions were added to fast-forward the
payload part of the cached objects. Of course headers and trailers are still
sent via the channel's buffer, using the HTX.
When an object is delivered from the cache, once the applet reaches the
HTX_CACHE_DATA state, it declares it can fast-forward data. From this point,
all data are directly transferred to the oppposite side.
We now save the body size of cached objets in the cache entry strucutre. In
addition, the cache applet tracks the body part already sent.
This will be mandatory to add support of endpoint-to-endpoint
fast-forwarding in the cache applet.
To be able to support endpoint-to-endpoint fast-forwarding (formerly called
mux-to-mux fast-forwarding), we cannot rely on data in the input channel to
compute amount of data the applet has produced.
The applet API is not really designed to know how many bytes are produced or
received at each call. Till now, it was not a problem because data always
passed through the channels. With E2E fast-frowarding, input data may be
immediately consumed. From the caller point of view (task_run_applet), there
is only the total field of the input channel that will change. So let's use
it now.
Till now, it was not possible to notify an producing applet is streaming
data. It means, it was not possible to set CF_STREAMER and CF_STREAMER_FLAGS
on the input channel of an applet streaming data.
While it is not a big deal for most of applets, it is interesting for the
cache. Because there are now dedicated functions to deal with these flags,
we can use them in task_run_applet() to set/unset these flags on the input
channel.
This patch relies on "MINOR: channel: Use dedicated functions to deal with
STREAMER flags".
For now, CF_STREAMER and CF_STREAMER_FAST flags are set in sc_conn_recv()
function. The logic is moved in dedicated functions.
First, channel_check_idletimer() function is now responsible to check the
channel's last read date against the idle timer value to be sure the
producer is still streaming data. Otherwise, it removes STREAMER flags.
Then, channel_check_xfer() function is responsible to check amount of data
transferred avec a receive, to eventually update STREAMER flags.
In sc_conn_recv(), we now use these functions.
peer_recv_msg() may return because the message is incomplete without
checking if a shutdown is pending for the SC. The function relies on
co_getblk() to detect shutdowns. However, the message length decoding may be
interrupted if the multi-bytes integer is incomplete. In this case, the SC
is not check for shutdowns.
When this happens, this leads to an appctx spinning loop.
This patch should fix the issue #2373. It must be backported to 2.8.
There is at least an bug for now in this part and it is still unstable. Thus
it is better to disable it for now by default. It can be enable by setting
tune.quic.zero-copy-fwd-send to 'on'.
tune.quic.zero-copy-fwd-send can now be used to enable or disable the
zero-copy fast-forwarding for the QUIC mux only, for sends. For now, there
is no option to disable it for receives because it is not supported yet.
It is enabled ('on') by default.
tune.h2.zero-copy-fwd-send can now be used to enable or disable the
zero-copy fast-forwarding for the H2 mux only, for sends. For now, there is
no option to disable it for receives because it is not supported yet.
It is enabled ('on') by default.
tune.h1.zero-copy-fwd-recv and tune.h1.zero-copy-fwd-send can now be used to
enable or disable the zero-copy fast-forwarding for the H1 mux only, for
receives or sends. Unlike the PT mux, there are 2 options here because
client and server sides can use difference muxes.
Both are enabled ('on') by default.
tune.pt.zero-copy-forwarding parameter can now be used to enable or disable
the zero-copy fast-forwarding for the PT mux only. It is enabled ('on') by
default. It can be disabled by setting the parameter to 'off'. In this case,
this disables receive and send side.
Zero-copy fast-forwading feature is a quite new and is a bit sensitive.
There is an option to disable it globally. However, all protocols have not
the same maturity. For instance, for the PT multiplexer, there is nothing
really new. The zero-copy fast-forwading is only another name for the kernel
splicing. However, for the QUIC/H3, it is pretty new, not really optimized
and it will evolved. And soon, the support will be added for the cache
applet.
In this context, it is usefull to be able to enable/disable zero-copy
fast-forwading per-protocol and applet. And when it is applicable, on sends
or receives separately. So, instead of having one flag to disable it
globally, there is now a dedicated bitfield, global.tune.no_zero_copy_fwd.
Building on gcc 4.4 reports "start may be used uninitialized". This is
a classical case of dependency between two variables where the compiler
lost track of their initialization and doesn't know that if one is not
set, the other is. By just moving the second test in the else clause
of the assignment both fixes it and makes the code more efficient, and
this can be simplified as a ternary operator.
It's probably not needed to backport this, unless anyone reports build
warnings with more recent compilers (intermediary optimization levels
such as -O1 can sometimes trigger such warnings).
It is possible that a server's addr family is temporarily set to AF_UNSPEC
even if we're certain to be in INET context (ipv4, ipv6).
Indeed, as soon as IP address resolving is involved, srv->addr family will
be set to AF_UNSPEC when the resolution fails (could happen at anytime).
However, _srv_event_hdl_prepare_inetaddr() wrongly assumed that it would
only be called with AF_INET or AF_INET6 families. Because of that, the
function will handle AF_UNSPEC address as an IPV6 address: not only
we could risk reading from an unititialized area, but we would then
propagate false information when publishing the event.
In this patch we make sure to properly handle the AF_UNSPEC family in
both the "prev" and the "next" part for SERVER_INETADDR event and that
every members are explicitly initialized.
This bug was introduced by 6fde37e046 ("MINOR: server/event_hdl: add
SERVER_INETADDR event"), no backport needed.
Previously an expression like:
path,word(2,/) -m found
always returned `true`.
Bug exists since the `word` converter exists. That is:
c9a0f6d023
The same bug was previously fixed for the `field` converter in commit
4381d26edc.
The fix should be backported to 1.6+.
REX and WEX date are already reported. But if the corresponding SC cannot
expire on read or write, "<NEVER>" is reported instead. The same is reported
if no expiration date is set. It is not really convenient because we cannot
distinguish the two cases.
So, now, for each SC, read and wirte timer (rto/wto) are also reported in
the dump, based on .lra/.fsb dates and the current I/O timeout. The SC I/O
timeout is also reported.
Since b40542000d ("MEDIUM: proxy: Warn about ambiguous use of named
defaults sections") we introduced a new error to prevent user from
having an ambiguous named default section in the config which is both
inherited explicitly using "from" and implicitly by another proxy due to
the default section being the last one defined.
However, despite the error message being presented as a warning the
err_code, the commit message and the documentation, it is actually
reported as a fatal error because ha_alert() was used in place of
ha_warning().
In this patch we make the code comply with the documentation and the
intended behavior by using ha_warning() to report the error message.
This should be backported up to 2.6.
On ubuntu 20.04 and 22.04 with gcc 9.4 and 11.4 respectively, we get
the following warning:
src/server.c: In function 'srv_update_addr_port':
src/server.c:4027:3: warning: 'new_port' may be used uninitialized in this function [-Wmaybe-uninitialized]
4027 | _srv_event_hdl_prepare_inetaddr(&cb_data.addr, &s->addr, s->svc_port,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4028 | ((ip_change) ? &sa : &s->addr),
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4029 | ((port_change) ? new_port : s->svc_port),
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4030 | 1);
| ~~
It's clearly wrong, port_change only changes from 0 to anything else
*after* assigning new_port. Let's just preset new_port to zero instead
of trying to play smart with the compiler.
It's useful to be able to recognize certain functions that are often
present in backtraces as they call lower level functions, and for this
they must not be static. Let's remove "static" in front of these
functions:
sc_notify, sc_conn_recv, sc_conn_send, sc_conn_process,
sc_applet_process, back_establish, stream_update_both_sc,
httpclient_applet_io_handler, httpclient_applet_init,
httpclient_applet_release
When an environment variable could not be matched by getenv(), the
current character to be parsed by parse_line() from <in> variable
is the trailing double quotes. If nothing is done in such a case,
this character is skipped by parse_line(), then the following spaces
are parsed as an empty argument.
To fix this, skip the double quotes character and the following spaces
to make <in> variable point to the next argument to be parsed.
Thank you to @sigint2 for having reported this issue in GH #2367.
Must be backported as far as 2.4.
preferred_address is a transport parameter specify by the server. It
specified both an IPv4 and IPv6 address. These addresses were defined as
plain array in <struct tp_preferred_address>.
Convert these adressees to use the common types in_addr/in6_addr. With
this change, dumping of preferred_address is extended. It now displays
the addresses using inet_ntop() and CID value.
quic_transport_param_dec_pref_addr() is responsible to decode
preferred_address from received transport parameter. There was two
issues with this function :
* address and port location as defined in RFC were inverted for both
IPv4 and IPv6 during decoding
* an invalid check was done to ensure decoded CID length corresponds to
remaining buffer size. It did not take into account the final field
for stateless reset token.
These issues were never encountered as only server can emit
preferred_address transport parameter, so the impact of this bug is
invisible.
This should be backported up to 2.6.
retrieve_qc_conn_from_cid() requires listener as argument whereas it is
unused. This is an artifact from the old architecture where CID trees
where stored on listener instances instead of globally. Remove it to
better reflect this change.
Mark the reverse HTTP feature as experimental. This will allow to adjust
if needed the configuration mechanism with future developments without
maintaining retro-compatibility.
Concretely, each config directives linked to it now requires to specify
first global expose-experimental-directives before. This is the case for
the following directives :
- rhttp@ prefix uses in bind and server lines
- nbconn bind keyword
- attach-srv tcp rule
Each documentation section refering to these keywords are updated to
highlight this new requirement.
Note that this commit has duplicated on several places the code from the
global function check_kw_experimental(). This is because the latter only
work with cfg_keyword type. This is not adapted with bind_kw or
action_kw types. This should be improve in a future patch.
A regression was introduced by commit 9431aa0bdf ("BUG/MEDIUM: cli: Don't
look for payload pattern on empty commands").
On empty commands (really empty or containing spaces and tabs), the number of
arguments set to 0. However we look for the payload pattern without checking
it. The result is an access at the index -1 in the argument array. It is of
course invalid.
To fix the issue, we just skip this part when there is no argument. Note that
the empty command is still sent to the worker.
This patch should solve the issue #2365. No backport needed.
"fc.id" and "bc.id" sample fetches can now be used to get, respectively, the
frontend or the backend stream ID. They rely on ->sctl() callback function
on the mux attached to the corresponding SC.
It means these sample fetches work only for connection, not applets, and
from the time a multiplexer is installed.
All muxes now implements the ->sctl() callback function and are able to
return the stream ID. For the PT multiplexer, it is always 0. For the H1
multiplexer it is the request count for the current H1 connection (added for
this purpose). The FCGI, H2 and QUIC muxes, the stream ID is returned.
The stream ID is returned as a signed 64 bits integer.
Instead of the generic MUX_, we now use MUX_CTL_ prefix for all mux_ctl_type
value. This will avoid any ambiguities with other enums, especially with a
new one that will be added to get information on mux streams.
"txn.conn_retries" can now be used to get the number of connection
retries. This value is only stable once the connection is fully
established. For HTTP sessions, L7-retries must also be passed.
It is now possible to retrieve the session terminate state, using
"txn.sess_term_state". The sample fetch returns the 2-character session
termation state.
Of course, the result of this sample fetch is volatile. It is subject to
change. It is also most of time useless because no termation state is set
except at the end. It should only be useful in http-after-response rule
sets. It may also be used to customize the logs using a log-format
directive.
This patch should fix the issue #2221.
When, for any reason and at any step, HAProxy decides to interrupt an HTTP
transaction, it returns a dedicated responses to the client (possibly empty)
and it sets the stream flags used to produce the session termination state.
These both operation were performed in any order, depending on the code
path. Most of time, the HAPRoxy response is produced first.
With this patch, the stream flags for the termination state are now set
first. This way, these flags become visible from http-after-reponse rule
sets. Only errors when the HAProxy response is generated are reported later.
It was possible get the status code in the HTTP response and the one
received from the server. Thanks to 'txn.status', it is now possible to get
the transaction status code. It is equivalent to '%ST' in log-format.
Most of time, it is the same than 'status', except if the status code of the
HTTP reply does not match the one used to interrupt the transaction. For
instance, an error file use mapped on 400 containing a 404.
The code returned by the "status" sample fetch is the one in the HTTP
response at the moment the sample is evaluated. It may be the status code in
the server response or the one of the HAProxy reply in case of error, deny,
redirect...
However, it could be handy to retrieve the status code returned by the
server, when a HTTP response was really received from it. It is the purpose
of the "server_status" sample fetch. The server status code itself is stored
in the HTTP txn.
Each received HTTP/3 frame is checked to ensure it is valid given the
type of stream and its current status. This was implemented via
h3_is_frame_valid().
Previously, no distinction was made for error code, so every failure
triggered a CONNECTION_CLOSE_APP with code H3_FRAME_UNEXPECTED. However,
this function also ensures that the first frame received on control
frame is of type SETTINGS. If not, the error code to use is
H3_MISSING_SETTINGS.
To support this, adjust the function prototype. Instead of returning a
boolean, 0 is returned for success, or a HTTP/3 error code. The function
is renamed h3_check_frame_valid() to reflects the return type change.
This is not considered as a bug as previously the connection was
correctly closed on a missing SETTINGS, albeit with a non conform error
code. It's not deemed as sufficient to be backported.
The condition for checking PUSH_PROMISE was not correctly interpreted
from the RFC. Initially, it rejects such a frame for every stream
initiated from client side.
In fact, the RFC indicates that PUSH_PROMISE are never sent by a client.
Thus, it can be rejected in any case until HTTP/3 will be implemented on
the backend side.
This should be backported up to 2.6.
HTTP/3 trailers encoding was never working as intended. It's because
h3_trailers_to_htx() manipulate a newly allocated buffer instead of the
already existing channel one. Thus, HTX message handled by the stream
was incomplete as it lacked trailers and EOM.
Fix this by reusing the already allocated channel buffer in
h3_trailers_to_htx().
This bug was detected by simulating TRAILERS emission which generate
CL--- state due to missing request side termination signal. Its impact
is deemed as minimal as trailers are pretty infrequent for now in
HTTP/3.
This must be backported up to 2.7.
When the producer negociate with the QUIC mux to perform a zero-copy
fast-forward, data in the input buffer are first transferred in the H3
buffer. However, after the transfer, if the input buffer is not empty, the
data fast-forwarding must be stopped. In this case, qmux_nego_ff() must
return 0.
No backport needed.
A previous fix was pushed for that (13fb7170be "BUG/MEDIUM: master/cli: Pin
the master CLI on the first thread of the group 1" ). Unfortunately, instead
of the master CLI, it is the sockpairs between the master and the workers
that were pinned to the first thread of the group 1. So the crash is still
there.
So, again, to fix the bug the master CLI is now pinned on the first thread
of the first group.
patch should fix the issue #2259 and must be backported to 2.8.
This bug was introduced in ead43fe4f2 ("MEDIUM: compression: Make it so
we can compress requests as well.")
2 cases where not properly handled, resulting in 2 possible NULL
dereferences leading to crashes in the function at runtime:
- when the backend didn't define any compression options so its comp
pointer is NULL (ie: if only the frontend defines some comp options)
- when both the frontend and the backend didn't set a compression algo
but at least one of the two defined some other comp options (comp
pointer set)
For the first case, we added the missing checks to make sure we don't
read ->comp pointer if it is NULL.
For the second case, we properly return from the function if no
compression algo is defined, because there is no default value that could
be used as a fallback.
This should be backported to 2.8.
In previous log backend implementation, we created a pseudo log target
for each declared log server, and we made the log target's address point
to the actual server address to save some time and prevent unecessary
copies.
But this was done without knowing that when FQDN is involved (more broadly
when dns/resolution is involved), the "port" part of server addr should
not be relied upon, and we should explicitly use ->svc_port for that
purpose.
With that in mind and thanks to the previous commit, some changes were
required: we allocate a dedicated addr within the log target when target
is in DGRAM mode. The addr is first initialized with known values and it
is then updated automatically by _srv_set_inetaddr() during runtime.
(the change is atomic so readers don't need to worry about it)
addr from server "log target" (INET/DGRAM mode) is made of the combination
of server's address (lacking the port part) and server's svc_port.
For inet families (IP4/IP6), it is expected that server's addr/port might
be updated at runtime from DNS, cli or lua for instance.
Such updates were performed under the server's lock.
Unfortunately, most readers such as backend.c or sink.c perform the read
without taking server's lock because they can't afford slowing down their
processing for a type of event which is normally rare. But this could
result in bad values being read for the server addr:svc_port tuple (ie:
during connection etablishment) as a result of concurrent updates from
external components, which can obviously cause some undesirable effects.
Instead of slowing the readers down, as we consider server's addr changes
are relatively rare, we take another approach and try to update the
addr:port atomically by performing changes under full thread isolation
when a new change is requested. The changes are performed by a dedicated
task which takes care of isolating the current thread and doesn't depend
on other threads (independent code path) to protect against dead locks.
As such, server's addr:port changes will now be performed atomically, but
they will not be processed instantly, they will be translated to events
that the dedicated task will pick up from time to time to apply the
pending changes.
This bug existed for a very long time and has never been reported so
far. It was discovered by reading the code during the implementation
of log backend ("mode log" in backends). As it involves changes in
sensitive areas as well as thread isolation, it is probably not
worth considering backporting it for now, unless it is proven that it
will help to solve bugs that are actually encountered in the field.
This patch depends on:
- 24da4d3 ("MINOR: tools: use const for read only pointers in ip{cmp,cpy}")
- c886fb5 ("MINOR: server/ip: centralize server ip updates")
- event_hdl API (which was first seen on 2.8) +
683b2ae ("MINOR: server/event_hdl: add SERVER_INETADDR event") +
BUG/MEDIUM: server/event_hdl: memory overrun in _srv_event_hdl_prepare_inetaddr() +
"MINOR: event_hdl: add global tunables"
Note that the patch may be reworked so that it doesn't depend on
event_hdl API for older versions, the approach would remain the same:
this would result in a larger patch due to the need to manually
implement a global queue of pending updates with its dedicated task
responsible for picking updates and comitting them. An alternative
approach could consist in per-server, lock-protected, temporary
addr:svc_port storage dedicated to "updaters" were only the most
recent values would be kept. The sync task would then use them as
source values to atomically update the addr:svc_port members that the
runtime readers are actually using.
The local variable "event_hdl_async_max_notif_at_once" which was
introduced with the event_hdl API was left as is but with a TODO note
telling that we should make it a global tunable.
Well, we're doing this now. To prepare for upcoming tunables related to
event_hdl API, we add a dedicated struct named event_hdl_tune which is
globally exposed through the event_hdl header file so that it may be used
from everywhere. The struct is automatically initialized in
event_hdl_init() according to defaults.h.
"event_hdl_async_max_notif_at_once" now becomes
"event_hdl_tune.max_events_at_once" with it's dedicated
configuation keyword: "tune.events.max-events-at-once".
We're also taking this opportunity to raise the default value from 10
to 100 since it's seems quite reasonnable given existing async event_hdl
users.
The documentation was updated accordingly.
As reported in GH #2358, #2359, #2360, #2361 and #2362: ipv6 address
handling may cause memory overrun due to struct in6_addr being handled
as sockaddr_in6 which is larger. Moreover, source variable wasn't properly
read from since the raw value was used as a pointer instead of pointing to
the actual variable's address.
This bug was introduced by 6fde37e046
("MINOR: server/event_hdl: add SERVER_INETADDR event")
Unfortunately for us, gcc didn't catch this and, this actually used to
"work" by accident since in6_addr struct is made of array so not passing
pointer explicitly still resolved to the proper starting address..
Hopefully this was caught by coverity so thanks to Ilya for that.
The fix is simple: we simply copy the whole in6_addr struct by accessing
it using a pointer and using the proper struct size for the copy.
Implements the customized payload pattern for the master CLI.
The pattern is stored in the stream in char pcli_payload_pat[8].
The principle is basically the same as the CLI one, it looks for '<<'
then stores what's between '<<' and '\n', and look for it to exit the
payload mode.
The CLI payload syntax has some limitation, it can't handle payloads
with empty lines, which is a common problem when uploading a PEM file
over the CLI.
This patch implements a way to customize the ending pattern of the CLI,
so we can't look for other things than empty lines.
A char cli_payload_pat[8] is used in the appctx to store the customized
pattern. The pattern can't be more than 7 characters and can still empty
to match an empty line.
The cli_io_handler() identifies the pattern and stores it, and
cli_parse_request() identifies the end of the payload.
If the customized pattern between "<<" and "\n" is more than 7
characters, it is not considered as a pattern.
This patch only implements the parser for the 'stats socket', another
patch is needed for the 'master CLI'.
When a stream is interrupted by the client before the full answer is
stored in the cache, we end up with an incomplete entry in the cache
that cannot be overwritten until it "naturally" expires. In such a case,
we call the cache filter's cache_store_strm_deinit callback without ever
calling cache_store_http_end which means that the 'complete' flag is
never set on the concerned cache_entry.
This patch adds a check on the 'complete' flag in the strm_deinit
callback and removes the entry from the cache if it is incomplete.
A way to exhibit this bug is to try to get the same "big" response on
multiple clients at the same time thanks to h2load for instance, and to
interrupt the client side before the answer can be fully stored in the
cache.
This patch can be backported up to 2.4 but it will need some rework
starting with branch 2.8 because of the latest cache changes.
As this function does only a few things with a not very well chosen name,
remove it and replace it by the its statements at the unique location it
is called.
Move qc_notify_send() from quic_tx.c to quic_conn.c. Note that it was already
exported from both quic_conn.h and quic_tx.h. Modify this latter header
to fix the duplication.
Add quic_retry.c new C file for the QUIC retry feature:
quic_saddr_cpy() moved from quic_tx.c,
quic_generate_retry_token_aad() moved from
quic_generate_retry_token() moved from
parse_retry_token() moved from
quic_retry_token_check() moved from
quic_retry_token_check() moved from
This function is in relation with the Initial packet number space which is
more linked to the QUIC TLS specifications. Let's move it to quic_tls.h
to be inlined.
Move quic_path struct from quic_conn-t.h to quic_cc-t.h and rename it to quic_cc_path.
Update the code consequently.
Also some inlined functions in relation with QUIC path to quic_cc.h
Move quic_pkt_type(), quic_saddr_cpy(), quic_write_uint32(), max_available_room(),
max_stream_data_size(), quic_packet_number_length(), quic_packet_number_encode()
and quic_compute_ack_delay_us() to quic_tx.c because only used in this file.
Also move quic_ack_delay_ms() and quic_read_uint32() to quic_tx.c because they
are used only in this file.
Move quic_rx_packet_refinc() and quic_rx_packet_refdec() to quic_rx.h header.
Move qc_el_rx_pkts(), qc_el_rx_pkts_del() and qc_list_qel_rx_pkts() to quic_tls.h
header.
Move quic_cstream struct definition from quic_conn-t.h to quic_tls-t.h.
Its pool is also moved from quic_conn module to quic_tls. Same thing for
quic_cstream_new() and quic_cstream_free().
Move quic_cid and quic_connnection_id from quic_conn-t.h to new quic_cid-t.h header.
Move defintions of quic_stateless_reset_token_init(), quic_derive_cid(),
new_quic_cid(), quic_get_cid_tid() and retrieve_qc_conn_from_cid() to quic_cid.c
new C file.
When a H2 stream is blocked during data fast-forwarding, we must take care
to remove H2_SF_NOTIFIED flag. This was only performed when data
fast-forward was attempted. However, if the H2 stream was blocked for any
reason, this flag was not removed. During our tests, we found it was
possible to infinitely block a connection because one of its streams was in
the send_list with the flag set. In this case, the stream was no longer
woken up to resume the sends, blocking all other streams.
No backport needed.
CONNECTION_CLOSE_APP encoding is broken, which prevents the sending of
every packet with such a frame. This bug was always present in quic
haproxy. However, it was slightly dissimulated by the previous code
which always initialized all frame members to zero, which was sufficient
to ensure CONNECTION_CLOSE_APP encoding was ok. The below patch changes
this behavior by removing this costly initialization step.
4cf784f38e
MINOR: quic: Avoid zeroing frame structures
Now, frames members must always be initialized individually given the
type of frame to used. However, for CONNECTION_CLOSE_APP this was not
done as qc_cc_build_frm() accessed the wrong union member refering to a
CONNECTION_CLOSE instead.
This bug was detected when trying to generate a HTTP/3 error. The
CONNECTION_CLOSE_APP frame encoding failed due to a non-initialized
<reason_phrase_len> which was too big. This was reported by the
following trace :
"frame building error : qc@0x5555561b86c0 idle_timer_task@0x5555561e5050 flags=0x86038058 CONNECTION_CLOSE_APP"
This must be backported up to 2.6. This is necessary even if above
commit is not as previous code is also buggy, albeit with a different
behavior.
It's the exact same as commit 0a7ab7067 ("OPTIM: mux-h2: don't allocate
more buffers per connections than streams"), but for the zero-copy case
this time. Previously it was only done on the regular snd_buf() path, but
this one is needed as well. A transfer on 16 parallel streams now consumes
half of the memory, and a single stream consumes much less.
An alternate approach would be worth investigating in the future, based
on the same principle as the CF_STREAMER_FAST at the higher level: in
short, by monitoring how many mux buffers we write at once before refilling
them, we would get an idea of how much is worth keeping in buffers max,
given that anything beyond would just waste memory. Some tests show that
a single buffer already seems almost as good, except for single-stream
transfers, which is why it's worth spending more time on this.
Add an optional argument for "-dt". This argument is interpreted as a
list of several trace statement separated by comma. For each statement,
a specific trace name can be specifed, or none to act on all sources.
Using double-colon separator, it is possible to add specifications on
the wanted level and verbosity.
Extract conversion of level string argument to integer value in a
dedicated internal function trace_parse_level(). This function is used
to for CLI trace parsing and will also be useful for "-dt" process
argument.
Add '-dt' haproxy process argument. This will automatically activate all
trace sources on stderr with the error level. This could be useful to
troubleshoot issues such as protocol violations.
<pattern> field pointer of pat_ref_elt structure has been by a
zero-length array. As such, it's now unneeded to check for NULL address
before printing it.
This type conversion was done in the following commit :
3ac9912837
OPTIM: pattern: save memory and time using ebst instead of ebis
The current patch is mandatory to fix the following GCC warning :
CC src/map.o
src/map.c: In function ‘cli_io_handler_map_lookup’:
src/map.c:549:54: error: the comparison will always evaluate as ‘true’ for the address of ‘pattern’ will never be NULL [-Werror=address]
549 | if (pat->ref && pat->ref->pattern)
|
No need to backport it unless the above commit is.
In the pat_ref_elt struct, the pattern string is stored outside of the
node element, using a pointer to an strdup(). Not only this needlessly
wastes at least 16-24 bytes per entry (8 for the pointer, 8-16 for the
allocator), it also makes the tree descent less efficient since both
the node and the string have to be visited for each layer (hence at least
two cache lines). Let's use an ebmb storage and place the pattern right
at the end of the pat_ref_elt, making it a variable-sized element instead.
The set-map test below jumps from 173 to 182 kreq/s/core, and the memory
usage drops from 356 MB to 324 MB:
http-request set-map(/dev/null) %[rand(1000000)] 1
This is even more visible with large maps: after loading 16M IP addresses
into a map, the process uses this amount of memory:
- 3.15 GB with haproxy-2.8
- 4.21 GB with haproxy-2.9-dev11
- 3.68 GB with this patch
So that's a net saving of 32 bytes per entry here, which cuts in half the
extra cost of the tree, and loading a large map takes about 20% less time.
It is not possible in H1, but in H2 (and probably H3) it is possible to have
trailers at the end of a message while a Content-Length was announced.
However, depending if the trailers are received with the last HTX DATA block
or the zero-copy forwarding is used or not, an processing error may be
triggered, leading to a 500-internal-error.
To fix the issue, when a content-length is announced and all the payload was
processed, we switch the message to H1_MSG_DONE state only if the
end-of-message was also reported (HTX_FL_EOM flag set). Otherwise, it is
switched to H1_MSG_TRAILERS state to be able to properly ignored the
trailers, if so.
The patch must be backported as far as 2.4. Be careful, this part was highly
refactored. The patch will have to be adapted to be backported.
The mworker mode never had a proper 'hard-stop' (-st) for the reload,
this is a mode which was commonly used with the daemon mode, but it was
never implemented in mworker mode.
This patch fixes the problem by implementing a "hard-reload" command
over the master CLI. It does the same as the "reload" command, but
instead of waiting for the connections to stop in the previous process,
it immediately quits the previous process after binding.
This patch removes the code which selects the SSL certificate in the
OpenSSL Client Hello callback, to use the ssl_sock_chose_sni_ctx()
function which does the same.
The bigger part of the function which remains is the extraction of the
servername, ciphers and sigalgs, because it's done manually by parsing
the TLS extensions.
This is not supposed to change anything functionally.
The certificate selection used in the WolfSSL cert_cb and in the OpenSSL
clienthello callback is the same, the function was duplicate to achieve
the same.
This patch move the selection code to a common function called
ssl_sock_chose_sni_ctx().
The servername string is still lowered in the callback, however the
search for the first dot in the string (wildp) is done in
ssl_sock_chose_sni_ctx()
The function uses the same certificate selection algorithm as before, it
needs to know if you need rsa or ecdsa, the bind_conf to achieve the
lookup, and the servername string.
This patch moves the code for WolSSL only.
PR https://github.com/wolfSSL/wolfssl/pull/6963 implements primitives to
extract ciphers and algorithm signatures.
It allows to chose a certificate depending on the sigals and
ciphers presented by the client (RSA or ECDSA).
Since WolfSSL does not implement the clienthello callback, the patch
uses the certificate callback (SSL_CTX_set_cert_cb())
The callback is inspired by our clienthello callback, however the
extraction of client ciphers and sigalgs is simpler,
wolfSSL_get_sigalg_info() and wolfSSL_get_ciphersuite_info() are used.
This is not enabled by default yet as the PR was not merged.
Maintain proper px->lbprm.tot_weight for log backends. server's weight is
considered as 1 as long as the server is usable.
This will allow the stats page to correctly display the proxy status since
the check currently relies on proxy's lbprm.tot_weight variable.
server_rules declared using "use-server" keyword within a proxy are not
supported inside a log backend (with "mode log" set), so we report a
warning to the user and reset the setting.
Take the px->server_rules freeing part out of free_proxy() and make it
a dedicated helper function so that it becomes possible to use it from
anywhere.
There are multiple places inside free_proxy() where we need to perform
the exact same operation: freeing a logformat list which includes freeing
every member.
To prevent code duplication, we add the free_logformat_list() function
that takes such list as parameter and does all the freeing job on its
own.
This reverts commit 5884e46ec8c8231e73c68e1bdd345c75c9af97a0 since we
cannot perform the test during parsing as the effective proxy mode is
not yet known.
This is a leftover from 1e0093a317 ("MINOR: backend/balance: "balance"
requires TCP or HTTP mode").
Indeed, we cannot perform the test during parsing as the effective proxy
type is not yet known. Moreover, thanks to b61147fd ("MEDIUM: log/balance:
merge tcp/http algo with log ones") we could potentially benefit from
this setting even in log mode, but for now it is ignored by all log
compatible load-balancing algorithms.
In this patch we fix the prototype for ipcmp() and ipcpy() functions so
that input pointers that are used exclusively for reads are used as const
pointers. This way, the compiler can safely assume that those variables
won't be altered by the function.
In this patch we add the support for a new SERVER event in the
event_hdl API.
SERVER_INETADDR is implemented as an advanced server event.
It is published each time the server's ip address or port is
about to change. (ie: from the cli, dns, lua...)
SERVER_INETADDR data is an event_hdl_cb_data_server_inetaddr struct
that provides additional info related to the server inet addr change,
but can be casted as a regular event_hdl_cb_data_server struct if
additional info is not needed.
When possible, we try send DATA frame without copying data. To do so, we
swap the input buffer with QCS tx buffer. It is only possible iff:
* There is only one HTX block of data at the beginning of the message
* Amount of data to send is equal to the size of the HTX data block
* The QCS tx buffer is empty
In this case, both buffers are swapped. The frame metadata are written at
the begining of the buffer, before data and where the HTX structure is
stored.
After giving it some thought, it could pretty well happen that other
protocols benefit from the sticky algorithm that some used to emulate
using a "stick-on int(0)" or things like this previously. So better
rename it to "sticky" right now instead of having to keep that "log-"
prefix forever. It's still limited to logs, of course, only the algo
is renamed in the config.
Thanks to previous commit, a reverse HTTP listener is able to distribute
actively opened connections accross its threads. To be able to exploit
this, allow "thread" keyword for such a listener.
An extra check is added to explicitely forbids a reverse bind to span
multiple thread groups. Without this, multiple listeners instances will
be created, each with its owned "nbconn" value. This may surprise users
so for now, better to deactivate this possibility.
Implement support for active HTTP reverse task migration on listener
threads. This operation is done each time a new reversable connection
will be instantiated. Instead of directly allocate the connection, a
lookup is done among all the listener threads.
A comparison is done to select the thread with the smallest number of
current reverse connection. If the thread found is different from the
current one, the connection allocation is delayed and the task
rescheduled on the chosen thread. The connection will then be created
and pinned on the new thread. This mechanisms allows to balance reverse
HTTP connections accross different threads.
Note that rhttp_set_affinity is still defined to disable thread
migration on accept. This is necessary as it's unsafe to move an
existing connection to another thread. However, active reverse task
migration should be sufficient to distribute connections accross several
threads. Better than that, this design allows to differentiate standard
frontend and reversable connections. The latest are designed to be
long-lived so it's useful to have their repartition solely based on
others reversed connections.
Add a new member <nb_rhttp_conns> in thread_ctx structure. Its purpose
is to count the current number of opened reverse HTTP connections
regarding from their listeners membership.
This patch will be useful to support multi-thread for active reverse
HTTP, in order to select the less loaded thread.
Note that despite access to <nb_rhttp_conns> are only done by the
current thread, atomic operations are used. This is because once
multi-thread support will be added, external threads will also retrieve
values from others.
Previous commit renames 'proto_reverse_connect' module to 'proto_rhttp'.
This commits follows this by replacing various custom prefix by 'rhttp_'
to make the code uniform.
Note that 'reverse_' prefix was kept in connection module. This is
because if a new reversable protocol not based on HTTP is implemented,
it may be necessary to reused the same connection function which are
protocol agnostic.
In the mux-to-mux fast-forwarding, when end-of-input is reached on the producer
side, the consumer side must not set the CO_SFL_MSG_MORE flag on send. It means
the H1C_F_CO_MSG_MORE flag must be removed from the H1 connection.
No backport needed.
In Github issue #2128, @jvincze84 explained the complexity of using
external checks in some advanced setups due to the systematic purge of
environment variables, and expressed the desire to preserve the
existing environment. During the discussion an agreement was found
around having an option to "external-check" to do that and that
solution was tested and confirmed to work by user @nyxi.
This patch just cleans this up, implements the option as
"preserve-env" and documents it. The default behavior does not change,
the environment is still purged, unless "preserve-env" is passed. The
choice of not using "import-env" instead was made so that we could
later use it to name specific variables that have to be imported
instead of keeping the whole environment.
The patch is simple enough that it could be backported if needed (and
was in fact tested on 2.6 first).
Here the idea is to collect components' versions and build options. The
main component is haproxy, but the API is made so that any sub-system
can easily add a component there (for example the detailed version of a
device detection lib, or some info about a lib loaded from Lua).
The elements are stored as a pointer to an array of structs and its count
so that it's sufficient to issue this in gdb to list them all at once:
print *post_mortem.components@post_mortem.nb_components
For now we collect name, version, toolchain, toolchain options, build
options and path. Maybe more could be useful in the future.
Having the libs and their addresses listed in the post_mortem struct
is also helpful. Sometimes it helps notice that one version is not the
expected one, e.g. due to some LD_LIBRARY_PATH. We don't emit it on
"show dev" however since that's already available via "show libs".
The last starting thread now copies the pthread ID and stack top of
each thread into post_mortem. That way it's as easy as issuing
"p post_mortem" in gdb to see all thread IDs and stack frames and more
easily map them to the threads met in a core.
Here we collect the original uid/gid/rlimits for FD and RAM since these
ones do affect behavior and are sometimes different from expected in
containers or when starting as a service.
When the x86 CPU flags show the "hypervisor" flag, we know we're running
inside QEMU, VMware or possibly other flavors of hypervisors. In this
case we'll report either "qemu", "vmware" or "yes" for other ones in
the "virt_techno" field, based on the DMI hardware vendor name,
otherwise "no" when the flag is not found.
The CPU model and type has significant impact on certain bugs, such
as contention issues caused by CPUs having split L3 caches, or stricter
memory models that exhibit some barrier issues. It's complicated though
because the info about the model depends on the arch. For example, x86
reports an SKU name while ARM rather reports the CPU core types, families
and versions for each CPU core. There, the SoC will sometimes be reported
in the device tree or DMI info instead. But we don't really care, it's
essentially useful to know if the code is running on an armv8.0 such as
A53, a 8.2 such as A55/A76/Neoverse etc. For MIPS the model appears to
generally be there, and in addition the SoC is often present in the
"system type" field before the first CPU, and the type of machine in the
"machine" field, to replace the missing DMI and DT, so they are also
collected. Note that only the first CPU is checked and reported, that's
expected to be vastly sufficient, since we're just trying to spot known
incompatibilities or issues.
If we detect we're running inside a container on Linux, let's check if
it seems to be docker. Docker usually creates a /.dockerenv file, which
is easy to check. It's uncertain whether it's always the case, but on the
few tested instances that was true, and we don't really care, what matters
is to place helpful debugging info for developers. When this file is
detected, we report "docker" instead of "yes" in the container techno.
Containers often cause significant trouble depending on how they're
set up, and they're not always trivial for their users to extract info
from. Here we're trying to detect if we're running inside a container
on Linux. There are plenty of approaches and none is perfectly clean
nor reliable, which makes sense since the goal is to remain transparent
enough.
One interesting approach is to rely on the observation that containers
generally do not expose most kernel threads, and that the very firsts
of them are extremely stable across all kernel versions: pid 2 was
called "keventd" in kernel 2.4, became "kthreadd" in kernel 2.6, and
has since not changed. This is true on all architectures tested, even
with highly stripped down kernels such as those found on 15 year-old
OpenWRT images. And this one doesn't appear inside containers. Thus
here we check if we find such a thread via /proc and whether it's
called keventd or kthreadd, to detect a container, and we set the
"cont_techno" variable to "yes" or "no" depending on what is found.
Let's extract some info about the system (board model, vendor etc),
this will indicate some hypervisors, some cloud instances or some
uncommon embedded boards etc. Typically, vmware, qemu and raspberry-pi
are visible here and can help during the troubleshooting session.
The goal here is to accumulate precious debugging information in a
struct that is easy to find in memory. It's aligned to 256-byte as
it also helps. We'll progressively add a lot of info about the
startup conditions, the operating system, the hardware and hypervisor
so as to limit the number of round trips between developers and users
during debugging sessions. Also, opening a core file with an hex editor
should often be sufficient to extract most of the info.
In addition, a new "show dev" command will show these information so
that they can be checked at runtime without having to wait for a crash
(e.g. if a limit is bad in a container, better know it early).
For now the struct only contains utsname that's fed at boot time.
When debugging a core, it's difficult to match a given gdb thread number
against an internal thread. Let's just store the pthread ID and the stack
pointer in each tinfo. This could help in the future by allowing to just
glance over them and pick the right one depending what info is found
first.
When a default-server directive is used in a defaults section, it's never
freed and the "defaults" proxy gets reset without freeing the fields from
that default-server. Normally there are no allocation there, except for
the config file location stored in srv->conf.file form an strdup() since
commit 9394a9444 ("REORG: server: move alert traces in parse_server")
that appeared in 2.4. In addition, if a "default-server" directive
appears multiple times in a defaults section, one more entry will be
leaked per call.
This commit addresses this by checking that we don't overwrite the file
upon multiple calls, and by clearing it when resetting the default proxy.
This should be backported to 2.4.
This bug could be reproduced with -dMfail and h2load generating plenty of connections.
A "show pools" CLI command showed that some memory in relation with RX packet pool
was never release. Furthermore, adding a RX packet counter to each connection
and a BUG_ON() in quic_conn_release() has proved that this unreleased memory
was in relation with RX packet which were not linked to a connection.
The responsible is quic_dgram_parse() which does not release some RX packet
memory before exiting after the connection thread affinity has changed.
Must be backported as far as 2.7.
This bug could be reproduced with -dMfail and detected added a counter of TX packet
to the QUIC connection. When released calling quic_conn_release() the connection
should have a null counter of TX packets. This was not always the case.
This could occur during the handshake step: a first packet was built, then another
one should have followed in the same datagram, but fail due to a memory allocation
issue. As the datagram length and first TX packet were not written in the TX
buffer, this latter could not really be purged by qc_purge_tx_buf() even if
called. This bug occured only when building coalesced packets in the same datagram.
To fix this, write the packet information (datagram length and first packet
address) in the TX buffer before purging it.
Must be backported as far as 2.6.
This bug could be reproduced with -dMfail and dectected by libasan as follows:
$ ASAN_OPTIONS=disable_coredump=0:unmap_shadow_on_exit=1:abort_on_error=f quic-freeze.cfg -dMfail -dMno-cache -dM0x55
=================================================================
==82989==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffc 0x560790cc4749 bp 0x7fff8e0e8e30 sp 0x7fff8e0e8e28
WRITE of size 8 at 0x7fff8e0ea338 thread T0
#0 0x560790cc4748 in qc_frm_free src/quic_frame.c:1222
#1 0x560790cc5260 in qc_release_frm src/quic_frame.c:1261
#2 0x560790d1de99 in qc_treat_acked_tx_frm src/quic_rx.c:312
#3 0x560790d1e708 in qc_ackrng_pkts src/quic_rx.c:370
#4 0x560790d22a1d in qc_parse_ack_frm src/quic_rx.c:694
#5 0x560790d25daa in qc_parse_pkt_frms src/quic_rx.c:988
#6 0x560790d2a509 in qc_treat_rx_pkts src/quic_rx.c:1373
#7 0x560790c72d45 in quic_conn_io_cb src/quic_conn.c:906
#8 0x560791207847 in run_tasks_from_lists src/task.c:596
#9 0x5607912095f0 in process_runnable_tasks src/task.c:876
#10 0x560791135564 in run_poll_loop src/haproxy.c:2966
#11 0x5607911363af in run_thread_poll_loop src/haproxy.c:3165
#12 0x56079113938c in main src/haproxy.c:3862
#13 0x7f92606edd09 in __libc_start_main ../csu/libc-start.c:308
#14 0x560790bcd529 in _start (/home/flecaille/src/haproxy/haproxy+0x
Address 0x7fff8e0ea338 is located in stack of thread T0 at offset 1032 i
#0 0x560790d29b52 in qc_treat_rx_pkts src/quic_rx.c:1341
This frame has 2 object(s):
[32, 48) 'ar' (line 1380)
[64, 1088) '_msg' (line 1368) <== Memory access at offset 1032 is inable
HINT: this may be a false positive if your program uses some custom stacnism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-use-after-scope src/quic_frame.c:1222 i
Shadow bytes around the buggy address:
0x100071c15410: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15420: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15430: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15440: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15450: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
=>0x100071c15460: f8 f8 f8 f8 f8 f8 f8[f8]f8 f8 f8 f8 f8 f8 f3 f3
0x100071c15470: f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 00 00
0x100071c15480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x100071c15490: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x100071c154a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x100071c154b0: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 f3 f3 f3
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==82989==ABORTING
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
Aborted (core dumped)
Note that a coredump could not always be produced with all compilers. This was
always the case with clang 11.
When allocating frames to be retransmitted from qc_dgrams_retransmit(), if they
could not be sent for any reason, they could remain attached to a local list to
qc_dgrams_retransmit() and trigger a crash with libasan when releasing the
original frames they were duplicated from.
To fix this, always release the frames which could not be sent during
retransmissions calling qc_free_frm_list() where needed.
Must be backported as far as 2.6.
This is really boring to not know why some retransmissions could not be done
from qc_prep_hpkts() which allocates frames, prepare packets and send them.
Especially to not know about if frames are not remaining allocated and
attached to list on the stack. This patch already helped in diagnosing
such an issue during "-dMfail" tests.
Building without threads emits two warnings because the proxy pointer
is no longer used (only serves for the lock) since 2.9 commit 9a74a6cb1
("MAJOR: log: introduce log backends"). No backport is needed.
On CONNECTION_CLOSE reception/emission, QUIC connections enter CLOSING
state. At this stage, only CONNECTION_CLOSE can be reemitted and all
other exchanges are stopped.
Previously, on haproxy process stopping, if all QUIC connections were in
CLOSING state, they were released before their closing timer expiration
to not block the process shutdown. However, since a recent commit, the
closing timer has been shorten to a more reasonable delay. It is now
consider viable to respect connections closing state even on process
shutdown. As such, stopping specific code in QUIC connections idle timer
task was removed.
A specific function quic_handle_stopping() was implemented to notify
QUIC connections on shutdown from main() function. It should have been
deleted along the removal in QUIC idle timer task. This patch just does
this.
The connections are flagged as "to be killed" asap when the peer has left
(detected by sendto() "Connection refused" errno) by qc_kill_conn(). This
function has to wakeup the idle timer task to release the connection (and the idle
timer and the idle timer task itself). Then if in the meantime the connection
was flagged as having to process some retransmissions, some packet could lead
to sendto() errors again with a call to qc_kill_conn(), this time with a released
idle timer task.
This bug could be detected by libasan as follows:
.AddressSanitizer:DEADLYSIGNAL
=================================================================
==21018==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x 560b5d898717 bp 0x7f9aaac30000 sp 0x7f9aaac2ff80 T3)
==21018==The signal is caused by a READ memory access.
==21018==Hint: address points to the zero page.
. #0 0x560b5d898717 in _task_wakeup include/haproxy/task.h:209
#1 0x560b5d8a563c in qc_kill_conn src/quic_conn.c:171
#2 0x560b5d97f832 in qc_send_ppkts src/quic_tx.c:636
#3 0x560b5d981b53 in qc_send_app_pkts src/quic_tx.c:876
#4 0x560b5d987122 in qc_send_app_probing src/quic_tx.c:910
#5 0x560b5d987122 in qc_dgrams_retransmit src/quic_tx.c:1397
#6 0x560b5d8ab250 in quic_conn_app_io_cb src/quic_conn.c:712
#7 0x560b5de41593 in run_tasks_from_lists src/task.c:596
#8 0x560b5de4333c in process_runnable_tasks src/task.c:876
#9 0x560b5dd6f2b0 in run_poll_loop src/haproxy.c:2966
#10 0x560b5dd700fb in run_thread_poll_loop src/haproxy.c:3165
#11 0x7f9ab9188ea6 in start_thread nptl/pthread_create.c:477
#12 0x7f9ab90a8a2e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0xfba2e)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV include/haproxy/task.h:209 in _task_wakeup
Thread T3 created by T0 here:
#0 0x7f9ab97ac2a2 in __interceptor_pthread_create ../../../../src/libsaniti zer/asan/asan_interceptors.cpp:214
#1 0x560b5df4f3ef in setup_extra_threads src/thread.c:252 o
#2 0x560b5dd730c7 in main src/haproxy.c:3856
#3 0x7f9ab8fd0d09 in __libc_start_main ../csu/libc-start.c:308 i
==21018==ABORTING
AddressSanitizer:DEADLYSIGNAL
Aborted (core dumped)
To fix, simply reset the connection flag QUIC_FL_CONN_RETRANS_NEEDED to cancel
the retransmission when qc_kill_conn is called.
Note that this new bug arrived with this fix which is correct and flagged as to be
backported as far as 2.6.
BUG/MINOR: quic: idle timer task requeued in the past
Must be backported as far as 2.6.
A quic_conn is instantiated and tied on the first thread which has
received the first INITIAL packet. After handshake completion,
listener_accept() is called. For each quic_conn, a new thread is
selected among the least loaded ones Note that this occurs earlier if
handling 0-RTT data.
This thread connection migration is done in two steps :
* inside listener_accept(), on the origin thread, quic_conn
tasks/tasklet are killed. After this, no quic_conn related processing
will occur on this thread. The connection is flagged with
QUIC_FL_CONN_AFFINITY_CHANGED.
* as soon as the first quic_conn related processing occurs on the new
thread, the migration is finalized. This allows to allocate the new
tasks/tasklet directly on the destination thread.
This last step on the new thread must be done prior to other quic_conn
access. There is two events which may trigger it :
* a packet is received on the new thread. In this case,
qc_finalize_affinity_rebind() is called from quic_dgram_parse().
* the recently accepted connection is popped from accept_queue_ring via
accept_queue_process(). This will called session_accept_fd() as
listener.bind_conf.accept callback. This instantiates a new session
and start connection stack via conn_xprt_start(), which itself calls
qc_xprt_start() where qc_finalize_affinity_rebind() is used.
A condition was recently found which could cause a closing to be used
with qc_finalize_affinity_rebind() which is forbidden with a BUG_ON().
This lat step was not compatible with layer 4 rule such as "tcp-request
connection reject" which closes the connection early. In this case, most
of the body of session_accept_fd() is skipped, including
qc_xprt_start(), so thread migration is not finalized. At the end of the
function, conn_xprt_close() is then called which flags the connection as
CLOSING.
If a datagram is received for this connection before it is released,
this will call qc_finalize_affinity_rebind() which triggers its BUG_ON()
to prevent thread migration for CLOSING quic_conn.
FATAL: bug condition "qc->flags & ((1U << 29)|(1U << 30))" matched at src/quic_conn.c:2036
Thread 3 "haproxy" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7ffff794f700 (LWP 2973030)]
0x00005555556221f3 in qc_finalize_affinity_rebind (qc=0x7ffff002d060) at src/quic_conn.c:2036
2036 BUG_ON(qc->flags & (QUIC_FL_CONN_CLOSING|QUIC_FL_CONN_DRAINING));
(gdb) bt
#0 0x00005555556221f3 in qc_finalize_affinity_rebind (qc=0x7ffff002d060) at src/quic_conn.c:2036
#1 0x0000555555682463 in quic_dgram_parse (dgram=0x7fff5003ef10, from_qc=0x0, li=0x555555f38670) at src/quic_rx.c:2602
#2 0x0000555555651aae in quic_lstnr_dghdlr (t=0x555555fc4440, ctx=0x555555fc3f78, state=32832) at src/quic_sock.c:189
#3 0x00005555558c9393 in run_tasks_from_lists (budgets=0x7ffff7944c90) at src/task.c:596
#4 0x00005555558c9e8e in process_runnable_tasks () at src/task.c:876
#5 0x000055555586b7b2 in run_poll_loop () at src/haproxy.c:2966
#6 0x000055555586be87 in run_thread_poll_loop (data=0x555555d3d340 <ha_thread_info+64>) at src/haproxy.c:3165
#7 0x00007ffff7b59609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007ffff7a7e133 in clone () from /lib/x86_64-linux-gnu/libc.so.6
To fix this issue, ensure quic_conn migration is completed earlier
inside session_accept_fd(), before any tcp rules processing. This is
done by moving qc_finalize_affinity_rebind() invocation from
qc_xprt_start() to qc_conn_init().
This must be backported up to 2.7.
pre-c99 compilers will fail to build the cache since commit 48f81ec09
("MAJOR: cache: Delay cache entry delete in reserve_hot function") due
to an int declaration in the for loop. No backport is needed.
In 2.3, we started to get a cleaner socket unbinding mechanism with
commit f58b8db47 ("MEDIUM: receivers: add an rx_unbind() method in
the protocols"). This mechanism rightfully refrains from unbinding
when sockets are expected to be transferrable to another worker via
"expose-fd listeners", but this is not compatible with ABNS sockets,
which do not support reuseport, unbinding nor being renamed: in short
they will always prevent a new process from binding.
It turns out that this is not much visible because by pure accident,
GTUNE_SOCKET_TRANSFER is only set in the code dealing with master mode
and deamons, so it's never set in foreground mode nor in tests even if
present on the stats socket. However with master mode, it is now always
set even when not present on the stats socket, and will always conflict.
The only reasonable approach seems to consist in marking these abns
sockets as non-suspendable so that the generic sock_unbind() code can
decide to just unbind them regardless of GTUNE_SOCKET_TRANSFER.
This should carefully be backported as far as 2.4.
This bug was forbidding the GTUNE_SOCKET_TRANSFER option to be set
when haproxy is neither in daemon mode nor in mworker mode. So it
basically only impacts the foreground mode.
The fix moves the code outside the 'if (global.mode & (MODE_DAEMON |
MODE_MWORKER | MODE_MWORKER_WAIT))' condition.
Bug was introduced with 7f80eb23 ("MEDIUM: proxy: zombify proxies only
when the expose-fd socket is bound").
Must be backported in every stable version.
We start implementing some postparsing compatibility checks for log
backends.
Here we report a warning if user tries to use tcp-{request,response} rules
with log backend, and we properly ignore such rules when inherited from
defaults section.
add proxy_cfg_ensure_no_log() function (similar to
proxy_cfg_ensure_no_http()) to ensure at the end of proxy parsing that
no log exclusive options are found if the proxy is not in log mode.
"log-balance" directive was recently introduced to configure the
balancing algorithm to use when in a log backend. However, it is
confusing and it causes issues when used in default section.
In this patch, we take another approach: first we remove the
"log-balance" directive, and instead we rely on existing "balance"
directive to configure log load balancing in log backend.
Some algorithms such as roundrobin can be used as-is in a log backend,
and for log-only algorithms, they are implemented as "log-$name" inside
the "backend" directive.
The documentation was updated accordingly.
In 1b8e68e ("MEDIUM: stick-table: Stop handling stick-tables as proxies.")
we forgot to free the table pointer which is now dynamically allocated.
Let's take this opportunity to also fix a missing free in the table itself
(the table expire task wasn't properly destroyed)
This patch depends on:
- "MINOR: stktable: add sktable_deinit function"
It should be backported in every stable versions.
This one reports streams considered as "suspicious", i.e. those with
no expiration dates or dates in the past, or those without a front
endpoint. More criteria could be added in the future.
It's often needed to be able to refine "show sess" when debugging, and
very often a first glance at old streams is performed, but that's a
difficult task in large dumps, and it takes lots of resources to dump
everything.
This commit adds "older <age>" to "show sess" in order to specify the
minimum age of streams that will be dumped. This should simplify the
identification of blocked ones.
Since 2.4-dev2 with commit 15e525f49 ("MINOR: stream: Don't retrieve
anymore timing info from the mux csinfo"), we don't replace the
tv_accept (now accept_ts) anymore with the current request's, so that
it properly reflects the session's accept date and not the request's
date. However, since then we failed to update "show sess" to make use
of the request's timestamp instead of the session's timestamp, resulting
in fantasist values in the "age" field of "show sess" for the task.
Indeed, the session's age is displayed instead of the stream's, which
leads to great confusion when debugging, particularly when it comes to
multiplexed inter-proxy connections which are kept up forever.
Let's fix this now. This must be backported as far as 2.4. However,
for 2.7 and older, the field was named tv_request and was a timeval.
If less connections than threads are established on a reverse-http gateway
and these servers have a non-nul pool-min-conn, then conn_backend_get()
will refrain from picking available connections from other threads. But
this makes no sense for protocols for which there is no ->connect(),
since there's no way the current thread will manage to establish its own
connection. For such situations we should always accept to use another
thread's connection. That's precisely what this patch does.
A dummy connect() function previously had to be installed for the log
server so that a reverse-http address could be referenced on a "server"
line, but after the recent rework of the server line parsing, this is
no longer needed, and this is actually annoying as it makes one believe
there is a way to connect outside, which is not true. Let's now get rid
of this function.
This is the equivalent of the previous "BUG/MEDIUM: mux-h1: fail earlier
on malloc in takeover()".
Connection takeover was implemented for fcgi in 2.2 by commit a41bb0b6c
("MEDIUM: mux_fcgi: Implement the takeover() method."). It does have one
corner case related to memory allocation failure: in case the task or
tasklet allocation fails, the connection gets released synchronously.
Unfortunately the situation is bad there, because the lower layers are
already switched to the new thread while the tasklet is either NULL or
still the old one, and calling fcgi_release() will also result in
touching the thread-local list of buffer waiters, calling unsubscribe(),
There are even code paths where the thread will try to grab the lock of
its own idle conns list, believing the connection is there while it has
no useful effect. However, if the owner thread was doing the same at the
same moment, and ended up trying to pick from the current thread (which
could happen if picking a connection for a different name), the two
could even deadlock.
No tests were made to try to reproduce the problem, but the description
above is sufficient to see that nothing can guarantee against it.
This patch takes a simple but radically different approach. Instead of
starting to migrate the connection before risking to face allocation
failures, it first pre-allocates a new task and tasklet, then assigns
them to the connection if the migration succeeds, otherwise it just
frees them. This way it's no longer needed to manipulate the connection
until it's fully migrated, and as a bonus this means the connection will
continue to exist and the use-after-free condition is solved at the same
time.
This should be backported to 2.2. Thanks to Fred for the initial analysis
of the problem!
This is the h1 equivalent of previous "BUG/MEDIUM: mux-h2: fail earlier
on malloc in takeover()".
Connection takeover was implemented for H1 in 2.2 by commit f12ca9f8f1
("MEDIUM: mux_h1: Implement the takeover() method."). It does have one
corner case related to memory allocation failure: in case the task or
tasklet allocation fails, the connection gets released synchronously.
Unfortunately the situation is bad there, because the lower layers are
already switched to the new thread while the tasklet is either NULL or
still the old one, and calling h1_release() will call some unsubscribe
and and possibly other things whose safety is not guaranteed (and the
ambiguity here alone is sufficient to be careful). There are even code
paths where the thread will try to grab the lock of its own idle conns
list, believing the connection is there while it has no useful effect.
However, if the owner thread was doing the same at the same moment, and
ended up trying to pick from the current thread (which could happen if
picking a connection for a different name), the two could even deadlock.
Contrary to mux-h2, a few tests were not sufficient to try to crash the
process, but there's nothing that indicates it couldn't happen based on
the description above.
This patch takes a simple but radically different approach. Instead of
starting to migrate the connection before risking to face allocation
failures, it first pre-allocates a new task and tasklet, then assigns
them to the connection if the migration succeeds, otherwise it just
frees them. This way it's no longer needed to manipulate the connection
until it's fully migrated, and as a bonus this means the connection will
continue to exist and the use-after-free condition is solved at the same
time.
This should be backported to 2.2. Thanks to Fred for the initial analysis
of the problem!
Connection takeover was implemented for H2 in 2.2 by commit cd4159f03
("MEDIUM: mux_h2: Implement the takeover() method."). It does have one
corner case related to memory allocation failure: in case the task or
tasklet allocation fails, the connection gets released synchronously.
Unfortunately the situation is bad there, because the lower layers are
already switched to the new thread while the tasklet is either NULL or
still the old one, and calling h2_release() will also result in
h2_process() and h2_process_demux() that may process any possibly
pending frames. Even the session remains the old one on the old thread,
so that some sess_log() that are called when facing certain demux errors
will be associated with the previous thread, possibly accessing a number
of elements belonging to another thread. There are even code paths where
the thread will try to grab the lock of its own idle conns list, believing
the connection is there while it has no useful effect. However, if the
owner thread was doing the same at the same moment, and ended up trying
to pick from the current thread (which could happen if picking a connection
for a different name), the two could even deadlock.
The risk is extremely low, but Fred managed to reproduce use-after-free
errors in conn_backend_get() after a takeover() failed by playing with
-dMfail, indicating that h2_release() had been successfully called. In
practise it's sufficient to have h2 on the server side with reuse-always
and to inject lots of request on it with -dMfail.
This patch takes a simple but radically different approach. Instead of
starting to migrate the connection before risking to face allocation
failures, it first pre-allocates a new task and tasklet, then assigns
them to the connection if the migration succeeds, otherwise it just
frees them. This way it's no longer needed to manipulate the connection
until it's fully migrated, and as a bonus this means the connection will
continue to exist and the use-after-free condition is solved at the same
time.
This should be backported to 2.2. Thanks to Fred for the initial analysis
of the problem!
There was still a totally outdated comment speaking about issues
affecting solaris on 1.1.8pre4 (April 2002, 21 year-old)! This
proves that comments in headers are never read, so let's take this
opportunity for also removing the outdated one recommending to read
the "updated" RFC7230.
Adapt session_accept_fd() called on accept() to set the handshake timeout from
"hanshake-timeout" setting if set by configuration. If not set, continue to use
the "client" timeout setting.
This bug arrived with this commit:
MINOR: quic: Avoid zeroing frame structures
Before this latter, the CONNECTION_CLOSE was zeroed, especially the "reason phrase
length".
Restablish this behavior.
No need to backport.
This date is shared between the idle timer and hanshake timeout. So, it should be
useful to dump the expiration date of the idle timer task itself, in place of the
idle timer expiration date. This way, the handshake timeout value will be visible
during the handshake from CLI "show quic full" command.
The idle timer task may be used to trigger the client handshake timeout.
The hanshake timeout expiration date (qc->hs_expire) is initialized when the
connection is allocated. Obviously, this timeout is taken into an account only
during the handshake by qc_idle_timer_do_rearm() whose job is to rearm the idle timer.
The idle timer expiration date could be initialized only one time, then
never updated until the hanshake completes. But this only works if the
handshake timeout is smaller than the idle timer task timeout. If the handshake
timeout is set greater than the idle timeout, this latter may expire before the
handshake timeout.
This patch may have an impact on the L1/C1 interop tests (with heavy packet loss
or corruption). This is why I guess some implementations with a hanshake timeout
support set a big timeout during this test. This is at least the case for ngtcp2
which sets a 180s hanshake timeout! haproxy will certainly have to proceed the
same way if it wants to have a chance to pass this test as before this handshake
timeout.
Add a new timeout for the handshake, on the frontend side only. Such a hanshake
will be typically used for TLS hanshakes during client connections to TLS/TCP or
QUIC frontends.
The shctx lock was changed from a SPINLOCK to a RWLOCK in commit ed35b94
"MEDIUM: cache: Switch shctx spinlock to rwlock and restrict its scope"
but a SPIN_INIT was left behind.
This patch does not need to be backported.
For applets and connection, when a send attempt is performed, we must be
sure to not report a send activity if there was no output data at all before
the attempt.
It is not important for the <fsb> date itself but for the <lra> date for
non-independent stream.
This patch must be backported to 2.8.
Some channel function are used to check if the channel's buffer is full, not
empty or if there are input data. However, functions used are not
HTX-aware. So it is not accurate and may prevent some actions to be
performed (However, not sure there are really issues). Because HTX-aware
versions now exist, use them instead.
This patch may be backported as far as 2.2. It relies on
* "MINOR: channel: Add functions to get info on buffers and deal with HTX streams"
* "MINOR: htx: Use a macro for overhead induced by HTX"
Since the HTX was introduced, the streamer detection is broken for HTX
streams because the HTX overhead was not counted in the test to set
CF_STREAMER and CF_STREAMER_FAST flags.
The consequence was that the consumer side was no longer able to send more
than tune.ssl.maxrecord at a time in SSL.
To fix the issue, we now count the HTX overhead of HTX streams to be able to
set CF_STREAMER/CF_STREAMER_FAST flags on a channel.
This patch relies on folloing commits:
* "MINOR: channel: Add functions to get info on buffers and deal with HTX streams"
* "MINOR: htx: Use a macro for overhead induced by HTX"
The series must be backported as far as 2.2.
The first-send-blocked date was originally designed to save the date of the
first send of a series where some data remain blocked. It was relaxed
recently (3083fd90e "BUG/MEDIUM: stconn: Report a send activity everytime
data were sent") to save the date of the first full blocked send. However,
it is not accurrate.
When all data are sent, the fsb value must be reset to TICK_ETERNITY. When
nothing is sent and if it is not already set, it must be set. But when data
are partially sent, the value must be updated and not reset. Otherwise the
write timeout may be ignored because fsb date is never set.
So, changes brought by the patch above are reverted and
sc_ep_report_blocked_send() was changed to know if some data were sent or
not. This way we are able to update fsb value.
l
This patch must be backported to 2.8.
Some functions are built on the fact that the cache lock must be already
taken by the caller. This patch adds this information in the functions'
descriptions.
This global variable was used to avoid using locks on shared_contexts in
the unlikely case of nbthread==1. Since the locks do not do anything
when USE_THREAD is not defined, it will be more beneficial to simply
remove this variable and the systematic test on its value in the shared
context locking functions.
A reference counter on the cache_entry was added in a previous commit.
Its value is atomically increased and decreased via the retain_entry and
release_entry functions.
This is needed because of the latest cache and shared_context
modifications that introduced two separate locks instead of the
preexisting single shctx_lock one.
With the new logic, we have two main blocks competing for the two locks:
- the one in the http_action_req_cache_use that performs a lookup in the
cache tree (locked by the cache lock) and then tries to remove the
corresponding blocks from the shared_context's 'avail' list until the
response is sent to the client by the cache applet,
- the shctx_row_reserve_hot that traverses the 'avail' list and gives
them back to the caller, while removing previous row heads from the
cache tree
Those two blocks require the two locks but one of them would take the
cache lock first, and the other one the shctx_lock first, which would
end in a deadlock without the current patch.
The way this conflict is resolved in this patch is by ensuring that at
least one of those uses works without taking the two locks at the same
time.
The solution found was to keep taking the two locks in the cache_use
case. We first lock the cache to lookup for an entry and we then take
the shctx lock as well to detach the corresponding blocks from the
'avail' list. The subtlety is that between the cache lookup and the
actual locking of the shctx, another thread might have called the
reserve_hot function in which we only take the shctx lock.
In this function we traverse the 'avail' list to remove blocks that are
then given to the caller. If one of those blocks corresponds to a
previous row head, we call the 'free_blocks' callback that used to
delete the cache entry from the tree.
We now avoid deleting directly the cache entries in reserve_hot and we
rather set the cache entries 'complete' param to 0 so that no other
thread tries to work with this entry. This way, when we release the
shctx lock in reserve_hot, the first thread that had performed the cache
lookup and had found an entry that we just gave to another thread will
see that the 'complete' field is 0 and it won't try to work with this
response.
The actual removal of entries from the cache tree will now be performed
in the new 'reserve_finish' callback called at the end of the
shctx_row_reserve_hot function. It will iterate on all the row head that
were inserted in a dedicated list in the 'free_block' callback and
perform the actual delete.
This patch adds a reserve_finish callback that can be defined by the
subsystems that require a shared_context. It is called at the end of
shctx_row_reserve_hot after the shared_context lock is released.
Descend the shctx_lock calls into the shctx_row_reserve_hot so that the
cases when we don't need to lock anything (enough space in the current
row or not enough space in the 'avail' list) do not take the lock at
all.
In sh_ssl_sess_new_cb the lock had to be descended into
sh_ssl_sess_store in order not to cover the shctx_row_reserve_hot call
anymore.
Add a reference counter on the cache_entry. Its value will be atomically
increased and decreased via the retain_entry and release_entry
functions.
The release_entry function has two distinct versions,
release_entry_locked and release_entry_unlocked that should be called
when the cache lock is already taken in write mode or not
(respectively). In the unlocked case the cache lock will only be taken
in write mode on the last reference of the entry (before calling
delete_entry). This allows to limit the amount of times when we need to
take the cache lock during a release operation.
Since a lock on the cache tree was added in the latest cache changes, we
do not need to use the shared_context's lock to lock more than pure
shared_context related data anymore. This already existing lock will now
only cover the 'avail' list from the shared_context. It can then be
changed to a rwlock instead of a spinlock because we might want to only
run through the avail list sometimes.
Apart form changing the type of the shctx lock, the main modification
introduced by this patch is to limit the amount of code covered by the
shctx lock. This lock does not need to cover any code strictly related
to the cache tree anymore.
After the latest changes in the cache/shared_context mechanism, the
cache and shared_context logic were decorrelated and in some unlikely
cases we might end up using the "show cache" command while some regular
cache processing is occurring (a response being stored in the cache for
instance). In such a case, because we used the same 'trash' buffer in
those two contexts, we could end up with the contents of a response in
the ouput of the "show cache" command.
This patch fixes this problem by allocating a dedicated trash for the
CLI command.
The "hot" list stored in a shared_context was used to keep a reference
to shared blocks that were currently being used and were thus removed
from the available list (so that they don't get reused for another cache
response). This 'hot' list does not ever need to be shared across
threads since every one of them only works on their current row.
The main need behind this 'hot' list was to detach the corresponding
blocks from the 'avail' list and to have a known list root when calling
list_for_each_entry_from in shctx_row_data_append (for instance).
Since we actually never need to iterate over all members of the 'hot'
list, we can remove it and replace the inc_hot/dec_hot logic by a
detach/reattach one.
When looking for a valid entry in the cache tree in
http_action_req_cache_use, we do not need to delete an expired entry at
once because even if an expired entry exists, since the request will be
forwarded to the server, then the expired entry will be overwritten when
the updated response is seen. We can then use a simpler rdlock during
cache_use operation.
Any lookup in the cache tree done through entry_exist or
secondary_entry_exist functions could end up deleting the corresponding
entry if it is expired which prevents from using a rdlock on code paths
that would just perform a lookup on the tree (in
http_action_req_cache_use for instance).
Adding a 'delete_expired' boolean as a parameter allows for "pure"
lookups and thus it will allow to perform operations on the tree that
simply require a rdlock instead of a "heavier" wrlock.
The "show cache" CLI command iterates over all the entries of the cache
tree and it used this opportunity to remove expired entries from the
cache. This behavior was completely undocumented and does not seem that
necessary. By removing it we can take the cache lock in read mode only
which limits the impact on the other threads.
Every use of the cache tree was covered by the shctx lock even when no
operations were performed on the shared_context lists (avail and hot).
This patch adds a dedicated RW lock for the cache so that blocks of code
that work on the cache tree only can use this lock instead of the
superseding shctx one. This is useful for operations during which the
concerned blocks are already in the hot list.
When the two locks need to be taken at the same time, in
http_action_req_cache_use and in shctx_row_reserve_hot, the shctx one
must be taken first.
A new parameter needed to be added to the shared_context's free_block
callback prototype so that cache_free_block can take the cache lock and
release it afterwards.
The shctx_row_reserve_hot relied on two loop levels in order to first
look for the first block of a preused row and then iterate on all the
blocks of this row to reserve them for the new row. This was not the
simplest nor the easiest to read way so this logic could be replaced by
a single iteration on the avail list members.
The two use cases of calling this function with or without a preexisting
"first" member were a bit cumbersome as well and were replaced by a more
straightforward approach.
Instead of iterating over all the elements of a given row when moving it
between the hot and available lists, we can make use of the last_reserved
pointer that already points to the last block of the list to perform the
move in O(1).
Ensure that the last_append pointer is always set to NULL on first block
of rows reserved by the subsystems using the shctx (cache for instance).
This pointer will be used directly in shctx_row_data_append instead of
the 'from' param which will simplify its uses.
A backend connection is inserted in server idle list via
srv_add_to_idle_list(). This function has several conditions which may
cause the connection to be rejected instead.
One of this condition is based on the current estimate count of needed
connections for the server. If the count of idle connections stored has
already reached this estimation, the new connection is rejected. This is
in opposition with the purpose of reverse HTTP. On active reverse,
haproxy can instantiate several connections to properly serve the future
traffic. However, the opposite passive haproxy will have only a low
estimate of needed connection and will reject most of them.
To fix this, simply check CO_FL_REVERSED connection flag on
srv_add_to_idle_list(). If set, the connection is inserted without
checking for estimate count. Note that all other conditions are not
impacted, so it's still possible to reject a connection, for example if
process FD limit is reached.
This commit relies on recent patch which change CO_FL_REVERSED flag for
connection after passive reverse.
On passive reverse, H2 mux is responsible to insert the connection in
the server idle list. This is done via srv_add_to_idle_list(). However,
this function may fail for various reason, such as FD usage limit
reached.
Handle properly this error case. H2 mux flags the connection on error
which will cause its release. Prior to this patch, the connection was
only released on server timeout.
This bug was found inspecting server curr_used_conns counter. Indeed, on
connection reverse, this counter is first incremented. It is decremented
just after on srv_add_to_idle_list() if insertion is validated. However,
if insertion is rejected, the connection was not released which cause
curr_used_conns to remains positive. This has the major downside to
break the reusing of idle connection on rhttp causing spurrious 503
errors.
No need to backport.
Change the flags used for reversed connection :
* CO_FL_REVERSED is now put after reversal for passive connect. For
active connect, it is delayed when accept is completed after reversal.
* CO_FL_ACT_REVERSING replace the old CO_FL_REVERSED. It is put only for
active connect on reversal and removes once accept is done.
This allows to identify a connection as reversed during its whole
lifetime. This should be useful to extend reverse connect.
The commit 5ff7d2276 ("BUG/MEDIUM: stream: Properly handle abortonclose when set
on backend only") introduced a regression. Not all multiplexer implement the
.ctl() callback function. Thus we must be sure this callback function is defined
first to call it.
This patch should fix a crash reported by Tristan in the issue #2095. It must be
backported as far as 2.2, with the commit above.
Since 2.7 and the mcli_reload_bind_conf (56f73b21a5), upon a reload
failure because of a bind error, the mcli_reload_bind_conf go through a
sock_unbind((). This is not supposed to do anything when a listener is
RX_F_INHERITED in the master, but unfortunately this is done too early
and provokes an exit of the master.
We already suspected in the past that setting the 'master' variable this
late could have negative impact.
The fix sets the master variable earlier before the bind.
This must be backported at least to 2.7. This could be backported
earlier but better wait any feedbacks on the fix.
Seeing counters in "show profiling" is not always very helpful without
an indication of how long the analysis lasted nor if it's still active
or not. Let's add a pair of start/stop timers for tasks and memory so
that we can now indicate how long the measurements lasted and when they
ended (or 0 if still running).
Note that for tasks profiling set to "auto", the measurement is considered
enabled since it can automatically switch on and off on a per-thread
basis.
Since the 2.2 and the commit dedd30610 ("MEDIUM: h1: Don't wake the H1 tasklet
if we got the whole request."), we avoid to subscribe for reads if the H1
message is fully received. However, this broke the abortonclose option. To fix
the issue, a CO_RFL flag was added to instruct the mux it should still wait for
read events to properly handle read0. Only the H1 mux was concerned.
But since then, most of time, the option is only handled if it is set on the
frontend proxy because the request is fully received before selecting the
backend. If the backend is selected before the end of the request there is no
issue. But otherwise, because the backend is not known yet, we are unable to
properly handle the option and we miss to subscribe for reads.
Of course the option cannot be set on a frontend proxy. So concretly it means
the option is properly handled if it is enabled in the defaults section (if
common to frontend and backend) or a listen proxy, but it is ignored if it is
set on backend only.
Thanks to previous patches, we can now instruct the mux it should subscribe for
reads if not already done. We use this mechanism in process_stream() when the
connection is set up, ie when backend SC is set to SC_ST_REQ state.
This patch relies on following patches:
* MINOR: connection: Add a CTL flag to notify mux it should wait for reads again
* MEDIUM: mux-h1: Handle MUX_SUBS_RECV flag in h1_ctl() and susbscribe for reads
This patch should be the issue #2344. All the series must be backported as far
as 2.2.
The H1 mux now handle MUX_SUBS_RECV flag in h1_ctl(). If it is not already
subscribed for reads, it does so. This patch will be mandatory to properly
handle abortonclose option.
abortonclose option is a backend option, it should not be handle on frontend
side. Of course a frontend can also be a backend but the option should not
be handled too early because it is not necessarily the selected backend
(think about a listen proxy routing requests to another backend).
It is especially an issue when the abortonclose option is enabled in the
defaults section and disabled by the selected backend. Because in this case,
the option may still be enabled while it should not.
Thus, now we wait the backend connection was set up to handle the option. To
do so, we check the backend SC state. The option is ignored if it is in
ST_CS_INI state. For all other states, it means the backend was already
selected.
This patch could be backported as far as 2.2.
An annoying issue was met when testing the reverse-http mechanism, by
which failed connection attempts would apparently not be attempted again
when there was no connect timeout. It turned out to be more generalized
than the rhttp system, and actually affects all outgoing connections
relying on NPN or ALPN to choose the mux, on which no mux is installed
and for which the subscriber (ssl_sock) must be notified instead.
The problem appeared during 2.2-dev1 development. First, commit
062df2c23 ("MEDIUM: backend: move the connection finalization step to
back_handle_st_con()") broke the error reporting by testing CO_FL_ERROR
only under CO_FL_CONNECTED. While it still worked OK for cases where a
mux was present, it did not for this specific situation because no
single error path would be considered when no mux was present. Changing
the CO_FL_CONNECTED test to also include CO_FL_ERROR did work, until
a few commits later with 477902bd2 ("MEDIUM: connections: Get ride of
the xprt_done callback.") which removed the xprt_done callback that was
used to indicate success or failure of the transport layer setup, since,
as the commit explains, we can report this via the mux. What this last
commit says is true, except when there is no mux.
For this, however, the sock_conn_iocb() function (formerly conn_fd_handler)
is called for such errors, evaluates a number of conditions, none of which
is matched in this error condition case, since sock_conn_check() instantly
reports an error causing a jump to the leave label. There, the mux is
notified if installed, and the function returns. In other error condition
cases, readiness and activity are checked for both sides, the tasklets
woken up and the corresponding subscriber flags removed. This means that
a sane (and safe) approach would consist in just notifying the subscriber
in case of error, if such a subscriber still exists: if still there, it
means the event hasn't been caught earlier, then it's the right moment
to report it. And since this is done after conn_notify_mux(), it still
leaves all control to the mux once it's installed.
This commit should be progressively backported as far as 2.2 since it's
where the problem was introduced. It's important to clearly check the
error path in each function to make sure the fix still does what it's
supposed to.
This bug arrived with this commit:
MINOR: quic: Add a max window parameter to congestion control algorithms
The documentation was been modified with missing/wrong modifications in the code part.
The 'g' suffix must be accepted to parse value in gigabytes. And exctly 4g is
also accepted.
No need to backport.
Make all the congestion support the maximum congestion control window
set by configuration. There is nothing special to explain. For each
each algo, each time the window is incremented it is also bounded.
Add a new ->max_cwnd member to bind_conf struct to store the maximum
congestion control window value for each QUIC binding.
Modify the "quic-cc-algo" keyword parsing to add an optional parameter
to its value: the maximum congestion window value between parentheses
as follows:
ex: quic-cc-algo cubic(10m)
This value must be bounded, greater than 10k and smaller than 1g.
This bug arrived with this commit:
BUG/MINOR: quic: Useless use of non-contiguous buffer for in order CRYPTO data
Before this commit qc->cstream was tested before entering qc_treat_rx_crypto_frms().
This patch restablishes this behavior. Furthermore, it simplyfies
qc_ssl_provide_all_quic_data() which is a little bit ugly: the CRYPTO data
frame may be freed asap in the list_for_each_entry_safe() block after
having store its data pointer and length in local variables.
Also interrupt the CRYPTO data process as soon as qc_ssl_provide_quic_data()
or qc_treat_rx_crypto_frms() fail.
No need to be backported.
Since following commit, quic_conn closes its owned socket before
transition to quic_cc_conn for closing state. This allows to save FDs as
quic_cc_conn could use the listener socket for their I/O.
commit 150c0da889
MEDIUM: quic: release conn socket before using quic_cc_conn
This patch is incomplete as it removes initialization of <fd> member for
quic_cc_conn. Thus, if sending is done on closing state, <fd> value is
undefined which in most cases will result in a crash. Fix this by simply
initializing <fd> member with qc_init_fd() in qc_new_cc_conn().
This bug should fix recent issue from #2095. Thanks to Tristan for its
reporting and then testing of this patch.
No need to backport.
Half open counter is used to comptabilize QUIC connections waiting for
address validation. It was recently reworked to adjust its scope. With
each decrement operation, a BUG_ON() was added to ensure the counter
never wraps.
This BUG_ON() could be triggered if an allocation fails for one of
quic_conn members in qc_new_conn(). This is because half open counter is
incremented at the end of qc_new_conn(). However, in case of alloc
failure, quic_conn_release() is called immediately to ensure the counter
is decremented if a connection is freed before peer address has been
validated.
To fix this, increment half open counter early in qc_new_conn() prior to
every quic_conn members allocations.
This issue was reproduced using -dMfail argument.
This issue has been introduced by
commit 278808915b
MINOR: quic: reduce half open counters scope
No need to backport.
A new counter was recently introduced to comptabilize the current number
of active QUIC handshakes. This counter is stored on the listener
instance.
This counter is incremented at the beginning of qc_new_conn() to check
if limit is not reached prior to quic_conn allocation. If quic_conn or
one of its inner member allocation fails, special care is taken to
decrement the counter as the connection instance is released. However,
it relies on <l> variable which is initialized too late to cover
pool_head_quic_conn allocation failure.
To fix this, simply initialize <l> at the beginning of qc_new_conn().
This issue was reproduced using -dMfail argument.
This issue was introduced by the following commit
commit 3df6a60113
MEDIUM: quic: limit handshake per listener
No need to backport.
This bug was introduced with 969e212 ("MINOR: log: add dup_logsrv() helper
function")
When duplicating an existing log entry, we must take care to inherit from
its original ->ref if it is set, because not doing so would make 28ac0999
("MINOR: log: Keep the ref when a log server is copied to avoid duplicate entries")
ineffective given that global log directives will lose their original
reference when duplicated resursively (at least twice), which is what
happens when global log directives are first inherited to defaults which
are then inherited to a regular proxy at the end of the chain.
This can be easily reproduced using the following configuration:
|global
| log stdout format raw local0
|
|defaults
| log global
|
|frontend test
| log global
| ...
Logs from "test" proxy will be duplicated because test incorrectly
inherited from global "log" directives twice, which 28ac0999 would
normally detect and prevent.
No backport needed unless 969e212 gets backported.
When the bytes converter was improved to be able to use variables (915e48675
["MEDIUM: sample: Enhances converter "bytes" to take variable names as
arguments"]), the behavior of the sample slightly change. A failure is
reported if the given offset is bigger than the sample length. Before, a
empty binary sample was returned.
This patch fixes the converter to restore the original behavior. The
function was also refactored to properly handle failures by removing
SMP_F_MAY_CHANGE flag. Because the converter now handles variables, the
conversion to an integer may fail. In this case SMP_F_MAY_CHANGE flag must
be removed to be sure the caller will not retry.
This patch should fix the issue #2335. No backport needed except if commit
above is backported.
MODE_CHECK does not output "Configuration file is valid" by default
anymore. To display this message the -V option must be used with -c.
However the warning and errors are still output by default if they
exist.
This allows to clean the output of the systemd unit file with is doing a
-c.
The proxy's initialization is rather odd. First, init_new_proxy() is
called to zero all the lists and certain values, except those that can
come from defaults, which are initialized by proxy_preset_defaults().
The default server settings are also only set there.
This results in these settings not to be set for a number of internal
proxies that do not explicitly call proxy_preset_defaults() after
allocation, such as sink and log forwarders.
This was revealed by last commit 79aa63823 ("MINOR: server: always
initialize pp_tlvs for default servers") which crashes in log parsers
when applied to certain proxies which did not initialize their default
servers.
In theory this should be backported, however it would be desirable to
wait a bit before backporting it, in case certain parts would rely on
these elements not being initialized.
In commit 6f4bfed3a ("MINOR: server: Add parser support for
set-proxy-v2-tlv-fmt") a suspicious check for a NULL srv_tlv was placed
in the list_for_each_entry(), that should not be needed. In practice,
it's caused by the list head not being initialized, hence the first
element is NULL, as shown by Alexander's reproducer below which crashes
if the test in the loop is removed:
backend dummy
default-server send-proxy-v2 set-proxy-v2-tlv-fmt(0xE1) %[fc_pp_tlv(0xE1)]
server dummy_server 127.0.0.1:2319
The right place to initialize this field is proxy_preset_defaults().
We'd really need a function to initialize a server :-/
The check in the loop was removed. No backport is needed.
This issue could be reproduced with a TLS client certificate verificatio to
generate enough CRYPTO data between the client and haproxy and with dev/udp/udp-perturb
as network perturbator. Haproxy could crash thanks to a BUG_ON() call as soon as
in disorder data were bufferized into a non-contiguous buffer.
There is no need to pass a non NULL non-contiguous to qc_ssl_provide_quic_data()
from qc_ssl_provide_all_quic_data() which handles in order CRYPTO data which
have not been bufferized. If not, the first call to qc_ssl_provide_quic_data()
to process the first block of in order data leads the non-contiguous buffer
head to be advanced to a wrong offset, by <len> bytes which is the length of the
in order CRYPTO frame. This is detected by a BUG_ON() as follows:
FATAL: bug condition "ncb_ret != NCB_RET_OK" matched at src/quic_ssl.c:620
call trace(11):
| 0x5631cc41f3cc [0f 0b 8b 05 d4 df 48 00]: qc_ssl_provide_quic_data+0xca7/0xd78
| 0x5631cc41f6b2 [89 45 bc 48 8b 45 b0 48]: qc_ssl_provide_all_quic_data+0x215/0x576
| 0x5631cc3ce862 [48 8b 45 b0 8b 40 04 25]: quic_conn_io_cb+0x19a/0x8c2
| 0x5631cc67f092 [e9 1b 02 00 00 83 45 e4]: run_tasks_from_lists+0x498/0x741
| 0x5631cc67fb51 [89 c2 8b 45 e0 29 d0 89]: process_runnable_tasks+0x816/0x879
| 0x5631cc625305 [8b 05 bd 0c 2d 00 83 f8]: run_poll_loop+0x8b/0x4bc
| 0x5631cc6259c0 [48 8b 05 b9 ac 29 00 48]: main-0x2c6
| 0x7fa6c34a2ea7 [64 48 89 04 25 30 06 00]: libpthread:+0x7ea7
| 0x7fa6c33c2a2f [48 89 c7 b8 3c 00 00 00]: libc:clone+0x3f/0x5a
Thank you to @Tristan971 for having reported this issue in GH #2095.
No need to backport.
Since 04276f3d ("MEDIUM: server: split the address and the port into two
different fields") we should not use srv->addr to store server's port
and rely on srv->svc_port instead.
For sink servers, we correctly set >svc_port upon server creation but
we didn't use it when initializing address for the connection.
As a result, FQDN resolution will not work properly with sink servers.
Hopefully, this used to work by accident because sink servers were
resolved using the PA_O_RESOLVE flag in str2sa_range(), which made the
srv->addr contain the port in addition to the address.
But this will fail to work when FQDN resolution is postponed because only
->svc_port will contain the proper server port upon resolution.
For instance, FQDN resolution with servers from log backends (which are
resolved as regular servers, that is, without the PA_O_RESOLVE) will fail
to work because of this.
This may be backported as far as 2.2 even though the bug didn't have
noticeable effects for versions below 2.9
[In 2.2, sink_forward_session_init() didn't exist it should be applied in
sink_forward_session_create()]
This bug was introduced with 29b76ca ("BUG/MEDIUM: server/log: "mode log"
after server keyword causes crash ")
Indeed, we cannot safely rely on addr_proto being set when str2sa_range()
returns in parse_server() (even if SRV_PARSE_PARSE_ADDR is set), because
proto lookup might be bypassed when FQDN addresses are involved.
Unfortunately, the above patch wrongly assumed that proto would always
be set when SRV_PARSE_PARSE_ADDR was passed to parse_server() (so when
str2sa_range() was called), resulting in invalid postparsing checks being
performed, which could as well lead to crashes with log backends
("mode log" set) because some postparsing init was skipped as a result of
proto not being set and this wasn't expected later in the init code.
To fix this, we now make use of the previous patch to perform server's
address compatibility checks on hints that are always set when
str2sa_range() succesfully returns.
For log backend, we're also adding a complementary test to check if the
address family is of expected type, else we report an error, plus we're
moving the postinit logic in log api since _srv_check_proxy_mode() is
only meant to check proxy mode compatibility and we were abusing it.
This patch depends on:
- "MINOR: tools: make str2sa_range() directly return type hints"
No backport required unless 29b76ca gets backported.
str2sa_range() already allows the caller to provide <proto> in order to
get a pointer on the protocol matching with the string input thanks to
5fc9328a ("MINOR: tools: make str2sa_range() directly return the protocol")
However, as stated into the commit message, there is a trick:
"we can fail to return a protocol in case the caller
accepts an fqdn for use later. This is what servers do and in this
case it is valid to return no protocol"
In this case, we're unable to return protocol because the protocol lookup
depends on both the [proto type + xprt type] and the [family type] to be
known.
While family type might not be directly resolved when fqdn is involved
(because family type might be discovered using DNS queries), proto type
and xprt type are already known. As such, the caller might be interested
in knowing those address related hints even if the address family type is
not yet resolved and thus the matching protocol cannot be looked up.
Thus in this patch we add the optional net_addr_type (custom type)
argument to str2sa_range to enable the caller to check the protocol type
and transport type when the function succeeds.
For now, the appctx is removed from the buffer wait list when it is
freed. However, when it is released, it is not necessarily freed
immediately. But it is detached from the SC. If it is still registered in
the buffer wait list, it could then be woken up to get a buffer. At this
stage it is totally unexpected, especially because we must access the SC.
The fix is obvious, the appctx must be removed from the buffer wait list on
release.
Note this bug exists because the appctx was moved at the mux level.
This patch must be backported as far as 2.6.
After emission/reception of a CONNECTION_CLOSE, a connection enters the
CLOSING state. In this state, only minimal exchanges occurs as only the
packets which containted the CONNECTION_CLOSE frame can be reemitted. In
conformance with the RFC, most resources are released and quic_conn
instance is converted to the lighter quic_cc_conn.
Push further this optimization by closing quic_conn socket FD before
switching to a quic_cc_conn. This means that quic_cc_conn will rely on
listener socket for its send/recv operation. This should not impact
performance as as stated input/output are minimal on this state.
This patch should improve FD consumption as prior to this a socket FD
was kept during the closing delay which could cause maxsock to be
reached for other connections.
Note that fd member is kept in QUIC_CONN_COMMON and not removed from
quic_cc_conn. This is because quic_cc_conn relies on qc_snd_buf() which
access this field.
As a side-effect to this change, jobs accounting for quic_conn is also
updated. quic_cc_conn instances are now not counted as jobs. Indeed, the
main objective of jobs is to prevent haproxy process to be stopped with
data truncation. However, this relies on the connection to uses its
owned socket as the listener socket is shut down inconditionaly on
shutdown.
A consequence of the jobs handling change is that haproxy process will
be closed if only quic_cc_conn instances are present, thus preventing to
respect the closing state. In case of a reload, if a client missed a
CONNECTION_CLOSE frame just before process shutdown, it will probably
received a Stateless Reset on sending retry.
This change is considered safe as, for now, haproxy only emits
CONNECTION_CLOSE on error conditions (such as protocol violation or
timeout). It is considered as expected to suffer from data truncation
from this. However, if connection closing is reused by haproxy to
implement clean shutdown, it should be necessary to delay
CONNECTION_CLOSE frame emission to ensure no data truncation happens
here.
Prior to this patch, a special condition was set when idle timer was
rearmed for closing connections during haproxy process stopping. In this
case, the timeout was ditched and the idle task woken up immediatly.
The objective was to release quickly closing connections to not prevent
the process stopping to be too long. However, it is not conform with RFC
9000 recommandations and may cause some clients to miss a
CONNECTION_CLOSE in case of a packet loss.
A recent fix was set to use a shorter timeout for closing state. Now a
connection should only be left in this state for one second or less.
This reduces greatly the importance of stopping special condition. Thus,
this patch removes it completely.
In quic_rx_pkt_retrieve_conn(), err label is now only used if qc is
NULL. Thus, condition on qc can be removed.
No need to backport.
This issue was reported by coverity on github.
This should fix issue #2338.
When an H2 mux works with a slow downstream connection and without the
mux-mux mode, it is possible that a single stream will allocate all 32
buffers in the connection. This is not desirable at all because 1) it
brings no value, and 2) it allocates a lot of memory per connection,
which, in addition to using a lot of memory, tends to degrade performance
due to cache thrashing.
This patch improves the situation by refraining from sending data frames
over a connection when more mbufs than streams are allocated. On a test
featuring 10k connections each with a single stream reading from the
cache, this patch reduces the RAM usage from ~180k buffers to ~20k bufs,
and improves the bandwidth. This may even be backported later to recent
versions to improve memory usage. Note however that it is efficient only
when combined with e16762f8a ("OPTIM: mux-h2: call h2_send() directly
from h2_snd_buf()"), and tends to slightly reduce the single-stream
performance without it, so in case of a backport, the two need to be
considered together.
It's common to see process_stream() being woken up by wake_expired_tasks
in the profiling output, without knowing which timeout was set to cause
this. By making it possible to record the call places of task_queue()
and task_schedule(), and by making wake_expired_tasks() explicitly not
replace it, we'll be able to know which task_queue() or task_schedule()
was triggered for a given wakeup.
For example below:
process_stream 51200 311.4ms 6.081us 34.59s 675.6us <- run_tasks_from_lists@src/task.c:659 task_queue
process_stream 19227 70.00ms 3.640us 9.813m 30.62ms <- sc_notify@src/stconn.c:1136 task_wakeup
process_stream 6414 102.3ms 15.95us 8.093m 75.70ms <- stream_new@src/stream.c:578 task_wakeup
It's visible that it's the run_tasks_from_lists() which in fact applies
on the task->expire returned by the ->process() function itself.
A client may send multiple INITIAL packets if ClientHello is too big for
only one. In case a Retry token is used, the client must reuse it for
every INITIAL packets.
On the haproxy server side, there was an inconsistency to handle these
packets depending on the socket mode :
* when using listener socket, token is always revalidated.
* when using connection socket, token check is bypassed. This is because
quic_conn instance is known through its socket and thus
quic_rx_pkt_retrieve_conn() is not necessary.
RFC 9000 does not seems to mandate retry token validation after the
first INITIAL packet per connection. Thus, this patch chooses to bypass
the check every time the connection instance is known, as this indicates
that a previous token was already validated.
This should be backported up to 2.7.
QUIC connections are pushed manually into a dedicated listener queue
when they are ready to be accepted. This happens after handshake
finalization or on 0-RTT packet reception. Listener is then woken up to
dequeue them with listener_accept().
This patch comptabilizes the number of connections currently stored in
the accept queue. If reaching a certain limit, INITIAL packets are
dropped on reception to prevent further QUIC connections allocation.
This should help to preserve system resources.
This limit is automatically derived from the listener backlog. Half of
its value is reserved for handshakes and the other half for accept
queues. By default, backlog is equal to maxconn which guarantee that
there can't be no more than maxconn connections in handshake or waiting
to be accepted.
Implement a limit per listener for concurrent number of QUIC
connections. When reached, INITIAL packets for new connections are
automatically dropped until the number of handshakes is reduced.
The limit value is automatically based on listener backlog, which itself
defaults to maxconn.
This feature is important to ensure CPU and memory resources are not
consume if too many handshakes attempt are started in parallel.
Special care is taken if a connection is released before handshake
completion. In this case, counter must be decremented. This forces to
ensure that member <qc.state> is set early in qc_new_conn() before any
quic_conn_release() invocation.
Accounting is implemented for half open connections which represent QUIC
connections waiting for handshake completion. When reaching a certain
limit, Retry mechanism is automatically activated prior to instantiate
new connections.
The issue with this behavior is that two notions are mixed : QUIC
connection handshake phase and Retry which is mechanism against
amplification attacks. As such, only peer address validation should be
taken into account to activate Retry protection.
This patch chooses to reduce the scope of half_open_conn. Now only
connection waiting to validate the peer address are now accounted for.
Most notably, connections instantiated with a validated Retry token
check are not accounted.
One impact of this patch is that it should prevent to activate Retry
mechanism too early, in particular in case if multiple handshakes are
too slow. Another limitation should be implemented to protect against
this scenario.
When a new QUIC connection is created, server considers peer address as
not yet validated. The server must limit its sending up to 3 times the
content already received. This is a defensive measure to avoid flooding
a remote host victim of address spoofing.
This patch adjust the condition to consider the peer address as
validated. Two conditions are now considered :
* successful handling of a received HANDSHAKE packet. This was already
done before although implemented in a different way.
* validation of a Retry token. This was not considered prior this patch
despite RFC recommandation.
This patch also adjusts how a connection is internally labelled as using
a validated peer address. Before, above conditions were checked via
quic_peer_validated_addr(). Now, a flag QUIC_FL_CONN_PEER_VALIDATED_ADDR
is set to labelled this. It already existed prior this patch but was
only used for quic_cc_conn. This should now be more explicit.
The commit 4be0c7c65 ("MEDIUM: stconn/muxes: Loop on data fast-forwarding to
forward at least a buffer") introduced a regression. In h1_fastfwd(), if
data fast-forwarding is not supported by the opposite SC, we must exit
without calling se_donn_ff(). Otherwise a BUG_ON() will be triggered because
the opposite mux has no .done_fastfwd() callback function.
No backport needed.
Defining a master CLI without the master-worker mode emits a warning
since version 1.8. This patch enforce the behavior by forbiding the
usage of the -S option without the master-worker mode.
Move the MODE_QUIET and MODE_VERBOSE test in print_message() so we
always output in the startup-logs even with MODE_QUIET.
ha_warning(), ha_alert() and ha_notice() does not check the MODE_QUIET
and MODE_VERBOSE anymore, it is done before doing the fprintf() in
print_message().
ha_alert(), ha_warning() and ha_notice() shouldn't check MODE_STARTING
for log emission. Let's remove the check.
This shouldn't do much since the stdio_quiet() function mute the output
in main().
The commit 08d7169f4 ("MINOR: stconn: Don't queue stream task in past in
sc_notify()") tried to fix issues with epiration date set in past for the
stream in sc_notify(). However it remains some cases where the stream
expiration date may already be expired before recomputing it. This happens
when an event is reported by the mux exactly when a timeout is triggered. In
this case, depending on the scheduling, the SC may be woken up before the
stream. For these cases, we fall into the BUG_ON() preventing to queue in
the past.
So, it remains unexpected to queue a task in the past. The BUG_ON() is
correct at this place. We must just avoid to recompute the stream expiration
date if it is already expired. At worst, the stream will be woken up for
nothing. But it is not really a big deal because it will only happen on
timeouts from time to time. It is so sporadic that we can ignore it from a
performance point of view.
This patch must be backpoted to 2.8. Be careful to remove the BUG_ON() on
the 2.8.
If a TX packet cannot be allocated (by qc_build_pkt()), as it can be coalesced
to another one, this leads the TX buffer to have remaining not sent prepared data.
Then haproxy crashes upon a BUG_ON() triggered by the next call to qc_txb_release().
This may happen only during handshakes.
To fix this, qc_build_pkt() returns a new -3 error to dected such allocation
failures followed which is for now on followed by a call to qc_purge_txbuf() to
send the TX prepared data and purge the TX buffer.
Must be backported as far as 2.6.
This may happen during handshakes when Handshake packets cannot be coalesced
to a first Initial packet because of TX frame allocation failures (from
qc_build_frms()). This leads too short (not padded) Initial packets to be sent.
This is detected by a BUG_ON() in qc_send_ppkts().
To avoid this an Handshake packet without ack-eliciting frames which should have
been built by qc_build_frms() is built.
Must be backported as far as 2.6.
This may happen upon ack ranges allocation failures (from quic_update_ack_ranges_list().
This can lead to empty trees of ack ranges to be used to build ACK frames which
is not good at all. Furthermore this is detected by a BUG_ON() (in qc_do_build_pkt()).
To avoid this, simply update the acknowledgemen state of the connection only if
quic_update_ack_ranges_list() succeeds, as it fails only in case of memory
allocation failures.
Must be backported as far as 2.6.
If the Handshake encryption level could not be allocated, this could lead
to Initial packets to be sent because no Handshake CRYPTO frames were generated.
Furthermore in such an allocation failure case, the connection should be closed
as soon as possible. This is done making ha_quic_set_encryption_secrets() return
0 upon an encryption level allocation failure.
Also fix a typo in the trace in relation to this allocation failure.
No need to be backported.
When the idle timer expired with a still present mux, this task was not freed
and even requeued with a timer in the past.
Fix this issue calling task_destroy() in this case. As the task is freed,
its handler must return NULL setting local <t> variable to NULL in every cases.
Also ensure that this timer task is not armed again after having been released
with a <return> statement when this is the case from qc_idle_timer_do_rearm().
Must be backported as far as 2.6.
This was no reason not to release as soon as possible the TLS/SSL QUIC connection
context from quic_conn_release() before allocating a "closing connection" connection
(quic_cc_conn struct).
This patch sets the handshake task in heavy task mode when receiving in disorder
CRYPTO data which results in in order bufferized CRYPTO data. This is done
thanks to a non-contiguous buffer and from qc_handle_crypto_frm() after having
potentially bufferized CRYPTO data in this buffer.
qc_treat_rx_crypto_frms() is no more called from qc_treat_rx_pkts() but instead
this is where the task is set in heavy task mode. Consequently,
this is the job of qc_ssl_provide_all_quic_data() to call directly
qc_treat_rx_crypto_frms() to provide the in order bufferized CRYPTO data to the
TLS stack. As this function releases the non-contiguous buffer for the CRYPTO
data, if possible, there is no need to do that from qc_treat_rx_crypto_frms()
anymore.
Add a new pool for the CRYPTO data frames received in order.
Add ->rx.crypto_frms list to each encryption level to store such frames
when they are received in order from qc_handle_crypto_frm().
Also set the handshake task (qc_conn_io_cb()) in heavy task mode from
this function after having received such frames. When this task
detects that it is set in heavy mode, it calls qc_ssl_provide_all_quic_data()
newly implemented function to provide the CRYPTO data to the TLS task.
Modify quic_conn_enc_level_uninit() to release these CRYPTO frames
when releasing the encryption level they are in relation with.
IOBUF_FL_EOI iobuf flag is now set by the producer to notify the consumer
that the end of input was reached. Thanks to this flag, we can remove the
ugly ack in h2_done_ff() to test the opposite SE flags.
Of course, for now, it works and it is good enough. But we must keep in mind
that EOI is always forwarded from the producer side to the consumer side in
this case. But if this change, a new CO_RFL_ flag will have to be added to
instruct the producer if it can forward EOI or not.
In the mux-to-mux data forwarding, we now try, as far as possible to send at
least a buffer. Of course, if the consumer side is congested or if nothing
more can be received, we leave. But the idea is to retry to fast-forward
data if less than a buffer was forwarded. It is only performed for buffer
fast-forwarding, not splicing.
The idea behind this patch is to optimise the forwarding, when a first
forward was performed to complete a buffer with some existing data. In this
case, the amount of data forwarded is artificially limited because we are
using a non-empty buffer. But without this limitation, it is highly probable
that a full buffer could have been sent. And indeed, with H2 client, a
significant improvement was observed during our test.
To do so, .done_fastfwd() callback function must be able to deal with
interim forwards. Especially for the H2 mux, to remove H2_SF_NOTIFIED flags
on the H2S on the last call only. Otherwise, the H2 stream can be blocked by
itself because it is in the send_list. IOBUF_FL_INTERIM_FF iobuf flag is
used to notify the consumer it is not the last call. This flag is then
removed on the last call.
In order to limit inter-thread contention on the global pool, in 2.9-dev3
with commit 7bf829ace ("MAJOR: pools: move the shared pool's free_list
over multiple buckets"), it was decided that if the selected bucket had
an empty free list, we would simply give up and fall back to the OS
allocator.
But this causes allocations to be made from the OS for certain threads,
to be released to overloaded pools that are sent back to the OS. One
visible effect is that sending a lot of traffic using h2load with 100
parallel streams over 100 connections causes 5-10k buffers to be
allocated, then reducing the load to only 10 connections doesn't make
these allocations go down, just because some buckets are no longer
visited.
Tests show that giving a second chance to pick another bucket in this
case is sufficient to visit all other buckets and recycle their pending
objects. Now "show pools" that starts at 10k buffers at 100 connections
goes down to about 150 with 1 connection and 100 streams in a fraction
of a second.
No backport is needed, as the issue is only in 2.9.
Since 2.9-dev3 with commit 7bf829ace ("MAJOR: pools: move the shared
pool's free_list over multiple buckets"), the global pool supports
multiple heads to reduce inter-thread contention. However, when
grabbing a freelist head fails because another thread is already
picking from it, we just skip to the next one and try again.
Unfortunately, it still maintains a bit of contention between thread
pairs when for some reasons only a few threads are used. This may
happen for example when running on a 4- or 8- thread system and
the two most active ones end up on adjacent buckets.
A better and much simpler solution consists in visiting a random bucket
instead of the current one. Tests show that the CPU usage spent in
pool_refill_local_from_shared() reduces at low number of connections
(hence threads).
No backport is needed, as the issue is only in 2.9.
The function returning the excess of events over the current period for a
target frequency (the overshoot) has a flaw if the inactivity period is too
long. In this case, the result may overflow. Instead to be negative, a very
high positive value is returned.
This function is used by the bandwidth limitation filter. It means after a
long inactivity period, a huge burst may be detected while it should not.
In fact, the problem arise from the moment we're past the current period. In
this case, we should not report any overshoot and just get the number of
remaining events as usual.
This patch should be backported as far as 2.7.
It is now the turn for the H1 mux to be fix to properly handle http-request
and http-keep-alive timeouts. It is quite surprising but it is broken since
the 2.2. For idle connections on client side, the smallest value between the
client timeout and the http-request/http-keep-alive timeout is used while
the client timeout should only be used if other ones are not defined. So, if
the client timeout is the smallest value, the keep-alive timeout is not
respected.
It is only an issue for idle client connections. The http-request timeout is
respected from the moment part of the next request was received.
This patch should fix the issue #2334. It must be backported as far as 2.2. But
be careful during the backports. The H1 mux had evolved a lot since the 2.2.
Add a special treatment for the IPV4 and IPV6 cases in
table_process_entry_per_key() function so that input string is parsed
in best effort (STR to pseudo type ADDR): input format is first considered
over table type and then let smp_to_stkey() do the type conversion for us
when needed.
This patch heavily depends on:
- "MEDIUM: stktable/cli: simplify entry key handling"
And optionally depends on:
- 72514a44 ("MEDIUM: tools/ip: v4tov6() and v6tov4() rework")
Make use of smp_to_stkey() in table_process_entry_per_key() to simplify
key handling and leverage auto type conversions from sample API.
One noticeable side effect is that integer input checks will be relaxed
given that c_str2int() sample conv is more permissible than the integrated
table_process_entry_per_key() integer parser.
When an ipv4 key is used to filter a CLI command on a stick table
clear/set/show table ...), inetaddr_host+htonl combination was used
with no error checking.
Instead, we now use inet_pton(), which is what we use for ipv6 addresses
since b7c962b0c0 ("BUG/MINOR: stick-table/cli: Check for invalid ipv6 key")
Doing this allows us to easily check for parsing errors: we're trading off
some parsing efficience to better catch input errors and ensure we get
similar behavior between ipv4 and ipv6 addresses handling.
This patch may be backported to all supported versions.
We must take care to release H1 input buffer when it is emptied during the
fast-forwarding nego. Otherwise, it may be kept allocated for a while,
waiting for the next "normal" receive or the H1C release.
No backport needed.
Use backend connect timeout when a new connection is instantiated for
rhttp. This ensures that if connect operation fails after a certain
delay, reverse_connect listener task is woken up. This allows to free
the current connection and retry a new connect.
As a consequence of this change, rev_process() may be woken up even if
connection is not reported with CO_FL_ERROR. This happens if timeout
fired before any network reported issue. Connection freeing is adjusted
as in this case MUX instance is already allocated. Use destroy callback
to release MUX context prior to the connection itself.
This patch is really useful as a side measure for a haproxy bug
impacting connect with SSL for both backend connections and active
reverse connect. This is caused by the delayed allocation of MUX
allocation. Asynchronous connect error detected at the socket layer is
not notified to upper layers. Currently, only connect timeout allows to
release this failed connection.
The commit d6d4abdc3 ("BUILD: mux-h1: Fix build without kernel splicing
support") introduced a regression. The kernel support for the underlying
XPRT is no longer checked. So it is possible to enable the splicing for SSL
connection. This of course leads to a segfault.
This patch restore the test on the xprt rcv_pipe/snd_pipe functions.
This patch should fix a crash reported by Tristan in #2095
(#issuecomment-1788949014). No backport needed.
QUIC connections are accounted inside global sslconns. As with QUIC
actconn, it suffered from a similar issue if an intermediary allocation
failed inside qc_new_conn().
Fix this similarly by moving increment operation inside qc_new_conn().
Increment and error path are now centralized and much easier to
validate.
The consequences are similar to the actconn fix : on memory allocation
global sslconns may wrap, this time blocking any future QUIC or SSL
connections on the process.
This must be backported up to 2.6.
Since the following commit, quic_conn instances are accounted into
global actconn and compared against maxconn.
commit 7735cf3854
MEDIUM: quic: count quic_conn instance for maxconn
Increment is always done prior to real allocation to guarantee minimal
resource consumption. Special care is taken to ensure there will always
be one decrement operation for each increment. To help this, decrement
is centralized in quic_conn_release().
This behaves incorrectly in case of an intermediary allocation failure
inside qc_new_conn(). In this case, quic_conn_release() will decrement
actconn. Then, a NULL qc is returned in quic_rx_pkt_retrieve_conn()
which will also decrement the counter on its own error code path.
To properly fix this, actconn incrementation has been moved directly
inside qc_new_conn(). It is thus easier to cover every cases :
* if alloc failure before or on pool_head_quic_conn, actconn is
decremented manually at the end of qc_new_conn()
* after this step, actconn will be decremented by quic_conn_release()
either on intermediary alloc failure or on proper connection release
This bug happens on memory allocation failure so it should be rare.
However, its impact is not negligeable as if actconn counter is wrapped
it will block any future connection allocation for both QUIC and TCP.
One small downside of this change is that a CID is now always allocated
before quic_conn even if maxconn will be reached. However, this is
considered as of minor importance compared to a more robust code.
This must be backported up to 2.6.
When a EOS or EOI is detected on the endpoint and when the event is reported
at the SC level, a read activity must be reported. It is not really a big
deal because these flags already inhibit any read timeout. But it is
consistent with the <lra> comment. In addition, no read activity is reported
on abort. It is up-down event and it is not an event unblocking the
reads. So there is no reason to report a read activity.
This patch must be backported to 2.8.
A task must never be queued in past. However, in sc_notify(), the stream
task, if not woken up, is queued. Thanks to previous fixes, the stream task
expiration date should be correct. But to prevent any issue, a BUG_ON() is
added to be sure it never happens. I guess a good idea could be to remove it
or change it to BUG_ON_HOT() for the final release.
When receive or send expiration date of a stream-connector is retrieved, we
now automatically check if it may expire. If not, TICK_ETERNITY is returned.
The expiration dates of the frontend and backend stream-connectors are used
to compute the stream expiration date. This operation is performed at 2
places: at the end of process_stream() and in sc_notify() if the stream is
not woken up.
With this patch, there is no special changes for process_stream() because it
was already handled. It make thing a little simpler. However, it fixes
sc_notify() by avoiding to erroneously compute an expiration date in
past. This highly reduce the stream wakeups when there is contention on the
consumer side.
The bug was introduced with the commit 8073094bf ("NUG/MEDIUM: stconn:
Always update stream's expiration date after I/O"). It was an error to
unconditionnaly set the stream expiration data, without testing blocking
conditions on both SC.
This patch must be backported to 2.8.
When data are directly forwarded from a mux to the opposite one, we must not
forget to report send activity when data are successfully sent or report a
blocked send with data are blocked. It is important because otherwise, if
the transfer is quite long, longer than the client or server timeout, an
error may be triggered because the write timeout is reached.
H1, H2 and PT muxes are concerned. To fix the issue, The done_fastword()
callback now returns the amount of data consummed. This way it is possible
to update/reset the FSB data accordingly.
No backport needed.
In commit 6f4bfed3a ("MINOR: server: Add parser support for
set-proxy-v2-tlv-fmt") a few free() calls were made to an element on
error path when it was detected it was NULL. It doesn't have any
effect, however there was one case of use-after-free at the end of
srv_settings_cpy() that was caught by gcc due to attempting to free
the element after freeing its holder.
No backport is needed.
This allows to eliminate full buffers very quickly and to recycle them
much faster, resulting in higher transfer rates and lower memory usage
at the same time. We just wake the tasklet up if it succeeded so that
h2_process() and friends are called to finalize what needs to.
For regular buffer sizes, the performance level becomes quite close to
the one obtained with the zero-copy mechanism (zero-copy remains much
faster with non-default buffer sizes). The memory savings are huge with
default buffer size: at 64c * 100 streams on a single thread, we used
to forward 4.4 Gbps of traffic using 10400 buffers. After the change,
the performance reaches 5.9 Gbps with only 22-24 buffers, since they
are quickly recycled. That's asaving of 160 MB of RAM.
A concern was an increase in the number of syscalls but this is not
the case, the numbers remained exactly the same before and after.
Some experimentations were made to try to cork data and not send
incomplete buffers, and that always voided these changes. One
explanation might be that keeping a first buffer with only headers
frames is sufficient to prevent a zero-copy of the data coming in
a next snd_buf() call. This still needs to be studied anyway.
By calling h2_process(), the code would theoretically make it possible
for a synchronous ->wake() call to provoke an indirect call to h2_snd_buf()
while we're in h2_done_ff(), which could be quite bad. The current
conditions do not permit it right now but this could easily break by
accident. Better use h2_send() and wake the task up if needed. Precise
performance tests showed no change.
There's a subtle issue that results from pat_ref_purge_range() trying
to release memory. Since commit 0d93a8186 ("MINOR: pools: work around
possibly slow malloc_trim() during gc") that was backported to 2.3,
trim_all_pools() now protects itself against concurrent malloc() and
free() by isolating itself. The problem is that pat_ref_purge_range()
must be called under a lock, which is precisely what's done in
cli_io_handler_clear_map(). Thus during a clearing of a map, if
another thread tries to access or update an entry in the same map, it
will wait for the ref->lock to be released, and trim_all_pools() will
wait for all threads to be harmless, thus causing a deadlock. Note
that disabling memory trimming cannot work around the problem here
because it's tested only under isolation.
The solution here consists in moving the call to trim_all_pools() to
the caller, out of the lock.
This must be backported as far as 2.4.
To follow-up the implementation of the new set-proxy-v2-tlv-fmt
keyword in the server, the connection is updated to use the previously
allocated TLVs. If no value was specified, we send out an empty TLV.
As the feature is fully working with this commit, documentation and a
test for the server and default-server are added as well.
This commit introduces a generic server-side parsing of type-value pair
arguments and allocation of a TLV list via a new keyword called
set-proxy-v2-tlv-fmt.
This allows to 1) forward any TLV type with the help of fc_pp_tlv,
2) generally, send out any TLV type and value via a log format expression.
To have this fully working the connection will need to be updated in
a follow-up commit to actually respect the new server TLV list.
default-server support has also been implemented.
In this patch, we add the possibility to declare on a table definition
("table" in peer section, or "stick-table" in proxy section) that we
want the remote/peer updates on that table to be pushed on a local
haproxy table in addition to the source table.
Consider this example:
|peers mypeers
| peer local 127.0.0.1:3334
| peer clust 127.0.0.1:3333
| table t1.local type string size 10m store server_id,server_key expire 30s
| table t1.clust type string size 10m store server_id,server_key write-to mypeers/t1.local expire 30s
With this setup, we consider haproxy uses t1.local as cache/local table
for read and write operations, and that t1.clust is a remote table
containing datas processed from t1.local and similar tables from other
haproxy peers in a cluster setup. The t1.clust table will be used to
refresh the local/cache one via the "write-to" statement.
What will happen, is that every time haproxy will see entry updates for
the t1.clust table: it will overwrite t1.local table with fresh data and
will update the entry expiration timer. If t1.local entry doesn't exist
yet (key doesn't exist), it will automatically create it. Note that only
types that cannot be used for arithmetic ops will be handled, and this
to prevent processed values from the remote table from interfering with
computations based on values from the local table. (ie: prevent
cumulative counters from growing indefinitely).
"write-to" will only push supported types if they both exist in the source
and the target table. Be careful with server_id and server_key storage
because they are often declared implicitly when referencing a table in
sticking rules but it is required to declare them explicitly for them to
be pushed between a remote and a local table through "write-to" option.
Also note that the "write-to" target table should have the same type as
the source one, and that the key length should be strictly equal,
otherwise haproxy will raise an error due to the tables being
incompatibles. A table that is already being written to cannot be used
as a source table for a "write-to" target.
Thanks to this patch, it will now be possible to use sticking rules in
peer cluster context by using a local table as a local cache which
will be automatically refreshed by one or multiple remote table(s).
This commit depends on:
- "MINOR: stktable: stktable_init() sets err_msg on error"
- "MINOR: stktable: check if a type should be used as-is"
stick table types now have an extra bit named 'as_is' that allows us to
check if such type should be used as-is or if it may be involved in
arithmetic operations such as counters. This can be useful since those
types are not common and may require specific handling.
e.g.: stktable_data_types[data_type].as_is will be set to 1 if the type
cannot be used in arithmetic operations.
As a result of copy paste error in 1b8e68e ("MEDIUM: stick-table: Stop
handling stick-tables as proxies."), postparsing stktable_init() failures
were reported as such for named peer tables:
"Proxy 'table_name': failed to initialize stick table."
Now they are correctly reported like this:
"Parsing [file:line]: failed to initialize 'table_name' stick-table."
This should be backported to every stable versions.
When "peers" keyword is encountered within a stick table definition,
peers.name hint gets replaced with a new copy of the provided name using
strdup(). However, there is no detection on whether the name was
previously set or not, so it is currently allowed to reuse the keyword
multiple time to overwrite previous value, but here we forgot to free
previous value for peers.name before assigning it to a new one.
This should be backported to every stable versions.
Simplify stick and store sticktable proxy rules postparsing by adding
a sticking rule entry resolve (postparsing) function.
This will ease code maintenance.
SNI may be specify on a server line for connecting to the remote host.
This requires to manually set it on the connection via
ssl_sock_set_servername().
This step was missing when a server line was used for active reverse
HTTP. Fix this by adding the missing ssl_sock_set_servername()
invocation inside new_reverse_conn().
Note that for the moment, no session is instantiated to carry active
reverse connection. A direct consequence of this is that SNI sample
retrieval may crash depending if it depends on session parameters. This
should be fixed by a later commit. In the meantime, this patch is
sufficient to support simple SNI value such as constant expressions.
No need to backport.
This new fetcher can be used to extract the list of cookie names from
Cookie request header or from Set-Cookie response header depending on
the stream direction. There is an optional argument that can be used
as the delimiter (which is assumed to be the first character of the
argument) between cookie names. The default delimiter is comma (,).
Note that we will treat the Cookie request header as a semi-colon
separated list of cookies and each Set-Cookie response header as
a single cookie and extract the cookie names accordingly.
When an expect rule failed for a tcp-check, information about the expect
rule is dumped in the report. For a check on a binary string, a hexstring is
used in the configuration but the decoded string is dumped. It is an problem
because it can contain special characters. And it is not really handy
because there is no correspondance with the config.
So, now, the hexstring is dumped in the report. This way, we are sure there
is no special characters and it is easy to find it in the configuration.
This patch shoudl solve the issue #2326. It must be backported as far as
2.2.
The patch which fixes the certificate selection uses
SSL_CIPHER_get_id() to skip the SCSV ciphers without checking if cipher
is NULL. This patch fixes the issue by skipping any NULL cipher in the
iteration.
Problem was reported in #2329.
Need to be backported where 23093c72f1 was
backported. No release was made with this patch so the severity is
MEDIUM.
When no client timeout is defined in the configuration, QCC timeout task
is never allocated. However, a NULL timeout task is also used as a
criteria in qcc_is_dead() to consider that the MUX instance should be
released as timeout stroke earlier.
This bug causes every connection to be closed by haproxy side with a
CONNECTION_CLOSE. This is notable when using several streams per
connection with only the first stream completed and the others failed.
To fix this, change timeout task allocation policy. It is now always
allocated. This means that if no timeout is defined, it will never be
run. This is not considered a waste of resource as no timeout in the
configuration is considered as an exception case. However, this has the
advantage to simplify the rest of the code which can now check for the
task instance without having an extra check on the timeout value.
This bug is labelled as minor as it only occurs if no timeout client is
defined which reports warning on startup as it may caused unexpected
behavior.
This bug should be backported up to 2.6.
When using TLSv1.3, the signature algorithms extension is used to chose
the right ECDSA or RSA certificate.
However there was an old test for previous version of TLS (< 1.3) which
was testing if the cipher is compatible with ECDSA when an ECDSA
signature algorithm is used. This test was relying on
SSL_CIPHER_get_auth_nid(cipher) == NID_auth_ecdsa to verify if the
cipher is still good.
Problem is, with TLSv1.3, all ciphersuites are compatible with any
authentication algorithm, but SSL_CIPHER_get_auth_nid(cipher) does not
return NID_auth_ecdsa, but NID_auth_any.
Because of this, with TLSv1.3 when both ECDSA and RSA certificates are
available for a domain, the ECDSA one is not chosen in priority.
This patch also introduces a test on the cipher IDs for the signaling
ciphersuites, because they would always return NID_auth_any, and are not
relevent for this selection.
This patch fixes issue #2300.
Must be backported in all stable versions.
Similar to the previous commit which check for maxconn before allocating
a QUIC connection, this patch checks for maxsslconn at the same step.
This is necessary as a QUIC connection cannot run without a SSL context.
This should be backported up to 2.6. It relies on the following patch :
"BUG/MINOR: ssl: use a thread-safe sslconns increment"
Increment actconn and check maxconn limit when a quic_conn is
instantiated. This is necessary because prior to this patch, quic_conn
instances where not counted. Global actconn was only incremented after
the handshake has been completed and the connection structure is
allocated.
The increment is done using increment_actconn() on INITIAL packet
parsing if a new connection is about to be created. If the limit is
reached, the allocation is cancelled and the INITIAL packet is dropped.
The decrement is done under quic_conn_release(). This means that
quic_cc_conn instances are not taken into account. This seems safe
enough because quic_cc_conn are only used for minimal usage.
The counterpart of this change is that maxconn must not be checked a
second time when listener_accept() is done over a QUIC connection. For
this, a new bind_conf flag BC_O_XPRT_MAXCONN is set for listeners when
maxconn is already counted by the lower layer. For the moment, it is
positionned only for QUIC listeners.
Without this patch, haproxy process could suffer from heavy memory/CPU
load if the number of concurrent handshake is high.
This patch is not considered a bug fix per-se. However, it has a major
benefit to protect against too many QUIC handshakes. As such, it should
be backported up to 2.6. For this, it relies on the following patch :
"MINOR: frontend: implement a dedicated actconn increment function"
Each time a new SSL context is allocated, global.sslconns is
incremented. If global.maxsslconn is reached, the allocation is
cancelled.
This procedure was not entirely thread-safe due to the check and
increment operations conducted at different stage. This could lead to
global.maxsslconn slightly exceeded when several threads allocate SSL
context while sslconns is near the limit.
To fix this, use a CAS operation in a do/while loop. This code is
similar to the actconn/maxconn increment for connection.
A new function increment_sslconn() is defined for this operation. For
the moment, only SSL code is using it. However, it is expected that QUIC
will also use it to count QUIC connections as SSL ones.
This should be backported to all stable releases. Note that prior to the
2.6, sslconns was outside of global struct, so this commit should be
slightly adjusted.
When a new frontend connection is instantiated, actconn global counter
is incremented. If global maxconn value is reached, the connection is
cancelled. This ensures that system limit are under control.
Prior to this patch, the atomic check/increment operations were done
directly into listener_accept(). Move them in a dedicated function
increment_actconn() in frontend module. This will be useful when QUIC
connections will be counted in actconn counter.
When entering closing state, a QUIC connection is maintained during a
certain delay. The principle is to ensure the other peer has received
the CONNECTION_CLOSE frame. In case of packet duplication/reordering,
CONNECTION_CLOSE is reemitted.
QUIC RFC recommends to use at least 3 times the PTO value. However,
prior to this patch, haproxy used instead the max value between 3 times
the PTO and the connection idle timeout. In the default case, idle
timeout is set to 30s which is in most of the times largely superior to
the PTO. This has the downside of keeping the connection in memory for
too long whereas all resources could be released much earlier.
Fix this behavior by using 3 times the PTO on closing or draining state.
This value is limited up to 1s. This ensures that most of connections
are covered by this. If a connection runs with a very high RTT, it must
not impact the whole process and should be released in a reasonable
delay.
This should be backported up to 2.6.
Now when calling ha_panic() with a thread still under malloc_trim(),
we'll set a new tainted flag to easily report it, and the output
trace will report that this condition happened and will suggest to
use no-memory-trimming to avoid it in the future.
William suggested that since we can detect the presence of Lua in the
stack, let's combine it with stuck detection to set a new pair of flags
indicating a stuck Lua context and a stuck Lua shared context.
Now, executing an infinite loop in a Lua sample fetch function with
yield disabled crashes with tainted=0xe40 if loaded from a lua-load
statement, or tainted=0x640 from a lua-load-per-thread statement.
In addition, at the end of the panic dump, we can check if Lua was
seen stuck and emit recommendations about lua-load-per-thread and
the choice of dependencies depending on the presence of threads
and/or shared context.
This will make it easier to know that the panic function was called,
for the occasional case where the dump crashes and/or the stack is
corrupted and not much exploitable. Now at least it will be sufficient
to check the tainted value to know that someone called ha_panic(), and
it will also be usable to condition extra analysis.
Remove some code duplication by introducing a basic helper function
to detach a server from its parent proxy. It is supported to call
the function even if the server is not yet listed in the proxy list.
If the server is not yet listed in the proxy, the function will do
nothing. In delete_server(), we previously performed some BUG_ON()
to ensure that the detach always succeeded given that we were certain
that the server was in the proxy list because it was retrieved through
get_backend_server().
However this test is superfluous, we can safely assume that the operation
will always succeed if get_backend_server() returned != NULL (we're under
full thread isolation), and if it's not the case, then we have a bigger
API issue anyway..
In 304672320e ("MINOR: server: support keyword proto in 'add server' cli")
improper use of conn_get_best_mux_entry() function was made:
First, server's proxy mode was directly passed as "proto_mode" argument
to conn_get_best_mux_entry(), but this is strictly invalid because while
there is some relationship between proto modes and proxy modes, they
don't use the same storage mechanism and cannot be used interchangeably.
Because of this bug, conn_get_best_mux_entry() would not work at all for
TCP because PR_MODE_TCP equals 0, where PROTO_MODE_TCP normally equals 1.
Then another, less sensitive bug, remains:
as its name and description implies, conn_get_best_mux_entry() will try
its best to return something to the user, only using keyword (mux_proto)
input as an hint to return the most relevant mux within the list of
mux that are compatibles with proto_side and proto_mode values.
This means that even if mux_proto cannot be found or is not available
with current proto_side and proto_mode values, conn_get_best_mux_entry()
will most probably fallback to a more generic mux.
However in cli_parse_add_server(), we directly check the result of
conn_get_best_mux_entry() and consider that it will return NULL if the
provided keyword hint for mux_proto cannot be found. This will result in
the function not raising errors as expected, because most of the times if
the expected proto cannot be found, then we'll silently switch to the
fallback one, despite the user providing an explicit proto.
To fix that, we store the result of conn_get_best_mux_entry() to compare
the returned mux proto name with the one we're expecting to get, as it
is originally performed in cfgparse during initial server keyword parsing.
This patch depends on
- "MINOR: connection: add conn_pr_mode_to_proto_mode() helper func")
It must be backported up to 2.6.
This function allows to safely map proxy mode to corresponding proto_mode
This will allow for easier code maintenance and prevent mixups between
proxy mode and proto mode.
In 9a74a6c ("MAJOR: log: introduce log backends"), a mistake was made:
it was assumed that the proxy mode was already known during server
keyword parsing in parse_server() function, but this is wrong.
Indeed, "mode log" can be declared late in the proxy section. Due to this,
a simple config like this will cause the process to crash:
|backend test
|
| server name 127.0.0.1:8080
| mode log
In order to fix this, we relax some checks in _srv_parse_init() and store
the address protocol from str2sa_range() in server struct, then we set-up
a postparsing function that is to be called after config parsing to
finish the server checks/initialization that depend on the proxy mode
to be known. We achieve this by checking the PR_CAP_LB capability from
the parent proxy to know if we're in such case where the effective proxy
mode is not yet known (it is assumed that other proxies which are implicit
ones don't provide this possibility and thus don't suffer from this
constraint).
Only then, if the capability is not found, we immediately perform the
server checks that depend on the proxy mode, else the check is postponed
and it will automatically be performed during postparsing thanks to the
REGISTER_POST_SERVER_CHECK() hook.
Note that we remove the SRV_PARSE_IN_LOG_BE flag because it was introduced
in the above commit and it is no longer relevant.
No backport needed unless 9a74a6c gets backported.
Define a new function srv_add_to_avail_list(). This function is used to
centralize connection insertion in available tree. It reuses a BUG_ON()
statement to ensure the connection is not present in the idle list.
Since the following commit, idle conns are stored in a list as secondary
storage to retrieve them in usage order :
5afcb686b9
MAJOR: connection: purge idle conn by last usage
The list usage has been extended wherever connections lookup are done
both on idle and safe trees. This reduced the code size by replacing a
two tree loops by a single list loop.
LIST_ELEM() is used in this context to retrieve the first idle list
element from the server list head. However, macro usage was wrong due to
an extra '&' operator which returns an invalid connection reference.
This will most of the time caused a crash on conn_delete_from_tree() or
affiliated functions.
This bug only occurs if the FD pool is exhausted and some idle
connections are selected to be killed.
It can be reproduced using the following config and h2load command :
$ h2load -t 8 -c 800 -m 10 -n 800 "http://127.0.0.1:21080/?s=10k"
global
maxconn 100
defaults
mode http
timeout connect 20s
timeout client 20s
timeout server 20s
listen li
bind :21080 proto h2
server nginx 127.99.0.1:30080 proto h1
This bug has been introduced by the above commit. Thus no need to
backport this fix.
Note that LIST_ELEM() macro usage was slightly adjusted also in
srv_migrate_conns_to_remove(). The function used toremove_list instead
of idle_list connection list element. This is not a bug as they are
stored in the same union. However, the new code is clearer as it intends
to move connection from the idle_list only into the toremove_list
mt-list.
Idle connections are both stored in an idle/safe tree and in an idle
list. The list is used as a secondary storage to be able to retrieve
them by usage order.
If a connection is moved into the available tree, it must not be present
in the idle list. A BUG_ON() was written to check this but was placed at
the wrong code section. Fix this by removing the misplaced one and write
new ones for avail_conns tree insertion and lookup.
The impact of this bug is minor as the misplaced BUG_ON() did not seem
to be triggered.
No need to backport.
After making it configurable in previous commit "MINOR: lua: Add flags
to configure logging behaviour", this patch changes the default value
of tune.lua.log.stderr from 'on' (unconditionally forward LUA logs to
stderr) to 'auto' (only forward LUA logs to stderr if logging via a
standard logger is disabled, or none is configured for the current context)
Since this is a change in behaviour, it shouldn't be backported
Until now, messages printed from LUA log functions were sent both to
the any logger configured for the current proxy, and additionally to
stderr (in most cases)
This introduces two flags to configure LUA log handling:
- tune.lua.log.loggers to use standard loggers or not
- tune.lua.log.stderr to use stderr, or not, or only conditionally
This addresses github feature request #2316
This can be backported to 2.8 as it doesn't change previous behaviour.
The configuration parser still adds the 'ca-base' directory when loading
the @system-ca, preventing it to be loaded correctly.
This patch fixes the problem by not adding the ca-base when a file
starts by '@'.
Fix issue #2313.
Must be backported as far as 2.6.
Originally H2 would transfer everything to H1 and parsing errors were
handled there, so that if there was a track-sc rule in effect, the
counters would be updated as well. As we started to add more and more
HTTP-compliance checks at the H2 layer, then switched to HTX, we
progressively lost this ability. It's a bit annoying because it means
we will not maintain accurate error counters for a given source, for
example.
This patch adds the calls to session_inc_http_req_ctr() and
session_inc_http_err_ctr() when needed (i.e. when failing to parse
an HTTP request since all other cases are handled by the stream),
just like mux-h1 does. The same should be done for mux-h3 by the
way.
This can be backported to recent stable versions. It's not exactly a
bug, rather a missing feature in that we had never updated this counter
for H2 till now, but it does make sense to do it especially based on
what the doc says about its usage.
The H2 spec says that a HEADERS frame turns an idle stream to the open
state, and it may then turn to half-closed(remote) on ES, then to close,
all at once, if we respond with RST (e.g. on error). Due to the fact that
we process a complete frame at once since h2_dec_hdrs() may reassemble
CONTINUATION frames until everything is complete, the state was only
committed after the frame was completley valid (otherwise multiple passes
could result in subsequent frames being rejected as the stream ID would
be equal to the highest one).
However this is not correct because it means that a client may retry on
the same ID as a previously failed one, which technically is forbidden
(for example the client couldn't know which of them a WINDOW_UPDATE or
RST_STREAM frame is for).
In practice, due to the error paths, this would only be possible when
failing to decode HPACK while leaving the HPACK stream intact, thus
when the valid decoded HPACK stream cannot be turned into a valid HTTP
representation, e.g. when the resulting headers are too large for example.
The solution to avoid this consists in committing the stream ID on this
error path as well. h2spec continues to be happy.
Thanks to Annika Wickert and Tim Windelschmidt for reporting this issue.
This fix must be backported to all stable versions.
In h2_frt_handle_headers() all failures lead to a generic message saying
"rejected H2 request". It's quite inexpressive while there are a few
distinct tests that are made before jumping there:
- trailers on closed stream
- unparsable request
- refused stream
Let's emit the traces from these call points instead so that we get more
info about what happened. Since these are user-level messages, we take
care of keeping them aligned as much as possible.
For example before it would say:
[04|h2|1|mux_h2.c:2859] rejected H2 request : h2c=0x7f5d58036fd0(F,FRE)
[04|h2|5|mux_h2.c:2860] h2c_frt_handle_headers(): leaving on error : h2c=0x7f5d58036fd0(F,FRE) dsi=1 h2s=0x9fdb60(0,CLO)
And now it says:
[04|h2|1|mux_h2.c:2817] rcvd unparsable H2 request : h2c=0x7f55f8037160(F,FRH) dsi=1 h2s=CLO
[04|h2|5|mux_h2.c:2875] h2c_frt_handle_headers(): leaving on error : h2c=0x7f55f8037160(F,FRE) dsi=1 h2s=CLO
Sometimes it's unclear whether a stream is still open or closed when
certain traces are emitted, for example when the stream was refused,
because the reported pointer and ID in fact correspond to the refused
stream. And for closed streams, no pointer/name is printed, leaving
some confusion about the state. This patch makes the situation easier
to analyse by explicitly reporting "h2s=CLO" on closed/error/refused
streams so that we don't waste time comparing pointers and we instantly
know the stream is closed. Now instead of emitting:
[03|h2|5|mux_h2.c:2874] h2c_frt_handle_headers(): leaving on error : h2c=0x7fdfa8026820(F,FRE) dsi=201 h2s=0x9fdb60(0,CLO)
It will emit:
[03|h2|5|mux_h2.c:2874] h2c_frt_handle_headers(): leaving on error : h2c=0x7fdfa8026820(F,FRE) dsi=201 h2s=CLO
Method now returns the content of Json Arrays, if it is specified in
Json Path as String. The start and end character is a square bracket. Any
complex object in the array is returned as Json, so that you might get Arrays
of Array or objects. Only recommended for Arrays of simple types (e.g.,
String or int) which will be returned as CSV String. Also updated
documentation and fixed issue with parenthesis and other changes from
comments.
This patch was discussed in issue #2281.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
Reverse HTTP bind is very specific in that in rely on a server to
initiate connection. All connection settings are defined on the server
line and ignored from the bind line.
Before this patch, most of keywords were silently ignored. This could
result in a configuration from doing unexpected things from the user
point of view. To improve this situation, add a new 'rhttp_ok' field in
bind_kw structure. If not set, the keyword is forbidden on a reverse
bind line and will cause a fatal config error.
For the moment, only the following keywords are usable with reverse bind
'id', 'name' and 'nbconn'.
This change is safe as it's already forbidden to mix reverse and
standard addresses on the same bind line.
Previously, maxconn keyword was reused for a specific usage on reverse
HTTP binds to specify the number of active connect to proceed. To avoid
confusion, introduce a new dedicated keyword 'nbconn' which is specific
to reverse HTTP bind.
This new keyword is forbidden for non-reverse listener. A fatal error is
emitted during config parsing if this rule is not respected. It's safe
because it's also forbidden to mix standard and reverse addresses on the
same bind line.
Internally, nbconn value will be reassigned to 'maxconn' member of
bind_conf structure. This ensures that listener layer will automatically
reenable the preconnect task each time a connection is closed.
Reverse HTTP listeners are very specific and share only a very limited
subset of keywords with other listeners. As such, it is probable
meaningless to mix standard and reverse addresses on the same bind line.
This patch emits a fatal error during configuration parsing if this is
the case.
The number of updates sent at once was limited to not loop too long to emit
updates when the buffer size is huge or when the number of sync tables is
huge. The limit can be configured and is set to 200 by default. However,
this fix introduced a bug. It is impossible to syncrhonize two peers if the
number of tables is higher than this limit. Thus by default, it is not
possible to sync two peers if there are more than 200 tables to sync.
Technically speacking, a teaching process is finished if we loop on all tables
with no new update messages sent. Because we are limited at each call, the loop
is splitted on several calls. However the restart point for the next loop is
always the last table for which we emitted an update message. Thus with more
tables than the limit, the loop never reachs the end point.
Worse, in conjunction with the bug fixed by "BUG/MEDIUM: peers: Be sure to
always refresh recconnect timer in sync task", it is possible to trigger the
watchdog because the applets may be woken up in loop and leave requesting
more room while its buffer is empty.
To fix the issue, restart conditions for a teaching loop were changed. If
the teach process is interrupted, we now save the restart point, called
stop_local_table. It is the last evaluated table on the previous loop. This
restart point is reset when the teach process is finished.
In additionn, the updates_sent variable in peer_send_msgs() was renamed to
updates to avoid ambiguities. Indeed, the variable is incremented, whether
messages were sent or not.
This patch must be backported as far as 2.6.
A sync task used to manage reconnect, sessions creation or shutdown and data
synchronization is responsible to refresh reconnect and heartbeat timers for
each remote peers and trigger applets wakeup. These timers are used to
refresh the sync task timeer itself. Thus it is important to take care to
always properly refresh them.
However, when there are some data to push, the reconnect timer is not
checked. It may be expired and not refreshed. In this case, an expired timer
may be used to the sync task, leading to a storm of wakeups. The sync task
is woken up in loop because its timer is in the past, waking up Peer applets
at each time.
To fix the issue, the peer's reconnect timer is now refresh to the default
reconnect timeout, if necessary, when there are some data to push.
This patch must be backported to all stable versions.
Since traces were adapted to support being declared in the global section
in 2.7 with commit c11f1cdf4 ("MINOR: trace: split the CLI "trace" parser
in CLI vs statement"), the method used to return the error message was
unreliable. For example an invalid sink name in the global section would
produce:
[ALERT] (26685) : config : parsing [test-trace.cfg:51] : 'trace': No such sink
[ALERT] (26685) : config : parsing [test-trace.cfg:51] : (null)
[ALERT] (26685) : config : Error(s) found in configuration file : test-trace.cfg
[ALERT] (26685) : config : Fatal errors found in configuration.
The reason is that the trace is emitted manually using ha_error() in
cfg_parse_trace() and -1 is returned without setting the message, and
the caller also prints the empty message. That's quite awkward given
that the API originally comes from the CLI which does support dynamic
strings and that config keywords do as well.
This commit modifies both cli_parse_trace() and cfg_parse_trace() to
return a dynamically allocated message instead, and adapts the central
function trace_parse_statement() to do the same, replacing a few direct
assignments with strdup() or memprintf(). This way the alert is no
longer emitted by the parser function, it just passes the message to
the caller.
A few of the static messages switching to memprintf() also took this
opportunity to report the faulty word:
[ALERT] (26772) : config : parsing [test-trace.cfg:51] : No such trace sink 'stduot'
[ALERT] (26772) : config : Error(s) found in configuration file : test-trace.cfg
[ALERT] (26772) : config : Fatal errors found in configuration.
This may be backported to 2.8 and 2.7.
Stefan Behte reported that since commit f279a2f14 ("BUG/MINOR: mux-h2:
refresh the idle_timer when the mux is empty"), the http-request and
http-keep-alive timeouts don't work anymore on H2. Before this patch,
and since 3e448b9b64 ("BUG/MEDIUM: mux-h2: make sure control frames do
not refresh the idle timeout"), they would only be refreshed after stream
frames were sent (HEADERS or DATA) but the patch above that adds more
refresh points broke these so they don't expire anymore as long as
there's some activity.
We cannot just revert the fix since it also addressed an isse by which
sometimes the timeout would trigger too early and provoque truncated
responses. The right approach here is in fact to only use refresh the
idle timer when the mux buffer was flushed from any such stream frames.
In order to achieve this, we're now setting a flag on the connection
whenever we write a stream frame, and we consider that flag when deciding
to refresh the buffer after it's emptied. This way we'll only clear that
flag once the buffer is empty and there were stream data in it, not if
there were no such stream data. In theory it remains possible to leave
the flag on if some control data is appended after the buffer and it's
never cleared, but in practice it's not a problem as a buffer will always
get sent in large blocks when the window opens. Even a large buffer should
be emptied once in a while as control frames will not fill it as much as
data frames could.
Given the patch above was backported as far as 2.6, this patch should
also be backported as far as 2.6.
tune.rcvbuf.client and tune.rcvbuf.server are not suitable for shared
dgram sockets because they're per connection so their units are not the
same. However, QUIC's listener and log servers are not connected and
take per-thread or per-process traffic where a socket log buffer might
be too small, causing undesirable packet losses and retransmits in the
case of QUIC. This essentially manifests in listener mode with new
connections taking a lot of time to set up under heavy traffic due to
the small queues causing delays. Let's add a few new settings allowing
to set these shared socket sizes on the frontend and backend side (which
reminds that these are per-front/back and not per client/server hence
not per connection).
Instead of speaking of an initialisation stage for each data
fast-forwarding, we now use the negociate term. Thus init_ff/init_fastfwd
functions were renamed nego_ff/nego_fastfwd.
Data fast-forwarding does not build without the kernel splicing support
because counters about splicing don't exist. To make the code more readable,
all code about splicing is disabled if kernel splicing is not supported.
The zero-copy forwarding or the mux-to-mux forwarding is a way to
fast-forward data without using the channels buffers. Data are transferred
from a mux to the other one. The kernel splicing is an optimization of the
zero-copy forwarding. But it can also use normal buffers (but not channels
ones). This way, it could be possible to fast-forward data with muxes not
supporting the kernel splicing (H2 and H3 muxes) but also with applets.
However, this mode can introduce regressions or bugs in future (just like
the kernel splicing). Thus, It could be usefull to disable this optim. To do
so, in configuration, the global tune settting
'tune.disable-zero-copy-forwarding' may be set in a global section or the
'-dZ' command line parameter may be used to start HAProxy. Of course, this
also disables the kernel splicing.
The PT multiplexer now implements callbacks function to produce and consume
fast-forwarded data. Only splicing is support because the mux-pt does not
use its own buffers.
Because channel_is_empty() function does now only check the channel's
buffer, we can remove it and rely on co_data() instead. Of course, all tests
must be inverted.
channel_is_empty() is thus removed.
It is important to split channels and I/O buffers. When data are pushed in
an I/O buffer, we consider them as forwarded. The channel never sees
them. Fast-forwarded data are now handled in the SE only.
The H2 multiplexer now implements callbacks to consume fast-forwarded
data. It is the most usful case: A H2 client getting data from a H1
server. It is also the easiest case to implement. The producer side is
trickier because of multiplexing. It is not obvious this case would be
improved with data fast-forwarding.
When message headers are parsed and an HTX start-line is created, if we
detect the response must not have any payload, a specific flag must be set
on the HTX start-line. It happens for instance for response to HEAD
requests. This flag is useb by the multiplexers to know response payload, if
any, must be silently skipped.
This was not performed when h2 HEADERS frames were decoded. This HTX flag
was specifically added to fix a bug when the splicing is inuse. Thus the H2
multiplexer was not concerned. Because the mux-to-mux fast-forwarding will
be introduced, it is important handle this flag in the H2 multiplexer too.
Just like for the zero-copy, this patch tries to simplify the code
responsible to format the message payload before sending it. But here, we
take care to simplify the loop on the HTX blocks. The result should be
less errorrpone.
In h1_make_data(), the function responsible to format the message payload
before sending it, the code dealing with zero-copy was slighly simplified
(at least for me :).
There is no real change but there is a better split between messages with a
content-length and cunked messages.
This function should be used to send the chunk size, before appending the
chunk payload. It also takes care to add a CRLF to finish a previous chunk,
if necessary. This function will be used to fix the splicing for re-chunk
responses with an unknown length.
When data were sent using the kernel splicing, we tried to send all data
with no restriction. Most of time it is valid. However, because the payload
representation may differ between the producer and the consumer, it is
important to be able to specify how must data to send via the splicing.
Of course, for performance reason, it is important to maximize amount of
data send via splicing at each call. However, on edge-cases, this now can be
limited.
On the sending path, there are 3 states for chunked payload in H1:
* H1_MSG_CHUNK_SIZE: the chunk size must be emitted
* H1_MSH_CHUNK_CRLF: The end of the chunk must be emitted
* H1_MSG_DATA: Chunked data must be emitted
However, some shortcuts were used on the sending path to avoid some
transitions. Especially, outgoing messages were never switched in
H1_MSG_CHUNK_SIZE state.
However, it will be necessary to properly handle all transitions on the payload
to implement mux-to-mux forwarding, to be sure to always known when the chunk
size or the end of the chunk must be emitted.
For now, it is not an issue, but it is safer to explicitly ignore HTX extra
field for responses with unknown length. This will be mandatory to future
fixes, to be able to re-chunk responses with an unknown length..
Now the kernel splicing support was removed, we can add mux-to-mux
fast-forward support. Of course, the splicing support will be reintroduced
in the muxes themselves but this will be transparent.
Changes are mainly located into sc_conn_recv() and sc_conn_send().
Because the kernel splicing support was removed from the stconn, it is
useless to keep it in muxes. In this patch, we remove the kernel splicing
support from the H1 multiplexer. It will be replaced by the mux-to-mux data
fast-forwarding.
Because the kernel splicing support was removed from the stconn, it is
useless to keep it in muxes. In this patch, we remove the kernel splicing
support from the passthough multiplexer. It will be replaced by the
mux-to-mux data fast-forwarding.
mux-to-mux fast-forwarding will be added. To avoid mix with the splicing and
simplify the commits, the kernel splicing support is removed from the
stconn. CF_KERN_SPLICING flag is removed and the support is no longer tested
in process_stream().
In the stconn part, rcv_pipe() callback function is no longer called.
Reg-tests scripts testing the kernel splicing are temporarly marked as
broken.
It is unused for now, but the iobuf structure now owns a pointer to a
buffer. This buffer will be used to perform mux-to-mux fast-forwarding when
splicing is not supported or unusable. This pointer should be filled by an
endpoint to let the opposite one forward data.
Extra fields, in addition to the buffer, are mandatory because the buffer
may already contains some data. the ".offset" field may be used may be used
as the position to start to copy data. Finally, the amount of data copied in
this buffer must be saved in ".data" field.
Some flags are also added to prepare next changes. And helper stconn
fnuctions are updated to also count data in the buffer. For a first
implementation, it is not planned to handle data in the buffer and in the
pipe in same time. But it will be possible to do so.
Instead of talking about kernel splicing at stconn/sedesc level, we now try
to talk about mux-to-mux fast-forwarding. To do so, 2 functions were added
to know if there are fast-forwarded data and to retrieve this amount of
data. Of course, for now, there is only data in a pipe.
In addition, some flags were renamed to reflect this notion. Note the
channel's documentation was not updated yet.
The pipes used to put data when the kernel splicing is in used are moved in
the SE descriptors. For now, it is just a simple remplacement but there is a
major difference with the pipes in the channel. The data are pushed in the
consumer's pipe while it was pushed in the producer's pipe. So it means the
request data are now pushed in the pipe of the backend SE descriptor and
response data are pushed in the pipe of the frontend SE descriptor.
The idea is to hide the pipe from the channel/SC side and to be able to
handle fast-forwading in pipe but also in buffer. To do so, the pipe is
inside a new entity, called iobuf. This entity will be extended.
If a shutw is blocked because the mux is full or busy, we must defer the
shutr. In this case, the H2 stream is not in H2_SS_CLOSED state because the
shutw is also deferred. If the shutr is performed, this will lead to a
error.
Concretly, when the mux is unblocked, a RST_STREAM is sent while in some
cases, an empty DATA frame with ES flag set could be sent.
This patch should be backported to all stable versions.
Redirect responses sent during the HTTP analysis have no payload. However
there is still a "Content-Length" header. It is important to set the
corresponding flag on the HTX start-line to be sure to preserve this header
when the reponse is sent to the client. The same is true with the stats
applet, when it returns a redirect responses.
It is especially important because we no ignore in-fly modifications of
"Content-Length" or "Transfer-Encoding" headers without updating the HTX
start-line flags.
This patch may be backported to all stable versions but it is probably
useless because only the 2.9-dev is affected by the bug.
Since commit 723c73f8a ("MEDIUM: mux-h1: Split h1_process_mux() to make code
more readable"), outgoing H1 chunked messages with no data at all get
delayed by 200ms. It is due to the fact that we end processing too early and
we don't have the opportunity to process trailers in this case.
This fix addresses it by verifying if it's required to emit EOT or trailers,
if any, when retruning from h1_make_data()
No backport is needed, this was in 2.9-dev.
Since last fixes about the lua cosocket, the appctx is no longer initialized
in hlua_socket_new(). The code to deal with error at this stage can be
removed.
This patch should fix the issue #2308.
The two timer handlers qc_process_timer() and qc_idle_timer_task() would
inadvertently return NULL when they don't want to be requeued, instead
of just returning the task itself. The effect of returning NULL for the
scheduler is that it considers the task as freed, so it must not touch
it anymore. As such, the TASK_F_RUNNING flag is never removed from these
tasks, and when quic_conn_release() later tries to release these tasks
using task_destroy(), the latter sees the RUNNING flag and just sets
->process to NULL, hoping that the scheduler will kill them on return,
but there's no longer being executed so this never happens and they are
leaked.
Interestingly, this doesn't seem to happen as much when multi-queue is
set to off, but it's likely because the tasks are being replaced and the
first ones have already been woken up and leaked, while the latter might
only trigger on a timeout or timer renewal.
This should address github issue #2310. Thanks to @hpn0t0ad for the
numerous traces that helped understand this sequence.
This must be backported to 2.7 at least, and adapted for 2.6
(qc_idle_timer_task must return t there).
When looking at "show pools", it's often difficult to know which alloc()
corresponds to which free() since it's not often 1:1. But sometimes we
have all elements available to maintain a link between alloc and free.
Indeed, when the caller is recorded in the allocated area, we can store
the pointer to the just created bin instead of the caller address itself,
since the caller address is already in the memprof bin. By doing so, we
permit the pool_free() call to locate the allocator bin and update its
free count when caller tracing is enabled. This for example allows to
produce outputs like this on "show profiling" and a process started with
-dMcaller:
1391967 1391968 22805987328 22806003712| 0x59f72f process_stream+0x19f/0x3a7a p_alloc(0) [delta=-16384] [pool=buffer]
1391936 1391937 22805479424 22805495808| 0x6e1476 task_run_applet+0x426/0xea2 p_alloc(0) [delta=-16384] [pool=buffer]
1391925 1391925 22805299200 22805299200| 0x58435a main+0xdf07a p_alloc(0) [delta=0] [pool=buffer]
0 2087930 0 34208645120| 0x59b519 stream_release_buffers+0xf9/0x110 p_free(-16384) [pool=buffer]
695993 695992 11403149312 11403132928| 0x66018f main+0x1baeaf p_alloc(0) [delta=16384] [pool=buffer]
0 1391957 0 22805823488| 0x59b47c stream_release_buffers+0x5c/0x110 p_free(-16384) [pool=buffer]
695968 695970 11402739712 11402772480| 0x587b85 h1_io_cb+0x9a5/0xe7c p_alloc(0) [delta=-32768] [pool=buffer]
0 1391923 0 22805266432| 0x57f388 main+0xda0a8 p_free(-16384) [pool=buffer]
695959 695960 11402592256 11402608640| 0x586add main+0xe17fd p_alloc(0) [delta=-16384] [pool=buffer]
0 695978 0 11402903552| 0x59cc58 stream_free+0x178/0x9ea p_free(-16384) [pool=buffer]
(...)
Here it's quickly visible that all of them got properly released.
An interesting issue was met when testing the mux-to-mux forwarding code.
In order to preserve fairness, in h2_snd_buf() if other streams are waiting
in send_list or fctl_list, the stream that is attempting to send also goes
to its list, and will be woken up by h2_process_mux() or h2_send() when
some space is released. But on rare occasions, there are only a few (or
even a single) streams waiting in this list, and these streams are just
quickly removed because of a timeout or a quick h2_detach() that calls
h2s_destroy(). In this case there's no even to wake up the other waiting
stream in its list, and this will possibly resume processing after some
client WINDOW_UPDATE frames or even new streams, so usually it doesn't
last too long and it not much noticeable, reason why it was left that
long. In addition, measures have shown that in heavy network-bound
benchmark, this exact situation happens on less than 1% of the streams
(reached 4% with mux-mux).
The fix here consists in replacing these LIST_DEL_INIT() calls on
h2s->list with a function call that checks if other streams were queued
to the send_list recently, and if so, which also tries to resume them
by calling h2_resume_each_sending_h2s(). The detection of late additions
is made via a new flag on the connection, H2_CF_WAIT_INLIST, which is set
when a stream is queued due to other streams being present, and which is
cleared when this is function is called.
It is particularly difficult to reproduce this case which is particularly
timing-dependent, but in a constrained environment, a test involving 32
conns of 20 streams each, all downloading a 10 MB object previously
showed a limitation of 17 Gbps with lots of idle CPU time, and now
filled the cable at 25 Gbps.
This should be backported to all versions where it applies.
Except if we must silently ignore empty connections by enabling
http-ignore-probes or dontlognull options, when a client connection is
closed before the first request, a 400-bad-request response must be sent
with the corresponding log message. However, that is broken since the commit
fc473a6453 ("MEDIUM: mux-h1: Rely on the H1C to deal with shutdown for
reads").
The bug is subtle. Parsing errors are no longer reported on connection errors
before the first request while it should be.
This patch must be backported where the above commit is (as far as 2.7).
In the same way than for stream-connectors (see "BUG/MEDIUM: stconn: Report
a send activity everytime data were sent" for details), we now report a send
activity everytime something was consumed by an applet, even if some output
data remains blocked into the channel's buffer.
This patch must be backported to 2.8.
When read/write timeouts were refactored in 2.8, we decided to change when a
send activity had to be reported. Before, everytime some data were sent a
send activity were reported. At this time, the channel's wex timer were
updated. During the refactoring, we decided to limit send activity to sends
that ampty te channel's buffer, consuming all outgoing data. Idea behind
this change was to protect haproxy against clients consumming data very
slowly.
However, it is too strict. Some congested muxes but still active can hit the
client or the server timeout. It seems a bit unfair. It is especially
visible with QUIC/H3 but it is probably also possible with H2 if the window
size is small.
The better is to restore the old behavior.
This patch must be backported to 2.8.
"log-bufsize" may now be used for a log server (in a log backend) to
configure the bufsize of implicit ring associated to the server (which
defaults to BUFSIZE).
hash lb algorithm can be configured with the "log-balance hash <cnv_list>"
directive. With this algorithm, the user specifies a converter list with
<cnv_list>.
The produced log message will be passed as-is to the provided converter
list, and the resulting hash will be used to select the log server that
will receive the log message.
split sample_process() in 2 parts in order to be able to only process
the converter part of a sample expression from an existing input sample
struct passed as parameter.
Instead of systematically computing the avalanche hash right after the
gen_hash() call, do it inside the gen_hash() function directly to ensure
avalanche setting is always considered.
Allow the use of the "none" hash-type function so that the key resulting
from the sample expression is directly used as the hash.
This can be useful to do the hashing manually using available hashing
converters, or even custom ones, and then inform haproxy that it can
directly rely on the sample expression result which is explictly handled
as an integer in this case.
In this patch we add basic support for the random algorithm:
random algorithm picks a random server using the result of the
statistical_prng() function as if it was a hash key to then compute the
related server ID.
There is no support for the <draw> parameter (which is implemented for
tcp/http load-balancing), because we don't have the required metrics to
evaluate server's load in log backends for the moment. Plus it would add
more complexity to the __do_send_log_backend() function so we'll keep it
this way for now but this might be needed in the future.
sticky algorithm always tries to send log messages to the first server in
the farm. The server will stay in front during queue and dequeue
operations (no other server can steal its place), unless it becomes
unavailable, in which case it will be replaced by another server from
the tree.
Using "mode log" in a backend section turns the proxy in a log backend
which can be used to log-balance logs between multiple log targets
(udp or tcp servers)
log backends can be used as regular log targets using the log directive
with "backend@be_name" prefix, like so:
| log backend@mybackend local0
A log backend will distribute log messages to servers according to the
log load-balancing algorithm that can be set using the "log-balance"
option from the log backend section. For now, only the roundrobin
algorithm is supported and set by default.
This helper function can be used to create a new sink from an existing
server struct (and thus existing proxy as well), in order to spare some
resources when possible.
implicit rings were automatically forced to the parent logger format, but
this was done upon ring creation.
This is quite restrictive because we might want to choose the desired
format right before generating the log header (ie: when producing the
log message), depending on the logger (log directive) that is
responsible for the log message, and with current logic this is not
possible. (To this day, we still have dedicated implicit ring per log
directive, but this might change)
In ring_write(), we check if the sink->fmt is specified:
- defined: we use it since it is the most precise format
(ie: for named rings)
- undefined: then we fallback to the format from the logger
With this change, implicit rings' format is now set to UNSPEC upon
creation. This is safe because the log header building function
automatically enforces the "raw" format when UNSPEC is set. And since
logger->format also defaults to "raw", no change of default behavior
should be expected.
Introduce log_header struct to easily pass log header data between
functions and use that to simplify the logic around log header
handling.
While at it, some outdated comments were updated as well.
No change in behavior should be expected.
__do_send_log() now takes an extra target parameter to pass an explicit
log target instead of getting it from logger->target.
This will allow __do_send_log() to be called multiple times within a
logger entry containing multiple log targets.
Since a5b325f92 ("MINOR: protocol: add a real family for existing FDs"),
we don't rely anymore on AF_UNSPEC for buffer rings in do_send_log.
But we kept it as a parsing hint to differentiate between implicit and
named rings during ring buffer postparsing.
However it is still a bit confusing and forces us to systematically rely
on target->addr, even for named buffer rings where it doesn't make much
sense anymore.
Now that target->addr was made a pointer in a recent commit, we can
choose not to initialize it when not needed (i.e.: named rings) and use
this as a hint to distinguish implicit rings during init since they rely
on the addr struct to temporarily store the ring's address until the ring
is actually created during postparsing step.
log targets were immediately embedded in logger struct (previously
named logsrv) and could not be used outside of this context.
In this patch, we're introducing log_target type with the associated
helper functions so that it becomes possible to declare and use log
targets outside of loggers scope.
When 'log' directive was implemented, the internal representation was
named 'struct logsrv', because the 'log' directive would directly point
to the log target, which used to be a (UDP) log server exclusively at
that time, hence the name.
But things have become more complex, since today 'log' directive can point
to ring targets (implicit, or named) for example.
Indeed, a 'log' directive does no longer reference the "final" server to
which the log will be sent, but instead it describes which log API and
parameters to use for transporting the log messages to the proper log
destination.
So now the term 'logsrv' is rather confusing and prevents us from
introducing a new level of abstraction because they would be mixed
with logsrv.
So in order to better designate this 'log' directive, and make it more
generic, we chose the word 'logger' which now replaces logsrv everywhere
it was used in the code (including related comments).
This is internal rewording, so no functional change should be expected
on user-side.
Since the following patch :
commit 33c49cec987c1dcd42d216c6d075fb8260058b16
MINOR: quic: Make qc_dgrams_retransmit() return a status.
retransmission process is interrupted as soon as a fatal send error has
been encounted. However, this may leave frames in local list. This cause
several issues : a memory leak and a potential crash.
The crash happens because leaked frames are duplicated of an origin
frame via qc_dup_pkt_frms(). If an ACK arrives later for the origin
frame, all duplicated frames are also freed. During qc_frm_free(),
LIST_DEL_INIT() operation is invalid as it still references the local
list used inside qc_dgrams_retransmit().
This bug was reproduced using the following injection from another
machine :
$ h2load --npn-list h3 -t 8 -c 10000 -m 1 -n 2000000000 \
https://<host>:<port>/?s=4m
Haproxy was compiled using ASAN. The crash resulted in the following
trace :
==332748==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fff82bf9d78 at pc 0x556facd3b95a bp 0x7fff82bf8b20 sp 0x7fff82bf8b10
WRITE of size 8 at 0x7fff82bf9d78 thread T0
#0 0x556facd3b959 in qc_frm_free include/haproxy/quic_frame.h:273
#1 0x556facd59501 in qc_release_frm src/quic_conn.c:1724
#2 0x556facd5a07f in quic_stream_try_to_consume src/quic_conn.c:1803
#3 0x556facd5abe9 in qc_treat_acked_tx_frm src/quic_conn.c:1866
#4 0x556facd5b3d8 in qc_ackrng_pkts src/quic_conn.c:1928
#5 0x556facd60187 in qc_parse_ack_frm src/quic_conn.c:2354
#6 0x556facd693a1 in qc_parse_pkt_frms src/quic_conn.c:3203
#7 0x556facd7531a in qc_treat_rx_pkts src/quic_conn.c:4606
#8 0x556facd7a528 in quic_conn_app_io_cb src/quic_conn.c:5059
#9 0x556fad3284be in run_tasks_from_lists src/task.c:596
#10 0x556fad32a3fa in process_runnable_tasks src/task.c:876
#11 0x556fad24a676 in run_poll_loop src/haproxy.c:2968
#12 0x556fad24b510 in run_thread_poll_loop src/haproxy.c:3167
#13 0x556fad24e7ff in main src/haproxy.c:3857
#14 0x7fae30ddd0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
#15 0x556facc9375d in _start (/opt/haproxy-quic-2.8/haproxy+0x1ea75d)
Address 0x7fff82bf9d78 is located in stack of thread T0 at offset 40 in frame
#0 0x556facd74ede in qc_treat_rx_pkts src/quic_conn.c:4580
This must be backported up to 2.7.
qcs_new() allocates several elements in intermediary steps. All elements
must first be properly initialized to be able to free qcs instance in
case of an intermediary failure.
Previously, qc_stream_desc allocation was done in the middle of
qcs_new() before some elements initializations. In case this fails, a
crash can happened as some elements are left uninitialized.
To fix this, move qc_stream_desc allocation at the end of qcs_new().
This ensures that all qcs elements are initialized first.
This should be backported up to 2.6.
qc_new_conn() allocates several elements in intermediary steps. If one
of the fails, a global free is done on the quic_conn and its elements.
This requires that most elements are first initialized to NULL or
equivalent to ensure freeing operation is done only on proper values.
Once of this element is qc.tx.cc_buf_area. It was initialized too late
which could caused crashes. This is introduced by
9f7cfb0a56
MEDIUM: quic: Allow the quic_conn memory to be asap released.
No need to backport.
Since commit 5afcb686b ("MAJOR: connection: purge idle conn by last usage")
in 2.9-dev4, the test on conn->toremove_list added to conn_get_idle_flag()
in 2.8 by commit 3a7b539b1 ("BUG/MEDIUM: connection: Preserve flags when a
conn is removed from an idle list") becomes misleading. Indeed, now both
toremove_list and idle_list are shared by a union since the presence in
these lists is mutually exclusive. However, in conn_get_idle_flag() we
check for the presence in the toremove_list to decide whether or not to
delete the connection from the tree. This test now fails because instead
it sees the presence in the idle or safe list via the union, and concludes
the element must not be removed. Thus the element remains in the tree and
can be found later after the connection is released, causing crashes that
Tristan reported in issue #2292.
The following config is sufficient to reproduce it with 2 threads:
defaults
mode http
timeout client 5s
timeout server 5s
timeout connect 1s
listen front
bind :8001
server next 127.0.0.1:8002
frontend next
bind :8002
timeout http-keep-alive 1
http-request redirect location /
Sending traffic with a few concurrent connections and some short timeouts
suffices to instantly crash it after ~10k reqs:
$ h2load -t 4 -c 16 -n 10000 -m 1 -w 1 http://0:8001/
With Amaury we analyzed the conditions in which the function is called
in order to figure a better condition for the test and concluded that
->toremove_list is never filled there so we can safely remove that part
from the test and just move the flag retrieval back to what it was prior
to the 2.8 patch above. Note that the patch is not reverted though, as
the parts that would drop the unexpected flags removal are unchanged.
This patch must NOT be backported. The code in 2.8 works correctly, it's
only the change in 2.9 that makes it misbehave.
In conn_delete_from_tree() there remains a cast of the toremove_list
to struct list while the introduction of the union precisely was to
avoid this cast. It's a leftover from the first version of patch
5afcb686b ("MAJOR: connection: purge idle conn by last usage") merged
into in 2.9-dev4, let's fix that.
No backport is needed.
HTTP/3 specification has several requirement when parsing authority or
host header inside a request. However, it was until then only partially
implemented.
This commit fixes this by ensuring the following :
* reject an empty authority/host header
* reject a host header if an authority was found with a different value
* no authority neither host header present
This must be backported up to 2.6.
Support stream opening with an initial max-stream-data of 0.
In normal case, QC_SF_BLK_SFCTL is set when a qcs instance cannot
transfer more data due to flow-control. This flag is set when
transfering data from MUX to quic-conn instance.
However, it's possible to define an initial value of 0 for
max-stream-data. In this case, qcs instance is blocked despite
QC_SF_BLK_SFCTL not set. No STREAM frame is prepared for this stream as
it's not possible to emit any byte, so QC_SF_BLK_SFCTL flag is never
set.
This behavior should cause no harm. However, this can cause a BUG_ON()
crash on qcc_io_send(). Indeed, when sending is retried, it ensures that
only qcs instance waiting for a new qc_stream_buf or with
QC_SF_BLK_SFCTL set is present in the send_list.
To fix this, initialize qcs with 0 value for msd and QC_SF_BLK_SFCTL.
The flag is removed only if transport parameter msd value is non null.
This should be backported up to 2.6.
When receiving a RESET_STREAM on a send-only stream, it is mandatory to
close the connection with an error STREAM_STATE error. However, this was
badly implemented as this caused two invocation of qcc_set_error() which
is forbidden by the mux-quic API.
To fix this, rely on qcc_get_qcs() to properly detect the error. Remove
qcc_set_error() usage from qcc_recv_reset_stream() instead.
This must be backported up to 2.7.
RFC 9000 indicates that a QUIC packet with no frame must trigger a
connection closure with PROTOCOL_VIOLATION error code. Implement this
via an early return inside qc_parse_pkt_frms().
This should be backported up to 2.6.
Move all QUIC trace definitions from quic_conn.h to quic_trace-t.h. Also
remove multiple definition trace_quic macro definition into
quic_trace.h. This forces all QUIC source files who relies on trace to
include it while reducing the size of quic_conn.h.
This bug was detected when compiling haproxy against aws-lc TLS stack
during QUIC interop runner tests. Some algorithms could be negotiated by haproxy
through the TLS stack but not fully supported by haproxy QUIC implentation.
This leaded tls_aead() to return NULL (same thing for tls_md(), tls_hp()).
As these functions returned values were never checked, they could triggered
segfaults.
To fix this, one closes the connection as soon as possible with a
handshake_failure(40) TLS alert. Note that as the TLS stack successfully
negotiates an algorithm, it provides haproxy with CRYPTO data before entering
->set_encryption_secrets() callback. This is why this callback
(ha_set_encryption_secrets() on haproxy side) is modified to release all
the CRYPTO frames before triggering a CONNECTION_CLOSE with a TLS alert. This is
done calling qc_release_pktns_frms() for all the packet number spaces.
Modify some quic_tls_keys_hexdump to avoid crashes when the ->aead or ->hp EVP_CIPHER
are NULL.
Modify qc_release_pktns_frms() to do nothing if the packet number space passed
as parameter is not intialized.
This bug does not impact the QUIC TLS compatibily mode (USE_QUIC_OPENSSL_COMPAT).
Thank you to @ilia-shipitsin for having reported this issue in GH #2309.
Must be backported as far as 2.6.