A dedicated <fin> field was used in quic_stream structure. However, this
info is already encoded in the frame type field as specified by QUIC
protocol.
In fact, only code for packet reception used the <fin> field. On the
sending side, we only checked for the FIN bit. To align both sides,
remove the <fin> field and only used the FIN bit.
This should be backported up to 2.7.
Define a new qcc_app_ops callback named close(). This will be used to
notify app-layer about the closure of a stream by the remote peer. Its
main usage is to ensure that the closure is allowed by the application
protocol specification.
For the moment, close is not implemented by H3 layer. However, this
function will be mandatory to properly reject a STOP_SENDING on the
control stream and preventing a later crash. As such, this commit must
be backported with the next one on 2.6.
This is related to github issue #2006.
proxy http-only options implemented in http_ext were statically stored
within proxy struct.
We're making some changes so that http_ext are now stored in a dynamically
allocated structs.
http_ext related structs are only allocated when needed to save some space
whenever possible, and they are automatically freed upon proxy deletion.
Related PX_O_HTTP{7239,XFF,XOT) option flags were removed because we're now
considering an http_ext option as 'active' if it is allocated (ptr is not NULL)
A few checks (and BUG_ON) were added to make these changes safe because
it adds some (acceptable) complexity to the previous design.
Also, proxy.http was renamed to proxy.http_ext to make things more explicit.
forwarded header option (rfc7239) deals with sample expressions in two
steps: first a sample expression string is extracted from the config file
and later in startup sequence this string is converted into the resulting
sample_expr.
We need to perform these two steps because we cannot compile the expr
too early in the parsing sequence. (or we would miss some context)
Because of this, we have two dinstinct structure members (expr and expr_s)
for each 7239 field supporting sample expressions.
This is not cool, because we're bloating the http forwarded config structure,
and thus, bloating proxy config structure.
To address this, we now merge both expr and expr_s members inside a single
union to regain some space. This forces us to perform some additional logic
to make sure to use the proper structure member at different parsing steps.
Thanks to this, we're also able to free/release related config hints and
sample expression strings as soon as the sample expression
compilation is finished.
Just like forwarded (7239) header and forwardfor header, move parsing,
logic and management of 'originalto' option into http_ext dedicated class.
We're only doing this to standardize proxy http options management.
Existing behavior remains untouched.
Just like forwarded (7239) header, move parsing, logic and management
of 'forwardfor' option into http_ext dedicated class.
We're only doing this to standardize proxy http options management.
Existing behavior remains untouched.
Introducing http_ext class for http extension related work that
doesn't fit into existing http classes.
HTTP extension "forwarded", introduced with 7239 RFC is now supported
by haproxy.
The option supports various modes from simple to complex usages involving
custom sample expressions.
Examples :
# Those servers want the ip address and protocol of the client request
# Resulting header would look like this:
# forwarded: proto=http;for=127.0.0.1
backend www_default
mode http
option forwarded
#equivalent to: option forwarded proto for
# Those servers want the requested host and hashed client ip address
# as well as client source port (you should use seed for xxh32 if ensuring
# ip privacy is a concern)
# Resulting header would look like this:
# forwarded: host="haproxy.org";for="_000000007F2F367E:60138"
backend www_host
mode http
option forwarded host for-expr src,xxh32,hex for_port
# Those servers want custom data in host, for and by parameters
# Resulting header would look like this:
# forwarded: host="host.com";by=_haproxy;for="[::1]:10"
backend www_custom
mode http
option forwarded host-expr str(host.com) by-expr str(_haproxy) for for_port-expr int(10)
# Those servers want random 'for' obfuscated identifiers for request
# tracing purposes while protecting sensitive IP information
# Resulting header would look like this:
# forwarded: for=_000000002B1F4D63
backend www_for_hide
mode http
option forwarded for-expr rand,hex
By default (no argument provided), forwarded option will try to mimic
x-forward-for common setups (source client ip address + source protocol)
The option is not available for frontends.
no option forwarded is supported.
More info about 7239 RFC here: https://www.rfc-editor.org/rfc/rfc7239.html
More info about the feature in doc/configuration.txt
This should address feature request GH #575
Depends on:
- "MINOR: http_htx: add http_append_header() to append value to header"
- "MINOR: sample: add ARGC_OPT"
- "MINOR: proxy: introduce http only options"
This commit is innoffensive but will allow to do some code refactors in
existing proxy http options. Newly created http related proxy options
will also benefit from this.
Just like http_append_header(), but this time to insert new value before
an existing one.
If the header already contains one or multiple values, ',' is automatically
inserted after the new value.
Calling this function as an alternative to http_replace_header_value()
to append a new value to existing header instead of replacing the whole
header content.
If the header already contains one or multiple values: a ',' is automatically
appended before the new value.
This function is not meant for prepending (providing empty ctx value),
in which case we should consider implementing dedicated prepend
alternative function.
Till now pseudo headers were passed as const strings, but having them as
ISTs will be more convenient for traces. This doesn't change anything for
strings which are derived from them (and being constants they're still
zero-terminated).
TRACE_PRINTF() can be used to produce arbitrary trace contents at any
trace level. It uses the exact same arguments as other TRACE_* macros,
but here they are mandatory since they are followed by the format-string,
though they may be filled with zeroes. The reason for the arguments is to
match tracking or filtering and not pollute other non-inspected objects.
It will probably be used inside loops, in which case there are two points
to be careful about:
- output atomicity is only per-message, so competing threads may see
their messages interleaved. As such, it is recommended that the
caller places a recognizable unique context at the beginning of the
message such as a connection pointer.
- iterating over arrays or lists for all requests could be very
expensive. In order to avoid this it is best to condition the call
via TRACE_ENABLED() with the same arguments, which will return the
same decision.
- messages longer than TRACE_MAX_MSG-1 (1023 by default) will be
truncated.
For example, in order to dump the list of HTTP headers between hpack
and h2:
if (outlen > 0 &&
TRACE_ENABLED(TRACE_LEVEL_DEVELOPER,
H2_EV_RX_FRAME|H2_EV_RX_HDR, h2c->conn, 0, 0, 0)) {
int i;
for (i = 0; list[i].n.len; i++)
TRACE_PRINTF(TRACE_LEVEL_DEVELOPER, H2_EV_RX_FRAME|H2_EV_RX_HDR,
h2c->conn, 0, 0, 0, "h2c=%p hdr[%d]=%s:%s", h2c, i,
list[i].n.ptr, list[i].v.ptr);
}
In addition, a lower-level TRACE_PRINTF_LOC() macro is provided, that takes
two extra arguments, the caller's location and the caller's function name.
This will allow to emit composite traces from central functions on the
behalf of another one.
By default, passing a NULL cb to the trace functions will result in the
source's default one to be used. For some cases we won't want to use any
callback at all, not event the default one. Let's define a trace_no_cb()
function for this, that does absolutely nothing.
Sometimes it would be necessary to prepare some messages, pre-process some
blocks or maybe duplicate some contents before they vanish for the purpose
of tracing them. However we don't want to do that for everything that is
submitted to the traces, it's important to do it only for what will really
be traced.
The __trace() function has all the knowledge for this, to the point of even
checking the lockon pointers. This commit splits the function in two, one
with the trace decision logic, and the other one for the trace production.
The first one is now usable through wrappers such as _trace_enabled() and
TRACE_ENABLED() which will indicate whether traces are going to be produced
for the current source, level, event mask, parameters and tracking.
There are ifdefs at several places to only define TRC_ARGS_QCON when
QUIC is defined, but nothing prevents this code from building without.
Let's just remove those ifdefs, the single "if" they avoid is not worth
the extra maintenance burden.
As specified by HTTP3 RFC, SETTINGS frame should be sent as soon as
possible. Before this patch, this was only done on the first qc_send()
invocation. This delay significantly SETTINGS emission until the first
H3 response is ready to be transferred.
This patch fixes this by ensuring SETTINGS is emitted when MUX-QUIC is
being setup.
As a side point, return value of finalize operation is checked. This
means that an error during SETTINGS emission will cause the connection
init to fail.
This should be backported up to 2.7.
thread_harmless_end() needs to wait for rdv_requests to disappear so
that we're certain to respect a harmless promise that possibly allowed
another thread to proceed under isolation. But this doesn't work in a
signal handler because a thread could be interrupted by the debug
handler while already waiting for isolation and with rdv_request>0.
As such this function could cause a deadlock in such a signal handler.
Let's implement a specific variant for this, thread_harmless_end_sig(),
that just resets the thread's bit and doesn't wait. It must of course
not be used past a check point that would allow the isolation requester
to return and see the thread as temporarily harmless then turning back
on its promise.
This will be needed to fix a race in the debug handler.
A few loops waiting for threads to synchronize such as thread_isolate()
rightfully filter the thread masks via the threads_enabled field that
contains the list of enabled threads. However, it doesn't use an atomic
load on it. Before 2.7, the equivalent variables were marked as volatile
and were always reloaded. In 2.7 they're fields in ha_tgroup_ctx[], and
the risk that the compiler keeps them in a register inside a loop is not
null at all. In practice when ha_thread_relax() calls sched_yield() or
an x86 PAUSE instruction, it could be verified that the variable is
always reloaded. If these are avoided (e.g. architecture providing
neither solution), it's visible in asm code that the variables are not
reloaded. In this case, if a thread exists just between the moment the
two values are read, the loop could spin forever.
This patch adds the required _HA_ATOMIC_LOAD() on the relevant
threads_enabled fields. It must be backported to 2.7.
Slighty adjust b_quic_enc_int(). This function is used to encode an
integer as a QUIC varint in a struct buffer.
A new parameter is added to the function API to specify the width of the
encoded integer. By default, 0 should be use to ensure that the minimum
space is used. Other valid values are 1, 2, 4 or 8. An error is reported
if the width is not large enough.
This new parameter will be useful when buffer space is reserved prior to
encode an unknown integer value. The maximum size of 8 bytes will be
reserved and some data can be put after. When finally encoding the
integer, the width can be requested to be 8 bytes.
With this new parameter, a small refactoring of the function has been
conducted to remove some useless internal variables.
This should be backported up to 2.7. It will be mostly useful to
implement H3 trailers encoding.
This function was introduced in OpenSSL 1.1.0. Prior to that, the
ECDSA_SIG structure was public.
This function was used in commit 5a8f02ae "BUG/MEDIUM: jwt: Properly
process ecdsa signatures (concatenated R and S params)".
This patch needs to be backported up to branch 2.5 alongside commit
5a8f02ae.
Commit 5a8f02ae6 ("BUG/MEDIUM: jwt: Properly process ecdsa signatures
(concatenated R and S params)") makes use of ECDSA_SIG_set0() which only
appeared in openssl-1.1.0 and libressl 2.7, and breaks the build before.
Let's just do what it minimally does (only assigns the two fields to the
destination).
This will need to be backported where the commit above is, likely 2.5.
Add "no-quic" to "global" section to disable the use of QUIC transport protocol
by all configured QUIC listeners. This is listeners with QUIC addresses on their
"bind" lines. Internally, the socket addresses binding is skipped by
protocol_bind_all() for receivers with <proto_quic4> or <proto_quic6> as
protocol (see protocol struct).
Add information about "no-quic" global option to the documentation.
Must be backported to 2.7.
In se_fl_set_error() we used to switch to SE_FL_ERROR only when there
is already SE_FL_EOS indicating that the read side is closed. But that
is not sufficient, we need to consider all cases where no more reads
will be performed on the connection, and as such also include SE_FL_EOI.
Without this, some aborted connections during a transfer sometimes only
stop after the timeout, because the ERR_PENDING is never promoted to
ERROR.
This must be backported to 2.7 and requires previous patch "CLEANUP:
stconn: always use se_fl_set_error() to set the pending error".
When the payload length cannot be determined, the htx extra field is set to
the magical vlaue ULLONG_MAX. It is not obvious. This a dedicated HTX value
is now used. Now, HTX_UNKOWN_PAYLOAD_LENGTH must be used in this case,
instead of ULLONG_MAX.
There is already a function to set termination flags but it is not well
suited for HTTP streams. So a function, dedicated to the HTTP analysis, was
added. This way, this new function will be called for HTTP analysers on
error. And if the error is not caugth at this stage, the generic function
will still be called from process_stream().
Here, by default a PRXCOND error is reported and depending on the stream
state, the reson will be set accordingly:
* If the backend SC is in INI state, SF_FINST_T is reported on tarpit and
SF_FINST_R otherwise.
* SF_FINST_Q is the server connection is queued
* SF_FINST_C in any connection attempt state (REQ/TAR/ASS/CONN/CER/RDY).
Except for applets, a SF_FINST_R is reported.
* Once the server connection is established, SF_FINST_H is reported while
HTTP_MSG_DATA state on the response side.
* SF_FINST_L is reported if the response is in HTTP_MSG_DONE state or
higher and a client error/timeout was reported.
* Otherwise SF_FINST_D is reported.
During multiple tests we've already noticed that shared stats counters
have become a real bottleneck under large thread counts. With QUIC it's
pretty visible, with qc_snd_buf() taking 2.5% of the CPU on a 48-thread
machine at only 25 Gbps, and this CPU is entirely spent in the atomic
increment of the byte count and byte rate. It's also visible in H1/H2
but slightly less since we're working with larger buffers, hence less
frequent updates. These counters are exclusively used to report the
byte count in "show info" and the byte rate in the stats.
Let's move them to the thread_ctx struct and make the stats reader
just collect each thread's stats when requested. That's way more
efficient than competing on a single cache line.
After this, qc_snd_buf has totally disappeared from the perf profile
and tests made in h1 show roughly 1% performance increase on small
objects.
When a STOP_SENDING or RESET_STREAM must be send, its corresponding qcs
is inserted into <qcc.send_list> via qcc_reset_stream() or
qcc_abort_stream_read().
This allows to remove the iteration on full qcs tree in qc_send().
Instead, STOP_SENDING and RESET_STREAM is done in the loop over
<qcc.send_list> as with STREAM frames. This should improve slightly the
performance, most notably when large number of streams are opened.
This must be backported up to 2.7.
Complete qcc_send_stream() function to allow to specify if the stream
should be handled in priority. Internally this will insert the qcs
instance in front of <qcc.send_list> to be able to treat it before other
streams.
This functionality is useful when some QUIC streams should be sent
before others. Most notably, this is used to guarantee that H3 SETTINGS
is done first via the control stream.
This must be backported up to 2.7.
Implement a mechanism to register streams ready to send data in new
STREAM frames. Internally, this is implemented with a new list
<qcc.send_list> which contains qcs instances.
A qcs can be registered safely using the new function qcc_send_stream().
This is done automatically in qc_send_buf() which covers most cases.
Also, application layer is free to use it for internal usage streams.
This is currently the case for H3 control stream with SETTINGS sending.
The main point of this patch is to handle stream sending fairly. This is
in stark contrast with previous code where streams with lower ID were
always prioritized. This could cause other streams to be indefinitely
blocked behind a stream which has a lot of data to transfer. Now,
streams are handled in an order scheduled by se_desc layer.
This commit is the first one of a serie which will bring other
improvments which also relied on the send_list implementation.
This must be backported up to 2.7 when deemed sufficiently stable.
In applets, we stop processing when a write error (CF_WRITE_ERROR) or a shutdown
for writes (CF_SHUTW) is detected. However, any write error leads to an
immediate shutdown for writes. Thus, it is enough to only test if CF_SHUTW is
set.
When a read error (CF_READ_ERROR) is reported, a shutdown for reads is
always performed (CF_SHUTR). Thus, there is no reason to check if
CF_READ_ERROR is set if CF_SHUTR is also checked.
CF_READ_ATTACHED flag is only used in input events for stream analyzers,
CF_MASK_ANALYSER. A read event can be reported instead and this flag can be
removed. We must only take care to report a read event when the client
connection is upgraded from TCP to HTTP.
It appears CF_ANA_TIMEOUT is flag only used in CF_MASK_ANALYSER. All
analyzer timeout relies on the analysis expiration date (chn->analyse_exp).
Worst, once set, this flag is never removed. Thus this flag can be removed
and replaced by a read event (CF_READ_EVENT).
Thanks to previous changes, CF_WRITE_ACTIVITY flags can be removed.
Everywhere it was used, its value is now directly used
(CF_WRITE_EVENT|CF_WRITE_ERROR).
Thanks to previous changes, CF_READ_ACTIVITY flags can be removed.
Everywhere it was used, its value is now directly used
(CF_READ_EVENT|CF_READ_ERROR).
Just like CF_READ_PARTIAL, CF_WRITE_PARTIAL is now merged with
CF_WRITE_EVENT. There a subtlety in sc_notify(). The "connect" event
(formely CF_WRITE_NULL) is now detected with
(CF_WRITE_EVENT + sc->state < SC_ST_EST).
CF_READ_PARTIAL flag is now merged with CF_READ_EVENT. It means
CF_READ_EVENT is set when a read0 is received (formely CF_READ_NULL) or when
data are received (formely CF_READ_ACTIVITY).
There is nothing special here, except conditions to wake the stream up in
sc_notify(). Indeed, the test was a bit changed to reflect recent
change. read0 event is now formalized by (CF_READ_EVENT + CF_SHUTR).
As for CF_READ_NULL, it appears CF_WRITE_NULL and other write events on a
channel are mainly used to wake up the stream and may be replace by on write
event.
In this patch, we introduce CF_WRITE_EVENT flag as a replacement to
CF_WRITE_EVENT_NULL. There is no breaking change for now, it is just a
rename. Gradually, other write events will be merged with this one.
CF_READ_NULL flag is not really useful and used. It is a transient event
used to wakeup the stream. As we will see, all read events on a channel may
be resumed to only one and are all used to wake up the stream.
In this patch, we introduce CF_READ_EVENT flag as a replacement to
CF_READ_NULL. There is no breaking change for now, it is just a
rename. Gradually, other read events will be merged with this one.
This action increments the General Purpose Counter at the index <idx> of
the array associated to the sticky counter designated by <sc-id> by the
value of either integer <int> or the integer evaluation of expression
<expr>. Integers and expressions are limited to unsigned 32-bit values.
If an error occurs, this action silently fails and the actions evaluation
continues. <idx> is an integer between 0 and 99 and <sc-id> is an integer
between 0 and 2. It also silently fails if the there is no GPC stored at
this index. The entry in the table is refreshed even if the value is zero.
The 'gpc_rate' is automatically adjusted to reflect the average growth
rate of the gpc value.
The main use of this action is to count scores or total volumes (e.g.
estimated danger per source IP reported by the server or a WAF, total
uploaded bytes, etc).
The number of stick-counter entries usable by track-sc rules is currently
set at build time. There is no good value for this since the vast majority
of users don't need any, most need only a few and rare users need more.
Adding more counters for everyone increases memory and CPU usages for no
reason.
This patch moves the per-session and per-stream arrays to a pool of a size
defined at boot time. This way it becomes possible to set the number of
entries at boot time via a new global setting "tune.stick-counters" that
sets the limit for the whole process. When not set, the MAX_SESS_STR_CTR
value still applies, or 3 if not set, as before.
It is also possible to lower the value to 0 to save a bit of memory if
not used at all.
Note that a few low-level sample-fetch functions had to be protected due
to the ability to use sample-fetches in the global section to set some
variables.
There is a bug in b_slow_realign() function when wrapping output data are
copied in the swap buffer. block1 and block2 sizes are inverted. Thus blocks
with a wrong size are copied. It leads to data mixin if the first block is
in reality larger than the second one or to a copy of data outside the
buffer is the first block is smaller than the second one.
The bug was introduced when the buffer API was refactored in 1.9. It was
found by a code review and seems never to have been triggered in almost 5
years. However, we cannot exclude it is responsible of some unresolved bugs.
This patch should fix issue #1978. It must be backported as far as 2.0.
Due to the previous SSL exception we coudln't restrict the collected
CFLAGS/LDFLAGS to those of enabled options, so all of them were
considered if set. The problem is that it would prevent simply
disabling a build option without unsetting its xxx_CFLAGS or _LDFLAGS
values if those had incompatible values (e.g. -lfoo).
Now that only existing options are listed in collect_opts_flags, we
can safely check that the option is set and only consider its settings
in this case. Thus OT_LDFLAGS will not be used if USE_OT is not set
for example.
By creating USE_SSL and enabling it when USE_OPENSSL is set, we can
get rid of the special case that was made with it regarding cflags
collect and when resetting options. The option doesn't need to be
manually set, though in the future it might prove useful if other
non-openssl API are supported.
It's getting complicated to configure includes and lib dirs for
OpenSSL API variants such as WolfSSL, because some settings are
common and others are specific but carry a prefix that doesn't
match the USE_* rule scheme.
This patch simplifies everything by considering that all SSL libs
will use SSL_INC, SSL_LIB, SSL_CFLAGS and SSL_LDFLAGS. That's much
more convenient. This works thanks to the settings collector which
explicitly checks the SSL_* settings. When USE_OPENSSL_WOLFSSL is
set, then USE_OPENSSL is implied, so that there's no need to
duplicate maintenance effort.
The new function collect_opts_flags now scans all USE_* options defined
in use_opts and appends the corresponding *_CFLAGS and *_LDFLAGS to
OPTIONS_{C,LD}FLAGS respectively. This will be useful to get rid of all
the individual concatenations to these variables.
A lot of _SRC, _INC, _LIB etc variables are set and expected to be
initialized to an empty string by default. However, an in-depth
review of all of them showed that WOLFSSL_{INC,LIB}, SSL_{INC,LIB},
LUA_{INC,LIB}, and maybe others were not always initialized and could
sometimes leak from the environment and as such cause strange build
issues when running from cascaded scripts that had exported them.
The approach taken here consists in iterating over all USE_* options
and unsetting any _SRC, _INC, _LIB, _CFLAGS and _LDFLAGS that follows
the same name. For the few variable names options that don't exactly
match the build option (SSL & WOLFSSL), these ones are specifically
added to the list. The few that were explicitly cleared in their own
sections were just removed since not needed anymore. Note that an
"undefine" command appeared in GNU make 3.82 but since we support
older ones we can only initialize the variables to an empty string
here. It's not a problem in practice.
We're now certain that these variables are empty wherever they are
used, and that it is possible to just append to them, or use them
as-is.
Some macros and functions are barely understandable and are only used
to iterate over known options from the use_opts list. Better assign
them a name and move them into a dedicated file to clean the makefile
a little bit. Now at least "use_opts" only appears once, where it is
defined. This also allowed to completely remove the BUILD_FEATURES
macro that caused some confusion until previous commit.
Implement STOP_SENDING. This is divided in two main functions :
* qcc_abort_stream_read() which can be used by application protocol to
request for a STOP_SENDING. This set the flag QC_SF_READ_ABORTED.
* qcs_send_reset() is a static function called after the preceding one.
It will send a STOP_SENDING via qcc_send().
QC_SF_READ_ABORTED flag is now properly used : if activated on a stream
during qcc_recv(), <qcc.app_ops.decode_qcs> callback is skipped. Also,
abort reading on unknown unidirection remote stream is now fully
supported with the emission of a STOP_SENDING as specified by RFC 9000.
This commit is part of implementing H3 errors at the stream level. This
will allows the H3 layer to request the peer to close its endpoint for
an error on a stream.
This should be backported up to 2.7.
Implement RESET_STREAM reception by mux-quic. On reception, qcs instance
will be mark as remotely closed and its Rx buffer released. The stream
layer will be flagged on error if still attached.
This commit is part of implementing H3 errors at the stream level.
Indeed, on H3 stream errors, STOP_SENDING + RESET_STREAM should be
emitted. The STOP_SENDING will in turn generate a RESET_STREAM by the
remote peer which will be handled thanks to this patch.
This should be backported up to 2.7.
Implement mux_ops shutw operation for QUIC mux. A RESET_STREAM is
emitted unless the stream is already closed due to all data or
RESET_STREAM already transmitted.
This operation is notably useful when upper stream layer wants to close
the connection early due to an error.
This was tested by using a HTTP server which listens with PROXY protocol
support. The corresponding server line on haproxy configuration
deliberately not specify send-proxy. This causes the server to close
abruptly the connection. Without this patch, nothing was done on the QUIC
stream which was kept open until the whole connection is closed. Now, a
proper RESET_STREAM is emitted to report the error.
This should be backported up to 2.7.
The same change was already performed for the cli. The stats applet and the
prometheus exporter are also concerned. Both use the stats API and rely on
pool functions to get total pool usage in bytes. pool_total_allocated() and
pool_total_used() must return 64 bits unsigned integer to avoid any wrapping
around 4G.
This may be backported to all versions.
This patch contains the main function of the ocsp auto update mechanism
as well as an init and destroy function of the task used for this.
The task is not created in this patch but in a later one.
The function has two distinct parts and the branching to one or the
other is completely based on the fact that the cur_ocsp pointer of the
ssl_ocsp_task_ctx member is set.
If the pointer is not set, we need to look at the first item of the
update tree and see if it needs to be updated. If it does not we simply
wait until the time is right and let the task asleep. If it does need to
be updated, we simply build and send the corresponding ocsp request
thanks to the http_client. The task is then sent to sleep with an expire
time set to infinity. The http_client will wake it back up once the
response is received (or a timeout occurs). Just note that during this
whole process the cetificate_ocsp object corresponding to the entry
being updated is taken out of the update tree and only stored in the
ssl_ocsp_task_ctx context.
Once the task is waken up by the http_client, it branches on the
response processing part of the function which basically checks that the
response is valid and inserts it into the ocsp_response tree. The task
then goes back to sleep until another entry needs to be updated.
The 'ocsp-update' option is parsed at the same time as all the other
bind line options but it does not actually have anything to do with the
bind line since it concerns the frontend certificate instead. For that
reason, we should have a mean to identify inconsistencies in the
configuration and raise an error when a given certificate has two
different ocsp-update modes specified in one or more crt-lists.
The simplest way to do it is to store the ocsp update mode directly in
the ckch and not only in the ssl_bind_conf.
This option will define how the ocsp update mechanism behaves. The
option can either be set to 'on' or 'off' and can only be specified in a
crt-list entry so that we ensure that it concerns a single certificate.
The 'off' mode is the default one and corresponds to the old behavior
(no automatic update).
When the option is set to 'on', we will try to get an ocsp response
whenever an ocsp uri can be found in the frontend's certificate. The
only limitation of this mode is that the certificate's issuer will have
to be known in order for the OCSP certid to be built.
This patch only adds the parsing of the option. The full functionality
will come in a later commit.
The ocsp_response member of the cert_key_and_chain structure is only
used temporarily. During a standard init process where an ocsp response
is provided, this ocsp file is first copied into the ocsp_response
buffer without any ocsp-related parsing (see
ssl_sock_load_ocsp_response_from_file), and then the contents are
actually interpreted and inserted into the actual ocsp tree
(cert_ocsp_tree) later in the process (see ssl_sock_load_ocsp). If the
response was deemed valid, it is then copied into the actual
ocsp_response structure's 'response' field (see
ssl_sock_load_ocsp_response). From this point, the ocsp_response field
of the cert_key_and_chain object could be discarded since actual ocsp
operations will be based of the certificate_ocsp object.
The only remaining runtime use of the ckch's ocsp_response field was in
the CLI, and more precisely in the 'show ssl cert' mechanism.
This constraint could be removed by adding an OCSP_CERTID directly in
the ckch because the buffer was only used to get this id.
This patch then adds the OCSP_CERTID pointer in the ckch, it clears the
ocsp_response buffer early and simplifies the ckch_store_build_certid
function.
This helper function will check that an OCSP response is valid, meaning
that the proper "Content-Type: application/ocsp-response" header is
present and the data itself is a proper OCSP_RESPONSE that can be
checked thanks to the issuer certificate.
The tree that contains OCSP responses is never locked despite being used
at runtime for OCSP stapling as well as the CLI through "set ssl cert"
and "set ssl ocsp-response" commands.
Everything works though because the certificate_ocsp structure is
refcounted and the tree's entries are cleaned up when SSL_CTXs are
destroyed (thanks to an ex_data entry in which the certificate_ocsp
pointer is stored).
This new lock will come to use when the OCSP auto update mechanism is
fully implemented because this new feature will be based on another tree
that stores the same certificate_ocsp members and updates their contents
periodically.
Shards were completely forgotten in commit f5a0c8abf ("MEDIUM: quic:
respect the threads assigned to a bind line"). The thread mask is
taken from the bind_conf, but since shards were introduced in 2.5,
the per-listener mask is held by the receiver and can be smaller
than the bind_conf's mask.
The effect here is that the traffic is not distributed to the
appropriate thread. At first glance it's not dramatic since it remains
one of the threads eligible by the bind_conf, but it still means that
in some contexts such as "shards by-thread", some concurrency may
persist on listeners while they're expected to be alone. One identified
impact is that it requires more rxbufs than necessary, but there may
possibly be other not yet identified side effects.
This must be backported to 2.7 and everywhere the commit above is
backported.
qcs instances for bidirectional streams are inserted in
<qcc.opening_list>. It is removed from the list once a full HTTP request
has been parsed. This is required to implement http-request timeout.
In case a stream is deleted before receiving full HTTP request, it also
must be removed from <qcc.opening_list>. This was not the case on first
implementation but has been fixed by the following patch :
641a65ff3c
BUG/MINOR: mux-quic: remove qcs from opening-list on free
This means that now a stream can be deleted from the list in two
different functions. Sadly, as LIST_DELETE was used in both cases,
nothing prevented a double-deletion from the list, even though
LIST_INLIST was used. Both calls are replaced with LIST_DEL_INIT which
is idempotent.
This bug causes memory corruption which results in most cases in a
segfault, most of times outside of mux-quic code itself. It has been
found first by gabrieltz who reported it on the github issue #1903. Big
thanks to him for his testing.
This bug also causes failures on several 'M' transfer testcase of QUIC
interop-runner. The s2n-quic client is particularly useful in this case
as segfaults triggers were most of the times on the LIST_DELETE
operation itself. This is probably due to its encapsulating of HEADERS
frame with fin bit delayed in a following empty STREAM frame.
This must be backported wherever the above patch is, up to 2.6.
Some uses of swrate_add() only consist in getting a rough estimate of
a frequency. There are cases where speed matters more than accuracy
(e.g. pools). For such use cases, let's just stop looping on the CAS,
if the update fails, another thread is already providing input, and
it's not dramatic to lose the race. All these functions are now
suffixed with "_opportunistic".
Till now it was only possible to change the thread local hot cache size
at build time using CONFIG_HAP_POOL_CACHE_SIZE. But along benchmarks it
was sometimes noticed a huge contention in the lower level memory
allocators indicating that larger caches could be beneficial, especially
on machines with large L2 CPUs.
Given that the checks against this value was no longer on a hot path
anymore, there was no reason for continuing to force it to be tuned at
build time. So this patch allows to set it by tune.memory-hot-size.
It's worth noting that during the boot phase the value remains zero so
that it's possible to know if the value was set or not, which opens the
possibility that we try to automatically adjust it based on the per-cpu
L2 cache size or the use of certain protocols (none of this is done yet).
Performance profiling on a 48-thread machine showed a lot of time spent
in pool_free(), precisely at the point where pool->limit was retrieved.
And the reason is simple. Some parts of the pool_head are heavily updated
only when facing a cache miss ("allocated", "used", "needed_avg"), while
others are always accessed (limit, flags, size). The fact that both
entries were stored into the same cache line makes it very difficult for
each thread to access these precious info even when working with its own
cache.
By just splitting the fields apart, a test on QUIC (which stresses pools
a lot) more than doubled performance from 42 Gbps to 96 Gbps!
Given that the patch only reorders fields and addresses such a significant
contention, it should be backported to 2.7 and 2.6.
peers-t.h uses "struct stktable" as well as STKTABLE_DATA_TYPES which
are defined in stick-table-t.h. It works by accident because
stick-table-t.h was always included before. But could provoke build
issue with EXTRA code.
To be backported as far as 2.2.
Add a new value in stats ctx: field.
Implement field support in line dumping parent functions
stats_print_proxy_field_json() and stats_dump_proxy_to_buffer().
This will allow child dumping functions to support partial line dumping
when needed. ie: when dumping buffer is exhausted: do a partial send and
wait for a new buffer to finish the dump. Thanks to field ctx, the function
can start dumping where it left off on previous (unterminated) invokation.
Extract function h2_parse_cont_len_header() in the generic HTTP module.
This allows to reuse it for all HTTP/x parsers. The function is now
available as http_parse_cont_len_header().
Most notably, this will be reused in the next bugfix for the H3 parser.
This is necessary to check that content-length header match the length
of DATA frames.
Thus, it must be backported to 2.6.
qc_new_conn() is used to allocate a quic_conn instance and its various
internal members. If one allocation fails, quic_conn_release() is used
to cleanup things.
For the moment, pool_zalloc() is used which ensures that all content is
null. However, some members must be initialized to a special values
to be able to use quic_conn_release() safely. This is the case for
quic_conn lists and its tasklet.
Also, some quic_conn internal allocation functions were doing their own
cleanup on failure without reset to NULL. This caused an issue with
quic_conn_release() which also frees this members. To fix this, these
functions now only return an error without cleanup. It is the caller
responsibility to free the allocated content, which is done via
quic_conn_release().
Without this patch, allocation failure in qc_new_conn() would often
result in segfault. This was reproduced easily using fail-alloc at 10%.
This should be backported up to 2.6.
This patch introduces haproxy_backend_agg_check_status metric
as we wanted in 42d7c402d but with the right data source.
This patch could be backported as far as 2.4.
haproxy_backend_agg_server_check_status currently aggregates
haproxy_server_status instead of haproxy_server_check_status.
We deprecate this and create a new one,
haproxy_backend_agg_server_status to clarify what it really
does.
This patch could be backported as far as 2.4.
Since the massive pools cleanup that happened in 2.6, the pools
architecture was made quite more hierarchical and many alternate code
blocks could be moved to runtime flags set by -dM. One of them had not
been converted by then, DEBUG_UAF. It's not much more difficult actually,
since it only acts on a pair of functions indirection on the slow path
(OS-level allocator) and a default setting for the cache activation.
This patch adds the "uaf" setting to the options permitted in -dM so
that it now becomes possible to set or unset UAF at boot time without
recompiling. This is particularly convenient, because every 3 months on
average, developers ask a user to recompile haproxy with DEBUG_UAF to
understand a bug. Now it will not be needed anymore, instead the user
will only have to disable pools and enable uaf using -dMuaf. Note that
-dMuaf only disables previously enabled pools, but it remains possible
to re-enable caching by specifying the cache after, like -dMuaf,cache.
A few tests with this mode show that it can be an interesting combination
which catches significantly less UAF but will do so with much less
overhead, so it might be compatible with some high-traffic deployments.
The change is very small and isolated. It could be helpful to backport
this at least to 2.7 once confirmed not to cause build issues on exotic
systems, and even to 2.6 a bit later as this has proven to be useful
over time, and could be even more if it did not require a rebuild. If
a backport is desired, the following patches are needed as well:
CLEANUP: pools: move the write before free to the uaf-only function
CLEANUP: pool: only include pool-os from pool.c not pool.h
REORG: pool: move all the OS specific code to pool-os.h
CLEANUP: pools: get rid of CONFIG_HAP_POOLS
DEBUG: pool: show a few examples in -dMhelp
This one was set in defaults.h only when neither DEBUG_NO_POOLS nor
DEBUG_UAF were set. This was not the most convenient location to look
for it, and it was only used in pool.c to decide on the initial value
of POOL_DBG_NO_CACHE.
Let's just use DEBUG_NO_POOLS || DEBUG_UAF directly on this flag and
get rid of the intermediary condition. This also has the benefit of
removing a double inversion, which is always nice for understanding.
Till now pool-os used to contain a mapping from pool_{alloc,free}_area()
to pool_{alloc,free}_area_uaf() in case of DEBUG_UAF, or the regular
malloc-based function. And the *_uaf() functions were in pool.c. But
since 2.4 with the first cleanup of the pools, there has been no more
calls to pool_{alloc,free}_area() from anywhere but pool.c, from exactly
one place each. As such, there's no more need to keep *_uaf() apart in
pool.c, we can inline it into pool-os.h and leave all the OS stuff there,
with pool.c calling either based on DEBUG_UAF. This is cleaner with less
round trips between both files and easier to find.
There's no need for the low-level pool functions to be known from all
callers anymore, they're only used by pool.c. Let's reduce the amount
of header files processed.
We get a build error in ncbuf.c when building for ARMv8.2-a because ncbuf
has minimal includes and among them bug.h which includes atomic.h. Atomic.h
may use "forceinline" without including compiler.h, hence the build error.
It was verified that adding it doesn't inflate the total headers.
Since all other C files include api.h which already covers this, there's
no real need to bapkport this. The issue was already there in 2.3 though.
With previous commit, 9e080bf ("BUG/MINOR: checks: make sure fastinter is used
even on forced transitions"), on-error mark-down|sudden-death|fail-check are
now working as expected.
However, on-error fastinter remains broken because srv_getinter(), used in
the above commit to check the expiration date, won't return fastinter interval
if server health is maxed out (which is the case with on-error fastinter mode).
To fix this, we introduce a check flag named CHK_ST_FASTINTER.
This flag is set when on-error is triggered. This way we can force
srv_getinter() to return fastinter interval whenever the flag is set.
The flag is automatically cleared as soon as the new check task expiry is
recalculated in process_chk_conn().
This restores original behavior prior to d114f4a ("MEDIUM: checks: spread the
checks load over random threads").
It must be backported to 2.7 along with the aforementioned commits.
We're using srv_update_status() as the only event source or UP/DOWN server
events in an attempt to simplify the support for these 2 events.
It seems srv_update_status() is the common path for server state changes anyway
Tested with server state updated from various sources:
- the cli
- server-state file (maybe we could disable this or at least don't publish
in global event queue in the future if it ends in slower startup for setups
relying on huge server state files)
- dns records (ie: srv template)
(again, could be fined tuned to only publish in server specific subscriber
list and no longer in global subscription list if mass dns update tend to
slow down srv_update_status())
- normal checks and observe checks (HCHK_STATUS_HANA)
(same as above, if checks related state update storms are expected)
- lua scripts
- html stats page (admin mode)
Basic support for ADD and DEL server events are added through this commit:
SERVER_ADD is published on dynamic server addition through cli.
SERVER_DEL is published on dynamic server deletion through cli.
This work depends on:
"MINOR: event_hdl: add event handler base api"
"MINOR: server: add srv->rid (revision id) value"
Make use of the new srv->rid value in stats.
Stat is referred as ST_F_SRID, it is now used in stats_fill_sv_stats
function in order to be included in csv and json stats dumps.
Moreover, "rid: $value" will be displayed next to server puid
in html stats page if "stats show-legend" is specified in the stats frontend.
(mouse hovering tooltip)
Depends on the following commit:
"MINOR: server: add srv->rid (revision id) value"
With current design, we could not distinguish between
previously existing deleted server and a new server reusing
the deleted server name/id.
This can cause some confusion when auditing stats/events/logs,
because the new server will look similar to the old
one.
To address this, we're adding a new value in server structure: rid
rid (revision id) value is an unsigned 32bits value that is set upon
server creation. Value is derived from a global counter that starts
at 0 and is incremented each time one or multiple server deletions are
followed by a server addition (meaning that old name/id reuse could occur).
Thanks to this revision id, it is now easy to tell whether the server
we're looking at is the same as before or if it has been deleted and
re-added in the meantime.
(combining server name/id + server revision id yields a process-wide unique
identifier)
UDP addresses may change over time for a QUIC connection. When using
quic-conn owned socket, we have to detect address change to break the
bind/connect association on the socket.
For the moment, on change detected, QUIC connection socket is closed and
a new one is opened. In the future, we may improve this by trying to
keep the original socket and reexecute only bind/connect syscalls.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
This change is the second part for reception on QUIC connection socket.
All operations inside the FD handler has been delayed to quic-conn
tasklet via the new function qc_rcv_buf().
With this change, buffer management on reception has been simplified. It
is now possible to use a local buffer inside qc_rcv_buf() instead of
quic_receiver_buf().
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Try to use the quic-conn socket for reception if it is allocated. For
this, the socket is inserted in the fdtab. This will call the new
handler quic_conn_io_cb() which is responsible to process the recv()
system call. It will reuse datagram dispatch for simplicity. However,
this is guaranteed to be called on the quic-conn thread, so it will be
more efficient to use a dedicated buffer. This will be implemented in
another commit.
This patch should improve performance by reducing contention on the
receiver socket. However, more gain can be obtained when the datagram
dispatch operation will be skipped.
Older quic_sock_fd_iocb() is renamed to quic_lstnr_sock_fd_iocb() to
emphasize its usage for the receiver socket.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Allocate quic-conn owned socket if possible. This requires that this is
activated in haproxy configuration. Also, this is done only if local
address is known so it depends on the support of IP_PKTINFO.
For the moment this socket is not used. This causes QUIC support to be
broken as received datagram are not read. This commit will be completed
by a following patch to support recv operation on the newly allocated
socket.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
To be able to use individual sockets for QUIC connections, we rely on
the OS network stack which must support UDP sockets binding on the same
local address.
Add a detection code for this feature executed on startup. When the
first QUIC listener socket is binded, a test socket is created and
binded on the same address. If the bind call fails, we consider that
it's impossible to use individual socket for QUIC connections.
A new global option GTUNE_QUIC_SOCK_PER_CONN is defined. If startup
detect fails, this value is resetted from global options. For the
moment, there is no code to activate the option : this will be in a
follow-up patch with the introduction of a new configuration option.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Detect connection migration attempted by the client. This is done by
comparing addresses stored in quic-conn with src/dest addresses of the
UDP datagram.
A new function qc_handle_conn_migration() has been added. For the
moment, no operation is conducted and the function will be completed
during connection migration implementation. The only notable things is
the increment of a new counter "quic_conn_migration_done".
This should be backported up to 2.7.
Complete ipcmp() function with a new argument <check_port>. If this
argument is true, the function will compare port values besides IP
addresses and return true only if both are identical.
This commit will simplify QUIC connection migration detection. As such,
it should be backported to 2.7.
Extract individual datagram parsing code outside of datagrams list loop
in quic_lstnr_dghdlr(). This is moved in a new function named
quic_dgram_parse().
To complete this change, quic_lstnr_dghdlr() has been moved into
quic_sock source file : it belongs to QUIC socket lower layer and is
directly called by quic_sock_fd_iocb().
This commit will ease implementation of quic-conn owned socket.
New function quic_dgram_parse() will be easily usable after a receive
operation done on quic-conn IO-cb.
This should be backported up to 2.7.
quic_rx_packet struct had a reference to the quic_conn instance. This is
useless as qc instance is always passed through function argument. In
fact, pkt.qc is used only in qc_pkt_decrypt() on key update, even though
qc is also passed as argument.
Simplify this by removing qc field from quic_rx_packet structure
definition. Also clean up qc_pkt_decrypt() documentation and interface
to align it with other quic-conn related functions.
This should be backported up to 2.7.
Rename the structure "cert_key_and_chain" to "ckch_data" in order to
avoid confusion with the store whcih often called "ckchs".
The "cert_key_and_chain *ckch" were renamed "ckch_data *data", so we now
have store->data instead of ckchs->ckch.
Marked medium because it changes the API.
Adding base code to provide subscribe/publish API for internal
events processing.
event_hdl provides two complementary APIs, both are implemented
in src/event_hdl.c and include/haproxy/event_hdl{-t.h,.h}:
One API targeting developers that want to register event handlers
that will be notified on specific events.
(SUBSCRIBE)
One API targeting developers that want to notify registered handlers
about an event.
(PUBLISH)
This feature is being considered to address the following scenarios:
- mailers code refactoring (getting rid of deprecated
tcp-check ruleset implementation)
- server events from lua code (registering user defined
lua function that is executed with relevant data when a
server is dynamically added/removed or on server state change)
- providing a stable and easy to use API for upcoming
developments that rely on specific events to perform actions.
(e.g: ressource cleanup when a server is deleted from haproxy)
At this time though, we don't have much use cases in mind in addition to
server events handling, but the API is aimed at being multipurpose
so that new event families, with their own particularities, can be
easily implemented afterwards (and hopefully) without requiring breaking
changes to the API.
Moreover, you should know that the API was not designed to cope well
with high rate event publishing.
Mostly because publishing means iterating over unsorted subscriber list.
So it won't scale well as subscriber list increases, but it is intended in
order to keep the code simple and versatile.
Instead, it is assumed that events implemented using this API
should be periodic events, and that events related to critical
io/networking processing should be handled using
dedicated facilities anyway.
(After all, this is meant to be a general purpose event API)
Apart from being easily extensible, one of the main goals of this API is
to make subscriber code as simple and safe as possible.
This is done by offering multiple event handling modes:
- SYNC mode:
publishing code directly
leverages handler code (callback function)
and handler code has a direct access to "live" event data
(pointers mostly, alongside with lock hints/context
so that accessing data pointers can be done properly)
- normal ASYNC mode:
handler is executed in a backward compatible way with sync mode,
so that it is easy to switch from and to SYNC/ASYNC mode.
Only here the handler has access to "offline" event data, and
not "live" data (ptrs) so that data consistency is guaranteed.
By offline, you should understand "snapshot" of relevant data
at the time of the event, so that the handler can consume it
later (even if associated ressource is not valid anymore)
- advanced ASYNC mode
same as normal ASYNC mode, but here handler is not a function
that is executed with event data passed as argument: handler is a
user defined tasklet that is notified when event occurs.
The tasklet may consume pending events and associated data
through its own message queue.
ASYNC mode should be considered first if you don't rely on live event
data and you wan't to make sure that your code has the lowest impact
possible on publisher code. (ie: you don't want to break stuff)
Internal API documentation will follow:
You will find more details about the notions we roughly approached here.
WolfSSL does not implement the TLS1_3_CK_AES_128_CCM_SHA256 cipher as
well as the SSL_ERROR_WANT_ASYNC, SSL_ERROR_WANT_ASYNC_JOB and
SSL_ERROR_WANT_CLIENT_HELLO_CB error codes.
This patch disables them for WolfSSL.
Signed-off-by: William Lallemand <wlallemand@haproxy.org>
The function used to calculate the shard number currently requires a
stktable_key on input for this. Unfortunately, it happens that peers
currently miss this calculation and they do not provide stktable_key
at all, instead they're open-coding all the low-level stick-table work
(hence why it's missing). Thus we'll need to be able to calculate the
shard number in keys coming from peers as well but the current API does
not make it possible.
This commit addresses this by inverting the order where the length and
the shard number are used. Now the low-level function is independent on
stksess and stktable_key, it takes a table, pointer and length and does
all the job. The upper function takes care of the type and key to get
the its length, and is for use only from stick-table code.
This doesn't change anything except that the low-level one will be usable
from outside (hence why it's exported now).
ncbuf API relies on lot of small functions. Mark these functions as
inline to reduce call invocations and facilitate compiler optimizations
to reduce code size.
This should be backported up to 2.6.
Instead of using memcpy() to concatenate the table's name to the key when
allocating an stksess, let's compute once for all a per-table seed at boot
time and use it to calculate the key's hash. This saves two memcpy() and
the usage of a chunk, it's always nice in a fast path.
When tested under extreme conditions with a 80-byte long table name, it
showed a 1% performance increase.
With an OpenSSL library which use the wrong OPENSSLDIR, HAProxy tries to
load the OPENSSLDIR/certs/ into @system-ca, but emits a warning when it
can't.
This patch fixes the issue by allowing to shut the error when the SSL
configuration for the httpclient is not explicit.
Must be backported in 2.6.
This adds a USE_OPENSSL_WOLFSSL option, wolfSSL must be used with the
OpenSSL compatibility layer. This must be used with USE_OPENSSL=1.
WolfSSL build options:
./configure --prefix=/opt/wolfssl --enable-haproxy
HAProxy build options:
USE_OPENSSL=1 USE_OPENSSL_WOLFSSL=1 WOLFSSL_INC=/opt/wolfssl/include/ WOLFSSL_LIB=/opt/wolfssl/lib/ ADDLIB='-Wl,-rpath=/opt/wolfssl/lib'
Using at least the commit 54466b6 ("Merge pull request #5810 from
Uriah-wolfSSL/haproxy-integration") from WolfSSL. (2022-11-23).
This is still to be improved, reg-tests are not supported yet, and more
tests are to be done.
Signed-off-by: William Lallemand <wlallemand@haproxy.org>
A number of internal flags started to be exposed to external programs
at the location of their definition since commit 77acaf5af ("MINOR:
flags: add a new file to host flag dumping macros"). This allowed the
"flags" utility to decode many more of them and always correctly. The
condition to expose them was to rely on the preliminary definition of
EOF that indicates that stdio is already included. But this was a
wrong approach. It only guarantees that snprintf() can safely be used
but still causes large functions to be built. But stdio is often
included before some of these includes, so these heavy inline functions
actually have to be compiled in many cases. The result is that the
build time significantly increased, especially with fast compilers
like gcc -O0 which took +50% or TCC which took +100%!
This patch addresses the problem by instead relying on an explicit
macro HA_EXPOSE_FLAGS that the calling program must explicitly define
before including these files. flags.c does this and that's all. The
previous build time is now restored with a speed up of 20 to 50%
depending on the build options.
These includes brought by commit 9c76637ff ("MINOR: anon: add new macros
and functions to anonymize contents") resulted in an increase of exactly
20% of the number of lines to build. These include are not needed there,
only tools.c needs xxhash.h.
Building with TCC caused a warning on __attribute__() being redefined,
because we do define it on compilers that don't have it, but we didn't
include the compiler's definitions first to leave it a chance to expose
its definitions. The correct way to do this would be to include
sys/cdefs.h but we currently don't include it explicitly and a few
reports on the net mention some platforms where it could be missing
by default. Let's use inttypes.h instead, it always causes it (or its
equivalent) to be included and we know it's present on supported
platforms since we already depend on it.
No backport is needed.
The issue addressed by commit fbb934da9 ("BUG/MEDIUM: stick-table: fix
a race condition when updating the expiration task") is still present
when thread groups are enabled, but this time it lies in the scheduler.
What happens is that a task configured to run anywhere might already
have been queued into one group's wait queue. When updating a stick
table entry, sometimes the task will have to be dequeued and requeued.
For this a lock is taken on the current thread group's wait queue lock,
but while this is necessary for the queuing, it's not sufficient for
dequeuing since another thread might be in the process of expiring this
task under its own group's lock which is different. This is easy to test
using 3 stick tables with 1ms expiration, 3 track-sc rules and 4 thread
groups. The process crashes almost instantly under heavy traffic.
One approach could consist in storing the group number the task was
queued under in its descriptor (we don't need 32 bits to store the
thread id, it's possible to use one short for the tid and another
one for the tgrp). Sadly, no safe way to do this was figured, because
the race remains at the moment the thread group number is checked, as
it might be in the process of being changed by another thread. It seems
that a working approach could consist in always having it associated
with one group, and only allowing to change it under this group's lock,
so that any code trying to change it would have to iterately read it
and lock its group until the value matches, confirming it really holds
the correct lock. But this seems a bit complicated, particularly with
wait_expired_tasks() which already uses upgradable locks to switch from
read state to a write state.
Given that the shared tasks are not that common (stick-table expirations,
rate-limited listeners, maybe resolvers), it doesn't seem worth the extra
complexity for now. This patch takes a simpler and safer approach
consisting in switching back to a single wq_lock, but still keeping
separate wait queues. Given that shared wait queues are almost always
empty and that otherwise they're scanned under a read lock, the
contention remains manageable and most of the time the lock doesn't
even need to be taken since such tasks are not present in a group's
queue. In essence, this patch reverts half of the aforementionned
patch. This was tested and confirmed to work fine, without observing
any performance degradation under any workload. The performance with
8 groups on an EPYC 74F3 and 3 tables remains twice the one of a
single group, with the contention remaining on the table's lock first.
No backport is needed.
In order to evenly pick idle connections from other threads, there is
a "next_takeover" index in the server, that is incremented each time
a connection is picked from another thread, and indicates which one to
start from next time.
With thread groups this doesn't work well because the index is the same
regardless of the group, and if a group has more threads than another,
there's even a risk to reintroduce an imbalance.
This patch introduces a new per-tgroup storage in servers which, for now,
only contains an instance of this next_takeover index. This way each
thread will now only manipulate the index specific to its own group, and
the takeover will become fair again. More entries may come soon.
In 2.2, some idle conns usage metrics were added by commit cf612a045
("MINOR: servers: Add a counter for the number of currently used
connections."), which mentioned that the operation doesn't need to be
atomic since we're not seeking exact values. This is true but at least
we should use atomic stores to make sure not to cause invalid values
to appear on archs that wouldn't guarantee atomicity when writing an
int, such as writing two 16-bit words. This is pretty unlikely on our
targets but better keep the code safe against this.
This may be backported as far as 2.2.
The "show pools" command is used a lot for debugging but didn't get much
love over the years. This patch brings new capabilities:
- sorting the output by pool names to ese their finding ("byname").
- sorting the output by reverse item size to spot the biggest ones("bysize")
- sorting the output by reverse number of allocated bytes ("byusage")
The last one (byusage) also omits displaying the ones with zero allocation.
In addition, an optional max number of output entries may be passed so as
to dump only the N most relevant ones.
This previous patch was not sufficient to prevent haproxy from
crashing when some Handshake packets had to be inspected before being possibly
retransmitted:
"BUG/MAJOR: quic: Crash upon retransmission of dgrams with several packets"
This patch introduced another issue: access to packets which have been
released because still attached to others (in the same datagram). This was
the case for instance when discarding the Initial packet number space before
inspecting an Handshake packet in the same datagram through its ->prev or
member in our case.
This patch implements quic_tx_packet_dgram_detach() which detaches a packet
from the adjacent ones in the same datagram to be called when ackwowledging
a packet (as done in the previous commit) and when releasing its memory. This
was, we are sure the released packets will not be accessed during retransmissions.
Thank you to @gabrieltz for having reported this issue in GH #1903.
Must be backported to 2.6.
As revealed by some traces provided by @gabrieltz in GH #1903 issue,
there are clients (chrome I guess) which acknowledge only one packet among others
in the same datagram. This is the case for the first datagram sent by a QUIC haproxy
listener made an Initial packet followed by an Handshake one. In this identified
case, this is the Handshake packet only which is acknowledged. But if the
client is able to respond with an Handshake packet (ACK frame) this is because
it has successfully parsed the Initial packet. So, why not also acknowledging it?
AFAIK, this is mandatory. On our side, when restransmitting this datagram, the
Handshake packet was accessed from the Initial packet after having being released.
Anyway. There is an issue on our side. Obviously, we must not expect an
implementation to respect the RFC especially when it want to build an attack ;)
With this simple patch for each TX packet we send, we also set the previous one
in addition to the next one. When a packet is acknowledged, we detach the next one
and the next one in the same datagram from this packet, so that it cannot be
resent when resending these packets (the previous one, in our case).
Thank you to @gabrieltz for having reported this issue.
Must be backported to 2.6.
In diag mode, the section position is checked and a warning is emitted if a
global section is defined after any non-global one. Now, this check is
always performed. But the warning is still only emitted in diag mode. In
addition, the result of this check is now stored in a global variable, to be
used from anywhere.
The aim of this patch is to be able to restrict usage of some global
directives to the very first global sections. It will be useful to avoid
undefined behaviors. Indeed, some config parts may depend on global settings
and it is a problem if these settings are changed after.
h1_process_mux() is written to allow partial headers formatting. For now,
all headers are forwarded in one time. But it is still good to keep this
ability at the H1 mux level. So we must rely on a H1S flag instead of a
local variable to know a WebSocket key was found in headers to be able to
generate a key if necessary.
There is no reason to backport this patch.
Similarly to the H1 and H2 multiplexers, FCFI_CF_ERR_PENDING is now used to
report an error when we try to send data and FCGI_CF_ERROR to report an
error when we try to read data. In other funcions, we rely on these flags
instead of connection ones. Only FCGI_CF_ERROR is considered as a final
error. FCGI_CF_ERR_PENDING does not block receive attempt.
In addition, FCGI_CF_EOS flag was added. we rely on it to test if a read0
was received or not.
Some fields in h2c structures are not used: .mfl, .mft and .mff. Just remove
them.
.msi field is also removed. It is tested but never set, except when a H2
connection is initialized. It also means h2c_mux_busy() function is useless
because it always returns 0 (.msi is always -1). And thus, by transitivity,
H2_CF_DEM_MBUSY is also useless because it is never set. So .msi field,
h2c_mux_busy() function and H2C_MUX_BUSY flag are removed.
Similarly to the H1 multiplexer, H2_CF_ERR_PENDING is now used to report an
error when we try to send data and H2_CF_ERROR to report an error when we
try to read data. In other funcions, we rely on these flags instead of
connection ones. Only H2_CF_ERROR is considered as a final error.
H2_CF_ERR_PENDING does not block receive attempt.
In addition, we rely on H2_CF_RCVD_SHUT flag to test if a read0 was received
or not.
When the H1 connection is aborted, we no longer set a final error. To do so,
the flag H1C_F_ABORTED was added. For now, it is only set when a error is
detected on the H1 stream. Idea is to use ERR_PENDING/ERROR for upgoing
errors and ABRT_PENDING/ABRTED for downgoing errors.
read0 is now handled with a H1 connection flag (H1C_F_EOS). Corresponding
flag was removed on the H1 stream and we fully rely on the SE descriptor at
the stream level.
Concretly, it means we rely on the H1 connection flags instead of the
connection one. H1C_F_EOS is only set in h1_recv() or h1_rcv_pipe() after a
read if a read0 was detected.
A new error is added on H1 stream to deal with internal errors. For now,
this error is only reported when we fail to create a stream-connector. This
way, the error is reported at the H1 stream level and not the H1 connection
level.
H1C_F_ERR_PENDING flags will be used to refactor error handling at the H1
connection level. It will be used to notify error during sends. Thus, the
flag to notify an error must be sent before closing the connection is now
named H1C_F_ABRT_PENDING.
This introduce a naming convertion: ERROR must be used to notify upper layer
of an event at the lower ones while ABORT must be used in the opposite
direction.
The H1 connection state is now handled in a dedicated state. H1C_F_ST_*
flags are removed. All states are now exclusives. It is easier to know the
H1 connection states. It is alive, or usable, if it is not CLOSING or
CLOSED. It is CLOSING if it should be closed ASAP but a stream is still
attached and/or the output buffer is not empty. CLOSED is used when the H1
connection is ready to be closed. Other states are quite easy to understand.
There is no special changes in the H1 connection behavior. Except in
h1_send(). When a CLOSING connection is CLOSED, the function now reports an
activity. In addition, when an embryonic H1 stream is aborted, it is
destroyed. This way, the H1 connection can be switched to CLOSED state.
The H1 connection state will be handled is a dedicated field. To do so,
h1_cs enum was added. The different states are more or less equivalent to
H1C_F_ST_* flags:
* H1_CS_IDLE <=> H1C_F_ST_IDLE
* H1_CS_EMBRYONIC <=> H1C_F_ST_EMBRYONIC
* H1_CS_UPGRADING <=> H1C_F_ST_ATTACHED && !H1C_F_ST_READY
* H1_CS_RUNNING <=> H1C_F_ST_ATTACHED && H1C_F_ST_READY
* H1_CS_CLOSING <=> H1C_F_ST_SHUTDOWN && (H1C_F_ST_ATTACHED || b_data(&h1c->ibuf))
* H1_CS_CLOSED <=> H1C_F_ST_SHUTDOWN && !H1C_F_ST_ATTACHED && !b_data(&h1c->ibuf)
In addition, in this patch, the h1_is_alive() and h1_close() function are
added. The first one will be used to know if a H1 connection is alive or
not. The second one will be used to set the connection in CLOSING or CLOSED
state, depending on the output buffer state and if there is still a H1
stream or not.
For now, the H1 connection state is not used.
There's quite a large barely readable functions block in the makefile
dedicated to compiler option support. It provides no value here and
makes it harder to find user-configurable stuff, so let's move it to
include/make/compiler.mk to keep the makefile a bit cleaner. It's better
to keep the options themselves in the makefile however.
It's better to see "make" entering a subdir than seeing nothing, so
let's use a command name for make. Since make 3.81, "+" needs to be
prepended in front of the command to pass the job server to the subdir.
The $(Q), $(V), $(cmd_xx) handling needs to be reused in sub-project
makefiles and it's a pain to maintain inside the main makefile. Let's
just move that into a new subdir include/make/ with a dedicated file
"verbose.mk". It slightly cleans up the makefile in addition.
When building with DEBUG_MEM_STATS, we only see b_alloc() and b_free() as
users of the "buffer" pool, because all call places rely on these more
convenient functions. It's annoying because it makes it very hard to see
which parts of the code are consuming buffers.
By switching the b_alloc() and b_free() inline functions to macros, we
can now finally track the users of struct buffer, e.g:
mux_h1.c:513 P_FREE size: 1275002880 calls: 38910 size/call: 32768 buffer
mux_h1.c:498 P_ALLOC size: 1912438784 calls: 58363 size/call: 32768 buffer
stream.c:763 P_FREE size: 4121493504 calls: 125778 size/call: 32768 buffer
stream.c:759 P_FREE size: 2061697024 calls: 62918 size/call: 32768 buffer
stream.c:742 P_ALLOC size: 3341123584 calls: 101963 size/call: 32768 buffer
stream.c:632 P_FREE size: 1275068416 calls: 38912 size/call: 32768 buffer
stream.c:631 P_FREE size: 637435904 calls: 19453 size/call: 32768 buffer
channel.h:850 P_ALLOC size: 4116480000 calls: 125625 size/call: 32768 buffer
channel.h:850 P_ALLOC size: 720896 calls: 22 size/call: 32768 buffer
dynbuf.c:55 P_FREE size: 65536 calls: 2 size/call: 32768 buffer
Let's do this since it doesn't change anything for the output code
(beyond adding the call places). Interestingly the code even got
slightly smaller now.
This macro just serves as an intermediary for __pool_alloc() and forwards
the flag. When DEBUG_MEM_STATS is set, it will be used to collect all
pool allocations including those which need to pass an explicit flag.
It's now used by b_alloc() which previously couldn't be tracked by
DEBUG_MEM_STATS, causing some free() calls to have no corresponding
allocations.
Similarly to the previous patch, it's better to keep a local copy of
the new node's key instead of accessing it every time. This slightly
reduces the code's size in the descent and further improves the load
time to 7.45s.
looking at a perf profile while loading a conf with a huge map, it
appeared that there was a hot spot on the access to the new node's
prefix, which is unexpectedly being reloaded for each visited node
during the tree descent. Better keep a copy of it because with large
trees that don't fit into the L3 cache the memory bandwidth is scarce.
Doing so reduces the load time from 8.0 to 7.5 seconds.
Once in a while we spot a bug in the deinit code that is complex,
especially when it has to deal with incomplete initializations, and the
ability to bypass this step has regularly been raised. In addition for
fast-reloading setups it could theoretically save some time. Tests have
shown that very large configs can barely save ~100-150ms by skipping the
deinit step. However the ability not to crash if a bug is encountered can
occasionally help.
This patch adds an option to do exactly this. It's obviously not enabled
by default and the documentation discourages from using it, but this might
be useful in the future.
The ->exp_next field of the stick-table was probably useful in 1.5 but
it currently only carries a copy of what the future value of the table's
task's expire value will be, while it's systematically copied over there
immediately after being assigned. As such it provides exactly a local
variable. Let's remove it, as it costs atomic operations.
Coverity raised a potential overflow issue in these new functions that
work on unsigned long long objects. They were added in commit 9b25982
"BUG/MEDIUM: ssl: Verify error codes can exceed 63".
This patch needs to be backported alongside 9b25982.
When the code is preprocessed first and compiled later, such as when
built under distcc, a lot of fallthrough warnings are emitted because
the preprocessor has already stripped the comments.
As an alternative, a "fallthrough" attribute was added with the same
compilers as those which started to emit those warnings. However it's
not portable to older compilers. Let's just define a __fallthrough
statement that corresponds to this attribute on supported compilers
and only switches to the classical empty do {} while (0) on other ones.
This way the code will support being cleaned up using __fallthrough.
It happens that gcc since 5.x has this macro which is only mentioned
once in the doc, associated with __builtin_has_attribute(). Clang had
it at least since 3.0. In addition it validates #ifdef when present,
so it's easy to detect it. Here we're providing a fallback to another
macro __has_attribute_<name> so that it's possible to define that macro
to the value 1 for older compilers when the attribute is supported.