When xprt_qstrm layer is completed, MUX layer is started. Rx buffer from
the XPRT layer is transferred to the MUX so that it can handle any extra
data following the transport parameters first frame.
Since previous commit, QCC Rx buffer is dynamically allocated only when
needed. However, qmux_init() must still allocate it when there is data
to be transferred from the XPRT layer. As a result, code has been over
extended to continue to support this case.
This patch simplifies xprt_qstrm API for the Rx buffer transfer. Buffer
content and remaining record length can now be retrieved via the single
function xprt_qstrm_xfer_rxbuf(). If the buffer is empty, nothing is
performed and XPRT layer will release it. If not empty, MUX will take
ownership of the buffer from the XPRT layer.
3 new enum values and a mask were added in latest -dev with commit
24e05fe33a ("MINOR: stream: Use a pcli transaction to replace pcli_*
members"), unfortunately the entries needed by the "flags" command were
forgotten.
No backport is needed.
In 3.4-dev8, commit e264523112 ("MINOR: servers: Don't update last_sess
if it did not change") adjusted the last_sess date to avoid writing to
the same cache line all the time, however a typo makes it pick the wrong
second because it uses now_ms instead of now_ns (so the date would roughly
change every 12 days).
No backport needed.
In task_schedule(), before attempting to set the new task expiration
date, make sure it is not running by trying to set the TASK_RUNNING
flag, and waiting if it is already there. Having the flag set will
ensure that the task won't be running while we're modifying it.
There is a very rare race condition, where the expire would be set by
task_schedule(), then the running task might set it to something else,
and if it sets it to TICK_ETERNITY before task_schedule() calls
__task_queue(), then we will hit a BUG_ON() there.
This is very hard to reproduce, but has been reported a few times,
included in Github issue #3327, which should now be fixed.
This should be backported as far back as 2.8.
WIP: Make sure the task is not running before changing expire
This pointer was used during the appctx refactoring performed in 2.6. The
ctx union was still there and this pointer was used as the "shadow" of the
svcctx pointer used by most commands. In 2.7, the union was removed, making
the shadow pointer useless. Let's remove it now.
A new type of transaction was introduced for master-cli streams. So
SF_TXN_PCLI flag and functions to allocate and destroy PCLI transactions
were added.
In the stream structure, all pcli_* members were moved in the pcli
transaction and the txn union was updated accordingly.
When it was ambiguous, a test on the transaction type was performed. For
instance to destroy the transaciton.
To be able to deal with different types of transaction for a stream, new
stream flags was added to know the transaction type when allocated. For now
only HTTP transactions can be allocated, so only SF_TXN_HTTP was
introduced. The mask SF_TXN_MASK must be used to get the transaction type.
The transaction type is set when it is allocated and removed when it is
destroyed.
The HTTP transaction is moved in an union. For now, it is the only possible
transaction that can be allocated. But that will change. Thanks to this
commit and the next one, it will be possible to deal with different kind of
transactions for a stream.
This patch looks quite huge, but it is more or less a renaming of all
accesses to "txn" field by "txn.http".
The maximum size allowed for the payload pattern was increase up to 64 bytes
(65 bytes because of the trailing \0), to be able to use a sha256 of random
data for instance. It could be useful to prevent any data smuggling on the
payload.
Note that on the CLI, it could be possible to have only the buffer size as a
limit, because the command line is only consumed once all commands are
executed. The payload pattern is only a pointer in the buffer where the
command line was copied. However, for the master CLI, the data are streamed
to the worker, so we must keep a copy of he payload pattern. This is why we
must limit its size.
It is now possible to deal with too big payload to fit in a buffer, without
changing the buffer size. By default, a payload up to 128 KB can be
dynamically allocated. "tune.cli.max-payload-size" global parameter can be
used to change this value, with some caution for huge values.
For CLI command handler functions, there is no change at all. A pointer on
the payload is still passed as parameter. Internally, an area is allocated
for the payload only if it is too big.
The payload pattern used to detect the end of the payload is part from the
allocated area.
The payload is now saved as a buffer in the CLI context instead of a simple
pointer. It is mandatory to be able to reallocate the payload if it is too
big.
Instead of copying the payload pattern in the CLI context, we now only save
a pointer on this pattern. It is possible because the command line is copied
in the CLI context. Arguments are already handled this way when the command
is processed.
The single-threaded build is currently broken in development since commit
0af603f46f ("MEDIUM: threads: change the default max-threads-per-group
value to 16") because it doesn't set the default for the non-threaded
build. Let's set it to 1.
No backport is needed.
Commit 7d40b3134 ("MEDIUM: sched: do not run a same task multiple times
in series") required to slightly reorder a few fields in struct tasklet
and task in order to reuse an existing hole and keep tree nodes aligned.
The problem is that nice+expire were placed in struct task just before
rq, and that a 48-bit hole replaces them in struct tasklet on 64-bit
platforms, just before the struct list. However, on 32-bit platforms,
the hole is only 16-bit and preserves nice, but expire is overwritten
by the first pointer of the list element. This is not a problem for
real tasklets which do not use these fields, but it definitely is a
problem for tasks that are cast to tasklets in the run queues, because
the expire field can be overwritten when the task is woken up, and if
requeued as-is, it will expire at a completely random date.
This is what caused certain regtests to fail on i386 and 32-bit arm
machines.
This fix needs to be backported wherever the patch above was backported.
The bug has no effect on 64-bit platforms. The fix doesn't inflate
structs on 64-bit, but will raise struct tasklet from 40 to 44 bytes on
32-bit platforms.
Thanks to William for spotting the problem, bisecting it and providing
a working reproducer.
The ACME Profiles extension (draft-ietf-acme-profiles) allows a client
to request a specific certificate profile by including a "profile" field
in the newOrder request. This lets the CA select the appropriate
certificate issuance policy (e.g. "classic", "shortlived") for a given
order.
A new "profile" keyword is added to the acme section. When set, its
value is included in the newOrder JSON payload sent to the CA.
This patch is related to the issue reported on the previous issue
related to QMux record length parsing.
QCC rx.rlen is used to store the decoded record length. Convert it into
a plain 64bits integer instead of a size_t. This ensures it is
sufficient to decode record length, even with an increase of
max_record_length value (not currently implemented).
This should fix github build issue #3334 for 32bits architecture.
No need to backport.
Implement the reception of a HTTP/3 GOAWAY frame. This is performed via
the new function h3_parse_goaway_frm(). The advertised ID is stored in
new <id_shut_r> h3c member. It serves to ensure that a bigger ID is not
advertised when receiving multiple GOAWAY frames.
GOAWAY frame reception is only really useful on the backend side for
haproxy. When this occurs, h3c is now flagged with H3_CF_GOAWAY_RECV.
Also, QCC is also updated with new flag QC_CF_CONN_SHUT. This flag
indicates that no new stream may be opened on the connection. Callback
avail_streams() is thus edited to report 0 in this case.
QUIC streams ID are encoded as 62-bit integer and cannot reuse an ID
within a connection. This is necessary to take into account this
limitation for backend connections.
This patch implements this via qmux_avail_streams() callback. In the
case where the connection is approaching the encoding limit, reduce the
advertised value until the limit is reached. Note that this is very
unlikely to happen as the value is pretty high.
This should be backported up to 3.3.
Since commit 7d40b31 ("MEDIUM: sched: do not run a same task multiple
times in series") I noticed that any valid config, once haproxy was
started, would produce uninitialised reads on valgrind:
[NOTICE] (243490) : haproxy version is 3.4-dev9-0af603-2
[NOTICE] (243490) : path to executable is /tmp/haproxy/haproxy
[WARNING] (243490) : missing timeouts for proxy 'test'.
| While not properly invalid, you will certainly encounter various problems
| with such a configuration. To fix this, please ensure that all following
| timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[NOTICE] (243490) : Automatically setting global.maxconn to 491.
==243490== Thread 4:
==243490== Conditional jump or move depends on uninitialised value(s)
==243490== at 0x44DBD7: run_tasks_from_lists (task.c:567)
==243490== by 0x44E99E: process_runnable_tasks (task.c:913)
==243490== by 0x395A41: run_poll_loop (haproxy.c:2981)
==243490== by 0x396178: run_thread_poll_loop (haproxy.c:3211)
==243490== by 0x4E2DAA3: start_thread (pthread_create.c:447)
==243490== by 0x4EBAA63: clone (clone.S:100)
==243490==
Looking at it, it is caused by the fact that task->last_run member which
was introduced and used by commit above is never assigned a default value
so the first time it is used, reading from it causes uninitialised read.
To fix the issue, we simply ensure last_run task member gets a default
value when the task or tasklet is created. We use '0' as default value,
as the value itself is from minor importance because the member is used
to detect if the task has already been executed for the current loop cycle
so it will self-correct in any case.
No backport needed, unless 7d40b31 is
A lot of our subsystems start to be shared by thread groups now
(listeners, queues, stick-tables, stats, idle connections, LB algos).
This has allowed to recover the performance that used to be out of
reach on losely shared platforms (typically AMD EPYC systems), but in
parallel other large unified systems (Xeon and large Arm in general)
still suffer from the remaining contention when placing too many
threads in a group.
A first test running on a 64-core Neoverse-N1 processor with a single
backend with one server and no LB algo specifiied shows 1.58 Mrps with
64 threads per group, and 1.71 Mrps with 16 threads per group. The
difference is essentially spent updating stats counters everywhere.
Another test is the connection:close mode, delivering 85 kcps with
64 threads per group, and 172 kcps (202%) with 16 threads per group.
In this case it's mostly the more numerous listeners which improve
the situation as the change is mostly in the kernel:
max-threads-per-group 64:
# perf top
Samples: 244K of event 'cycles', 4000 Hz, Event count (approx.): 61065854708 los
Overhead Shared Object Symbol
10.41% [kernel] [k] queued_spin_lock_slowpath
10.36% [kernel] [k] _raw_spin_unlock_irqrestore
2.54% [kernel] [k] _raw_spin_lock
2.24% [kernel] [k] handle_softirqs
1.49% haproxy [.] process_stream
1.22% [kernel] [k] _raw_spin_lock_bh
# h1load
time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1024 84560 83536 4761666 0 84k5 83k5 38M0 11.91m
2 1024 168736 167713 9559698 0 84k0 84k0 38M3 11.98m
3 1024 253865 252841 14412165 0 85k0 85k0 38M7 11.84m
4 1024 339143 338119 19272783 0 85k1 85k1 38M8 11.80m
5 1024 424204 423180 24121374 0 84k9 84k9 38M7 11.86m
max-threads-per-group 16:
# perf top
Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 375998622679 lost
Overhead Shared Object Symbol
15.20% [kernel] [k] queued_spin_lock_slowpath
4.31% [kernel] [k] _raw_spin_unlock_irqrestore
3.33% [kernel] [k] handle_softirqs
2.54% [kernel] [k] _raw_spin_lock
1.46% haproxy [.] process_stream
1.12% [kernel] [k] _raw_spin_lock_bh
# h1load
time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1020 172230 171211 9759255 0 172k 171k 78M0 5.817m
2 1024 343482 342460 19520277 0 171k 171k 78M0 5.875m
3 1021 515947 514926 29350953 0 172k 172k 78M5 5.841m
4 1024 689972 688949 39270207 0 173k 173k 79M2 5.783m
5 1024 863904 862881 49184274 0 173k 173k 79M2 5.795m
So let's change the default value to 16. It also happens to match what's
used by default on EPYC systems these days.
This change was marked MEDIUM as it will increase the number of listening
sockets on some systems, to match their counter parts from other vendors,
which is easier for capacity planning.
When the opportunistic initial DNS check (ACME_INITIAL_RSLV_READY) fails,
the state machine was incorrectly transitioning to ACME_RSLV_RETRY_DELAY
instead of ACME_CLI_WAIT. This caused the challenge to enter the DNS retry
loop rather than falling back to the normal cond_ready flow that waits for
the CLI signal.
Also reorder ACME_CLI_WAIT in the state enum and trace switch to reflect
the actual execution order introduced in the previous commit: it comes after
ACME_INITIAL_RSLV_READY, not before ACME_INITIAL_RSLV_TRIGGER.
No backport needed.
For dns-persist-01, the "_validation-persist.<domain>" TXT record is set once
and never changes between renewals. Add an initial opportunistic DNS check
(ACME_INITIAL_RSLV_TRIGGER / ACME_INITIAL_RSLV_READY states) that runs before
the challenge-ready conditions are evaluated. If all domains already have the
TXT record, the challenge is submitted immediately without going through the
cli/delay/dns challenge-ready steps, making renewals faster once the record is
in place.
The new ACME_RDY_INITIAL_DNS flag is automatically set for
dns-persist-01 in cond_ready.
Till now, threads were all started one at a time from thread 1. This
will soon cause us limitations once we want to reduce shared stuff
between thread groups.
Let's slightly change the startup sequence so that the first thread
starts one initial thread for each group, and that each of these
threads then starts all other threads from their group before switching
to the final task. Since it requires an intermediary step, we need to
store that threads' start function to access it from the group, so it
was put into the tgroup_info which still has plenty of room available.
It could also theoretically speed up the boot sequence, though in
practice it doesn't change anything because each thead's initialization
is made one at a time to avoid races during the early boot. However
ther is now a function in charge of starting all extra threads of a
group, and whih is called from this group.
Implement a new setting to limit the total number of bidirectional
streams that the client may use on a single connection. By default, it
is set to 0 which means it is not limited at all.
If a positive value is configured, the client can only open a fixed
number of request streams per QUIC connection. Internally, this is
implemented in two steps :
* First, MAX_STREAMS_BIDI flow control advertizing will be reduced when
approaching the limit before being completely turned off when reaching
it. This guarantees that the client cannot exceed the limit without
violating the flow control.
* Second, when attaching the latest stream with ID matching max-total
setting, connection graceful shutdown is initiated. In HTTP/3, this
results in a GOAWAY emission. This allows the remaining streams to be
completed before the connection becomes completely idle.
Since thread groups were enabled by default in 3.3, it has become an
important element of diagnostic that we're missing in "show info". Let's
add it under "NbThreadGroups".
__htx_blkinfo_type() and __htx_blkinfo_size() function was added to return,
respectively, the type and the size from the block info field. The main
usage for these functions is internal to the htx code.
The lack of mjson_next() prevents to iterate easily and need to hack by
iterating on a loop of snprintf + $.field[XXX] combined with
mjson_find().
This reintroduce mjson_next() so we could iterate without having to
build the string.
The patch does not reintroduce MJSON_ENABLE_NEXT so it could be used
without having to define it.
This implementation is directly modeled after `stream_generate_unique_id()` and
the corresponding `unique_id` field on `struct stream`.
It will be used in a future commit to enable the use of the `%[unique-id]`
fetch in check rules.
With the introduction of the `generate_unique_id()` helper, the actual
complicated logic is sitting in a different file. Allow inlining of
`stream_generate_unique_id()`, so that callers can benefit from an abstraction
without hiding away the access of `strm->unique_id` behind a function call.
This new function will handle the actual generation of the unique ID according
to a format. The caller is responsible to check that no unique ID is stored
yet.
Add challenge_type parameter to acme_rslv_start() to select the correct
DNS lookup prefix: _validation-persist.<domain> for dns-persist-01 and
_acme-challenge.<domain> for dns-01.
Default cond_ready to ACME_RDY_DNS|ACME_RDY_DELAY for dns-persist-01.
Extend ACME_CLI_WAIT to cover dns-persist-01 alongside dns-01.
In ACME_RSLV_READY, check only TXT record existence for dns-persist-01
since the resolver cannot parse multiple strings within a single TXT entry.
`sess_build_logline_orig()` takes a `size_t maxsize` as input and accordingly
should also return `size_t` instead of `int` as the resulting length. In
practice most of the callers already stored the result in a `size_t` anyways.
The few places that used an `int` were adjusted.
This Coccinelle patch was used to check for completeness:
@@
type T != size_t;
T var;
@@
(
* var = build_logline(...)
|
* var = build_logline_orig(...)
|
* var = sess_build_logline(...)
|
* var = sess_build_logline_orig(...)
)
Reviewed-by: Volker Dusch <github@wallbash.com>
The OpenTelemetry (OTel) filter enables distributed tracing of requests
across service boundaries, export of metrics such as request rates,
latencies and error counts, and structured logging tied to trace context,
giving operators a unified view of HAProxy traffic through any
OpenTelemetry-compatible backend.
The OTel filter is implemented using the standard HAProxy stream filter
API. Stream filters attach to proxies and intercept traffic at each stage
of processing: they receive callbacks on stream creation and destruction,
channel analyzer events, HTTP header and payload processing, and TCP data
forwarding. This allows the filter to collect telemetry data at every
stage of the request/response lifecycle without modifying the core proxy
logic.
This commit added the minimum set of files required for the filter to
compile: the addon Makefile with pkg-config-based detection of the
opentelemetry-c-wrapper library, header files with configuration
constants, utility macros and type definitions, and the source files
containing stub filter operation callbacks registered through
flt_otel_ops and the "opentelemetry" keyword parser entry point.
The filter uses the opentelemetry-c-wrapper library from HAProxy
Technologies, which provides a C interface to the OpenTelemetry C++ SDK.
This wrapper allows HAProxy, a C codebase, to leverage the full
OpenTelemetry observability pipeline without direct C++ dependencies
in the HAProxy source tree.
https://github.com/haproxytech/opentelemetry-c-wrapperhttps://github.com/open-telemetry/opentelemetry-cpp
Build options:
USE_OTEL - enable the OpenTelemetry filter
OTEL_DEBUG - compile the filter in debug mode
OTEL_INC - force the include path to the C wrapper
OTEL_LIB - force the library path to the C wrapper
OTEL_RUNPATH - add the C wrapper RUNPATH to the executable
Example build with OTel and debug enabled:
make -j8 USE_OTEL=1 OTEL_DEBUG=1 TARGET=linux-glibc
This reverts commit 8056117e988a3fde05d46ecc71b2d1a3d802977d.
Moving haterm init from haproxy is not the right way to fix the issue
because it should be possible to use a haterm configuration in haproxy.
So let's revert the commit above.
This patch implements the new QMux record layer parsing for xprt_qstrm.
This is mostly similar to the MUX code from the previous patch.
Along with this change, a new xprt_qstrm layer accessor exposes the
possible remaining record length after Transport parameters parsing.
This can only occur when xprt_qstrm Rx buffer is not completely emptied
due to other following frames. If stored in the same record, MUX layer
has to know the remaining record length.
Thus, xprt_qstrm_rxrlen() is now used in qmux_init() to preinitialize
<rx.rlen> QCC field.
This is the first patch of a serie which aims to support the new Record
layer defined by the draft 01 of QMux protocol.
https://www.ietf.org/archive/id/draft-ietf-quic-qmux-01.html#name-qmux-records
This patch deals with QMux reception at the MUX layer. The function
qcc_qstrm_recv() is adapted to read record headers before frame parsing.
This requires to keep the last record length read in a new QCC field
named <rx.rlen>.
Frames are only parsed once a full record is received. One of the
advantage of the record layer is that it can only contains whole frame
without truncation.
This patch implements proper connection error handling for xprt_qstrm
layer. Basically, processing is interrupted if CO_FL_ERROR is
encountered after either rcv_buf or snd_buf operations. Connectionn
error is set to the newly defined value CO_ER_QSTRM.
Layer xprt_qstrm is responsible to read the initial QMux transport
parameters frame. However, it could receive more data if some other
frames follow it. This extra content can only be handled by the MUX
layer once initialized.
Theorically, it could have been implemented via MSG_PEEK. However, this
flag is currently ignored by SSL layer. Besides, it is tedious to
implement safely. A new approach has been prefered where the MUX layer
is responsible to retrieve remaining data via xprt_qstrm_rxbuf()
accessor function during its initialization.
Thus, qmux_init() now may retrieve the buffer from xprt_qstrm layer.
This is performed via b_xfer() which will result in a zero copy
transfer. If this happens, tasklet is immediately scheduled to start
demuxing.
Samples of type SMP_T_METH were not properly handled in smp_dup(),
smp_is_safe() and smp_is_rw(). For "other" methods, for instance PATCH, a
fallback was performed on the SMP_T_STR type. Only the buffer considered
changed. "smp->data.u.meth.str" should be used for the SMP_T_METH samples
while smp->data.u.str should be used for SMP_T_STR samples. However, in
smp_dup(), the result was stored in wrong buffer, the string one instead of
the method one. In smp_is_safe() and smp_is_rw(), the method buffer was not
used at all.
We now take care to use the right buffer.
This patch must be backported to all stable versions.
decode_varint() has no iteration cap and accepts varints decoding to
any uint64_t value. When sz is large enough that p + sz wraps modulo
2^64, the check "p + sz > end" passes, *buf is set to the wrapped
pointer, and the caller's parsing loop continues from an arbitrary
relative offset before the demux buffer.
A malicious SPOE agent sending an AGENT_HELLO frame with a key-name
length varint of 0xfffffffffffff000 causes spop_conn_handle_hello()
to dereference memory ~64KB before the dbuf allocation, resulting in
SIGSEGV (DoS) or, if the read lands on live heap data, parser
confusion. The relative offset is fully attacker-controlled and
ASLR-independent.
Compare against the remaining length instead of computing p + sz.
Since p <= end is guaranteed after a successful decode_varint(),
end - p is non-negative.
This patch must be backport to all stable versions.
Storing the protocol directly into the check was not a good idea,
because the protocol may not be determined until after a DNS resolution
on the server, and may even change at runtime, if the DNS changes.
What we can, however, figure out at start up, is the net_addr_type,
which will contain all that we need to find out which protocol to use
later.
Also revert the changes made by commit 07edaed1918a6433126b4d4d61b7f7b0e9324b30
that would not reuse the server xprt if a different alpn is set for
checks. The alpn is just a string, and should not influence the choice
of the xprt.
We'll now make sure to use the server xprt, unless an address is
provided, in which case we'll use whatever xprt matches that address, or
a port, in which case we'll assume we want TCP, and use check_ssl to
know whetver we want the SSL xprt or not.
Now that the check contains all that is needed to know which protocol to
look up, always just use that when creating a new check connection if it
is the default check connection, and for now, always use TCP when a
tcp-check or http-check connect rule is used (which means those can't be
used for QUIC so far).
This should hopefully fix github issue #3324.
Commit 1b0dfff552713274b95c81594b153104e215ec81 attempted to make it so
the mux would expect a QUIC-like protocol or not, however it only made
that we would not instantiate a non-QUIC mux on a QUIC protocol, but not
that we tried to instance a QUIC mux on a non-QUIC protocol, so fix
that.