Right now we're taking the stick-tables update lock for reads just for
the sake of checking if the update index is past it or not. That's
costly because even taking the read lock is sufficient to provoke a
cache line write, while when under load or attack it's frequent that
the update has not yet been propagated and wouldn't require anything.
This commit brings a new field to the stksess, "seen", which is zeroed
when the entry is updated, and set to one as soon as at least one peer
starts to consult it. This way it will reflect that the entry must be
updated again so that this peer can see it. Otherwise no update will
be necessary. For now the flag is only set/reset but not exploited.
A great care is taken to avoid writes whenever possible.
Some bwlim error messages at parsing time were missing the trailing '\n'
in commit 2b6777021d ("MEDIUM: bwlim: Add support of bandwith limitation
at the stream level"). This commit can be backported wherever the commit
above is (likely as far as 2.7).
Thanks to previous commit, we can now build with USE_SYSTEMD=1 on any
system without requiring any parts from systemd. It just turns our that
there was one remaining include in haproxy.c that needed to be replaced
with haproxy/systemd.h to build correctly. That's what this commit does.
Given the xz drama which allowed liblzma to be linked to openssh, lets remove
libsystemd to get rid of useless dependencies.
The sd_notify API seems to be stable and is now documented. This patch replaces
the sd_notify() and sd_notifyf() function by a reimplementation inspired by the
systemd documentation.
This should not change anything functionnally. The function will be built when
haproxy is built using USE_SYSTEMD=1.
References:
https://github.com/systemd/systemd/issues/32028https://www.freedesktop.org/software/systemd/man/devel/sd_notify.html#Notes
Before:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffcfaf65000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x000074637fef4000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000074637fe4f000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000074637f400000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x000074637fe0d000)
libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x000074637f92a000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x000074637f365000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074637f000000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074637f27a000)
libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x000074637fdff000)
libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x000074637eeb8000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x000074637fdcd000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x000074637ee01000)
liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x000074637fda8000)
/lib64/ld-linux-x86-64.so.2 (0x000074637ff5d000)
libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x000074637f904000)
After:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffd51901000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f758d6c0000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f758d61b000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f758ca00000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x00007f758d5d9000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f758d365000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f758d5ba000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f758c600000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f758c915000)
/lib64/ld-linux-x86-64.so.2 (0x00007f758d729000)
A backport to all stable versions could be considered at some point.
Since faa8c3e ("MEDIUM: lb-chash: Deterministic node hashes based on
server address") the following configuration will cause haproxy to crash:
backend test1
mode http
balance hash int(1)
server s1 haproxy.org:80
This is because lbprm.update_server_eweight() method is now systematically
called in _srv_set_inetaddr_port() upon srv addr/port change (and with the
above config it happens during startup after initial dns resolution).
However, depending on the chosen lbprm algo, update_server_eweight function
may not be set (it is not a mandatory method, some lb implementations don't
define it).
Thus, using 'balance hash' with map-based hashing or 'balance sticky' will
cause a crash due to a NULL de-reference in _srv_set_inetaddr_port(). To
fix the issue, we first check that the update_server_eweight() method is
set before using it.
No backport needed unless faa8c3e ("MEDIUM: lb-chash: Deterministic node
hashes based on server address") gets backported.
This issue arrived with this commit:
"MINOR: quic: HyStart++ implementation (RFC 9406)"
Thanks to @chipitsine for having reported this issue in GH #2513.
Should be backported where the previous commit will be backported.
In peer_send_msg(), we take a lock before calling
peer_send_teach_process_msgs because of the check on the flags and update
indexes, and the function then drops it then takes it again just to resume
in the same situation, so that on return we can drop it again! Not only
this is absurd because it doubles the costs of taking the lock, it's also
totally inefficient because it takes a write lock while the only usage
that is done with it is to read the indexes! Let's drop the lock from
peer_send_teach_process_msgs() and move it explicitly in its only caller
around the condition, and turn it into a read lock only.
The MAX() macro was used to limit the count of bytes to be transferred
in appctx_raw_rcv_buf() by commit ee53d8421f ("MEDIUM: applet: Simplify
a bit API to exchange data with applets") instead of MIN(). It didn't
seem to have any consequences until commit f37ddbeb4b ("MAJOR: cli:
Update the CLI applet to handle its own buffers") that triggers a BUG_ON()
in __b_putblk() when the other side is slow to read, because we're trying
to append a full buffer on top of a non-empty one. A way to reproduce it
is to dump a heavy stick table on the CLI with a screen scrolling.
No backport is needed since this was introduced in 3.0-dev3 and revealed
after dev5 only.
In 2.9, the stick-tables' locking was split between the lock used to
manipulate the contents (->lock) and the lock used to manipulate the
list of updates and the update indexes (->updt_lock). This was done
with commit 87e072eea5 ("MEDIUM: stick-table: use a distinct lock for
the updates tree"). However a part was overlooked in the peers code,
the parts that consult (and update) the indexes use the table's lock
instead of the update lock. It's surprising that it hasn't caused more
trouble. It's likely due to the fact that the tree nodes are not often
immediately freed and that their memory area remains connected to valid
nodes in the tree during peer_stksess_lookup(), while other parts only
check or update indexes, thus are not that critical.
This needs to be backported wherever the commit above is, thus logically
2.9.
It is only an issue when the kernel splicing is used. The zero-copy
forwarding via the buffers is not affected. When a shutdown is received on
the producer side and some data are blocked in the pipe for a while, the
shutdown may be forwarded to the other side. Usually, in this case, the
shutdown must be scheduled, waiting all output data (from the channel and
the consumer's iobuf) are sent. But only the channel was considered.
The bug was introduced by commit 20c463955d ("MEDIUM: channel: don't look at
iobuf to report an empty channel"). To fix the issue, we must also check
data blocked in the consummer iobuf.
This patch should solve the issue #2505. It must be backported to 2.9.
This is a simple algorithm to replace the classic slow start phase of the
congestion control algorithms. It should reduce the high packet loss during
this step.
Implemented only for Cubic.
According to the documentation, "option redispatch 0" is expected to
disable redispatch just like "no option redispatch", but due to the
fact that it keeps PR_O_REDISP set, it doesn't actually work. Let's
make sure value 0 is properly handled and drops PR_O_REDISP. This can
be backported to all versions since it seems it has been broken since
its introduction in 1.6 with commit 726ab7145c ("MEDIUM: backend: Allow
redispatch on retry intervals").
As a workaround, "no option redispatch" does work though.
In 2.7 we addressed a race condition in the stick tables expiration task
with commit fbb934d ("BUG/MEDIUM: stick-table: fix a race condition when
updating the expiration task"). The issue was that the task could be
running on another thread which would destroy its expiration timer
while one had just recalculated it and prepares to queue it, causing
a bug due to the attempt to queue an expired task. The fix consisted in
enclosing the change into the stick-table's lock, which had a very low
cost since it's done only after having checked that the date changed,
i.e. no more than once every millisecond.
But as reported by Ricardo and Felipe from Taghos in github issue #2508,
a tiny race remained after the fix: the unlock() was done before the call
to task_queue(), leaving a tiny window for another thread to run between
unlock() and task_queue() and erase the timer. As confirmed, it's
sufficient to also protect the task_queue() call.
But overall this raises a point regarding the task_queue() API on tasks
that may run anywhere. A while ago an attempt was made at removing the
timer for woken up tasks, but something like this would be deserved
with more atomicity on the timer manipulation (e.g. atomically use
task_schedule() instead maybe). This should be backported to all
stable branches.
Motivation: When services are discovered through DNS resolution, the order in
which DNS records get resolved and assigned to servers is arbitrary. Therefore,
even though two HAProxy instances using chash balancing might agree that a
particular request should go to server3, it is likely the case that they have
assigned different IPs and ports to the server in that slot.
This patch adds a server option, "hash-key <key>" which can be set to "id" (the
existing behaviour, default), "addr", or "addr-port". By deriving the keys for
the chash tree nodes from a server's address and port we ensure that independent
HAProxy instances will agree on routing decisions. If an address is not known
then the key is derived from the server's puid as it was previously.
When adjusting a server's weight, we now check whether the server's hash has
changed. If it has, we have to remove all its nodes first, since the node keys
will also have to change.
A compilation error occurs when using DEBUG_MEM_STATS due to a variable
now being unused in debug_iohandler_memstats() :
src/debug.c: In function ‘debug_iohandler_memstats’:
src/debug.c:1862:24: error: unused variable ‘sc’ [-Werror=unused-variable]
1862 | struct stconn *sc = appctx_sc(appctx);
| ^~
This is caused since the following commit :
94b8ed446f
MEDIUM: cli/applet: Stop to test opposite SC in I/O handler of CLI commands
This must not be backported.
log load-balancing implementation was not seamlessly integrated within
lbprm API. The consequence is that it could become harder to maintain
over time since it added some specific cases just for the log backend.
Moreover, it resulted in some code duplication since balance algorithms
that are common to logs and regular (tcp, http) backends were specifically
rewritten for log backends.
Thanks to the previous commit, we now have all the prerequisites to make
log load-balancing fully leverage lbprm logic. Thus in this patch we make
__do_send_log_backend() use existing lbprm algorithms, and we no longer
require log-specific lbprm initialization in cfgparse.c and in
postcheck_log_backend().
As a bonus, for log backends this allows weighed algorithms to properly
support weights (ie: roundrobin, random and log-hash) since we now
leverage the same lb algorithms that we use for tcp/http backends
(doc was updated).
As previously mentioned in cd352c0db ("MINOR: log/balance: rename
"log-sticky" to "sticky""), let's define a sticky algorithm that may be
used from any protocol. Sticky algorithm sticks on the same server as
long as it remains available.
The documentation was updated accordingly.
b61147fd ("MEDIUM: log/balance: merge tcp/http algo with log ones")
introduced some ambiguities, because while it shares some algos with the
ones from mode {tcp,http}, we forgot report an error when the user tries
to use an algorithm that is not available in this mode (as per the doc).
Because of that, haproxy would silently drop log messages during runtime.
To fix that, we ensure that algo is one of the supported ones during log
backend postparsing. If the algo is not supported, we raise an error.
This should be backported in 2.9 with b61147fd
The CLI applet is now using its own snd_buf callback function. Instead of
copying as most output data as possible, only one command is copied at a
time.
To do so, a new state CLI_ST_PARSEREQ is added for the CLI applet. In this
state, the CLI I/O handle knows a full command was copied into its input
buffer and it must parse this command to evaluate it.
This flag can be use by endpoints to know the data to send, via .snd_buf
callback function are the last ones. It is useful to know a shutdown is
pending but it cannot be delivered while sedning data are not consumed.
It is now the responsbility of applets .snd_buf callback function to notify
the input buffer is full. This will allow the applets to not consume all
data waiting for more data. Of course, it is only useful for applets using a
custom .snd_buf callback function.
It is the third applet to be refactored to use its own buffers. In addition to
the CLI applet, some I/O handlers of CLI commands were also updated, especially
the stats ones.
Some command I/O handlers were updated to use applet's buffers instead of
channels ones.
It is an harmless bug for now because only stats and cache applets are using
their own buffers and it is not possible to trigger this bug with these
applets. However, it remains important to try a receive if EOI, EOS or ERROR
is reached by the applet while no data was produced. Otherwise, it is not
possible to ack these events at the SE level.
No backport needed.
The main CLI I/O handle is responsible to interrupt the processing on
shutdown/abort. It is not the responsibility of the I/O handler of CLI
commands to take care of it.
Instead of using connection versions, we now use generic versions. It means
we will also perfom sync receives and sync sends on applets too, but only
for applets using their own buffers. Old applets are not concerned.
sc_sync_recv() and sc_sync_send() were added to use connection or applet
versions, depending on the endpoint type. For now these functions are not
used. But this will be used by process_stream() to replace the connection
version.
Too big command, larger than a buffer, was silently rejected by the CLI
applet. It was handled as an error and the connection was closed, but no
error message was reported to user to notify him. Now an error is reported
before closing. It is only displayed if the chunk buffer used by the CLI
applet is full and no delimiter (\n or ;) is found to mark the end of the
command. It works for a simple command but also for a command with a huge
payload.
This patch could be backported to all stable versions.
This commit allows "cookie" keyword for dynamic servers. After code
review, nothing was found which could prevent a dynamic server to use
it. An extra warning is added under cli_parse_add_server() if cookie
value is ignored due to a non HTTP backend.
This patch is not considered a bugfix. However, it may backported if
needed as its impact seems minimal.
When adding a server dynamically, we observe that when a backend has a
dynamic persistence cookie, the new server has no cookie as we receive
the following HTTP header:
set-cookie: test-cookie=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/
Whereas we were expecting to receive something like the following, which
is what we receive for a server added in the config file:
set-cookie: test-cookie=abcdef1234567890; path=/
After investigating code path, srv_set_dyncookie() is never called when
adding a server through CLI, it is only called when parsing config file
or using "set server bkd1/srv1 addr".
To fix this, call srv_set_dyncookie() inside cli_parse_add_server().
This patch must be backported up to 2.4.
Since their first implementation, dynamic servers are created into
maintenance state. This has been done purposely to avoid immediate
activation of a newly inserted server.
However, this principle is incompatible if "enabled" keyword is used on
"add server". The newly created instance will be unreacheable as proxy
load-balancing algorithm is not informed of its presence via
srv_lb_propagate(). The new server could be unblocked by toggling its
state with "disable server" / "enable server" commands, which will
trigger srv_lb_propagate() invocation.
To avoid this unexpected state, simply forbid "enabled" keyword for
dynamic servers. In the long-term, it could be possible to re authorize
it but at least this requires to call srv_lb_propagate() on dynamic
server creation.
This should fix github issue #2497.
This patch should not be backported as-is, to avoid breaking dynamic
servers API on stable versions. "enabled" should instead be ignored for
them. This will be implemented in a dedicated patch on top of 2.9.
This option can be used to set a default ocsp-update mode for all
certificates of a given conf file. It allows to activate ocsp-update on
certificates without the need to create separate crt-lists. It can still
be superseded by the crt-list 'ocsp-update' option. It takes either "on"
or "off" as value and defaults to "off".
Since setting this new parameter to "on" would mean that we try to
enable ocsp-update on any certificate, and also certificates that don't
have an OCSP URI, the checks performed in ssl_sock_load_ocsp were
softened. We don't systematically raise an error when trying to enable
ocsp-update on a certificate that does not have an OCSP URI, be it via
the global option or the crt-list one. We will still raise an error when
a user tries to load a certificate that does have an OCSP URI but a
missing issuer certificate (if ocsp-update is enabled).
The inconsistencies in 'ocsp-update' parameter were only checked when
parsing a crt-list line so if a certificate was used on a bind line
after being used in a crt-list with 'ocsp-update' set to 'on', then no
error would be raised. This patch helps detect such inconsistencies.
This patch can be backported up to branch 2.8.
In a crt-list such as the following:
foo.pem [ocsp-update off] foo.com
foo.pem bar.com
we would get a wrong "Incompatibilities found in OCSP update mode ..."
error message during init when the two lines are actually saying the
same thing since the default for 'ocsp-update' option is 'off'.
This patch can be backported up to branch 2.8.
A recent issue was uncovered by the CI which started to randomly report
segfaults on a few tests, and more systematically on FreeBSD. It turn
out that it was introduced by recent commit 03816ccfa9 ("MAJOR: ring:
insert an intermediary ring_storage level"), which overlooked the munmap()
path of the sink and startup logs: once the ring and its storage were
split, it was no longer correct to munmap() the ring, only its storage
area needs to be unmapped, and the ring must always be freed separately.
Thanks to Christopher and William for their help at trying to reproduce
it and figure the circumstances that triggers it.
No backport is needed.
Sebastien Gross reported that 'interface' keyword ('source' subargument)
is silently ignored when used from 'default-server' directive despite the
documentation implicitly stating that the keyword should be supported
there.
When support for 'source' keyword was added to 'default-server' directive
in dba97077 ("MINOR: server: Make 'default-server' support 'source'
keyword."), we properly duplicated the conn iface_name from the default-
server but we forgot to copy the conn iface_len which must be set as well
since it is used as setsockopt()'s 'optlen' argument in
tcp_connect_server().
It should be backported to all stable versions.
OpenSSL 3.2 triggers the code part added by commit 25da217 ("MINOR: ssl:
Update ssl_fc_curve/ssl_bc_curve to use SSL_get0_group_name") which
contains a variable declaration in the for() statement and breaks on
older compilers, as reported in GH issues #2501.
Let's just declare it normally to fix the problem. This must be
backported wherever the commit above is (at least 2.9).
We don't care in what order the threads are released, so we can write
their sent value using relaxed atomic stores. This brings a 3-5% perf
boost on ARM with 80 cores, reaching 7.25M/s, and doesn't change
anything on x86 since it keeps using strict ordering.
It has been found that performing a first pass consisting in copying
all messages, and a second one to notify about releases is more efficient
on AMD than updating all of them on the fly using a CAS, despite making
writers wait longer to be released.
Maybe it's related to the ability for the CPU to prefetch the contents
during a simple load while it wouldn't do it for an XCHG, it's unsure
at this point. This will also mater permit to use relaxed stores to
release threads.
On ARM the performance increased to 7.0M/s. If this patch is applied
before the dropping of the intermediary step, instead it drops to
3.9M/s. This shows the dependency between such changes that strive to
limit the number of writes on the fast path.
On x86_64, the EPYC at 3C6T saw a small drop from 4.57M to 4.45M, but
the 24C48T setup saw a nice 33% boost from 3.33M to 4.44M, i.e. we
get stable perf at 3 and 24 cores, despite having 8 CCX involved and
fighting with each other.
Other possibilities are:
- use of HA_ATOMIC_XCHG() instead of FETCH_OR()
=> slightly faster (4.62/7.37 vs 4.58/7.34). Pb: requires to
modify the readers to wait much longer since the tail value
won't be valid in this case during updates, and it will have
to wait by looping over it.
- use other conditions to release a cell
=> to be tested
Archs relying on CAS benefit from a read prior to FETCH_OR, so it's
not just x86 that benefits from this. Let's just change the condition
to only exclude __ARM_FEATURE_ATOMICS which is the only one faster
without.
The loop was cleaned up a little bit so that the inner loops are more
readable and that the ifdef'd parts are whole blocks and not just an
"if" condition. A few conditions were adjusted to benefit from "break"
and "continue".
This is mostly a cleanup in that it turns the two-level loop into a
single one, but it also simplifies the code a little bit and brings
some performance savings again, which are mostly noticeable on ARM,
but don't change anything for x86.
x86_64 doesn't have a native atomic FETCH_OR(), it's implemented using
a CAS, which will always cause a write cycle. Here we know we can just
wait as long as the lock bit is held so better loop on a load, and only
attempt the CAS on success. This requires a tiny ifdef and brings nice
benefits. This brings the performance back from 3.33M to 3.75M at 24C48T
while doing no change at 3C6T.
By doing that and placing the cpu_relax at the right places, the ARM
reaches 6.0M/s on 80 threads. On x86_64, at 3C6T the EPYC sees a small
increase from 4.45M to 4.57M but at 24C48T it sees a drop from 3.82M
to 3.33M due to the write contention hidden behind the CAS that
implements the FETCH_OR(), that we'll address next.
The queue-based approach consists in forcing threads to wait away from
the work area so as not to disturb the current writer, and to prepare
the work by grouping them in a queue. The last arrived takes the head
of the queue by placing its preinitialized ring cell there, becomes the
queue's leader, informs itself about the amount of previously accumulated
bytes so that when its turn comes, it immediately knows how much room is
needed to be released.
It can then take the whole queue with it, leaving an empty one for new
threads to come while it's releasing the room needed to copy everything.
By doing so we're cascading contention areas so that multiple parts can
work in parallel.
Note that we must never leave a write counter set to 0xFF at tail, and
this happens when a message cannot fit and we give up, because in this
case we're writing back tail_ofs, and only later we restore the counter.
The solution here is to make a special case when we're going to drop
the messages, and to write the readers count before restoring tail.
This already shows a tremendous performance gain on ARM (385k -> 4.8M),
thanks to the fact that now all waiting threads wait on the queue's
head instead of polluting the tail lock. On x86_64, the EPYC sees a big
boost at 24C48T (1.88M -> 3.82M) and a slowdown at 3C6T (6.0->4.45)
though this one is much less of a concern as so few threads need less
bandwidth than bigger counts.
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.
Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.
This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.
It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
Code disassembly shows that ring->storage->tail and ring->queue are
accessed a lot and reloaded a lot due to aliasing. Let's just have
variables for them in the local stack. It makes the code smaller and
slightly faster.
It's inefficient and counter-productive that each ring writer iterates
over all readers to wake them up. Let's just have one in charge of this,
it strongly limits contention. The only thing is that since the thread
is iterating over a list, we want to be sure that if the first readers
have already completed their job, they will be woken up again. For this
we keep a counter of messages delivered after the wakeup started, and
the waking thread will check it before going back to sleep. In order to
avoid looping forever, it will also drop its waking flag soon enough to
possibly let another one take it.
There used to be a few cases of watchdogs before this on a 24-core AMD
EPYC platform on the list iteration those never appeared anymore.
The perf has dropped a bit on 3C6T on the EPYC, from 6.61 to 6.0M but
remains unchanged at 24C48T.
If there's nothing to read, it's pointless for a reader to try to update
the offset pointer, that's two atomic ops to replace a value by itself
twice. Let's just stop this.
It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.
The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.
The writer is using tags 0xFF instead of readers count at the front of
messages that are undergoing an update, while the tail has already been
updated. The reader needs to take care of this because it can face these
messages and mistakenly parse data that's still being written, leading
to corruption (especially if this happens while the size is changing).
Let's just stop reading when facing reserved codes, since they indicate
that the end of usable messages was reached.
Since we're going to remove the lock, there's no more way to prevent the
ring from being fed while we're attaching a client to it. We need to
freeze the buffer while looking at its head so that we can attach there
and have a trustable one. We could do it by setting the lock bit on the
tail offset but quite frankly we don't need to bother with that, attaching
a client is rare enough to permit a thread_isolate().
Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.
There's no point looking for freshly attached readers if there are none,
taking this lock requires an atomic write to a shared area, something we
clearly want to avoid.
A general test with 213-byte messages on different thread counts shows
how the performance degrades across CCX and how this patch improves the
situation:
Before After
3C6T/1CCX: 6.39 Mmsg/s 6.35 Mmsg/s
6C12T/2CCX: 2.90 Mmsg/s 3.16 Mmsg/s
12C24T/4CCX: 2.14 Mmsg/s 2.33 Mmsg/s
24C48T/8CCX: 1.75 Mmsg/s 1.92 Mmsg/s
This tends to confirm that the queues will really be needed and that
they'll have to be per-ccx hence per thread-group. They will amortize
the number of updates on head & tail (one per multiple messages).
We know we can continue to protect the message area so we can unlock the
tail as soon as we know its new value. Now we're seeing ~6.4M msg/s vs
5.4M previously on 3C6T of a 3rd gen EPYC, and 1.88M vs 1.54M for 24C48T
threads, which is a significant gain!
This requires to carefully write the new head counter before releasing
the writers, and to change the calculation of the work area from
tail..head to tail...new_tail while writing the message.
Now the lock is only taken around the readers list. With careful
ordering of writes to head/tail, the ring remains protected.
The perf is a bit better, though (1.54M msg/s vs 1.4M at 48T on
a 3rd gen EPYC, and 5.4M vs 5.3M for a 3C6T setup).
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.
For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.
Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
The reader now needs to protect the positions it's reading. This is
already done via the readers counter at the beginning of messages,
but as long as the lock is present, this counter is decremented
before starting to parse messages, and incremented at the end.
We must now do that in reverse, first protect the end of the messages,
and only then remove ourselves from the already processed messages, so
that at no point could a writer pass over and possibly overwrite data
we're currently watching.
The goal here is to start to protect the writing area inside the area
itself so that we'll later be able to release the ring's lock. We're not
there yet, but at least the tail is marked as protected for as long as the
message is not fully written.
We'll want to reserve some special values for the readers count to
temporary lock the following message, but for this it will be mandatory
that readers check for them before incrementing/decrementing the counter.
Let'sdo that using a CAS. The readers performance is not as critical as
the writer's anyway so the slight overhead is not a problem.
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.
The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.
The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.
This is the start of the replacement of the buffer API calls. Only the
ring_write() function was touched. Instead of manipulating a buffer all
along, we now extract the ring buffer's head and tail upon entry, store
them locally and use them using the vec<->ring API until the last moment
where we can update the buffer with the new values. One subtle point is
that we must never fill the buffer past the last byte otherwise the
vec-to-ring conversion gets lost and there's no more possibility to know
where's the beginning nor the end (just like when dealing with head+tail
in fact), because it then becomes impossible to distinguish between an
empty and a full buffer.
In ring_resize() we used to check if the new ring was at least as large
as the previous one before resizing it, but what counts is that it's as
large as the previous one's contents. Initially it was thought this
would not really matter, but given that rings are initially created as
BUFSIZE, it's currently not possible to shrink them for debugging
purposes. Now with this change it is.
The ring resizing was already quite tricky, but when facing atomic
writes it will no longer be possible and we definitely do not want to
have to deal with a lock there. Since it's only done at boot time, and
possibly later from the CLI, let's just do it under thread isolation.
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.
This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.
It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
Till now we used to rely on a heuristic pointer comparison to check if
a ring was mapped or allocated. Better assign a flag to clarify this
because it's going to become difficult otherwise.
Some open-coded constructs were updated to make use of the ring accessors
instead. This allows to remove some direct dependencies on the buffers
API a bit more.
In startup_logs_dup() we currently need to reference the ring's buffer,
better not do this as it will complicate operations when switching to
other types.
The ring_write() function uses confusing variable names: totlen is in
fact the length of the message, not the total length that is going to
be written. Let's rename it msglen and have a real "needed" that
corresponds to the total size we're going to write. We also add a
BUG_ON_HOT() to catch mistakes causing discrepancies.
In order to support concurrent writers we'll need to lock areas in the
buffer. For this we'll use one special value of the single-byte readers
count. Let's reserve it now and use the macro instead of the hardcoded
255.
The goal is to remove references to the buffer's head and tail in the
fast path so that we can release the lock during some reads. This means
no more comparisons with b_data() nor operations relative to b_head()
will be possible anymore. As a first step we need to have an absolute
offset in the buffer, and to use b_getblk_ofs() in the applet callbacks
to retrieve the data based on this.
This code becomes even simpler and almost does not need any knowledge
of the structure of the ring anymore. It even highlighted that an old
race had not been fixed due to code duplication, but that's now done.
This new function is made around the loop that scans a ring for new
messages and dispatches them to a message handler. It also takes
ring flags (WAIT, NEW, etc) and offset pointers that the caller will
use to initialize/reuse/update the current processing offset. The
caller is still responsible for presetting it to ~0 before the
first call if it wants the function to automatically adjust it (or set
it to the correct value). The function may also return the last_ofs
that was known before releasing the lock so that the caller knows
what to compare against and if it needs to restart processing or not.
The context remains a void* so that should not necessarily depend on
an appctx.
The current "show ring" code was ported to this and it continues to
work as expected.
A ring is used for the DNS code but slightly differently from the generic
one, which prevents some important changes from being made to the generic
code without breaking DNS. As the use cases differ, it's better to just
split them apart for now and have the DNS code use its own ring that we
rename dns_ring and let the generic code continue to live on its own.
The unused parts such as CLI registration were dropped, resizing and
allocation from a mapped area were dropped. dns_ring_detach_appctx() was
kept despite not being used, so as to stay consistent with the comments
that say it must be called, despite the DNS code explicitly mentioning
that it skips it for now (i.e. this may change in the future).
Hopefully after the generic rings are converted the DNS code can migrate
back to them, though this is really not necessary.
The rink reader code was duplicated as-is in 2.2 for the ring forwarding
code in commits 494c505703 ("MEDIUM: ring: add server statement to forward
messages from a ring") and 975564784f ("MEDIUM: ring: add new srv statement
to support octet counting forward") (which only differs by using a prefix
instead of a suffix to delimit messages).
Unfortunately, that makes it almost impossible to rework the core ring
code because all these parts rely on it. This first commit aims at
restoring a common structure for the core loop by just calling a distinct
function based on the use case. The functions are either
applet_append_line() when a whole line is to be emitted followed by an LF
character, or syslog_applet_appent_event() when trying to send a TCP
syslog line prepended with its size in decimal.
There is no functional change beyond this.
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. Contrary to its sibling applet_append_line(), instead of just
appending an LF at the end of the line, it prepends the message size
in decimal and a space before the message, as expected by syslog TCP
implementaions. This will be used to simplify the ring reader code.
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. This will be used to simplify the ring reader code.
This new command, enabled only with "DEBUG_DEV", sends 2 or 20 traces
per task wakeup (depending on the verbosity level), and stops after 1M
wakeups per thread in order not to have to stop/start the process each
time it's fired.
We have two small messages and 18 larger ones from 20 to 270 bytes
each, so that the average size is approx 213 bytes counting headers
(the header adds approx 82 bytes), which matches what's generally
observed on average when traces are enabled in all muxes.
Typical figures show varations between 5.7M and 6.2M msg/s on an EPYC
in a 3C6T setup (single CCX), and 2.12M - 2.22M in a 24C48T setup
(across 8 CCX, with 8 thread groups).
In http_7239_extract_{ipv4,ipv6}, we declare a local buffer in order to
use inet_pton() since it requires a valid destination argument (cannot be
NULL). Then, if the caller provided <ip> argument, we copy inet_pton()
result (from local buffer to <ip>).
In fact when the caller provides <ip>, we may directly use <ip> as
inet_pton() dst argument to avoid an useless copy. Thus the local buffer
is only relevant when the user doesn't provide <ip>.
While at it, let's add a missing testcase for the rfc7239_n2nn converter
(to check that http_7239_extract_ipv4() with <ip> provided works properly)
This could be backported in 2.8 with b2bb925 ("MINOR: proxy/http_ext:
introduce proxy forwarded option")
Willy reported that since 3ac79b504 ("MEDIUM: server:
make server_set_inetaddr() updater serializable"), haproxy fails to
compile on some older compilers such as gcc-4.4 with this kind of error:
src/server.c: In function 'snr_resolution_cb':
src/server.c:4471: error: unknown field 'dns_resolver' specified in initializer
compilation terminated due to -Wfatal-errors.
make: *** [Makefile:1006: src/server.o] Error 1
This is due to referencing a member inside anonymous union from a compound
literal assignment. Apparently such use of anonymous union wasn't properly
supported back then on older compilers. To fix the issue, we give "u" name
to the parent union use this name to explicitly refer to the union where
relevant in the code (only a few changes fortunately).
The fix itself was verified to restore build compatibility with gcc 4.4
(and even 4.2).
As 3ac79b504 is used as a prerequisite for 64c9c8ef3 ("BUG/MINOR:
server/dns: use server_set_inetaddr() to unset srv addr from DNS"), please
consider backporting this patch too if 64c9c8ef3 happens to be backported
in 2.9.
Trailers are skipped by the FCGI multiplexer. However empty chunked messages
are not properly handled. It may be a chunked H1 request with no payload or
a H2/H3 POST request with no payload. In that caes, the EOT HTX block is
just ignored. The issue is that the EOM flag is thus ignored too. It means
no empty STDIN record is sent to mark the end of the request to the server.
To fix the issue, when a EOT htx block is found and it is the last HTX block
of the message (and it should be), the EOM flag is tested. If it is found,
an empty STDIN record is emitted.
This patch should fix the issue #2499. It must be backported as far as 2.4.
QUIC MUX is freed via qcc_release(). This in turn liberate all the
remaining QCS instances. For each one of them, their corresponding
stream-desc is released via qc_stream_desc_release().
This last function may itself notifies QUIC MUX when new buffers are
available. This is useful when QCS are closed individually without the
whole connection. However, when the connection is closed through
qcc_release(), this may cause issue as some elements of QUIC MUX are
already freed.
In 2.9.6, a bug was detected directly linked to this. Indeed, QCC
instance may be woken up on stream-desc release. If called through
qcc_release(), this is an issue because QCC tasklet is freed before QCS
instances. However, this bug is not systematic and relies on prior
conditions : in particular, QUIC MUX must be under Tx buffers exhaustion
prior to the qcc_release() invocation.
The current dev tree is not impacted by this bug, thanks to QUIC MUX
refactoring. Indeed, notifying accross layers have changed and now
stream-desc release notifies individual QCS instances instead of the QCC
element, which is a safer mechanism. However, to simplify backport
process, bugfix is introduced in the current dev tree as it does not
have any impact.
Note that a proper fix would be to set quic-conn MUX state to
QC_MUX_RELEASED. However, it is not possible to call quic_close()
without having releasing all stream-desc elements first. The simpler
solution was chosen to prevent other breaking issues during backports.
This should fix github issue #2494.
It should be backported up to 2.6. Note that prior to 2.7 qcc_release()
was named qc_release().
This commit similar to the following one :
65ae241dcfe710e1cdd3ec4e7a9bde38d2e4c116
MEDIUM: server: close idle conn before server deletion
This patch implements a similar logic, this time to close private idle
connections stored in sessions. The principle is identical to the above
commit : conn_release() is used on idle connections after a takeover to
ensure thread safety.
An extra change was required to be able to execute takeover on such
connections. Their original thread ID was unknown, contrary to non
private connections which are stored in sharded lists. As such, a new
tid member has been added under sess_priv_conns chaining element.
When a backend connection is marked as idle, a special flag TASK_F_USR1
is set on MUX tasklet. When MUX tasklet is reactivated, extra checks are
executed under this flag to ensure no takeover occurred in the meantime.
Previously, only non private connections could be targetted by a
takeover. However, this will change when implementing private idle
connections closure on "delete server" CLI handler. As such, TASK_F_USR1
is now also set for private connections in MUX detach callbacks.
To be able to delete a server, a number of preconditions must be
validated to ensure it is not in used anymore. Previously, if idle
connections were stored in the server, the deletion was cancelled. No
action was implemented to force idle connection closure, the only
solution was to wait for the periodic purging to be achieved.
This is an extra burden to be able to delete a server. Indeed, idle
connections are by definition inactive and can be closed prior to delete
a server. This is the exact purpose of this patch.
Idle connections removal is implemented inside "delete server" handler,
once it has been determined that the server can be freely removed. A
simple loop is run to call conn_release() over each idle connections.
Takeover is also executed before conn_release() to ensure tasks/tasklets
or any other sensible elements are not deleted from a foreign thread.
This patch should reduce the occurence of rejected "delete server"
execution, especially when connection reuse is high.
Extend takeover API both for MUX and XPRT with a new boolean argument
<release>. Its purpose is to signal if the connection will be freed
immediately after the takeover, rendering new resources allocation
unnecessary.
For the moment, release argument is always false. However, it will be
set to true on delete server CLI handler to proactively close server
idle connections.
Several places reuse the same code to ensure a connection is properly
freed, either via its MUX or by calling the proper set of functions.
Factorize all of this in a new function conn_release().
This new function is now called via session_free() and
session_accept_fd(). It will also be reused on delete server to
proactively close idle connections.
Those fetchess were undocumented and were just here so that the
ocsp-update log could be made through a regular log format. But since
the logging is now "handmade" (since BUG/MEDIUM: ssl: Fix crash in
ocsp-update log function), we don't need those anymore.
Since commit "BUG/MEDIUM: ssl: Fix crash in ocsp-update log function",
some information from the log line are "faked" because they can be
actually retrieved anymore (or never could). We should then remove them
from the logline all along instead of providing some useless fields.
We then only keep pure OCSP-update information in the log line:
"<certname> <status> <status str> <fail count> <success count>"
The ocsp-update logging mechanism was built around the 'sess_log'
function which required to keep a pointer to the said session until the
logging function could be called. This was made by keeping a pointer to
the appctx returned by the 'httpclient_start' function. But this appctx
lives its life on its own and might be destroyed before
'ssl_ocsp_send_log' is called, which could result in a crash (UAF).
Fixing this crash requires to stop using the 'sess_log' function to emit
the ocsp-update logs. The log line will then need to be built by hand
out of the information actually available when 'ssl_ocsp_send_log' is
called. Since we don't use the "regular" logging functions anymore, we
don't need to use the error_logformat anymore. In order to keep a
consistent behavior than before, we will keep the same format for the
logs but replace the fields that required a 'sess' pointer by fake
values (the %ci:%cp for instance, which was never filled anyway).
This crash was raised in GitHub issue #2442.
It should be backported up to branch 2.8.
The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.
These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.
Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.
An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.
This patch can be backported up to 2.8. Wait a little bit before
backporting.
With the current way OCSP responses are stored, a single OCSP response
is stored (in a certificate_ocsp structure) when it is loaded during a
certificate parsing, and each SSL_CTX that references it increments its
refcount. The reference to the certificate_ocsp is kept in the SSL_CTX
linked to each ckch_inst, in an ex_data entry that gets freed when the
context is freed.
One of the downsides of this implementation is that if every ckch_inst
referencing a certificate_ocsp gets detroyed, then the OCSP response is
removed from the system. So if we were to remove all crt-list lines
containing a given certificate (that has an OCSP response), and if all
the corresponding SSL_CTXs were destroyed (no ongoing connection using
them), the OCSP response would be destroyed even if the certificate
remains in the system (as an unused certificate).
In such a case, we would want the OCSP response not to be "usable",
since it is not used by any ckch_inst, but still remain in the OCSP
response tree so that if the certificate gets reused (via an "add ssl
crt-list" command for instance), its OCSP response is still known as
well.
But we would also like such an entry not to be updated automatically
anymore once no instance uses it. An easy way to do it could have been
to keep a reference to the certificate_ocsp structure in the ckch_store
as well, on top of all the ones in the ckch_instances, and to remove the
ocsp response from the update tree once the refcount falls to 1, but it
would not work because of the way the ocsp response tree keys are
calculated. They are decorrelated from the ckch_store and are the actual
OCSP_CERTIDs, which is a combination of the issuer's name hash and key
hash, and the certificate's serial number. So two copies of the same
certificate but with different names would still point to the same ocsp
response tree entry.
The solution that answers to all the needs expressed aboved is actually
to have two reference counters in the certificate_ocsp structure, one
actual reference counter corresponding to the number of "live" pointers
on the certificate_ocsp structure, incremented for every SSL_CTX using
it, and one for the ckch stores.
If the ckch_store reference counter falls to 0, the corresponding
certificate must have been removed via CLI calls ('set ssl cert' for
instance).
If the actual refcount falls to 0, then no live SSL_CTX uses the
response anymore. It could happen if all the corresponding crt-list
lines were removed and there are no live SSL sessions using the
certificate anymore.
If any of the two refcounts becomes 0, we will always remove the
response from the auto update tree, because there's no point in spending
time updating an OCSP response that no new SSL connection will be able
to use. But the certificate_ocsp object won't be removed from the tree
unless both refcounts are 0.
Must be backported up to 2.8. Wait a little bit before backporting.
By default, backend connections are accounted by the server. This allows
to determine the number of idle connections to keep. A backend
connection can also be marked as private to prevent its reuse. It is
thus removed from server lists into the session list. As such, a private
connection is not accounted into server : conn_set_private() uses
srv_release_conn() to ensure this.
When using HTTP/2 on backend side with default http-reuse safe, the
above principle are mixed. Indeed, when a connection is first used, or
switches from idle to used, it is moved into the session list but it is
not flagged as private. This is done to prevent its sharing by different
clients to prevent head-of-line blocking issue. When all streams are
closed, the connection becomes idle again and is reinserted in the
server list. This has been introduced by the following patch :
0d21deaded
MEDIUM: backend: add reused conn to sess if mux marked as HOL blocking
When freeing a backend connection, special care is taken to ensure
server used counter is decremented. This is implemented into
conn_backend_deinit(). However, this function does this only if the
connection is not present in a session list. This is valid for private
connections. However, if a connection is non-private and present only
temporarily into a session list, the decrement operation won't be
executed despite the connection being accounted by the server.
This bug has several impacts. The server used counter won't be able to
reach its initial null value, even when all its connections are closed.
This can result in a wrong estimation of necessary idle connections,
which may cause unnecessary new connection usage. Also, this will
prevent definitely the server from being removed via "delete server" CLI
command.
This should be backported up to 2.4. Note that conn_backend_deinit() was
introduced in 2.9. For lesser versions, the change should be done
directly into conn_free().
Backend connections can be marked as private to prevent their sharing by
multiple clients. Now, this has become an exception as only two reasons
for data traffic can trigger this (checks are ignored here) :
* http-reuse never
* HTTP response with NTLM header
The first case is easy to manage as the connection is flagged as private
since its inception. However, the second case is dynamic as the
connection can be flagged anytime during its lifetime. When using a
backend protocol such as HTTP/2 with reuse mode aggressive or always, we
face a design issue as the connection would be marked as private,
despite potentially being shared by several clients at the same time.
This is conceptually invalid, but worst it can trigger crashes on MUX
stream detach callback depending on the order of release of the streams,
by calling session_check_idle_conn() with a NULL session. It could also
be possible to have several NTLM responses on a single connection for
different sessions. In this case, connection owner is still being
updated without attaching the connection to its correct session, which
ultimately would cause a crash on session_check_idle_conn with an
invalid session.
Here are two backtrace examples from GDB for such cases :
Thread 1 (Thread 0x7ff73e9fc700 (LWP 648859)):
#0 session_check_idle_conn (conn=0x7ff72f597800, sess=0x0) at include/haproxy/session.h:209
#1 h2_detach (sd=<optimized out>) at src/mux_h2.c:4520
#2 0x000056151742be24 in sc_detach_endp (scp=scp@entry=0x7ff73e9f0f18) at src/stconn.c:376
#3 0x000056151742c208 in sc_destroy (sc=<optimized out>) at src/stconn.c:444
#4 0x0000561517370871 in stream_free (s=s@entry=0x7ff72a2dbd80) at src/stream.c:728
#5 0x000056151737541f in process_stream (t=t@entry=0x7ff72d5e2620, context=0x7ff72a2dbd80, state=<optimized out>) at src/stream.c:2645
#6 0x0000561517456cbb in run_tasks_from_lists (budgets=budgets@entry=0x7ff73e9f10d0) at src/task.c:632
#7 0x00005615174576b9 in process_runnable_tasks () at src/task.c:876
#8 0x000056151742275a in run_poll_loop () at src/haproxy.c:2996
#9 0x0000561517422db1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3195
#10 0x00007ff789e081ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff789a39e73 in clone () from /lib64/libc.so.6
(gdb)
Thread 1 (Thread 0x7ff52e7fc700 (LWP 681458)):
#0 0x0000556ebd6e7e69 in session_check_idle_conn (conn=0x7ff5787ff100, sess=0x7ff51d2539a0) at include/haproxy/session.h:209
#1 h2_detach (sd=<optimized out>) at src/mux_h2.c:4520
#2 0x0000556ebd7f3e24 in sc_detach_endp (scp=scp@entry=0x7ff52e7f0f18) at src/stconn.c:376
#3 0x0000556ebd7f4208 in sc_destroy (sc=<optimized out>) at src/stconn.c:444
#4 0x0000556ebd738871 in stream_free (s=s@entry=0x7ff520e28200) at src/stream.c:728
#5 0x0000556ebd73d41f in process_stream (t=t@entry=0x7ff565783700, context=0x7ff520e28200, state=<optimized out>) at src/stream.c:2645
#6 0x0000556ebd81ecbb in run_tasks_from_lists (budgets=budgets@entry=0x7ff52e7f10d0) at src/task.c:632
#7 0x0000556ebd81f6b9 in process_runnable_tasks () at src/task.c:876
#8 0x0000556ebd7ea75a in run_poll_loop () at src/haproxy.c:2996
#9 0x0000556ebd7eadb1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3195
#10 0x00007ff5752081ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff574e39e73 in clone () from /lib64/libc.so.6
(gdb)
To solve this issue, simply ignore NTLM responses when using a
multiplexer with streams support and the connection is not already
attached to the session. The connection is not marked as private and
will continue to be shared freely accross clients. This is considered
conceptually valid as NTLM usage (rfc 4559) with HTTP is broken and was
designed only with HTTP/1.1 in mind. A side-effect of the change is that
SESS_FL_PREFER_LAST is also not set anymore on NTLM detection, which
allows following requests to be load-balanced accross several server
instances.
The original behavior is kept for HTTP/1 or if the connection is already
attached to the session. This last case happens when using HTTP/2 with
default http-reuse safe mode since the following patch :
0d21deaded
MEDIUM: backend: add reused conn to sess if mux marked as HOL blocking
This should be backported up to all stable releases. Up until 2.4, it
can be taken as-is. For lesser versions, above patch is not present. In
this case the condition should be restricted only to HTTP/1 usage :
if (srv_conn && strcmp(srv_conn->mux->name, "H1") == 0) {
Frames with a too small size must be detected on receive and an error must
be triggered. It is especially important for frames of size 0. Otherwise,
because the frame length is used as return value, the frame is ignored (0 is
the return value to state the frame must be ignored). It is an issue because
in this case, outgoing data, the 4 bytes representing the frame size, are
never consumed. If the agent also closes the connection, this leads to a
wakeup loop because outgoing data are stuck and a shutdown is pending.
In addition, all pending outgoing data are systematcially skipped when the
applet is in SPOE_APPCTX_ST_END state.
The patch should fix the issue #2490. It must be backported to all stable
versions.
It is the first deprecated directive exposed via the
'expose-deprecated-directives' global option. This way, it is possible to
silent the warning about the SPOE uses.
Similarly to "expose-exprimental-directives" option, there is no a global
option to expose some deprecated directives. Idea is to have a way to silent
warnings about deprecated directives when there is no alternative solution.
Of course, deprecated directives covered by this option are not listed and
may change. It is only a best effort to let users upgrade smoothly.
As announced on the ML few weeks (months ?) ago and on several GH issues,
the SPOE is now deprecated. Sadly, this filter should be refactored to work
properly. It was implemented as a functionnal PoC for the 1.7 and since
then, no time was invest to improve it and make it truly maintainable in
time. Worst, other parts of HAProxy evolve, especially applets part, making
maintenance ever more expensive.
Instead of keeping the SPOE filter in a this state and always reply to users
encountering issues or limitations that it is far from perfect but we cannot
work on it for now, we decided to deprecate it.
We can still change our mind before the 3.0.0 release if the situation
evolves. Otherwise the filter will be removed or marked as unmaintained for
the 3.1. If the situation does not change, it means the 3.0 will be the last
version with a true SPOE support.
On soft-stop, we try, as far as possible, to process all pending messages
before closing SPOE applets. However, in sync mode, when an applets waiting
for a response receives the ACK frame, it is switched to IDLE state without
checking if it may be closed. In this case, we will wait the idle timeout
before closing de applet, delaying the soft-stop.
To reduce this delay, on soft-stop, IDLE applets are woken up. On the next
wakeup, the applet will try to process pending messages or will be
closed.
This patch should be backported to all stable versions.
On stream side, the SPOE filter relied on the stream's expiration date to be
woken up and be able to detect processing timeout. However, the stream
expiration date must not be updated this way. Mainly because it may be
overwritten at the end of process_stream(). In the worst case, it is set to
TICK_ETERNITY for any reason. In this case, it is impossible to detect the
SPOE filter must time out and abort the processing.
The right way to do is to set an analysis expiration date on the
corresponding channel, depending on the direction. This expiration date will
be used to compute the stream's expiration date at the end of
process_stream().
This patch may be related to issue #2478. It must be backported to all
stable versions.
A server can only be deleted if there is no elements which reference it.
This is taken care via srv_check_for_deletion(), most notably for active
and idle connections.
A special case occurs for connections directly managed by a session.
This is for so-called private connections, when using http-reuse never
or H2 + http-reuse safe for example. In this case. server does not
account these connections into its idle lists. This caused a bug as the
server is deleted despite the session still being able to access it.
To properly fix this, add a new referencing element into the server for
these session connections. A mt_list has been chosen for this. On
default http-reuse, private connections are typically not used so it
won't make any difference. If using H2 servers, or more generally when
dealing with private connections, insert/delete should typically occur
only once per session lifetime so impact on performance should be
minimal.
This should be backported up to 2.4. Note that srv_check_for_deletion()
was introduced in 3.0 dev tree. On backport, the extra condition in it
should be placed in cli_parse_delete_server() instead.
By default, backend connections are attached to a server instance. This
allows to implement connection reuse. However, in some particular cases,
connection cannot be shared accross several clients. These connections
are considered and private and are attached to the session instance
instead.
These private connections are also indexed by the target server to not
mix them. All of this is implemented via a dedicated structure
previously named struct sess_srv_list.
Rename it to better reflect its usage to struct sess_priv_conns. Also
rename its internal members and all of the associated functions.
This commit is only a renaming, thus no functional impact is expected.
null pointer dereference was reported by Coverity in listener_release()
function. Indeed, we must not try to schedule frontend without task when a
limit is still blocking the frontend. This issue was introduced by commit
65ae1347c7 ("BUG/MINOR: listener: Wake proxy's mngmt task up if necessary on
session release")
This patch should fix issue #2488. It must be backported to all stable
version with the commit above.
When a session is released, listener_release() function is called to notify
the listener. It is an opportunity to resume limited/full listeners. We
first try to resume the listener owning the released session, then all
limited listeners in the global queue and finally all limited listeners in
the frontend's waiting queue. This last step is only performed if there is
no limit applied on the frontend. Nothing is performed if the session rate is
still limited. And it is an issue because if this happens for the last
listener's session, there is no other event to wake the frontend's managment
task up and the listener remains in the limited state.
To fix the issue, when a limit is still applied on the frontent, we must
compute the new wake up date from the sessions rate and schedule the
frontend's managment task.
It is easy to reproduce the issue in SSL by setting a maxconn and a rate
limit on sessions.
This patch should fix the issue #2476. It must be backported to all stable
versions.
-dI allow to enable "insure-fork-wanted" directly from the command line,
which is useful when you want to run ASAN with addr2line with a lot of
configuration files without editing them.
While trying to reproduce another crash case involving lua filters
reported by @bgrooot on GH #2467, we found out that mixing filters loaded
from different contexts ('lua-load' vs 'lua-load-per-thread') for the same
stream isn't supported and may even cause the process to crash.
Historically, mixing lua-load and lua-load-per-threads for a stream wasn't
supported, but this changed thanks to 0913386 ("BUG/MEDIUM: hlua: streams
don't support mixing lua-load with lua-load-per-thread").
However, the above fix didn't consider lua filters's use-case properly:
unlike lua fetches, actions or even services, lua filters don't simply
use the stream hlua context as a "temporary" hlua running context to
process some hlua code. For fetches, actions.. hlua executions are
processed sequentially, so we simply reuse the hlua context from the
previous action/fetch to run the next one (this allows to bypass memory
allocations and initialization, thus it increases performance), unless
we need to run on a different hlua state-id, in which case we perform a
reset of the hlua context.
But this cannot work with filters: indeed, once registered, a filter will
last for the whole stream duration. It means that the filter will rely
on the stream hlua context from ->attach() to ->detach(). And here is the
catch, if for the same stream we register 2 lua filters from different
contexts ('lua-load' + 'lua-load-per-thread'), then we have an issue,
because the hlua stream will be re-created each time we switch between
runtime contexts, which means each time we switch between the filters (may
happen for each stream processing step), and since lua filters rely on the
stream hlua to carry context between filtering steps, this context will be
lost upon a switch. Given that lua filters code was not designed with that
in mind, it would confuse the code and cause unexpected behaviors ranging
from lua errors to crashing process.
So here we take another approach: instead of re-creating the stream hlua
context each time we switch between "global" and "per-thread" runtime
context, let's have both of them inside the stream directly as initially
suggested by Christopher back then when talked about the original issue.
For this we leverage hlua_stream_ctx_prepare() and hlua_stream_ctx_get()
helper functions which return the proper hlua context for a given stream
and state_id combination.
As for debugging infos reported after ha_panic(), we check for both hlua
runtime contexts to check if one of them was active when the panic occured
(only 1 runtime ctx per stream may be active at a given time).
This should be backported to all stable versions with 0913386
("BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread")
This commit depends on:
- "DEBUG: lua: precisely identify if stream is stuck inside lua or not"
[for versions < 2.9 the ha_thread_dump_one() part should be skipped]
- "MINOR: hlua: use accessors for stream hlua ctx"
For 2.4, the filters API didn't exist. However it may be a good idea to
backport it anyway because ->set_priv()/->get_priv() from tcp/http lua
applets may also be affected by this bug, plus it will ease code
maintenance. Of course, filters-related parts should be skipped in this
case.
Change hlua_stream_ctx_prepare() prototype so that it now returns the
proper hlua ctx on success instead of returning a boolean.
Add hlua_stream_ctx_get() to retrieve hlua ctx out of a given stream.
This way we may easily change the storage mechanism for hlua stream in
the future without extensive code changes.
No backport needed unless a commit depends on it.
When ha_panic() is called by the watchdog, we try to guess from
ha_task_dump() and ha_thread_dump_one() if the thread was stuck while
executing lua from the stream context. However we consider this is the
case by simply checking if the stream hlua context was set, but this is
not very precise because if the hlua context is set, then it simply means
that at least one lua instruction was executed at the stream level, not
that the stuck was currently executing lua when the panic occured.
This is especially true with filters, one could simply register a lua
filter that does nothing but this will still end up initializing the
stream hlua context for each stream. If the thread end up being stuck
during the stream handling, then debug dumping functions will report
that the stream was stuck while handling lua, which is not necessarilly
true, and could in fact confuse us even more.
So here we take another approach, we add the BUSY flag to hlua context:
this flag is set by hlua_ctx_resume() around lua_resume() call, this way
we can precisely tell if the thread was handling lua when it was
interrupted, and we rely on this flag in debug functions to check if the
thread was effectively stuck inside lua or not while processing the stream
No backport needed unless a commit depends on it.
hlua_filter_delete() calls hlua_unref() on the stream hlua stack, but
we should own the lock prior to manipulating the stack.
This should be backported up to 2.6.
This is a complementary patch to 8670db7 ("BUG/MAJOR: hlua: improper lock
usage with hlua_ctx_resume()") for hlua_filter_new().
Indeed, the HLUA_E_ERRMSG case still relies on the lua stack but didn't
take the lock to do so.
This should be backported up to 2.6.
Trying to register the same lua filter from global and per-thread context
(using 'lua-load' + 'lua-load-per-thread') causes a segmentation fault in
hlua_post_init().
This is due to a simple copy paste error as we try to print the function
name in the error message (like we do when loading the same lua function
from different contexts) instead of the filter name.
This should be backported up to 2.6.
The new "ssl-security-level" option allows one to change the OpenSSL
security level without having to change the openssl.cnf global file of
your distribution. This directives applies on every SSL_CTX context.
People sometimes change their security level directly in the ciphers
directive, however there are some cases when the security level change
is not applied in the right order (for example when applying a DH
param).
Before this patch, it was to possible to trick by using a specific
openssl.cnf file and start haproxy this way:
OPENSSL_CONF=./openssl.cnf ./haproxy -f bug-2468.cfg
Values for the security level can be found there:
https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_security_level.html
This was discussed in github issue #2468.
In issue #2448, users are complaining that FIPS is not working correctly
since the removal of SSL_library_init().
This was removed because SSL_library_init() is deprecated with OpenSSL
3.x and emits a warning. But the initialization was not needed anymore
because it is done at the first openssl API call.
However it some cases it is needed. SSL_library_init() is now a define
to OPENSSL_init_ssl(0, NULL). This patch adds OPENSSL_init_ssl(0, NULL)
to the init.
This could be backported in every stable branches, however let's wait
before backporting it.
3.0-dev1 introduced a small regression with commit b4db3be86e ("BUG/MINOR:
server: fix server_find_by_name() usage during parsing"). By changing the
way servers are indexed and moving it into the server template loop, the
first one is no longer indexed because the loop starts at low+1 since it
focuses on duplication. Let's index the first one explicitly now.
This should not be backported, unless the commit above is backported.
This was not useful and was using uninitialized value. Introduced with
the commit 08ac28237 ("MINOR: Add aes_gcm_enc converter").
Must be backported wherever the commit 08ac28237 was backported.
The issue was introduced with the commit c31499d74 ("MINOR: ssl: Add
aes_gcm_dec converter").
This must be backported to all stable branches where the above converter
is present, but it may need to be adjusted for older branches because of
code refactoring.
Where possible (FreeBSD 13+), use the public, documented interface to
the ELF auxiliary argument vector: elf_aux_info().
__elf_aux_vector is a private interface exported so that the runtime
linker can set its value during process startup and not intended for
public consumption. In FreeBSD 15 it has been removed from libc and
moved to libsys.
This commit removes qc_treat_rx_crypto_frms(). This function was used in
a single place inside qc_ssl_provide_all_quic_data(). Besides, its
naming was confusing as conceptually it is directly linked to quic_ssl
module instead of quic_rx.
Thus, body of qc_treat_rx_crypto_frms() is inlined directly inside
qc_ssl_provide_all_quic_data(). Also, qc_ssl_provide_quic_data() is now
only used inside quic_ssl to its scope is set to static. Overall, API
for CRYPTO frame handling is now cleaner.
On CRYPTO frames reception, tasklet is rescheduled with TASK_HEAVY to
limit CPU consumption. This commit slighly simplifies this by regrouping
TASK_HEAVY setting and tasklet_wakeup() instructions in a single
location in qc_handle_crypto_frm(). All other unnecessary
tasklet_wakeup() are removed.
Till now it was still needed to write rules to eliminate bad behaving
H2 clients, while most of the time it would be desirable to just be able
to set a threshold on the level of anomalies on a connection.
This is what this patch does. By setting a glitches threshold for frontend
and backend, it allows to automatically turn a connection to the error
state when the threshold is reached so that the connection dies by itself
without having to write possibly complex rules.
One subtlety is that we still have the error state being exclusive to the
parser's state so this requires the h2c_report_glitches() function to return
a status indicating if the threshold was reached or not so that processing
can instantly stop and bypass the state update, otherwise the state could
be turned back to a valid one (e.g. after parsing CONTINUATION); we should
really contemplate the possibility to use H2_CF_ERROR for this. Fortunately
there were very few places where a glitch was reported outside of an error
path so the changes are quite minor.
Now by setting the front value to 1000, a client flooding with short
CONTINUATION frames is instantly stopped.
The function aims at centralizing counter measures but due to the fact
that it only increments the counter by one unit, sometimes it was not
used and the value was calculated directly. Let's pass the increment in
argument so that it can be used everywhere.
Compilation on solaris fails because of usage of names reserved on that
platform, i.e. 'queue' and 's_addr'.
This patch redefines 'queue' as '_queue' and renames 's_addr' to
'srv_addr' which fixes compilation for now.
Future plan: rename 'queue' in code base so define can be removed again.
Backporting: 2.9, 2.8
The sink lock was made to prevent event producers from passing while
there were other threads trying to print a "dropped" message, in order
to guarantee the absence of reordering. It has a serious impact however,
which is that all threads need to take the read lock when producing a
regular trace even when there's no reader.
This patch takes a different approach. The drop counter is shifted left
by one so that the lowest bit is used to indicate that one thread is
already taking care of trying to dump the counter. Threads only read
this value normally, and will only try to change it if it's non-null,
in which case they'll first check if they are the first ones trying to
dump it, otherwise will simply count another drop and leave. This has
a large benefit. First, it will avoid the locking that causes stalls
as soon as a slow reader is present. Second, it avoids any write on the
fast path as long as there's no drop. And it remains very lightweight
since we just need to add +2 or subtract 2*dropped in operations, while
offering the guarantee that the sink_write() has succeeded before
unlocking the counter.
While a reader was previously limiting the traffic to 11k RPS under
4C/8T, now we reach 36k RPS vs 14k with no reader, so readers will no
longer slow the traffic down and will instead even speed it up due to
avoiding the contention down the chain in the ring. The locking cost
dropped from ~75% to ~60% now (it's in ring_write now).
When a reader doesn't read fast enough and causes drops, subsequent
threads try to produce a "dropped" message. But it takes time to
produce and emit this message, in part due to the use of chunk_printf()
that relies on vfprintf() which has to parse the printf format, and
during this time other threads may continue to increment the counter.
This is the reason why this is currently performed in a loop. When
reading what is received, it's common to see a large count followed
by one or two single-digit counts, indicating that we could possibly
have improved that by writing faster.
Let's improve the situation a little bit. First we're now using a
static message prefixed with enough space to write the digits, and a
call to ultoa_r() fills these digits from right to left so that we
don't have to process a format string nor perform a copy of the message.
Second, we now re-check the counter immediately after having prepared
the message so that we still get an opportunity for updating it. In
order to avoid too long loops, this is limited to 10 iterations.
Tests show that the number of single-digit "dropped" counters on output
now dropped roughly by 15-30%. Also, it was observed that with 8 threads,
there's almost never more than one retry.
The previous patch fix the handling of in-order CRYPTO frames which
requires the usage of a new buffer for these data as their handling is
delayed to run under TASK_HEAVY.
In fact, as now all CRYPTO frames handling must be delayed, their
handling can be unify. This is the purpose of this commit, which removes
the just introduced new buffer. Now, all CRYPTO frames are buffered
inside the ncbuf. Unused elements such as crypto_frms member for
encryption level are also removed.
This commit is not a bugcfix but is a direct follow-up to the last one.
As such, it can probably be backported with it to 2.9 to reduce code
differences between these versions.
QUIC relies on SSL_do_hanshake() to be able to validate handshake. As
this function is computation heavy, it is since 2.9 called only under
TASK_HEAVY. This has been implemented by the following patch :
94d20be138
MEDIUM: quic: Heavy task mode during handshake
Instead of handling CRYPTO frames immediately during reception, this
patch delays the process to run under TASK_HEAVY tasklet. A frame copy
is stored in qel.rx.crypto_frms list. However, this frame still
reference the receive buffer. If the receive buffer is cleared before
the tasklet is rescheduled, it will point to garbage data, resulting in
haproxy decryption error. This happens if a fair amount of data is
received constantly to preempt the quic_conn tasklet execution.
This bug can be reproduced with a fair amount of clients. It is
exhibited by 'show quic full' which can report connections blocked on
handshake. Using the following commands result in h2load non able to
complete the last connections.
$ h2load --alpn-list h3 -t 8 -c 800 -m 10 -w 10 -n 8000 "https://127.0.0.1:20443/?s=10k"
Also, haproxy QUIC listener socket mode was active to trigger the issue.
This forces several connections to share the same reception buffer,
rendering the bug even more plausible to occur. It should be possible to
reproduce it with connection socket if increasing the clients amount.
To fix this bug, define a new buffer under quic_cstream. It is used
exclusively to copy CRYPTO data for in-order frame if ncbuf is empty.
This ensures data remains accessible even if receive buffer is cleared.
Note that this fix is only a temporary step. Indeed, a ncbuf is also
already used for out-of-order data. It should be possible to unify its
usage for both in and out-of-order data, rendering this new buffer
instance unnecessary. In this case, several unneeded elements will
become obsolete such as qel.rx.crypto_frms list. This will be done in a
future refactoring patch.
This must be backported up to 2.9.
The converter can be used to encrypt the raw byte input using the
AES-GCM algorithm, using provided nonce and key.
Co-authored-by: Dragan Dosen (ddosen@haproxy.com)
When a parsing error occurs inside a log-format-sd expression, we report
the location of the log-format directive (which may not be set) instead
of reporting the proper log-format-sd directive location where the parsing
error occured.
1|listen test
2| log-format "%B" # no error
3| log-format-sd "%bad" # error
| [ALERT] (322261) : config : Parsing [empty.conf:2]: failed to parse log-format-sd : no such format variable 'bad'. If you wanted to emit the '%' character verbatim, you need to use '%%'.
The fix consists in using the config hints dedicated to log-format-sd
directive instead of the log-format one.
The bug was introduced in 8a4e4420 ("MEDIUM: log-format: Use standard
HAProxy log system to report errors").
This should be backported to every stable versions.
httpclient_precheck(), ssl_ocsp_update_precheck(), and
resolvers_create_default() functions are registered through
REGISTER_PRE_CHECK() macro to be called by haproxy during init from the
pre_check_list list. When calling functions registered in pre_check_list,
haproxy expects ERR_* return values. However those 3 functions currently
use raw return values, so we better use explicit ERR_* macros to prevent
breakage in the future if ERR_* values mapping were to change.
Since 833cc794 ("MEDIUM: sample: handle comma-delimited converter list")
logformat expressions now support having a comma-delimited converter list
right after the fetch. Let's remove a leftover comment from the initial
implementation that says otherwise.
A remote unidirectional stream can be aborted prematurely if application
layers cannot identify its type. In this case, a STOP_SENDING frame is
emitted.
Since QUIC MUX refactoring, a crash would occur in this scenario due to
2 specific characteristics of remote uni streams :
* qcs.tx.fctl was not initialized completely. This cause a crash due to
BUG_ON() statement inside qcs_destroy().
* qcs.stream is never allocated. This caused qcs_prep_bytes() to crash
inside qcc_io_send().
This bug is considered minor as it happens only on very specific QUIC
clients. It was detected when using s2n-quic over interop.
This does not need to be backported.
After handshake completion, QUIC server is responsible to emit
HANDSHAKE_DONE frame. Some clients wait for it to begin STREAM
transfers.
Previously, there was no explicit tasklet_wakeup() after handshake
completion, which is necessary to emit post-handshake frames. In most
cases, this was undetected as most client continue emission which will
reschedule the tasklet. However, as there is no tasklet_wakeup(), this
is not a consistent behavior. If this bug occurs, it causes a connection
freeze, preventing the client to emit any request. The connection is
finally closed on idle timeout.
To fix this, add an explicit tasklet_wakeup() after handshake
completion. It sounds simple enough but in fact it's difficult to find
the correct location efor tasklet_wakeup() invocation, as post-handshake
is directly linked to connection accept, with different orderings.
Notably, if 0-RTT is used, connection can be accepted prior handshake
completion. Another major point is that along HANDSHAKE_DONE frame, a
series of NEW_CONNECTION_ID frames are emitted. However, these new CIDs
allocation must occur after connection is migrated to its new thread as
these CIDs are tied to it. A BUG_ON() is present to check this in
qc_set_tid_affinity().
With all this in mind, 2 locations were selected for the necessary
tasklet_wakeup() :
* on qc_xprt_start() : this is useful for standard case without 0-RTT.
This ensures that this is done only after connection thread migration.
* on qc_ssl_provide_all_quic_data() : this is done on handshake
completion with 0-RTT used. In this case only, connection is already
accepted and migrated, so tasklet_wakeup() is safe.
Note that as a side-change, quic_accept_push_qc() API has evolved to
better reflect differences between standard and 0-RTT usages. It is now
forbidden to call it multiple times on a single quic_conn instance. A
BUG_ON() has been added.
This issue is labelled as medium even though it seems pretty rare. It
was only reproducible using QUIC interop runner, with haproxy compiled
with LibreSSL with quic-go as client. However, affected code parts are
pretty sensible, which justify the chosen severity.
This should fix github issue #2418.
It should be backported up to 2.6, after a brief period of observation.
Note that the extra comment added in qc_set_tid_affinity() can be
removed in 2.6 as thread migration is not implemented for this version.
Other parts should apply without conflict.
In resolvers.c:rslv_promex_next_ts() and in
stick-tables.c:stk_promex_next_ts(), an unused argument was mistakenly
called "unsued" instead of "unused". Let's fix this in a separate patch
so that it can be omitted from backports if this causes build problems.
This is 39th iteration of typo fixes
The naming issue on the argument called "unsued" instead of "unused"
in two functions from resolvers and stick-tables was put into a second
patch so that it can be omitted if it were to cause backport issues.
That's exactly the same as commit 53bfab080c ("BUG/MINOR: sink: fix a race
condition between the writer and the reader") that went into 2.7 and was
backported as far as 2.4, except that since the code was duplicated, the
second instance was not noticed, leaving the race present. The race has
a limited impact, if a forwarder reaches the end of the logs and a new
message arrives before it leaves, the forwarder will only wake up after
yet another new message will be sent. In practice it remains unnoticeable
because for the race to trigger, one needs to have a steady flow of logs,
which means the wakeup will happen anyway.
This should be backported, but no need to insist on it if it resists.
Instead of reporting lua errors using ha_alert(), let's use SEND_ERR()
helper which will also try to generate a log message according to lua
log settings.
hlua_event_subscribe() is meant to be called from a protected lua env
during init and/or runtime. As such, only hlua_event_sub() makes uses
of it: when an error happens hlua_event_sub() will already raise a Lua
exception. Thus it's not relevant to use ha_alert() there as it could
generate log pollution (error is relevant from Lua script point of view,
not from haproxy one).
This could be backported in 2.8.
hlua_ctx_resume() itself can safely be used as-is in a multithreading
context because it takes care of taking the lua lock.
However, when hlua_ctx_resume() returns, the lock is released and it is
thus the caller's responsibility to ensure it owns the lock prior to
performing additional manipulations on the Lua stack. Unfortunately, since
early haproxy lua implementation, we used to do it wrong:
The most common hlua_ctx_resume() pattern we can find in the code (because
it was duplicated over and over over time) is the following:
|ret = hlua_ctx_resume()
|switch (ret) {
| case HLUA_E_OK:
| break;
| case HLUA_E_ERRMSG:
| break;
| [...]
|}
Problem is: for some of the switch cases, we still perform lua stack
manipulations. This is the case for the HLUA_E_ERRMSG for instance where
we often use lua_tostring() to retrieve last lua error message on the top
of the stack, or sometimes for the HLUA_E_OK case, when we need to perform
some lua cleanup logic once the resume ended. But all of this is done
WITHOUT the lua lock, so this means that the main lua stack could be
accessed simultaneously by concurrent threads when a script was loaded
using 'lua-load'.
While it is not critical for switch-cases dedicated to error handling,
(those are not supposed to happen very often), it can be very problematic
for stack manipulations occuring in the HLUA_E_OK case under heavy load
for instance. In this case, main lua stack corruptions will eventually
happen. This is especially true inside hlua_filter_new(), where this bug
was known to cause lua stack corruptions under load, leading to lua errors
and even crashing the process as reported by @bgrooot in GH #2467.
The fix is relatively simple, once hlua_ctx_resume() returns: we should
consider that ANY lua stack access should be lua-lock protected. If the
related lua calls may raise lua errors, then (RE)SET_SAFE_LJMP
combination should be used as usual (it allows to lock the lua stack and
catch lua exceptions at the same time), else hlua_{lock,unlock} may be
used if no exceptions are expected.
This patch should fix GH #2467.
It should be backported to all stable versions.
[ada: some ctx adj will be required for older versions as event_hdl
doesn't exist prior to 2.8 and filters were implemented in 2.5, thus
some chunks won't apply]
When we want to perform some unsafe lua stack manipulations from an
unprotected lua environment, we use SET_SAFE_LJMP() RESET_SAFE_LJMP()
combination to lock lua stack and catch potential lua exceptions that
may occur between the two.
Hence, we regularly find this pattern (duplicated over and over):
|if (!SET_SAFE_LJMP(hlua)) {
| const char *error;
|
| if (lua_type(hlua->T, -1) == LUA_TSTRING)
| error = hlua_tostring_safe(hlua->T, -1);
| else
| error = "critical error";
| SEND_ERR(NULL, "*: %s.\n", error);
|}
This is wrong because when SET_SAFE_LJMP() returns false (meaning that an
exception was caught), then the lua lock was released already, thus the
caller is not expected to perform lua stack manipulations (because the
main lua stack may be shared between multiple threads). In the pattern
above we only want to retrieve the lua exception message which may be
found at the top of the stack, to do so we now explicitly take the lua
lock before accessing the lua stack. Note that hlua_lock() doesn't catch
lua exceptions so only safe lua functions are expected to be used there
(lua functions that may NOT raise exceptions).
It should be backported to every stable versions.
[ada: some ctx adj will be required for older versions as event_hdl
doesn't exist prior to 2.8 and filters were implemented in 2.5, thus
some chunks won't apply, but other fixes should stay relevant]
In hlua_filter_new(), after each hlua resume, we systematically try to
empty the stack by calling lua_settop(). However we're doing this without
locking the lua context, so it is unsafe in multithreading context if the
script is loaded using 'lua-load'. To fix the issue, we protect the call
with hlua_{lock,unlock}() helpers.
This should be backported up to 2.6.
In hlua_filter_callback(), some lua stack work is performed under
SET_SAFE_LJMP() guard which also takes care of locking the hlua context
when needed. However, a lua_gettop() call is performed out of the guard,
thus it is unsafe in multithreading context if the script is loaded using
'lua-load' because in this case the main lua stack is shared between
threads and each access to a lua stack must be performed under the lock,
thus we move lua_gettop() call under the lock.
It should be backported up to 2.6.
hlua_filter_new() handles memory allocation errors by jumping to the
"end:" cleanup label in case of errors. Such errors may happen when the
system is heavily loaded for instance.
In hlua_filter_new(), we try to allocate two hlua contexts in a row before
checking if one of them failed (in which case we jump to the cleanup part
of the function), and only then we initialize them both.
If a memory allocation failure happens for only one out of the two
flt_ctx->hlua[] contexts pair, we still jump to the cleanup part.
It means that the hlua context that was successfully allocated and wasn't
initialized yet will be passed to hlua_ctx_destroy(), resulting in invalid
reads in the cleanup function, which may ultimately cause the process to
crash.
To fix the issue: we make sure flt_ctx hlua contexts are initialized right
after they are allocated, that is before any error handling condition that
may force the cleanup.
This bug was discovered when trying to reproduce GH #2467 with haproxy
started with "-dMfail" argument.
It should be backported up to 2.6.
As per lua documentation, lua_tostring() may raise a memory error.
However, we're often using it to fetch the error message at the top of
the stack (ie: after a failing lua call) from unprotected environments.
In practise, lua_tostring() has rare chances of failing, but still, if
it happens to be the case, it could crash the process and we better not
risk it.
So here, we add hlua_tostring_safe() function, which works exactly as
lua_tostring(), but the function cannot LJMP as it will catch
lua_tostring() exceptions to return NULL instead.
Everywhere lua_tostring() was used to retrieve error string from such
unprotected contexts, we now rely on hlua_tostring_safe().
This should be backported to all stable versions.
[ada: ctx adj will be required, for versions prior to 2.8 event_hdl
API didn't exist so some chunks won't apply, and prior to 2.5 filters
API didn't exist either, so again, some chunks should be ignored]
Lua documentation says that lua_tostring() returns a pointer that remains
valid as long as the object is not removed from the stack.
However there are some places were we use the returned string AFTER the
corresponding object is removed from the stack. In practise this doesn't
seem to cause visible bugs (probably because the pointer remains valid
waiting for a GC cycle), but let's fix that to comply with the
documentation and avoid undefined behavior.
It should be backported in all stable versions.
Thomas Baroux reported a very interesting issue. "balance random" would
systematically assign the same server first upon restart. That comes from
its use of statistical_prng() which is only seeded with the thread number,
and since at low loads threads are assigned to incoming connections in
round robin order, practically speaking, the same thread always gets the
same request and will produce the same random number.
We already have a much better RNG that's also way more expensive, but we
can use it at boot time to seed the PRNG instead of using the thread ID
only.
This needs to be backported to 2.4.
Add core.silent (-1) value to be able to disable logging via
TXN:set_loglevel() call. Otherwise, there is no way to do so and it may be
handy. This special value cannot be used with TXN:log() function.
This patch may be backported if necessary.
When the log level is changed in lua, by calling TXN:set_loglevel function,
it must be incremented by one because it is decremented in strm_log()
function.
This patch must be backport to all stable versions.
PROXY procotol is not supported on QUIC for now. Thus return an error during
configuration parsing if 'accept-proxy' option is used for a QUIC listener.
This patch should fix the issue #2186. It should be backport as far as 2.6.
2 return values are specified in the h2s_make_data() function comment. Both
are more or less equivalent but the later is probably more accurate. So,
keep the right one and remove the other one.
This patch should fix the issue #2175.
Extend "show quic" to be able to dump MUX related information. This is
done via the new function qcc_show_quic(). This replaces the old streams
dumping list which was incomplete.
These info are displayed on full output or by specifying "mux" field.
Add the possibility to customize show quic full output with only a
specific set of printed fields. This is specified as a comma-separated
list. Here are the currently supported values :
* tp: transport parameters
* sock: connection addresses and socket FD
* pktns: packet number space with ack ranges and in flight bytes
* cc: congestion controler and loss information
Note that streams output is not filtered by this mechanism. It's because
it will be replaced soon by an output generated from the MUX which will
use its owned field name.
Add the possibilty to restrict show quic output to only a single
connection. This is done by specifying a quic_conn address pointer.
Default format selection has evolved with it. Indeed, it seems more
fitting to use full format by default when filtering on a connection.
However, it's still possible to revert to the original oneline format
with it by specifying it explicitely.
When a response was returned by HAProxy, a dedicated HTX flag was
set. Thanks to this flag, it was possible to add a "connection: close"
header to the response if the request was not fully received and to close
the connection. In the same way, when a redirect rule was applied,
keep-alive was forcefully disabled for unfinished requests.
All these mechanisms are now useless because the H1 mux is able to drain the
response. So HTX_FL_PROXY_RESP flag is removed and no special processing is
performed on HAProxy response when the request is unfinished.
unlike for H2 and H3, there is no mechanism in H1 to notify the client it
must stop to upload data when a response is replied before the end of the
request without closing the connection. There is no RST_STREAM frame
equivalent.
Thus, there is only two ways to deal with this situation: closing the
connection or draining the request. Until now, HAProxy didn't support
draining H1 messages. Closing the connection in this case has however a
major drawback. It leads to send a TCP reset, dropping this way all in-fly
data. There is no warranty the client has fully received the response.
Draining H1 messages was never implemented because in old versions it was a
bit tricky to implement. However, it is now far simplier to support this
feature because it is possible to have a H1 stream without any applicative
stream. It is the purpose of this patch. Now, when a shutdown is requested
and the stream is detached from the connection, if the request is unfinished
while the response was fully sent, the request in drained.
To do so, in this case the shutdown and the detach are delayed. From the
upper layer point of view, there is no changes. The endpoint is shut down
and detached as usual. But on H1 mux point of view, the H1 stream is still
alive and is being able to drain data. However the stream-endpoint
descriptor is orphan. Once the request is fully received (and drained), the
connection is shut down if it cannot be reused for a new transaction and the
H1 stream is destroyed.
All code from h1_detach() function was moved in a internal function,
h1s_finish_detach(). It will be used to defer the detach and be able to
drain the requests payload.
Checks performed in h1_shutw() to determine if the connection must be
shutdown now or not was move in a dedicated function. This will be used to
be able to drain the requests payload.
During a zero-copy forwarding negociation, if the H1 mux is blocked for any
reason, the IOBUF_FL_FF_BLOCKED flag must be set on its iobuf to notfiy the
producer it must wait. However, there were two places where it was not
performed: when the output buffer allocation failed and when the chunk
formatting failed.
This patch fixes the issue. It must be backported to 2.9.
There is still an issue with zero-copy forwarding of chunks with an unknown
size. It is possible for a producer to fill the sapce reserved for the CRLF
at the end of the chunk. The root cause is that this space is not accounted
in the iobuf offset. So, from the producer point of view, the space may be
used. We can also argue the current design for iobuf is not well suited for
this case. Instead of using a pointer on the consumer's buffer, it could be
easier to use a custom buffer built on top of the consumer one, via a call
to b_make(), with the size, head and data field reflecting the avaialble
space the producer can use.
By the way, because of this bug, it is possible to trigger a BUG_ON() when
we try to write the CRLF at the end of the chunk because the buffer is
full. It is unexpected. Only the stats applet may hit this bug.
To fix the issue, instead of writting this CRLF when the current chunk is
consumed, it is written before consuming the next one. This way, all space
reserved to create the chunk formatting is always placed before forwarding
data.
No backport needed.
This is a followup of the previous commit: GH user @songliumeng initially
reported an issue with the GPL license version for event_hdl source file
which was fixed by the previous commit. It turns out the same mistake was
made in http_ext source file: due to a mixup between LGPL and GPL, GPL
version '2.1' was referenced instead of '2'.
Again, clarify that this is indeed GPL by making use of the banner
provided in doc/gpl.txt
This should be backported in 2.8 with b2bb925 ("MINOR: proxy/http_ext:
introduce proxy forwarded option")
As spotted by user @songliumeng in GH #2463, there was a mixup between
LGPL and GPL in event_hdl source file: GPL version '2.1' was referenced
instead of '2'. Clarify that this is indeed GPL by making use of the
banner provided in doc/gpl.txt.
This should be backported in 2.8 with 68e692d ("MINOR: event_hdl: add
event handler base api")
Since 23cab33 ("BUG/MINOR: ssl: Clear the ckch instance when deleting a
crt-list line"), LIST_DELETE is done twice, one time in
cli_parse_del_crtlist() and another time in ckch_inst_free().
It could trigger a crash with -DDEBUG_LIST.
This isn't a major problem since the ptr is not freed in the meantime so
it will only trigger with the debug.
This patch removes the LIST_DELETE as well as the loop done on link_ref
which is also don in ckch_inst_free()
Could be backported as far as 2.4. 2.4 version does not have a link_ref
loop.
Contrary to static servers, dynamic servers does not initialize their
settings from a default server instance. As such, _srv_parse_init() was
responsible to set a set of minimal values to have a correct behavior.
However, some settings were not properly initialized. This caused
dynamic servers to not behave as static ones without explicit
parameters.
Currently, the main issue detected is connection reuse which was
completely impossible. This is due to incorrect pool_purge_delay and
max_reuse settings incompatible with srv_add_to_idle_list().
To fix the connection reuse, but also more generally to ensure dynamic
servers are aligned with other server instances, define a new function
srv_settings_init(). This is used to set initial values for both default
servers and dynamic servers. For static servers, srv_settings_cpy() is
kept instead, using their default server as reference.
This patch could have unexpected effects on dynamic servers behavior as
it restored proper initial settings. Previously, they were set to 0 via
calloc() invocation from new_server().
This should be backported up to 2.6, after a brief period of
observation.
This patch reverts 2 fixes that were made in an attempt to fix the
ocsp-update feature used with the 'commit ssl cert' command.
The patches crash the worker when doing a soft-stop when the 'set ssl
ocsp-response' command was used, or during runtime if the ocsp-update
was used.
This was reported in issue #2462 and #2442.
The last patch reverted is the associated reg-test.
Revert "BUG/MEDIUM: ssl: Fix crash when calling "update ssl ocsp-response" when an update is ongoing"
This reverts commit 5e66bf26ec.
Revert "BUG/MEDIUM: ocsp: Separate refcount per instance and per store"
This reverts commit 04b77f84d1b52185fc64735d7d81137479d68b00.
Revert "REGTESTS: ssl: Add OCSP related tests"
This reverts commit acd1b85d3442fc58164bd0fb96e72f3d4b501d15.
In appctx_htx_rcv_buf(), HTX blocks found in the appctx output buffer are
copied into the channel buffer. At the end, the state of the underlying
buffer must be updated. If everything was copied, the buffer is reset. This
way, it will be released later, at the end of the applet process function.
However, here there was a typo. We do it on the input buffer instead of the
output buffer. As side effect, an empty HTX message remained stuck in the
appctx outbut buffer, blocking the applet and leading to blocked session
with no expiration date.
No backport needed.
Due to the possibility of calling a control process after adding CRLs, the
ssl_commit_crlfile_cb variable was added. It is actually a pointer to the
callback function, which is called if defined after initial loading of CRL
data from disk and after committing CRL data via CLI command
'commit ssl crl-file ..'.
If the callback function returns an error, then the CLI commit operation
is terminated.
Also, one case was added to the CLI context used by "commit cafile" and
"commit crlfile": CACRL_ST_CRLCB in which the callback function is called.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
Output of 'show quic' is messed up since the introduction of reordered
packets counter in the following commit. The new counter is mixed up
with the first stream line. This is due to the wrong placement of the
newline delimiter.
167e38e0e0
MINOR: quic: Add a counter for reordered packets
This should be backported up to 2.6.
The issue was decribed in commit "BUG/MEDIUM: cli: Warn if pipelined commands
are delimited by a \n". In non-interactive mode, it was possible to use a
newline character as delimiter for pipelined commands. As a consequence, it was
possible to stop commands processing on the middle.
With the above commit, a warning is emitted to notify users. With this one,
we restore the expected behavior, as documented in the management guide.
Only the first line of commands is parsed. This commit will not be
backported to avoid breaking changes on stable versions.
This commit has of course some visible effects. All script using a newline
character as delimiter to pipeline commands in non-interactive mode will
stop working. Only the first command will be evaluated, all others will be
ignored. Pipelined commands MUST now be separated by a semi-colon.
But there is a more subtle and probably more annoying change. It is no
longer possible to pipeline commands with a payload ! A command with a
payload will always be the last one evaluated because it must be finished by
a newline (eventually preceeded by a custom pattern).
It is really annoying to introduce such breaking change. But, on the long
term, it is mandatory. The 2.8 will be the last LST version supporting the
old behavior (with some warning however). This will let 4 years to users to
adapt their scripts.
No backport needed.
This was broken since commit 0011c25144 ("BUG/MINOR: cli: avoid O(bufsize)
parsing cost on pipelined commands"). It is not really a bug fix but it is
labelled as is to make it more visible.
Before, a full line was first retrieved from the request buffer before
extracting the first command to eval it. Now, only one command is retrieved.
But we rely on the request buffer state to interrupt processing in
non-interactive mode. After a command processing, if output of the request
buffer is empty, we leave. Before the above commit, this was not a problem.
But since then, it is obviously a bad statement. First because some input
data may still be there. It is not true today, but it might change. Then,
there is no warranty to receive all commands in same time. For small list of
commands, it will be most of time the case, but it is a dangerous
assumption. For long list of commands, it is almost always false.
To be an issue, commands must be chunked exactly between two commands. But
in this case, remaining commands are skipped. A good way to reproduce the
issue is to wait a bit between two commands, for instance:
(printf "show info;"; sleep 2; printf "show stat\n") | socat ...
In fact, to properly fix the issue, we should exit on the first command
finished by a newline. Indeed, as stated in the documentation, in
non-interactive mode, a single line is processed. To pipeline commands,
commands must be separated by a semi-colon. Unfortunately, the above commit
introduced another change. It is possible to pipeline commands delimited by
a newline. It was pushed 2 years ago and backported to all stable versions.
Several scripts may rely on this behavior.
So, on stable version, the bug will not be fixed. However a warning will be
emitted to notify users their scripts don't respect the documentation and
they must adapt it. Mainly because the cli behavior on this point will be
changed in 3.0 to stick to the doc. This warning will only be emitted once
over the whole worker process life. Idea is to not flood the logs with the
same warning for every offending commands.
This commit should probably be backported to all stable versions. But with
some cautions because the CLI was often modified.
This loop was added to detect pipelined commands when only co_getline() was
used to get commands. Now, co_getdelim() is used and the semi-colon is also
considered as a command delimiter.
As side effet, the last semi-colon, if any, is no longer replaced by a
newline. Thus, we must take care to adapt the test to detect partial
commands.
On qcs_destroy(), a BUG_ON() statement check that QCS does not have
anymore prepared data. This is to ensure connection flow control is
always coherent and prevent transfer freeze.
However, this BUG_ON() may cause a spurrious crash in case QCC is
considered on error. Indeed, in this case, all transfers are interrupted
and qmux_strm_detach() will proceed to immediate QCS free before
releasing the connection. In this situation, connection flow control is
irrelevant so the BUG_ON() should be ignored.
This crash occurs since the MUX refactoring via the following patch.
Previously, a similar BUG_ON() was used but it was incorrectly
implemented rendering it immune even to targetted cause.
3fe3251593
MEDIUM: mux-quic: simplify sending API
This should fix github issue #2456.
This does not need to be backported.
Before a dynamic server can be deleted, a set of preconditions must be
validated to ensure it is not referenced naymore by a stream or a
connection. This is implemented in srv_check_for_deletion().
The various criteria specified were incomplete. This allows a server
instance to be deleted while still be referenced by a stream and a
connection.
This bug was reproduced by using ASAN compilation. A script was used to
add and delete a server every second, while using h2load to generate
traffic with download of 1k objects. Here is the ASAN error.
==140916==ERROR: AddressSanitizer: heap-use-after-free on address 0x520000020080 at pc 0x63cb25679537 bp 0x701529ff5070 sp 0x701529ff5060
READ of size 1 at 0x520000020080 thread T7
#0 0x63cb25679536 in objt_server include/haproxy/obj_type.h:99
#1 0x63cb2568f465 in process_stream src/stream.c:1823
#2 0x63cb25a4a4a2 in run_tasks_from_lists src/task.c:632
#3 0x63cb25a4bf62 in process_runnable_tasks src/task.c:876
#4 0x63cb2596a220 in run_poll_loop src/haproxy.c:3050
#5 0x63cb2596b192 in run_thread_poll_loop src/haproxy.c:3252
#6 0x701539aa9559 (/usr/lib/libc.so.6+0x8b559) (BuildId: c0caa0b7709d3369ee575fcd7d7d0b0fc48733af)
#7 0x701539b26a3b (/usr/lib/libc.so.6+0x108a3b) (BuildId: c0caa0b7709d3369ee575fcd7d7d0b0fc48733af)
To fix this, add <curr_used_conns> to the counters checked in
srv_check_for_deletion().
Outside of this bug, one case which remains sensible is for SF_DIRECT
streams which referenced a server instance early in process_stream()
before connect_server(). This occurs with use-server directive,
force-persist rule or cookie persistence. However, after code
reexamination, the code is considered reliable as process_stream() is
not rescheduled before connect_server() invocation. These observations
have been saved in sess_change_server() documentation to ensure it
remains valid in the future.
This must be backported up to 2.6.
Server refcount is used to protect from server deletion while dumping a
server instance, for stats dump on both CLI and HTTP applet. However,
dump can be aborted prematurely before reaching the end. In this case,
server refcount is never decremented.
This bug can cause an inconsistency on servers refcount, preventing them
to be deleted even after "del server" success.
To fix this, implement release handler for both stats CLI and HTTP
applet. Drop server reference if dump was interrupted during servers
loop.
This should be backported up to 2.6.
Recent commit 2ed6068 ("MINOR: log: custom name for logformat node")
introduced a potential memory leak because when custom name is provided,
lf->name value is allocated using strdup(), thus is expected to be freed
alongside the node when the node is released.
However lf->name was only freed in some common places within log.c
cleanups and helpers func, but in reality there are still cases where
lf nodes are manually freed without making use of freeing helpers.
So this is what this patch does, it makes sure all lf freeing places now
leverage the free_logformat_node() helper function that takes care of
freeing all known allocated elements within the node, including custom
name.
This commit depends on:
- "MINOR: log: add free_logformat_node() helper function"
No backport needed unless 2ed6068 gets backported.
This is a follow up for 24a5e42db6 ("CLEANUP: log: deinitialization of
the log buffer in one function") as there was another opportunity to
make use of the new cleanup function.
A chunk message transferred via zero-copy forwarding in H1 may be
corrupted. This only happens when the chunk size is not known during the
nego stage and when there is nothing to forward when h1_donn_ff() is
called. In this case, we always emit a chunk. Because there is nothing to
forward, a 0-CRLF is emitted in the middle of the message.
The issue occurred with the HTTP stats applet only.
A simple fix is to check the size of data in the iobuf before emitting a new
chunk in h1_done_ff(). However, we still try to send outgoing data because
when this happens, it is most of time because the H1 output buffer is almost
full.
This patch should fix the issue #2453. No backport needed.
Previously, msghdr struct used for sendmsg was memset to 0. This was
updated for performance reason with each members individually defined.
This is done by the following commit :
commit 107d6d7546
OPTIM: quic: improve slightly qc_snd_buf() internal
msg_flags is the only member unset, as sendmsg manual page reports that
it is unused. However, this caused a coverity report. In the end, it is
better to explicitely set it to 0 to avoid any future interrogations,
compiler warning or even portability issues.
This should fix coverity report from github issue #2455.
No need to backport unless above patch is.
The to_forward field was added to debugging output of applets with commit
62a81cb6a ("MINOR: applet: Add callback function to deal with zero-copy
forwarding"), though it's a size_t printed as %lu, which causes complaints
on 32-bit archs. Let's just cast as %lu.
No backport is needed.
This patch is the direct followup of the previous one :
MINOR: quic: remove sendto() usage variant
This finalizes qc_snd_buf() simplification by removing send() syscall
usage for quic-conn owned socket. Syscall invocation is merged in a
single code location to the sendmsg() variant.
The only difference for owned socket is that destination address for
sendmsg() is set to NULL. This usage is documented in man 2 sendmsg as
valid for connected sockets. This allows maximum performance by avoiding
unnecessary lookups on kernel socket address tables.
As the previous patch, no functional change should happen here. However,
it will be simpler to extend qc_snd_buf() for GSO usage.
qc_snd_buf() is a wrapper around emission syscalls. Given QUIC
configuration, a different variant is used. When using connection
socket, send() is the only used. For listener sockets, sendmsg() and
sendto() are possible. The first one is used only if local address has
been retrieved prior. This allows to fix it on sending to guarantee the
source address selection. Finally, sendto() is used for systems which do
not support local address retrieval.
All of these variants render the code too complex. As such, this patch
simplifies this by removing sendto() alternative. Now, sendmsg() is
always used for listener sockets. Source address is then specified only
if supported by the system.
This patch should not exhibit functional behavior changes. It will be
useful when implementing GSO as the code is now simpler.
When using listener socket, source address for emission is explicitely
set using ancillary data for sendmsg(). This is useful to guarantee the
correct address is used when binding on a non-explicit address.
This code was implemented directly under qc_snd_buf(). However, it is
quite complex due to portability issue. For IPv4, two parallel
implementations coexist, defined under IP_PKTINFO or IP_RECVDSTADDR. For
IPv6, another option is defined under IPV6_RECVPKTINFO. Each variant
uses its distinct name which increase the code complexity.
Extract ancillary data filling in a dedicated function named
cmsg_set_saddr(). This reduces greatly the body of qc_snd_buf(). Such
functions can be replicated when other ancillary data type will be
implemented. This will notably be useful for GSO implementation.
qc_snd_buf() is a wrapper for sendmsg() syscall (or its derivatives)
used for all QUIC emissions. This patch aims at removing several
non-optimal code sections :
* fd_send_ready() for connected sockets is only checked on the function
preambule instead of inside the emission loop
* zero-ing msghdr structure for unconnected sockets is removed. This is
unnecessary as all fields are properly initialized then.
* extra memcpy/memset invocations when using IP_PKTINFO/IPV6_RECVPKTINFO
are removed by setting directly the address value into cmsg buffer
Binding on multiple addresses for QUIC is safe only if IP_PKTINFO or
equivalent is available. Else, the behavior may be undefined as the
system is responsible to choose the network interface and source address
on response.
This commit adds a warning on boot if no or partial support for
IP_PKTINFO or equivalent is detected and configuration contains UDP
binding on multiple addresses.
This should be backported up to 2.6. Special backport recommdations :
* change ha_warning() to ha_diag_warning() to ensure no spurrious
warnings will be triggered on stable releases
* IP_PKTINFO usage was introduced on 2.7. For 2.6, multiple addresses
QUIC binding is always unreliable. As such, preprocessor condition
must simply be removed so that the warning is always active regarding
of the system. Warning message should also be truncated to suppress
IP_PKTINFO reference.
make it so string array construction is performed by dedicated macro
helpers instead of manual char insertion between string members.
The goal is to easily be able to support multiple forms of array
construction depending on the data encoding format (raw, json..).
Only %hrl and %hsl logformats are concerned.
Some log variables may be prefixed with specific chars that represent
extra informations that are relevant with it but are are not directly
part of the "raw" value.
ie: '+' char is prepended before some values when "option logasap" is
used to indicate that the value has not yet reached its final value.
However, as those "metadata" are printed using the general purpose
LOGCHAR() printing helper, it's not easy to tell if they are part of the
base value or not.
In this patch we add the LOGMETACHAR() helper that is a wrapper for
LOGCHAR(). The goal is to prepare for adding some logic to prevent such
additional infos from being generated when not relevant or needed.
quotes building for some log formats is directly performed under each
switch case statement so it would become painful to add other conditions
to prevent the quotes from being generated when it's not supported by the
the data encoding format for instance (ie: JSON).
Let's centralize and simplify quotes handling by adding LOGQUOTE_START()
and LOGQUOTE_END() helper macros. If a quotation is started and not
explicitly ended, it will be automatically ended at the end of the current
logformat node:
LOGQUOTE_START() sets 'quote' variable to 1, this way LOGQUOTE_END() only
prints the ending quote when needed. LOGQUOTE_END() is systematically
called after each node switch-case (after each value). LOGQUOTE_START()
does nothing if LOG_OPT_QUOTE isn't set, so does LOGQUOTE_END().
Some rare cases such as %hsl (list of captured headers) required special
handling: in this case multiple quoted texts are generated for the same
field value so explicit LOGQUOTE_START() + LOGQUOTE_END() combination was
needed.
last_isspace variable is explicitly set to 0 in all cases except
LOG_FMT_SEPARATOR case. So we can actually simplify the code by setting
last_isspace to 0 by default and skipping the assignment for the
LOG_FMT_SEPARATOR case.
Add the ability to manually specify desired output type after a custom
field name for logformat nodes. Forcing the type can be useful to ensure
value is stored with the proper type representation. (i.e.: forcing
numerical to string to work around the limited resolution of JS number
types)
By default, type is set to SMP_T_SAME, which means the original type will
be preserved.
Currently supported types are: bool, str, sint
type_to_smp(type) does the reverse operation of smp_to_type[smp]: it takes
a type name as input string and tries to return the corresponding SMP_T_*
smp type or SMP_TYPES if not found.
Add the ability to specify custom name (will be used for representation
in verbose output types such as json) to logformat nodes.
For now, a custom name should be composed by characters [a-zA-Z0-9-_]*
Transient send errors is handled differentely if using connection or
listener socket for QUIC transfers. In the first case, proper poller
subscription is used via fd_cant_send()/fd_want_send(). For the listener
socket case, error is ignored by qc_snd_buf() caller and retransmission
mechanism will allow to reemit the data.
For listener socket, transient error code handling is buggy. It blindly
uses fd_cand_send() with <qc.fd> member which is set to -1 for listener
socket usage. This results in an invalid fdtab access, with a possible
crash or a modification of a totally unrelated FD.
This bug is simply fixed by using qc_test_fd() before using
fd_cant_send()/fd_want_send(). This ensures <qc.fd> is used only if
initialized which is only the case when using connection socket.
No crash was reported yet for this bug. However, it is reproducible by
using ASAN compilation and the following strace sendmsg() errno command
injection :
# strace -qq -yy -p $(pgrep haproxy) -f -e trace=%network \
-e inject=sendto,sendmsg:error=EAGAIN:when=20+20
This must be backported up to 2.7.
If some data are received for a lua socket while the lua script responsible
to consume these data is not ready to do so, for instance because it is
sleeping, the applet is woken up in loop because it never states it will not
consume these data yet.
To fix the issue, in the applet I/O handle, when there are outgoing data, we
always pretend the applet will not consume it. It is the responsibility to
the lua script to reactivate receives by calling Socket.receive() function.
This patch must be backported to every stable version. For 2.4 and older,
si_want_get()/si_cant_get() must be used instead of
applet_will_consume()/applet_wont_consume().
It is poosible to create a lua socket without performing any connect. In
this case, the lua socket is released because of the garbage collector.
However, the garbarge collector does not release the applet, it wakes it
up. Since commit 751b59c40b ("BUG/MEDIUM: hlua: Initialize appctx used by a
lua socket on connect only"), the applet initialization is performed on
connect. So, here, it is possible to wake an uninitialized applet. It is an
unexpected case for the applet's I/O handler, leading to a segfault because
some resources are not initialized (the stream's target in this case).
So, now, in the lua socket GC function, we take care to immediately release
uninitialized applets. At worst, the release itself is delayed. But it is
safe because we are sure the applet's I/O handler will never be executed.
In addition, we take case to increment the GC counter when the lua socket is
created. The way, uninitialized lua socket are released more quickly.
This patch should fix the issue #2451. It must be backported as far as 2.6.
When an error is triggered during the applet initialization, a dedicated
function is called to release it. Indeed, in this case, because the applet
was not initialized, the ->release callback must not be called. However,
because the init stage may be delayed to be performed during the first
applet wakeup, we must also take care to not rely on the default
appctx_free() function, to immediately release the applet. Otherwise, if the
error happens in a delayed init stage, the applet is never released.
This patch partially fix the issue #2451. It must be backported as far as
2.6.
Currently haproxy does not implement dynamic table support for QPACK. As
such, dynamic table capacity advertized via H3 SETTINGS is 0. When
receiving a non-null Set Dynamic Table Capacity instruction, close
immediately the connection using QPACK_ENCODER_STREAM_ERROR.
Prior to this patch, such instructions were simply ignored. This is non
conform to QUIC specification.
This should be backported up to 2.6. Note that on 2.6 qcc_set_error()
must be replaced by function qcc_emit_cc_app().
Close the connection using QPACK_DECODER_STREAM_ERROR when receiving an
invalid insert count increment. As haproxy does not use dynamic table,
this instruction must never be emitted by the peer.
Prior to this patch, haproxy silently ignored such instruction which is
not conform to the QUIC specification.
This should be backported up to 2.6. Note that on 2.6 qcc_set_error()
must be replaced by function qcc_emit_cc_app().
As specified in RFC 9000, a client must never emit a HANDSHAKE_DONE
frame. If this happens, the server must close the connection with error
PROTOCOL VIOLATION.
Previously, such a frame was silently discarded on server side. The
connection remained opened which is not conformant to the specification.
This should be backported up to 2.6.
Ensure every frame types are handled in qc_parse_pkt_frms. Add an
ABORT_NOW on the default case. This is safe as an unknown frame must be
rejected prior via qc_parse_frm().
As specified by RFC 9000, connection is closed on error if an unknown
QUIC frame type is received.
Previously, a frame with unknown type was silently discarded. The
connection remained opened which is not conformant to the specification.
This should be backported up to 2.6.
Global options to disable for zero-copy forwarding are now tested outside
callbacks responsible to perform the forwarding itself. It is cleaner this
way because we don't try at all zero-copy forwarding if at least one side
does not support it. It is equivalent to what was performed before, but it
is simplier this way.
There is a nego stage when a producer is ready to forward data to the other
side. At this stage, the zero-copy forwarding may be disabled if the
consumer does not support it. However, there is a flaw with this way to
proceed. If the channel buffer is not empty, we delay the zero-copy
forwarding to flush all data from the channel first. During this delay,
receives on the endpoint (at connection level for muxes), are blocked to be
sure to have the opportunity to switch on zero-copy forwarding. It is a
problem if the consumer cannot flush data from the channel's buffer, waiting
for more data for instance.
It is especially annoying with the CLI applet, because this scenario can
happen if a command is partially received. For instance without the LF at
the end. In this case, the CLI applet is blocked because it waits more
data. The frontend connexion is also blocked because channel's data must be
flushed before trying to receive more data. Worst, this happen at where no
timeout is armed. Thus the session is stuck infinitly, client aborts cannot
be detected because receives are blocked, and the applet cannot abort on its
side because there are pending outgoing data. It is clearly a situation
where it is easy to consume all CLI slots.
To fix the issue, thanks to previous commits, we now check zero-copy
forwarding support on both sides before proceeding.
This patch relies on the following commits:
* MINOR: muxes: Announce support for zero-copy forwarding on consumer side
* MINOR: stconn: Add SE flag to announce zero-copy forwarding on consumer side
* MINOR: stconn: Rename SE_FL_MAY_FASTFWD and reorder bitfield
* CLEANUP: stconn: Move SE flags set by app layer at the end of the bitfield
All the series must be backported to 2.9.
It is unused for now, but the muxes announce their support of the zero-copy
forwarding on consumer side. All muxes, except the fgci one, are supported
it.
To fix a bug, a flag to announce the capabitlity to support the zero-copy
forwarding on the consumer side will be added on the SE descriptor. So the
old flag SE_FL_MAY_FASTFWD is renamed to indicate it concerns the producer
side. It is now SE_FL_MAY_FASTFWD_PROD. And to prepare addition of the new
flag, the bitfield is a bit reordered.
This revert of commit 0b93ff8c87 ("BUG/MEDIUM: stconn: Wake applets on
sending path if there is a pending shutdown") and 9e394d34e0 ("BUG/MINOR:
stconn: Don't report blocked sends during connection establishment") because
it was not the right fixes.
We must not wake an applet up when a shutdown is pending because it means
output some data are still blocked in the channel buffer. The applet does
not necessarily consume these data. In this case, the applet may be woken up
infinitly, except if it explicitly reports it wont consume datay yet.
This patch must be backported as far as 2.8. For older versions, as far as
2.2, it may be backported. If so, a previous fix must be pushed to prevent
an HTTP applet to be stuck. In http_ana.c, in http_end_request() and
http_end_reponse(), the call to channel_htx_truncate() on the request
channel in case of MSG_ERROR must be replace by a call to
channel_htx_erase().
Thanks to the previous patch ("MEDIUM: applet: Add notion of shutdown for
write for applets"), it is no longer necessary to check SC flags to detect
shutdowns to interrupt the wait command. It is possible to remove this ugly
workaround. In addition, we only test the SE for shutdown because end of
stream and error are already checked by the CLI I/O handler. And it is no
longer necessary to remove output data from the channel's buffer because
shutdown are not reported if there are remaining outgoing data.
Of course, if the "wait" command is backported, the commit above and this
one must be backported too.
In fact there is already flags on the SE to state a shutdown for reads or
writes was performed. But for applets, this notion does not exist. Both
flags are set in same time when the applet is released. But at the SC level,
there are functions to perform a shutdown (formely the shutw) and an abort
(formely the shutr). For applets, when a shutdown is performed on the SC, if
the applet is not immediately released, nothing is acknowledge at the SE
level.
With old way to implement applets, this was not an real issue until recently
because applets accessed to the channel/SC flags. It was thus possible to
catch the shutdowns. But the "wait" command on the CLI reveals the
flaw. Indeed, when this command is executed, nothing is read or sent. So, it
is not possible to detect the shutdowns. As a workaround, a dedicated test
on the SC flags was added at the end of the wait command I/O handler. But it
is pretty ugly.
With new way to implement applets, there is no longer access to the channel
or SC. So we must add a way to acknowledge shutdown into the SE.
This patch solves the both sides of the issue. The shutw notion is added for
applets. Its only purpose is to set SE_FL_SHWN flags. This flag is tested by
all applets, so, it solves the issue quite simply.
Note that it is described as a bug fix but there is no real issue, just a
design flaw. However, if the "wait" command is backported, this patch must
be backported too. Unfortinately it will require an adaptation because there
is no appctx flags on older versions.
These both flags are set after releasing the applet, in
appctx_shut(). Concretly, it means the applet is shutdown for reads and
writes. Once set, the applet's I/O handler was no longer called. Tests on
these flags are useless. There is no chance to match them.
This case does not exist yet with the H1 multiplexer, but applets may decide to
not produce data if there is not enough room in the destination buffer (the
applet's outbuf or the opposite SE buffer). It is true for the stats applets for
instance. However this case is not properly handled when the zero-copy
forwarding is in-use.
To fix the issue, the se_done_ff() function was modified to return the number of
bytes really forwarded and to subs for sends if nothing was forwarded while the
zero-copy forwarding was blocked by the producer. On the applet side, we take
care to block the zero-copy forwarding if the applet requests more room. At the
end, zero-copy forwarding is unblocked if something was forwarded.
This way, it is now possible for the stats applet to report a full buffer and
block the zero-copy forwarding, even if the buffer is not really full, by
requesting more room.
No backport needed.
An issue was introduced when zero-copy forwarding was added to the stats and
cache applets. There is no test to be sure the upper layer is ready to use
the zero-copy forwarding. So these applets refuse to deliver the response
into the applet's output buffer if the zero-copy forwarding is supported by
the opposite endpoint. It is especially an issue when a filter, like the
compression, is in-use on the response channel.
Because of this bug, the response is not delivered and the applet is woken
up in loop to produce data.
To fix the issue, an appctx flag was added, APPCTX_FL_FASTFWD, to know when
the zero-copy forwarding is in-use. We rely on this flag to not fill the
outbuf in the applet's I/O handler.
No backport needed.
This simplifies a bit the stats applet. Because the CLI part was not
refactored yet to use the applet's buffers, there are 3 ways to produce
data:
* the HTX message for the HTTP stats when zero-copy forwarding is not
used
* raw data in the opposite endpoint buffer for the HTTP stats when
zero-copy forwarding is used
* the channel buffer when the CLI "show stat" command is evaluated
There is already a dedicated function to take care to copy data at the right
place. There is now also a dedicated function to check us the output buffer
is almost full.
Commit 91b77c1632 ("MEDIUM: mux-h1: Support zero-copy forwarding for chunks with
an unknown size") was recently pushed but it contains 3 bugs. The first one is
during the nego. The extra size reserved for the CRLF at the end of the chunk
must not be added to the offset value. Indeed, the CRLF will be appended after
the data and not prepended to them.
The second one, still during the nego, is an integer overflow when the available
room in the output buffer is computed.
Finally, the last one is when the chunk itself is formatted. This part was
totally buggy if the output buffer was not empty at the beginning.
No backport needed.
A packet is considered as reordered when it is detected as lost because its packet
number is above the largest acknowledeged packet number by at least the
packet reordering threshold value.
Add ->nb_reordered_pkt new quic_loss struct member at the same location that
the number of lost packets to count such packets.
Should be backported to 2.6.
Let's say that the largest packet number acknowledged by the peer is #10, when inspecting
the non already acknowledged packets to detect if they are lost or not, this is the
case a least if the difference between this largest packet number and and their
packet numbers are bigger or equal to the packet reordering threshold as defined
by the RFC 9002. This latter must not be less than QUIC_LOSS_PACKET_THRESHOLD(3).
Which such a value, packets #7 and oldest are detected as lost if non acknowledged,
contrary to packet number #8 or #9.
So, the packet loss detection is very sensitive to such a network characteristic
where non acknowledged packets are distant from each others by their packet number
differences.
Do not use this static value anymore for the packet reordering threshold which is used
as a criteria to detect packet loss. In place, make it depend on the difference
between the number of the last transmitted packet and the number of the oldest
one among the packet which are still in flight before being inspected to be
deemed as lost.
Add new tune.quic.reorder-ratio setting to apply a ratio in percent to this
dynamic packet reorder threshold.
Should be backported to 2.6.
The new formula for K CUBIC which arrives with RFC 9438 is as follows:
K = cubic_root((W_max - cwnd_epoch) / C)
Note that W_max is c->last_w_max, and cwnd_epoch is c->cwnd when entering
quic_cubic_update() just after a congestion event.
Must be backported as far as 2.6.
The formula for K CUBIC calculation is as follows:
K = cubic_root(W_max * (1 - beta_quic) / C).
Note that this does not match the comment. But the aim of this patch is to not
hide a bug inside another patch to update this K CUBIC calculation.
The unit of C is bytes/s^3 (or segments/s^3). And we want to store K as
milliseconds. So, the conversion inside the cubic_root() to convert seconds in
milliseconds is wrong. The unit used here is bytes/(ms/1000)^3 or
bytes*1000^3/ms^3. That said, it is preferable to compute K as seconds, then
convert to milliseconds as done by this patch.
Must be backported as far as 2.6.
The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.
These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.
Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.
An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.
This patch can be backported up to 2.8.
As reported by github user @JB0925 in issue #2427, there is a possible
crash in pool_flush(). The problem is that if the free_list is not empty
in the first test, and is empty at the moment the xchg() is performed,
for example because another thread called it in parallel, we place a
POOL_BUSY there that is never removed later, causing the next thread to
wait forever.
This was introduced in 2.5 with commit 2a4523f6f ("BUG/MAJOR: pools: fix
possible race with free() in the lockless variant"). It has probably
very rarely been detected, because:
- pool_flush() is only called when stopping is set
- the function does nothing if global pools are disabled, which is
the case on most modern systems with a fast memory allocator.
It's possible to reproduce it by modifying __task_free() to call
pool_flush() on 1% of the calls instead of only when stopping.
The fix is quite simple, it consists in moving the zeroing of the
entry in the break path after verifying that the entry was not already
busy.
This must be backported wherever commit 2a4523f6f is.
In issue #2427 Ilya reports that gcc-14 rightfully complains about
sizeof() being placed in the left term of calloc(). There's no impact
but it's a bad pattern that gets copy-pasted over time. Let's fix the
few remaining occurrences (debug.c, halog, udp-perturb).
This can be backported to all branches, and the irrelevant parts dropped.
The "wait" command now supports a condition, "srv-unused", which waits
for the designated server to become totally unused, indicating that it
is removable. Upon each wakeup it calls srv_check_for_deletion() to
verify if conditions are met, if not if it's recoverable, or if it's
not recoverable, and proceeds according to this, never waiting for a
final decision longer than the configured delay.
The purpose is to make it possible to remove servers from the CLI after
waiting for their sessions to be terminated:
$ socat -t5 /path/to/socket - <<< "
disable server px/srv1
shutdown sessions server px/srv1
wait 2s srv-unused px/srv1
del server px/srv1"
Or even wait for connections to terminate themselves:
$ socat -t70 /path/to/socket - <<< "
disable server px/srv1
wait 1m srv-unused px/srv1
del server px/srv1"
Conditions will need to have context, arguments etc from the command line.
Since these will vary with time (otherwise we wouldn't wait), let's just
pass them as text (possibly pre-processed). We're starting with 4 strings
that are expected to be allocated by strdup() and are always sent to free()
upon release.
Since we'll support waiting for an action to succeed or permanently
fail, we need the ability to return an unrecoverable failure. Let's
add CLI_WAIT_ERR_FAIL for this. A static error message may be placed
into ctx->msg to report to the user why the failure is unrecoverable.
We'll need to be able to verify whether or not a server may be deleted.
For now, both the verification and the action are performed in the same
function, at once under thread isolation. The goal here is to extract
the verification code into a new function that will perform these checks,
return a status between success/recoverable/non-recoverable failure, and
will also return a message for the caller.
When an applet is using its own buffers, it is important to release them, if
empty, after processing to recycle unsued buffers. It is not a leak because
these buffers are necessarily released when the applet is released. But this
leads to an excess of buffer allocations.
No need to backport.
This allows to insert delays between commands, i.e. to collect a same
set of metrics at a fixed interval. E.g:
$ socat -t20 /path/to/socket <<< "show activity; wait 10s; show activity"
The goal will be to extend the feature to optionally support waiting on
certain conditions. For this reason the struct definitions and enums were
placed into cli-t.h.
The CLI applet doesn't make use of its timeout at all, only the stream
does. That's a wonder because it allows any command's I/O handler to
trivially set a wakeup timer by simply touching the task's ->expire
field, and the I/O handler will automatically be woken up again. The
only condition for this is that we properly take care of clearing that
timeout whenever we finish processing a command and switch back to the
PROMPT state. That's what this patch does.
If a release handler produces a final message, it's currently left
pending in the CLI context and needs another I/O event to be dumped
because immediately after calling ->release, we check for states
OUTPUT and above and we wait until more data arrives.
This patch adds continue statement to go back to the loop immediately
after leaving the release handler in order to attempt to emit the
output message.
At this point it's not sure whether any release handlers are producing
messages, so it's probably not needed to backport this.
Some commands are still missing their trailing LF, and very few were even
already spotted in the past emitting more than one. The risk of missing
this LF is particularly high, especially when tests are run in non-
interactive mode where the output looks good at first glance. The problem
is that once run in interactive mode, the missing empty line makes the
command not being complete, and scripts can wait forever.
Let's tackle the problem at its root: messages emitted at the end must
always end with an LF and we know some miss it. Thus, in cli_output_msg()
we now start by removing the trailing LFs from the string, and we always
add exactly one. This way the trailing LF added by correct functions are
silently ignored and all functions are now correct.
This would need to be progressively backported to all supported versions
in order to address them all at once, though the risk of breaking a legacy
script relying on the wrong output is never zero. At first it should at
least go as far as the lastest LTS (2.8), and maybe another one depending
on user demands. Note that it also requires previous patch ("BUG/MINOR:
vars/cli: fix missing LF after "get var" output") because it fixes a test
for a bogus output for "get var" in a VTC.
"get var" on the CLI was also missing an LF, and the vtest as well, so
that fixing only the code breaks the vtest. This must be backported to
2.4 as the issue was brought with commit c35eb38f1d ("MINOR: vars/cli:
add a "get var" CLI command to retrieve global variables").
Some cli_err(), cli_msg() or even ha_error() etc are missing the trailing
LF, which breaks the continuity of the CLI parsing: the extra LF that serves
to mark the end of the command is in fact taken as the missing LF and no
extra one is added.
This patch adds the missing LF on identified messages. It might be worth
trying to proceed in a more generic way with this, given the amount of
code that is possibly at risk.
We now update the session's tracked counters with the observed glitches.
In order to avoid incurring a high cost, e.g. if many small frames contain
issues, we batch the updates around h2_process_demux() by directly passing
the difference. Indeed, for now all functions that increment glitches are
called from h2_process_demux(). If that were to change, we'd just need to
keep the value of the last synced counter in the h2c struct instead of the
stack.
The regtest was updated to verify that the 3rd client that does not cause
issue still sees the counter resulting from client 2's mistakes. The rate
is also verified, considering it shouldn't fail since the period is very
long (1m).
This adds a new pair of stored types in the stick-tables:
- glitch_cnt
- glitch_rate
These keep count of the number of glitches reported on a front connection,
in order to decide how to act with a badly defective client or a potential
attacker. For now nothing updates these counters, but all the infrastructure
needed to configure, update and retrieve them was added, including the doc.
No regtest was added yet since they're not filled yet.
It's quite uncommon for a client to decide to change the connection's
initial window size after the settings exchange phase, unless it tries
to increase it. One of the impacts depending is that it updates all
streams, so it can be expensive, depending on the stacks, and may even
be used to construct an attack. For this reason, we now count a glitch
when this happens.
A test with h2spec shows that it triggers 9 across a full test.
Here we consider that if a HEADERS frame is made of more than 4 fragments
whose average size is lower than 1kB, that's very likely an abuse so we
count a glitch per 16 fragments, which means 1 glitch per 1kB frame in a
16kB buffer. This means that an abuser sending 1600 1-byte frames would
increase the counter by 100, and that sending 100 headers per request in
individual frames each results in a count of ~7 to be added per request.
A test consisting in sending 100M requests made of 101 frames each over
a connection resulted in ~695M glitches to be counted for this connection.
Note that no special care is taken to avoid wrapping since it already takes
a very long time to reach 100M and there's no particular impact of wrapping
here (roughly 1M/s).
RFC9113 clarified a point regarding the payload from DATA frames sent to
closed streams. It must always be counted against the connection's flow
control. In practice it should really have no practical effect, but if
repeated upload attempts are aborted, this might cause the client's
window to progressively shrink since not being ACKed.
It's probably not necessary to backport this, unless another patch
depends on it.
The fetch will return true if the stream was redispatched: this is a
past action, thus we rename the fetch to better reflect its true
meaning and prevent confusions.
Documentation was updated.
While at it, the fetch was moved from internal states section to Layer 4
section, which is where it belongs.
No backport needed unless 92b2edb (" MINOR: stream: add "txn.redispatch"
fetch") gets backported.
If a certificate that has an OCSP uri is unused and gets added to a
crt-list with the ocsp auto update option "on", it would not have been
inserted into the auto update tree because this insertion was only
working on the first call of the ssl_sock_load_ocsp function.
If the configuration used a crt-list like the following:
cert1.pem *
cert2.pem [ocsp-update on] *
Then calling "del ssl crt-list" on the second line and then reverting
the delete by calling "add ssl crt-list" with the same line, then the
cert2.pem would not appear in the ocsp update list (can be checked
thanks to "show ssl ocsp-updates" command).
This patch ensures that in such a case we still perform the insertion in
the update tree.
This patch can be backported up to branch 2.8.
The ckch_store's free'ing function might end up calling
'ssl_sock_free_ocsp' if the corresponding certificate had ocsp data.
This ocsp cleanup function expects for the 'refcount_instance' member of
the certificate_ocsp structure to be 0, meaning that no live
ckch instance kept a reference on this certificate_ocsp structure.
But since in ckch_store_free we were destroying the ckch_data before
destroying the linked instances, the BUG_ON would fail during a standard
deinit. Reversing the cleanup order fixes the problem.
Must be backported to 2.8.
With the current way OCSP responses are stored, a single OCSP response
is stored (in a certificate_ocsp structure) when it is loaded during a
certificate parsing, and each ckch_inst that references it increments
its refcount. The reference to the certificate_ocsp is actually kept in
the SSL_CTX linked to each ckch_inst, in an ex_data entry that gets
freed when he context is freed.
One of the downside of this implementation is that is every ckch_inst
referencing a certificate_ocsp gets detroyed, then the OCSP response is
removed from the system. So if we were to remove all crt-list lines
containing a given certificate (that has an OCSP response), the response
would be destroyed even if the certificate remains in the system (as an
unused certificate). In such a case, we would want the OCSP response not
to be "usable", since it is not used by any ckch_inst, but still remain
in the OCSP response tree so that if the certificate gets reused (via an
"add ssl crt-list" command for instance), its OCSP response is still
known as well. But we would also like such an entry not to be updated
automatically anymore once no instance uses it. An easy way to do it
could have been to keep a reference to the certificate_ocsp structure in
the ckch_store as well, on top of all the ones in the ckch_instances,
and to remove the ocsp response from the update tree once the refcount
falls to 1, but it would not work because of the way the ocsp response
tree keys are calculated. They are decorrelated from the ckch_store and
are the actual OCSP_CERTIDs, which is a combination of the issuer's name
hash and key hash, and the certificate's serial number. So two copies of
the same certificate but with different names would still point to the
same ocsp response tree entry.
The solution that answers to all the needs expressed aboved is actually
to have two reference counters in the certificate_ocsp structure, one
for the actual ckch instances and one for the ckch stores. If the
instance refcount becomes 0 then we remove the entry from the auto
update tree, and if the store reference becomes 0 we can then remove the
OCSP response from the tree. This would allow to chain some "del ssl
crt-list" and "add ssl crt-list" CLI commands without losing any
functionality.
Must be backported to 2.8.
When deleting a crt-list line through a "del ssl crt-list" call on the
CLI, we ended up free'ing the corresponding ckch instances without fully
clearing their contents. It left some dangling references on other
objects because the attache SSL_CTX was not deleted, as well as all the
ex_data referenced by it (OCSP responses for instance).
This patch can be backported up to branch 2.4.
The only useful information taken out of the ckch_store in order to copy
an OCSP certid into a buffer (later used as a key for entries in the
OCSP response tree) is the ocsp_certid field of the ckch_data structure.
We then don't need to pass a pointer to the full ckch_store to
ckch_store_build_certid or even any information related to the store
itself.
The ckch_store_build_certid is then converted into a helper function
that simply takes an OCSP_CERTID and converts it into a char buffer.
When calling ckchs_dup (during a "set ssl cert" CLI command), if the
modified store had OCSP auto update enabled then the new certificate
would not keep the previous update mode and would not appear in the auto
update list.
This patch can be backported to 2.8.
At the beginning of the 3.0-dev cycle, the zero-copy forwarding support was
added only for the cache applet with an option to disable it. This was a
hack, waiting for a better integration with applets. It is now possible to
implement the zero-copy forwarding for any applets. So the specific option
for the cache applet was renamed to be used for all applets. And this option
is now also checked for the stats applet.
Concretely, 'tune.cache.zero-copy-forwarding' was renamed to
'tune.applet.zero-copy-forwarding'.
This field was introduced when the first implementation of the zero-copy
forwarding was added. It is now useless. However, we must still save the
body-size of the object in the cache.
Default .rcv_buf and .snd_buf functions that applets can use are now
specialized to manipulate raw buffers or HTX buffers.
Thus a TCP applet should use appctx_raw_rcv_buf() and appctx_raw_snd_buf()
while HTTP applet should use appctx_htx_rcv_buf() and appctx_htx_snd_buf().
Note that the appctx is now directly passed to these functions instead of
the SC.
Just like for the cache applet, it is now possible to send response to the
opposite side using the zero-copy forwarding. Internal functions were
slightly updated but there is nothing special to say. Except the requested
size during the nego stage is not exact.
Till now, for chunked messages, the H1 mux used the size requested during
the zero-copy forwarding negotiation as the chunk size. And till now, this
was accurate because the requested size was indeed the chunk size on the
producer side.
But this will be a problem to implement the zero-copy forwarding on some
applets because the content size is not known during the nego but only when
it is produced. Thanks to previous patches, it is now possible to know the
requested size is not exact and we are able to reserve a larger space to
write the chunk size later, in h1_done_ff(), with some padding.
Now, during the zero-copy forwarding negotiation, when the requested size is
exact, we are now able to check if it is bigger than the expected one or
not. If it is indeed bigger than expeceted, the zero-copy forwarding is
disabled, the error will be triggered later on the normal sending path.
It is now possible to use a flag during zero-copy forwarding negotiation to
specify the requested size is exact, it means the producer really expect to
receive at least this amount of data.
It can be used by consumer to prepare some processing at this stage, based
on the requested size. For instance, in the H1 mux, it is used to write the
next chunk size.
It is now possible to impose the length to represent the chunk size in the
function used to prepended the chunk size in a buffer (so before the chunk
itself). It is thus possible to reserve a specific space for an unknown
chunk size and padding it with leading '0' to use all the space and avoid
holes.
During zero-copy forwarding negotiation, a pseudo flag was already used to
notify the consummer if the producer is able to use kernel splicing or not. But
this was not extensible. So, now we use a true bitfield to be able to pass flags
during the negotiation. NEGO_FF_FL_* flags may be used now.
Of course, for now, there is only one flags, the kernel splicing support on
producer side (NEGO_FF_FL_MAY_SPLICE).
The cache applet will be refactored to use its own buffer. Thus, for now,
the zero-copy forwarding support is removed and it will be reintrocuded
later.
The HTTP stat applets and all internal functions was adapted to use its own
buffers instead of the channels ones. The CLI part was not refactored yet,
thus there are still some access to channels in the file. But for the HTTP
part, we no longer use the channels at all.
To do so, the HTTP stats applet now uses default .rcv_buf and .snd_buf
callback function. In addition, it sets appctx flags instead of SE ones.
We no longer test the opposite stream-connector to detect aborted partial
post. Applets must not try to access to info ouside their scope. This make
the code more sensitive to changes and it is a common source of bug.
Tests on the sedesc flags at the begining of the I/O handler should be
enough.
Thanks to this patch, it is possible to an applet to directly send data to
the opposite endpoint. To do so, it must implement <fastfwd> appctx callback
function and set SE_FL_MAY_FASTFWD flag.
Everything will be handled by appctx_fastfwd() function. The applet is only
responsible to transfer data. If it sets <to_forward> value, it is used to
limit the amount of data to forward.
This patch introduces the support for the callback function responsible to
produce data via the zero-copy forwarding mechanism. There is no
implementation for now. But <to_forward> field was added in the appctx
structure to let an applet inform how much data it want to forward. It is
not mandatory but it will be used during the zero-copy forwarding
negociation.
We have indroduced flags to deal with end of input, end of stream and errors
at the applet level. With this patch we make the link with the endpoint
descriptor.
In appctx_rcv_buf(), applet flags are converted to SE flags.
There is no shutdown for reads and send with applets. Both are performed
when the appctx is released. So instead of 2 flags, like for
muxes/connections, only one flag is used. But the idea is the same:
acknowledge the event at the applet level.
The appctx state was never really used as a state. It is only used to know
when an applet should be freed on the next wakeup. This can be converted to
a flag and the state can be removed. This is what this patch does.
Till now, we've extended the appctx state to add some flags. However, the
field name is misleading. So a bitfield was added to handle real flags. And
helper functions to manipulate this bitfield were added.
A dedicated function to run applets was introduced, in addition to the old
one, to deal with applets that use their own buffers. The main differnce
here is that this handler does not use channels at all. It performs a
synchronous send before calling the applet and performs a synchronous
receive just after.
No applets are plugged on this handler for now.
There is no tasklet to handle I/O subscriptions for applets, but functions
to deal with receives and sends from the SC layer were added. it meanse a
function to retrieve data from an applet with this synchronous version and a
function to push data to an applet wit this synchronous version.
It is pretty similar to the functions used for muxes but there are some
differences. So for now, we keep them separated.
Zero-copy forwarding is not supported for now. In addition, there is no
subscription mechanism.
In this patch, we add default functions to copy data from a channel to the
<inbuf> buffer of an applet (appctx_rcv_buf) and another on to copy data
from <outbuf> buffer of an applet to a channel (appctx_snd_buf).
These functions are not used for now, but they will be used by applets to
define their <rcv_buf> and <snd_buf> callback functions. Of course, it will
be possible for a specific applet to implement its own functions but these
ones should be good enough for most of applets. HTX and RAW buffers are
supported.
It is the first patch of a series aimed to align applets on connections.
Here, dedicated buffers are added for applets. For now, buffers are
initialized and helpers function to deal with allocation are added. In
addition, flags to report allocation failures or full buffers are also
introduced. <inbuf> will be used to push data to the applet from the stream
and <outbuf> will be used to push data from the applet to the stream.
sc_attach_applet() was changed to be able to fail and callers were updated
accordingly. For now it cannot fail but if this changes, callers will be
prepared to handle errors.
In sc_attach_applet, an untyped pointer (void *) was used to attach a SC on
an applet. There is no reason to not use the right type here. So now a
pointer on an appctx is explicitly used.
Avoid loss of precision when computing K cubic value.
Same issue when computing the congestion window value from cubic increase function
formula with possible integer varaiable wrap around.
Depends on this commit:
MINOR: quic: Code clarifications for QUIC CUBIC (RFC 9438)
Must be backported as far as 2.6.
The first version of our QUIC CUBIC implementation is confusing because relying on
TCP CUBIC linux kernel implementation and with references to RFC 8312 which is
obsoleted by RFC 9438 (August 2023) after our implementation. RFC 8312 is a little
bit hard to understand. RFC 9438 arrived with much more clarifications.
So, RFC 9438 is about "CUBIC for Fast Long-Distance Networks". Our implementation
for QUIC is not very well documented. As it was difficult to reread this
code, this patch adds only some comments at complicated locations and rename
some macros, variables without logic modifications at all.
So, the aim of this patch is to add first some comments and variables/macros renaming
to avoid embedding too much code modifications in the same big patch.
Some code modifications will come to adapt this CUBIC implementation to this new
RFC 9438.
Rename some macros:
CUBIC_BETA -> CUBIC_BETA_SCALED
CUBIC_C -> CUBIC_C_SCALED
CUBIC_BETA_SCALE_SHIFT -> CUBIC_SCALE_FACTOR_SHIFT (this is the scaling factor
which is used only for CUBIC_BETA_SCALED)
CUBIC_DIFF_TIME_LIMIT -> CUBIC_TIME_LIMIT
CUBIC_ONE_SCALED was added (scaled value of 1).
These cubic struct members were renamed:
->tcp_wnd -> ->W_est
->origin_point -> ->W_target
->epoch_start -> ->t_epoch
->remaining_tcp_inc -> remaining_W_est_inc
Local variables to quic_cubic_update() were renamed:
t -> elapsed_time
diff ->t
delta -> W_cubic_t
Add a grahpic curve about the CUBIC Increase function.
Add big copied & pasted RFC 9438 extracts in relation with the 3 different increase
function regions.
Same thing for the fast convergence.
Fix a typo about the reference to QUIC RFC 9002.
Must be backported as far as 2.6 to ease any further modifications to come.
If we were to enable 'ocsp-update' on a certificate that does not have
an OCSP URI, we would exit ssl_sock_load_ocsp with a negative error code
which would raise a misleading error message ("<cert> has an OCSP URI
and OCSP auto-update is set to 'on' ..."). This patch simply fixes the
error message but an error is still raised.
This issue was raised in GitHub #2432.
It can be backported up to branch 2.8.
Fetch will return true if the stream underwent a redispatch according to
"option redispatch" setting upon retries.
Documentation was added, and the "%rc" logformat alternative now mentions
the new fetch to properly emulate the logformat behavior.
This build issued was introduced by this previous commit which is a bugfix:
BUG/MINOR: quic: Wrong ack ranges handling when reaching the limit.
A BUG_ON() referenced <fist> variable in place of <first>.
Must be backported as far as 2.6 as the previous commit.
Acknowledgements ranges are used to build ACK frames. To avoid allocating too
much such objects, a limit was set to 32(QUIC_MAX_ACK_RANGES) by this commit:
MINOR: quic: Do not allocate too much ack ranges
But there is an inversion when removing the oldest range from its tree.
eb64_first() must be used in place of eb64_last(). Note that this patch
only does this modification in addition to rename <last> variable to <first>.
This bug leads such a h2load command to block when a request ends up not
being acknowledged by haproxy even if correctly served:
/opt/nghttp2/build/bin/h2load --alpn-list h3 -t 1 -c 1 -m 1 -n 100 \
https://127.0.0.1/?s=5m
There is a remaining question to be answered. In such a case, haproxy refuses to
reopen the stream, this is a good thing but should not haproxy ackownledge the
request (because correctly parsed again).
Note that to be easily reproduced, this setting had to be applied to the client
network interface:
tc qdisc add dev eth1 root netem delay 100ms 1s loss random
Must be backported as far as 2.6.
As noticed in this thread, some bogus configurations are not always easy
to spot: https://www.mail-archive.com/haproxy@formilux.org/msg44558.html
Here it was about config keywords being used in ACL patterns where strings
were expected, hence they're always valid.
Since we have the diag mode (-dD) we can perform some extra checks when
it's used, and emit them to suggest the user there might be an issue.
Here we detect a few common words (logic such as "and"/"or"/"||" etc),
C++/JS comments mistakenly used to try to isolate final args, and words
that have the exact name of a sample fetch or an ACL keyword. These checks
are only done in diag mode of course.
Final diags were added in 2.4 by commit 5a6926dcf ("MINOR: diag: create
cfgdiag module"), but it's called too late in the startup process,
because when "-c" is passed, the call is not made, while it's its primary
use case. Let's just move the call earlier.
Note that currently the check in this function is limited to verifying
unicity of server cookies in a backend, so it can be backported as far
as 2.4, but there is little value in insisting if it doesn't backport
easily.
Diag warnings were added in 2.4 by commit 7b01a8dbd ("MINOR: global:
define diagnostic mode of execution") but probably due to the split
function that checks for the mode, they did not reuse the emission of
the version string before the first warning, as was brought in 2.2 by
commit bebd21206 ("MINOR: init: report in "haproxy -c" whether there
were warnings or not"). The effet is that diag warnings are emitted
before the version string if there is no other warning nor error. Let's
just proceed like for the two other ones.
This can be backported to 2.4, though this is of very low importance.
Just like for stick-tables, this patch adds a promex module to dump
resolvers metrics. It adds the "resolver" scope and for now, it dumps
folloowing metrics:
* haproxy_resolver_sent
* haproxy_resolver_send_error
* haproxy_resolver_valid
* haproxy_resolver_update
* haproxy_resolver_cname
* haproxy_resolver_cname_error
* haproxy_resolver_any_err
* haproxy_resolver_nx
* haproxy_resolver_timeout
* haproxy_resolver_refused
* haproxy_resolver_other
* haproxy_resolver_invalid
* haproxy_resolver_too_big
* haproxy_resolver_outdated
It is now possible to selectively retrieve extra counters from stats
modules. H1, H2, QUIC and H3 fill_stats() callback functions are updated to
return a specific counter.
The list of modules registered on the stats to expose extra counters is now
public. It is required to export these counters into the Prometheus
exporter.
set-bc-{mark,tos} actions are pretty similar to set-fc-{mark,tos} to set
mark/tos on packets sent from haproxy to server: set-bc-{mark,tos} actions
act on the whole backend/srv connection: from connect() to connection
teardown, thus they may only be used before the connection to the server
is instantiated, meaning that they are only relevant for request-oriented
rules such as tcp-request or http-request rules. For now their use is
limited to content request rules, because tos and mark informations are
stored directly within the stream, thus it is required that the stream
already exists.
stream flags are used in combination with dedicated stream struct members
variables to pass 'tos' and 'mark' informations so that they are correctly
considered during stream connection assignment logic (prior to connecting
to actually connecting to the server)
'tos' and 'mark' fd sockopts are taken into account in conn hash
parameters for connection reuse mechanism.
The documentation was updated accordingly.
In this patch we add the possibility to use sample expression as argument
for set-fc-{mark,tos} actions. To make it backward compatible with
previous behavior, during parsing we first try to parse the value as
as integer (decimal or hex notation), and then fallback to expr parsing
in case of failure.
The documentation was updated accordingly.
This is a complementary patch to "MINOR: tcp-act: Rename "set-{mark,tos}"
to "set-fc-{mark,tos}"", but for the Lua API.
set_mark and set_tos were kept as aliases for set_fc_mark and set_fc_tos
but they were marked as deprecated.
Using this opportunity to reorder set_mark and set_tos by alphabetical
order.
"set-mark" and "set-tos" only alter packets from haproxy to client
(frontend connection). Since we may add support for equivalent keywords
on server side, we rename them with an explicit name to prevent
confusions.
Thus, we rename:
- "set-mark" to "set-fc-mark"
- "set-tos" to "set-fc-tos"
"set-mark" and "set-tos" were kept as aliases (to "set-fc-mark" and
"set-fc-tos" respectively) for now to prevent config breakage, but they
have been marked as deprecated so they can be removed in future version.
Some CPU time is needlessly wasted in conn_calculate_hash(), because all
params are first copied into a temporary buffer before computing the
hash on the whole buffer. Instead, let's leverage the XXH progressive
hash update functions to avoid expensive memcpys.
A major reorganization of QUIC MUX sending has been implemented. Now
data transfer occur over a single QCS buffer. This has improve
performance but at the cost of restrictions on snd_buf. Indeed, buffer
instances are now shared from stream callback snd_buf up to quic-conn
layer.
As such, snd_buf cannot manipulate freely already present data buffer.
In particular, realign has been completely removed by the previous
patches.
This commit reintroduces a partial realign support. This is only done if
the buffer contains only unsent data, via a new MUX function
qcc_realign_stream_txbuf() which is called during snd_buf.
This commit is a direct follow-up on the major rearchitecture of send
buffering. This patch implements the proper handling of connection pool
buffer temporary exhaustion.
The first step is to be able to differentiate a fatal allocation error
from a temporary pool exhaustion. This is done via a new output argument
on qcc_get_stream_txbuf(). For a fatal error, application protocol layer
will schedule the immediate connection closing. For a pool exhaustion,
QCC is flagged with QC_CF_CONN_FULL and stream sending process is
interrupted. QCS instance is also registered in a new list
<qcc.buf_wait_list>.
A new connection buffer can become available when all ACKs are received
for an older buffer. This process is taken in charge by quic-conn layer.
It uses qcc_notify_buf() function to clear QC_CF_CONN_FULL and to wake
up every streams registered on buf_wait_list to resume sending process.
This commit is a direct follow-up on the major rearchitecture of send
buffering. It allows application protocol to react if current QCS
sending buffer space is too small. In this case, the buffer can be
released to the quic-conn layer. This allows to allocate a new QCS
buffer and retry HTX parsing, unless connection buffer pool is already
depleted.
A new function qcc_release_stream_txbuf() serves as API for app protocol
to release the QCS sending buffer. This operation fails if there is
unsent data in it. In this case, MUX has to keep it to finalize transfer
of unsent data to quic-conn layer. QCS is thus flagged with
QC_SF_BLK_MROOM to interrupt snd_buf operation.
When all data are sent to the quic-conn layer, QC_SF_BLK_MROOM is
cleared via qcc_streams_sent_done() and stream layer is woken up to
restart snd_buf.
Note that a new function qcc_stream_can_send() has been defined. It
allows app proto to check if sending is currently blocked for the
current QCS. For now, it checks QC_SF_BLK_MROOM flag. However, it will
be extended to other conditions with the following patches.
The previous commit was a major rework for QUIC MUX sending process.
Following this, this patch cleans up a few elements that remains but can
be removed as they are duplicated.
Of notable changes, offset fields from QCS and QCC are removed. They are
both equivalent to flow control soft offsets.
A new function qcs_prep_bytes() is implemented. Its purpose is to return
the count of prepared data bytes not yet sent. It also replaces
qcs_need_sending().
Previously, QUIC MUX sending was implemented with data transfered along
two different buffer instances per stream.
The first QCS buffer was used for HTX blocks conversion into H3 (or
other application protocol) during snd_buf stream callback. QCS instance
is then registered for sending via qcc_io_cb().
For each sending QCS, data memcpy is performed from the first to a
secondary buffer. A STREAM frame is produced for each QCS based on the
content of their secondary buffer.
This model is useful for QUIC MUX which has a major difference with
other muxes : data must be preserved longer, even after sent to the
lower layer. Data references is shared with quic-conn layer which
implements retransmission and data deletion on ACK reception.
This double buffering stages was the first model implemented and remains
active until today. One of its major drawbacks is that it requires
memcpy invocation for every data transferred between the two buffers.
Another important drawback is that the first buffer was is allocated by
each QCS individually without restriction. On the other hand, secondary
buffers are accounted for the connection. A bottleneck can appear if
secondary buffer pool is exhausted, causing unnecessary haproxy
buffering.
The purpose of this commit is to completely break this model. The first
buffer instance is removed. Now, application protocols will directly
allocate buffer from qc_stream_desc layer. This removes completely the
memcpy invocation.
This commit has a lot of code modifications. The most obvious one is the
removal of <qcs.tx.buf> field. Now, qcc_get_stream_txbuf() returns a
buffer instance from qc_stream_desc layer. qcs_xfer_data() which was
responsible for the memcpy between the two buffers is also completely
removed. Offset fields of QCS and QCC are now incremented directly by
qcc_send_stream(). These values are used as boundary with flow control
real offset to delimit the STREAM frames built.
As this change has a big impact on the code, this commit is only the
first part to fully support single buffer emission. For the moment, some
limitations are reintroduced and will be fixed in the next patches :
* on snd_buf if QCS sent buffer in used has room but not enough for the
application protocol to store its content
* on snd_buf if QCS sent buffer is NULL and allocation cannot succeeds
due to connection pool exhaustion
One final important aspect is that extra care is necessary now in
snd_buf callback. The same buffer instance is referenced by both the
stream and quic-conn layer. As such, some operation such as realign
cannot be done anymore freely.
qcs_build_stream_frm() is responsible to generate a STREAM frame
pointing to the content of QCS TX buffer.
This patch moves send flow control overflow check from qcs_xfer_data()
to qcs_build_stream_frm(), i.e. from transfer between internal
QCS buffer and qc_stream_desc, to STREAM frame generation.
Flow control is both check at stream and connection level. For
connection flow control, as several frames are built before emission, an
accumulator is used as extra arguments to functions to account the total
length of already built frames.
This patch should not provide any functional changes. Its main purpose
is to prepare for the removal of QCS internal buffer.
Both QCS and QCC have their owned sent offset field. These fields store
the newest offset sent to the quic-conn layer. It is similar to QCS/QCC
flow control real offset. This patch removes them and replaces them by
the latter for code clarification.
MINOR: mux-quic: remove unneeded qcc.tx.sent_offsets field
This commit as a similar purpose as previous, except that it removes QCC
<sent_offsets> field, now equivalent to connection flow control real
offset.
This commit is a direct follow-up on the previous one. This time, it
deals with connection level flow control. Process is similar to stream
level : soft offset is incremented during snd_buf and real offset during
STREAM frame emission.
On MAX_DATA reception, both stream layer and QMUX is woken up if
necessary. One extra feature for conn level is the introduction of a new
QCC list to reference QCS instances. It will store instances for which
snd_buf callback has been interrupted on QCC soft offset reached. Every
stream instances is woken up on MAX_DATA reception if soft_offset is
unblocked.
This patch is the first of two to reimplement flow control emission
limits check. The objective is to account flow control earlier during
snd_buf stream callback. This should smooth transfers and prevent over
buffering on haproxy side if flow control limit is reached.
The current patch deals with stream level flow control. It reuses the
newly defined flow control type. Soft offset is incremented after HTX to
data conversion. If limit is reached, snd_buf is interrupted and stream
layer will subscribe on QCS.
On qcc_io_cb(), generation of STREAM frames is restricted as previously
to ensure to never surpass peer limits. Finally, flow control real
offset is incremented on lower layer send notification. Thus, it will
serve as a base offset for built STREAM frames. If limit is reached,
STREAM frames generation is suspended.
Each time QCS data flow control limit is reached, soft and real offsets
are reconsidered.
Finally, special care is used when flow control limit is incremented via
MAX_STREAM_DATA reception. If soft value is unblocked, stream layer
snd_buf is woken up. If real value is unblocked, qcc_io_cb() is
rescheduled.
Create a new module dedicated to flow control handling. It will be used
to implement earlier flow control update on snd_buf stream callback.
For the moment, only Tx part is implemented (i.e. limit set by the peer
that haproxy must respect for sending). A type quic_fctl is defined to
count emitted data bytes. Two offsets are used : a real one and a soft
one. The difference is that soft offset can be incremented beyond limit
unless it is already in excess.
Soft offset will be used for HTX to H3 parsing. As size of generated H3
is unknown before parsing, it allows to surpass the limit one time. Real
offset will be used during STREAM frame generation : this time the limit
must not be exceeded to prevent protocol violation.
Add a new argument to qcc_send_stream() to specify the count of sent
bytes.
For the moment this argument is unused. This commit is in fact a step to
implement earlier flow control update during stream layer snd_buf.
Ben Kallus kindly reported that we still hadn't blocked the NUL
character from header values as clarified in RFC9110 and that, even
though there's no known issure related to this, it may one day be
used to construct an attack involving another component.
Actually, both Christopher and I sincerely believed we had done it
prior to releasing 2.9, shame on us for missing that one and thanks
to Ben for the reminder!
The change was applied, it was confirmed to properly reject this NUL
byte from both header and trailer values, and it's still possible to
force it to continue to be supported using the usual pair of unsafe
"option accept-invalid-http-{request|response}" for those who would
like to keep it for whatever reason that wouldn't make sense.
This was tagged medium so that distros also remember to apply it as
a preventive measure.
It should progressively be backported to all versions down to 2.0.
Trailers are parsed using a temporary h1m struct, likely due to using
distinct h1 parser states. However, the err_pos field that's used to
decide whether or not to enfore option accept-invalid-http-request (or
response) was not initialized in this struct, resulting in using a
random value that may randomly accept or reject a few bad chars. The
impact is very limited in trailers (e.g. no message size is transmitted
there) but we must make sure that the option is respected, at least for
users facing the need for this option there.
The issue was introduced in 2.0 by commit 2d7c5395ed ("MEDIUM: htx:
Add the parsing of trailers of chunked messages"), and the code moved
from mux_h1.c to h1_htx.c in 2.1 with commit 4f0f88a9d0 ("MEDIUM:
mux-h1/h1-htx: move HTX convertion of H1 messages in dedicated file")
so the patch needs to be backported to all stable versions, and the
file adjusted for 2.0.
Always compile the test of the early_data variable in
"ssl_quic_initial_ctx", this way we can emit a warning about its support
or not.
The test was moved in a more simple preprocessor check which only checks
the new HAVE_SSL_0RTT_QUIC constant.
Could be backported to 2.9 with the 2 previous commits.
However AWS-LC must be excluded of HAVE_SSL_0RTT_QUIC in this version.
It is similar to the previous fix but for the chunk size parsing. But this
one is more annoying because a poorly coded application in front of haproxy
may ignore the last digit before the LF thinking it should be a CR. In this
case it may be out of sync with HAProxy and that could be exploited to
perform some sort or request smuggling attack.
While it seems unlikely, it is safer to forbid LF with CR at the end of a
chunk size.
This patch must be backported to 2.9 and probably to all stable versions
because there is no reason to still support LF without CR in this case.
When the message is chunked, all chunks must ends with a CRLF. However, on
old versions, to support bad client or server implementations, the LF only
was also accepted. Nowadays, it seems useless and can even be considered as
an issue. Just forbid LF only at the end of chunks, it seems reasonnable.
This patch must be backported to 2.9 and probably to all stable versions
because there is no reason to still support LF without CR in this case.
In several places in the source, there was the same block of code that was
used to deinitialize the log buffer. There were even two functions that
did this, but they were called only from the code that is in the same
source file (free_tcpcheck_fmt() in src/tcpcheck.c and free_logformat_list()
in src/proxy.c - they were both static functions).
The function free_logformat_list() was moved from the file src/proxy.c to
src/log.c, and a check of the list before freeing the memory was added to
that function.
A recent fix was introduced to ensure unsent data are deleted when a
QUIC MUX stream releases its qc_stream_desc instance. This is necessary
to ensure all used buffers will be liberated once all ACKs are received.
This is implemented by the following patch :
commit ad6b13d317 (quic-dev/qns)
BUG/MEDIUM: quic: remove unsent data from qc_stream_desc buf
Before this patch, buffer removal was done only on ACK reception. ACK
handling is only done in order from the oldest one. A BUG_ON() statement
is present to ensure this assertion remains valid.
This is however not true anymore since the above patch. Indeed, after
unsent data removal, the current buffer may be empty if it did not
contain yet any sent data. In this case, it is not the oldest buffer,
thus the BUG_ON() statement will be triggered.
To fix this, simply remove this BUG_ON() statement. It should not have
any impact as it is safe to remove buffers in any order.
Note that several conditions must be met to trigger this BUG_ON crash :
* a QUIC MUX stream is destroyed before transmitting all of its data
* several buffers must have been previously allocated for this stream so
it happens only for transfers bigger than bufsize
* latency should be high enough to delay ACK reception
This must be backported wherever the above patch is (currently targetted
to 2.6).
HTTP status codes outside of 100..599 are considered invalid in HTTP
RFC9110. However, it is explicitely stated that range 600..999 is often
used for internal communication so in practice haproxy must be lenient
with it.
Before this patch, QPACK encoder rejected these values. This resulted in
a connection error. Fix this by extending the range of allowed values
from 100 to 999.
This is linked to github issue #2422. Once again, thanks to @yokim-git
for his help here.
This must be backported up to 2.6.
A crash occurs in h3_resp_headers_send() if an invalid response code is
received from the backend side. Fix this by properly flagging the
connection on error. This will cause a CONNECTION_CLOSE.
This should fix github issue #2422.
Big thanks to ygkim (@yokim-git) for his help and reactivity. Initially,
GDB reported an invalid code source location due to heavy functions
inlining inside h3_snd_buf(). The issue was found after using -Og flag.
This must be backported up to 2.6.
Replace h3_debug_printf() by real trace for functions used by stream
layer snd_buf callback. A new event type H3_EV_STRM_SEND is created for
the occasion.
This should be backported up to 2.6 to help investigate H3 issues on
stable releases. Note that h3_nego_ff/h3_done_ff definition are not
available from 2.8.
It has been found that under some rare error circumstances,
SSL_do_handshake() could return with SSL_ERROR_WANT_READ without
even trying to call the read function, causing permanent wakeups
that prevent the process from sleeping.
It was established that this only happens if the retry flags are
not systematically cleared in both directions upon any I/O attempt,
but, given the lack of documentation on this topic, it is hard to
say if this rather strange behavior is expected or not, otherwise
why wouldn't the library always clear the flags by itself before
proceeding?
In addition, this only seems to affect OpenSSL 1.1.0 and above,
and does not affect wolfSSL nor aws-lc.
A bisection on haproxy showed that this issue was first triggered by
commit a8955d57ed ("MEDIUM: ssl: provide our own BIO."), which means
that OpenSSL's socket BIO does not have this problem. And this one
does always clear the flags before proceeding. So let's just proceed
the same way. It was verified that it properly fixes the problem,
does not affect other implementations, and doesn't cause any freeze
nor spurious wakeups either.
Many thanks to Valentn Gutirrez for providing a network capture
showing the incident as well as a reproducer. This is GH issue #2403.
This patch needs to be backported to all versions that include the
commit above, i.e. as far as 2.0.
QCS instances use qc_stream_desc for data buffering on emission. On
stream reset, its Tx channel is closed earlier than expected. This may
leave unsent data into qc_stream_desc.
Before this patch, these unsent data would remain after QCS freeing.
This prevents the buffer to be released as no ACK reception will remove
them. The buffer is only freed when the whole connection is closed. As
qc_stream_desc buffer is limited per connection, this reduces the buffer
pool for other streams of the same connection. In the worst case if
several streams are resetted, this may completely freeze the transfer of
the remaining connection streams.
This bug was reproduced by reducing the connection buffer pool to a
single buffer instance by using the following global statement :
tune.quic.frontend.conn-tx-buffers.limit 1.
Then a QUIC client is used which opens a stream for a large enough
object to ensure data are buffered. The client them emits a STOP_SENDING
before reading all data, which forces the corresponding QCS instance to
be resetted. The client then opens a new request but the transfer is
freezed due to this bug.
To fix this, adjust qc_stream_desc API. Add a new argument <final_size>
on qc_stream_desc_release() function. Its value is compared to the
currently buffered offset in latest qc_stream_desc buffer. If
<final_size> is inferior, it means unsent data are present in the
buffer. As such, qc_stream_desc_release() removes them to ensure the
buffer will finally be freed when all ACKs are received. It is also
possible that no data remains immediately, indicating that ACK were
already received. As such, buffer instance is immediately removed by
qc_stream_buf_free().
This must be backported up to 2.6. As this code section is known to
regression, a period of observation could be reserved before
distributing it on LTS releases.
On ACK reception, data are removed from buffer via qc_stream_desc_ack().
The buffer can be freed if no more data are left. A new slot is also
accounted in buffer connection pool. Extract this operation in a
dedicated private function qc_stream_buf_free().
This change should have no functional change. However it will be useful
for the next patch which needs to remove a buffer from another function.
This patch is necessary for the following bugfix. As such, it must be
backported with it up to 2.6.
Very minor modification to replace a statement with an hardcoded value by a macro.
Should be backported as far as 2.6 to ease any further modification to come.
There is a typo in the statement to initialize this variable when selecting
newreno as cc algo:
const char *newreno = "newrno";
This would have happened if #defines had be used in place of several const char *
variables harcoded values.
Take the opportunity of this patch to use #defines for all the available cc algorithms.
Must be backported to 2.9.
When a cache is "cold" and multiple clients simultaneously try to access
the same resource we must forward all the requests to the server. Next,
every "duplicated" response will be processed in http_action_store_cache
and we will try to cache every one of them regardless of whether this
response was already cached. In order to avoid having multiple entries
for a same primary key, the logic is then to first delete any
preexisting entry from the cache tree before storing the current one.
The actual previous response content will not be deleted yet though
because if the corresponding row is detached from the "avail" list it
might still be used by a cache applet if it actually performed a lookup
in the cache tree before the new response could be received.
This all means that we can end up using a valid row that references a
cache_entry that was already removed from the cache tree. This does not
pose any problem in regular caches (no 'vary' mechanism enabled) because
the applet only works on the data and not the 'cache_entry' information,
but in the "vary" context, when calling 'http_cache_applet_release' we
might call 'delete_entry' on the given entry which in turn tries to
iterate over all the secondary entries to find the right one in which
the secondary entry counter can be updated. We would then call
eb32_next_dup on an entry that was not in the tree anymore which ended
up crashing.
This crash was introduced by "48f81ec09 : MAJOR: cache: Delay cache
entry delete in reserve_hot function" which added the call to
"release_entry" in "http_cache_applet_release" that ended up crashing.
This issue was raised in GitHub #2417.
This patch must be backported to branch 2.9.
This is cleanup patch to address cosmetic issues introduced in f034139bc0
("MINOR: lua: Allow reading "proc." scoped vars from LUA core.")
Also taking this opportunity to prefix the function with __LJMP to
indicate that it may longjump.
No backport needed.
As raised by Coverity in GH #2223, f034139bc0 ("MINOR: lua: Allow reading
"proc." scoped vars from LUA core.") causes uninitialized reads due to
smp being passed to vars_get_by_name() without being initialized first.
Indeed, vars_get_by_name() tries to read smp->sess and smp->strm pointers.
As we're only interested in the PROC var scope, it is safe to call
vars_get_by_name() with sess and strm pointers set to NULL, thus we
simply memset smp prior to calling vars_get_by_name() to fix the issue.
This should be backported in 2.9 with f034139bc0.
This previous commit was not sufficient to completely fix the building issue
in relation with the TLS stack 0-RTT support. LibreSSL was the last TLS
stack to refuse to compile because of undefined a QUIC specific function
for 0-RTT: SSL_set_quic_early_data_enabled().
To get rid of such compilation issues, define HA_OPENSSL_HAVE_0RTT_SUPPORT
only when building against TLS stack with 0-RTT support.
No need to backport.
This commit:
"MINOR: quic: Enable early data at SSL session level (aws-lc)
introduced a build error when using wolfssl as TLS stack
because it references unknown function wolfSSL_set_quic_early_data_enabled()
which is not defined in qc_set_quic_early_data_context() that must not be used
in this case. The compilation of this fonction was enabled for wolfssl when
it should not have by the mentionned commit.
No backport is needed.
Commit f783dd959b ("MINOR: quic: Enable early data at SSL session level
(aws-lc)") introduced a build error when using the openssl compat layer
because it references unknown function SSL_set_quic_early_data_context()
in qc_set_quic_early_data_context() that is not used in this case.
No backport is needed.
The jwt_verify converter was added in 2.5 with commit 130e142ee2
("MEDIUM: jwt: Add jwt_verify converter to verify JWT integrity"). It
takes a string on input and returns an integer. It turns out that by
presetting the return value to zero before processing contents, while
the sample data is a union, it overwrites the beginning of the buffer
struct passed on input. On a 64-bit arch it's not an issue because it's
where the allocated size is stored and it's not used in the operation,
which explains why the regtest works. But on 32-bit, both the size and
the pointer are overwritten, causing a NULL pointer to be passed to
jwt_tokenize() which is not designed to support this, hence crashes.
Let's just use a temporary variable to hold the result and move the
output sample initialization to the end of the function.
This should be backported as far as 2.5.
The initial purpose of CSV stats through CLI was to make it easely
parsable by scripts. But in some specific cases some error or warning
messages strings containing LF were dumped into cells of this CSV.
This made some parsing failure on several tools. In addition, if a
warning or message contains to successive LF, they will be dumped
directly but double LFs tag the end of the response on CLI and the
client may consider a truncated response.
This patch extends the 'csv_enc_append' and 'csv_enc' functions used
to format quoted string content according to RFC with an additionnal
parameter to convert multi-lines strings to one line: CRs are skipped,
and LFs are replaced with spaces. In addition and optionally, it is
also possible to remove resulting trailing spaces.
The call of this function to fill strings into stat's CSV output is
updated to force this conversion.
This patch should be backported on all supported branches (issue was
already present in v2.0)
This patch impacts only the haproxy builds against aws-lc TLS stack (USE_OPENSSL_AWSLC).
As mentionned by the boringssl documentation, SSL_do_handshake() completes as soon
as ClientHello is processed and server flight sent (from the TLS stack to the
server endpoint I guess). Into QUIC, the completion has as side effect to discard
the Handshake packet number space. If this handshake completion is not deffered,
the Handshake level CRYPTO data will not be sent to the peer (because of the
assotiated packet number space discarding). According to the documentation,
SSL_in_early_data() may be used to do that. If it returns 1, this means that
the handshake is still in progress but has enough progressed to send half-RTT
data.
This patch is required to make the haproxy builds against aws-lc TLS stack support 0-RTT.
This patch impacts only haproxy when built against aws-lc TLS stack (OPENSSL_IS_AWSLC).
During the SSL_CTX switching from ssl_sock_switchctx_cbk() callback,
ssl_sock_switchctx_set() is called. This latter calls SSL_set_SSL_CTX()
whose aims is to change the SSL_CTX attached o an SSL object (TLS session).
But the aws-lc (or boringssl) implementation of this function copy
the "early data enabled" setting value (boolean) coming with the SSL_CTX object
into the SSL object. So, if not set in the SSL_CTX object this setting disabled
the one which has been set by configuration into the SSL object
(see qc_set_quic_early_data_enabled(), it calls SSL_set_early_data_enabled()
with an SSL object as parameter).
Fix this enabling the "early data enabled" setting into the SSL_CTX before
setting this latter into the SSL object.
This patch is required to make QUIC 0-RTT work with haproxy built against
aws-lc.
Note that, this patch should also help in early data support for TCP connections.
This patch impacts only the haproxy build against aws-lc TLS stack (USE_OPENSSL_AWSLC).
Implement qc_set_quic_early_data_enabled() new function to enable
early data at session level. To make QUIC O-RTT work, a context string
must be set calling SSL_set_quic_early_data_context(). This is a
subset of the encoded transport parameters which is used for this.
Note that some application level settings should be also added (TODO).
This patch is required to make 0-RTT work for haproxy builds against aws-lc.