This code was there because the timer task was not running on the same thread
as the one which parse the QUIC packets. Now that this is no more the case,
we can wake up this task directly.
Must be backported to 2.7.
Move quic_rx_pkts_del() out of quic_conn.h to make it benefit from the TRACE API.
Add traces which already already helped in diagnosing an issue encountered with
ngtcp2 which sent too much 1RTT packets before the handshake completion. This
has been fixed here after having discussed with Tasuhiro on QUIC dev slack:
https://github.com/ngtcp2/ngtcp2/pull/663
Must be backported to 2.7.
Some counters could potentially be incremented even if send*() syscall returned
no error when ret >= 0 and ret != sz. This could be the case for instance if
a first call to send*() returned -1 with errno set to EINTR (or any previous syscall
which set errno to a non-null value) and if the next call to send*() returned
something positive and smaller than <sz>.
Must be backported to 2.7 and 2.6.
Add traces inside h3_decode_qcs(). Every error path has now its
dedicated trace which should simplify debugging. Each early returns has
been converted to a goto invocation.
To complete the demux tracing, demux frame type and length are now
printed using the h3s instance whenever its possible on trace
invocation. A new internal value H3_FT_UNINIT is used as a frame type to
mark demuxing as inactive.
This should be backported up to 2.7.
Since the recent changes on the clocks, now.tv_sec is not to be used
between processes because it's a clock which is local to the process and
does not contain a real unix timestamp. This patch fixes the issue by
using "data.tv_sec" which is the wall clock instead of "now.tv_sec'.
It prevents having incoherent timestamps.
It also introduces some checks on negatives values in order to never
displays a netative value if it was computed from a wrong value set by a
previous haproxy version.
It must be backported as far as 2.0.
Implement support for clients that emit the stream FIN with an empty
STREAM frame. For that, qcc_recv() offset comparison has been adjusted.
If offset has already been received but the FIN bit is now transmitted,
do not skip the rest of the function and call application layer
decode_qcs() callback.
Without this, streams will be kept open forever as HTX EOM is never
transfered to the upper stream layer.
This behavior was observed with mvfst client prior to its patch
38c955a024aba753be8bf50fdeb45fba3ac23cfd
Fix hq-interop (HTTP 0.9 over QUIC)
This notably caused the interop multiplexing test to fail as unclosed
streams on haproxy side prevented the emission of new MAX_STREAMS frame
to the client.
This shoud be backported up to 2.6. It also relies on previous commit :
381d8137e3
MINOR: h3/hq-interop: handle no data in decode_qcs() with FIN set
Properly handle a STREAM frame with no data but the FIN bit set at the
application layer. H3 and hq-interop decode_qcs() callback have been
adjusted to not return early in this case.
If the FIN bit is accepted, a HTX EOM must be inserted for the upper
stream layer. If the FIN is rejected because the stream cannot be
closed, a proper CONNECTION_CLOSE error will be triggered.
A new utility function qcs_http_handle_standalone_fin() has been
implemented in the qmux_http module. This allows to simply add the HTX
EOM on qcs HTX buffer. If the HTX buffer is empty, a EOT is first added
to ensure it will be transmitted above.
This commit will allow to properly handle FIN notify through an empty
STREAM frame. However, it is not sufficient as currently qcc_recv() skip
the decode_qcs() invocation when the offset is already received. This
will be fixed in the next commit.
This should be backported up to 2.6 along with the next patch.
Several times during debugging it has been difficult to find a way to
reliably indicate if a thread had been started and if it was still
running. It's really not easy because the elements we look at are not
necessarily reliable (e.g. harmless bit or idle bit might not reflect
what we think during a signal). And such notions can be subjective
anyway.
Here we define two thread flags, TH_FL_STARTED which is set as soon as
a thread enters run_thread_poll_loop() and drops the idle bit, and
another one, TH_FL_IN_LOOP, which is set when entering run_poll_loop()
and cleared when leaving it. This should help init/deinit code know
whether it's called from a non-initialized thread (i.e. tid must not
be trusted), or shared functions know if they're being called from a
running thread or from init/deinit code outside of the polling loop.
As reported in github issue #1881, there are situations where an excess
of TLS handshakes can cause a livelock. What's happening is that normally
we process at most one TLS handshake per loop iteration to maintain the
latency low. This is done by tagging them with TASK_HEAVY, queuing these
tasklets in the TL_HEAVY queue. But if something slows down the loop, such
as a connect() call when no more ports are available, we could end up
processing no more than a few hundred or thousands handshakes per second.
If the llmit becomes lower than the rate of incoming handshakes, we will
accumulate them and at some point users will get impatient and give up or
retry. Then a new problem happens: the queue fills up with even more
handshake attempts, only one of which will be handled per iteration, so
we can end up processing only outdated handshakes at a low rate, with
basically nothing else in the queue. This can for example happen in
parallel with health checks that don't require incoming handshakes to
succeed to continue to cause some activity that could maintain the high
latency stuff active.
Here we're taking a slightly different approach. First, instead of always
allowing only one handshake per loop (and usually it's critical for
latency), we take the current situation into account:
- if configured with tune.sched.low-latency, the limit remains 1
- if there are other non-heavy tasks, we set the limit to 1 + one
per 1024 tasks, so that a heavily loaded queue of 4k handshakes
per thread will be able to drain them at ~4 per loops with a
limited impact on latency
- if there are no other tasks, the limit grows to 1 + one per 128
tasks, so that a heavily loaded queue of 4k handshakes per thread
will be able to drain them at ~32 per loop with still a very
limited impact on latency since only I/O will get delayed.
It was verified on a 56-core Xeon-8480 that this did not degrade the
latency; all requests remained below 1ms end-to-end in full close+
handshake, and even 500us under low-lat + busy-polling.
This must be backported to 2.4.
There's a per-thread "long_rq" counter that is used to indicate how
often we leave the scheduler with tasks still present in the run queue.
The purpose is to know when tune.runqueue-depth served to limit latency,
due to a large number of tasks being runnable at once.
However there's a bug there, it's not always set: if after the first
run, one heavy task was processed and later only heavy tasks remain,
we'll loop back to not_done_yet where we try to pick more tasks, but
none are eligible (since heavy ones have already run) so we directly
return without incrementing the counter. This is what causes ultra-low
values on long_rq during massive SSL handshakes, that are confusing
because they make one believe that tl_class_mask doesn't have the HEAVY
flag anymore. Let's just fix that by not returning from the middle of
the function.
This can be backported as far as 2.4.
In 2.7, the method used to check for a sleeping thread changed with
commit e7475c8e7 ("MEDIUM: tasks/fd: replace sleeping_thread_mask with
a TH_FL_SLEEPING flag"). Previously there was a global sleeping mask
and now there is a flag per thread. The commit above partially broke
the watchdog by looking at the current thread's flags via th_ctx
instead of the reported thread's flags, and using an AND condition
instead of an OR to update and leave. This can cause a wrong thread
to be killed when the load is uneven. For example, when enabling
busy polling and sending traffic over a single connection, all
threads have their run time grow, and if the one receiving the
signal is also processing some traffic, it will not match the
sleeping/harmless condition and will set the stuck flag, then die
upon next invocation. While it's reproducible in tests, it's unlikely
to be met in field.
This fix should be backported to 2.7.
A feature command was added to detect if infinite forward is disabled to be
able to skip the script. Unfortunately, it is no supported to evaluate such
expression. Thus remove it. For now, reg-tests must not be executed with
"-dF" option.
The -dF option can now be used to disable data fast-forward. It does the
same than the global option "tune.fast-forward off". Some reg-tests may rely
on this optim. To detect the feature and skip such script, the following
vtest command must be used:
feature cmd "$HAPROXY_PROGRAM -cc '!(globa.tune & GTUNE_NO_FAST_FWD)'"
The new global option "tune.fast-forward" can be set to "off" to disable the
data fast-forward. It is an debug option, thus it is internally marked as
experimental. The directive "expose-experimental-directives" must be set
first to use this one. By default, the data fast-forward is enable.
It could be usefull to force to wake the stream up when data are
received. To be sure, evreything works fine in this case. The data
fast-forward is an optim. It must work without it. But some code may rely on
the fact the stream will not be woken up. With this option, it is possible
to spot some hidden bugs.
At the stream level, the read expiration date is unset if a shutr was
received but not if the end of input was reached. If we know no more data
are excpected, there is no reason to let the read expiration date armed,
except to respect clientfin/serverfin timeout on some circumstances.
This patch could slowly be backported as far as 2.2.
During the payload forwarding, since the commit f2b02cfd9 ("MAJOR: http-ana:
Review error handling during HTTP payload forwarding"), when an error
occurred on one side, we don't rely anymore on a specific HTTP message state
to detect it on the other side. However, nothing was added to detect the
error. Thus, when this happens, a spinning loop may be experienced and an
abort because of the watchdog.
To fix the bug, we must detect the opposite side is closed by checking the
opposite SC state. Concretly, in http_end_request() and http_end_response(),
we wait for the other side iff the HTTP message state is lower to
HTTP_MSG_DONE (the message is not finished) and the SC state is not
SC_ST_CLO (the opposite side is not closed). In these function, we don't
care if there was an error on the opposite side. We only take care to detect
when we must stop waiting the other side.
This patch should fix the issue #2042. No backport needed.
The ssl_bind_kw structure is exclusively used for crt-list keyword, it
must be named otherwise to remove the confusion.
The structure was renamed ssl_crtlist_kws.
Released version 2.8-dev4 with the following main changes :
- BUG/MINOR: stats: fix source buffer size for http dump
- BUG/MEDIUM: stats: fix resolvers dump
- BUG/MINOR: stats: fix ctx->field update in stats_dump_proxy_to_buffer()
- BUG/MINOR: stats: fix show stats field ctx for servers
- BUG/MINOR: stats: fix STAT_STARTED behavior with full htx
- MINOR: quic: Update version_information transport parameter to draft-14
- BUG/MINOR: stats: Prevent HTTP "other sessions" counter underflows
- BUG/MEDIUM: thread: fix extraneous shift in the thread_set parser
- BUG/MEDIUM: listener/thread: bypass shards setting on failed thread resolution
- BUG/MINOR: ssl/crt-list: warn when a line is malformated
- BUG/MEDIUM: stick-table: do not leave entries in end of window during purge
- BUG/MINOR: clock: do not mix wall-clock and monotonic time in uptime calculation
- BUG/MEDIUM: cache: use the correct time reference when comparing dates
- MEDIUM: clock: force internal time to wrap early after boot
- BUILD: ssl/ocsp: ssl_ocsp-t.h depends on ssl_sock-t.h
- MINOR: ssl/ocsp: add a function to check the OCSP update configuration
- MINOR: cfgparse/server: move (min/max)conn postparsing logic into dedicated function
- BUG/MINOR: server/add: ensure minconn/maxconn consistency when adding server
- BUG/MEDIUM: stconn: Schedule a shutw on shutr if data must be sent first
- BUG/MEDIUM: quic: fix crash when "option nolinger" is set in the frontend
- MINOR: quic: implement a basic "show quic" CLI handler
- MINOR: quic: display CIDs and state in "show quic"
- MINOR: quic: display socket info on "show quic"
- MINOR: quic: display infos about various encryption level on "show quic"
- MINOR: quic: display Tx stream info on "show quic"
- MINOR: quic: filter closing conn on "show quic"
- BUG/MINOR: quic: fix filtering of closing connections on "show quic"
- BUG/MEDIUM: stconn: Don't needlessly wake the stream on send during fast-forward
- BUG/MINOR: quic: fix type bug on "show quic" for 32-bits arch
- BUG/MINOR: mworker: fix uptime for master process
- BUG/MINOR: clock/stats: also use start_time not start_date in HTML info
- BUG/MEDIUM: stconn: stop to enable/disable reads from streams via si_update_rx
- BUG/MEDIUM: quic: Buffer overflow when looking through QUIC CLI keyword list
- DOC: proxy-protocol: fix wrong byte in provided example
- MINOR: ssl-ckch: Stop to test CF_WRITE_ERROR to commit CA/CRL file
- MINOR: bwlim: Remove useless test on CF_READ_ERROR to detect the last packet
- BUG/MINOR: http-ana: Fix condition to set LAST termination flag
- BUG/MINOR: mux-h1: Don't report an H1C error on client timeout
- BUG/MEDIUM: spoe: Don't set the default traget for the SPOE agent frontend
- BUG/MINOR: quic: Wrong datagram dispatch because of qc_check_dcid()
- BUG/CRITICAL: http: properly reject empty http header field names
The HTTP header parsers surprizingly accepts empty header field names,
and this is a leftover from the original code that was agnostic to this.
When muxes were introduced, for H2 first, the HPACK decompressor needed
to feed headers lists, and since empty header names were strictly
forbidden by the protocol, the lists of headers were purposely designed
to be terminated by an empty header field name (a principle that is
similar to H1's empty line termination). This principle was preserved
and generalized to other protocols migrated to muxes (H1/FCGI/H3 etc)
without anyone ever noticing that the H1 parser was still able to deliver
empty header field names to this list. In addition to this it turns out
that the HPACK decompressor, despite a comment in the code, may
successfully decompress an empty header field name, and this mistake
was propagated to the QPACK decompressor as well.
The impact is that an empty header field name may be used to truncate
the list of headers and thus make some headers disappear. While for
H2/H3 the impact is limited as haproxy sees a request with missing
headers, and headers are not used to delimit messages, in the case of
HTTP/1, the impact is significant because the presence (and sometimes
contents) of certain sensitive headers is detected during the parsing.
Thus, some of these headers may be seen, marked as present, their value
extracted, but never delivered to upper layers and obviously not
forwarded to the other side either. This can have for consequence that
certain important header fields such as Connection, Upgrade, Host,
Content-length, Transfer-Encoding etc are possibly seen as different
between what haproxy uses to parse/forward/route and what is observed
in http-request rules and of course, forwarded. One direct consequence
is that it is possible to exploit this property in HTTP/1 to make
affected versions of haproxy forward more data than is advertised on
the other side, and bypass some access controls or routing rules by
crafting extraneous requests. Note, however, that responses to such
requests will normally not be passed back to the client, but this can
still cause some harm.
This specific risk can be mostly worked around in configuration using
the following rule that will rely on the bug's impact to precisely
detect the inconsistency between the known body size and the one
expected to be advertised to the server (the rule works from 2.0 to
2.8-dev):
http-request deny if { fc_http_major 1 } !{ req.body_size 0 } !{ req.hdr(content-length) -m found } !{ req.hdr(transfer-encoding) -m found } !{ method CONNECT }
This will exclusively block such carefully crafted requests delivered
over HTTP/1. HTTP/2 and HTTP/3 do not need content-length, and a body
that arrives without being announced with a content-length will be
forwarded using transfer-encoding, hence will not cause discrepancies.
In HAProxy 2.0 in legacy mode ("no option http-use-htx"), this rule will
simply have no effect but will not cause trouble either.
A clean solution would consist in modifying the loops iterating over
these headers lists to check the header name's pointer instead of its
length (since both are zero at the end of the list), but this requires
to touch tens of places and it's very easy to miss one. Functions such
as htx_add_header(), htx_add_trailer(), htx_add_all_headers() would be
good starting points for such a possible future change.
Instead the current fix focuses on blocking empty headers where they
are first inserted, hence in the H1/HPACK/QPACK decoders. One benefit
of the current solution (for H1) is that it allows "show errors" to
report a precise diagnostic when facing such invalid HTTP/1 requests,
with the exact location of the problem and the originating address:
$ printf "GET / HTTP/1.1\r\nHost: localhost\r\n:empty header\r\n\r\n" | nc 0 8001
HTTP/1.1 400 Bad request
Content-length: 90
Cache-Control: no-cache
Connection: close
Content-Type: text/html
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
$ socat /var/run/haproxy.stat <<< "show errors"
Total events captured on [10/Feb/2023:16:29:37.530] : 1
[10/Feb/2023:16:29:34.155] frontend decrypt (#2): invalid request
backend <NONE> (#-1), server <NONE> (#-1), event #0, src 127.0.0.1:31092
buffer starts at 0 (including 0 out), 16334 free,
len 50, wraps at 16336, error at position 33
H1 connection flags 0x00000000, H1 stream flags 0x00000810
H1 msg state MSG_HDR_NAME(17), H1 msg flags 0x00001410
H1 chunk len 0 bytes, H1 body len 0 bytes :
00000 GET / HTTP/1.1\r\n
00016 Host: localhost\r\n
00033 :empty header\r\n
00048 \r\n
I want to address sincere and warm thanks for their great work to the
team composed of the following security researchers who found the issue
together and reported it: Bahruz Jabiyev, Anthony Gavazzi, and Engin
Kirda from Northeastern University, Kaan Onarlioglu from Akamai
Technologies, Adi Peleg and Harvey Tuch from Google. And kudos to Amaury
Denoyelle from HAProxy Technologies for spotting that the HPACK and
QPACK decoders would let this pass despite the comment explicitly
saying otherwise.
This fix must be backported as far as 2.0. The QPACK changes can be
dropped before 2.6. In 2.0 there is also the equivalent code for legacy
mode, which doesn't suffer from the list truncation, but it would better
be fixed regardless.
CVE-2023-25725 was assigned to this issue.
There was a parenthesis placed in the wrong place for a memcmp().
As a consequence, clients could not reuse a UDP address for a new connection.
Must be backported to 2.7.
The commit d5983cef8 ("MINOR: listener: remove the useless ->default_target
field") revealed a bug in the SPOE. No default-target must be defined for
the SPOE agent frontend. SPOE applets are used on the frontend side and a
TCP connection is established on the backend side.
Because of this bug, since the commit above, the stream target is set to the
SPOE applet instead of the backend connection, leading to a spinning loop on
the applet when it is released because are unable to close the backend side.
This patch should fix the issue #2040. It only affects the 2.8-DEV but to
avoid any future bug, it should be backported to all stable versions.
When a client timeout is reported by the H1 mux, it is not an error but an
abort. Thus, H1C_F_ERROR flag must not be set. It is espacially important to
not inhibit the send. Because of this bug, a 408-Request-time-out is
reported in logs but the error message is not sent to the client.
This patch must be backported to 2.7.
We should not report LAST data in log if the response is in TUNNEL mode on
client close/timeout because there is no way to be sure it is the last
data. It means, it can only be reported in DONE, CLOSING or CLOSE states.
No backport needed.
There is already a test on CF_EOI and CF_SHUTR. The last one is always set
when a read error is reported. Thus there is no reason to check
CF_READ_ERROR.
This change was performed on all applet I/O handlers but one was missed. In
the CLI I/O handler used to commit a CA/CRL file, we can remove the test on
CF_WRITE_ERROR because there is already a test on CF_SHUTW.
There was a mistake in the example of proxy-proto frame
provided, it cannot end with 0x02 but only 0x20 or 0x21
since the version is in the upper 4 bits and the lower ones
are 0 for LOCAL or 1 for PROXY, hence the example should be:
\x0D\x0A\x0D\x0A\x00\x0D\x0A\x51\x55\x49\x54\x0A\x20
Thanks to Bram Grit for reporting this mistake.
This has been detected by libasan as follows:
=================================================================
==3170559==ERROR: AddressSanitizer: global-buffer-overflow on address 0x55cf77faad08 at pc 0x55cf77a87370 bp 0x7ffc01bdba70 sp 0x7ffc01bdba68
READ of size 8 at 0x55cf77faad08 thread T0
#0 0x55cf77a8736f in cli_find_kw src/cli.c:335
#1 0x55cf77a8a9bb in cli_parse_request src/cli.c:792
#2 0x55cf77a8c385 in cli_io_handler src/cli.c:1024
#3 0x55cf77d19ca1 in task_run_applet src/applet.c:245
#4 0x55cf77c0b6ba in run_tasks_from_lists src/task.c:634
#5 0x55cf77c0cf16 in process_runnable_tasks src/task.c:861
#6 0x55cf77b48425 in run_poll_loop src/haproxy.c:2934
#7 0x55cf77b491cf in run_thread_poll_loop src/haproxy.c:3127
#8 0x55cf77b4bef2 in main src/haproxy.c:3783
#9 0x7fb8b0693d09 in __libc_start_main ../csu/libc-start.c:308
#10 0x55cf7764f4c9 in _start (/home/flecaille/src/haproxy-untouched/haproxy+0x1914c9)
0x55cf77faad08 is located 0 bytes to the right of global variable 'cli_kws' defined in 'src/quic_conn.c:7834:27' (0x55cf77faaca0) of size 104
SUMMARY: AddressSanitizer: global-buffer-overflow src/cli.c:335 in cli_find_kw
Shadow bytes around the buggy address:
According to cli_find_kw() code and cli_kw_list struct definition, the second
member of this structure ->kw[] must be a null-terminated array.
Add a last element with default initializers to <cli_kws> global variable which
is impacted by this bug.
This bug arrived with this commit:
15c74702d MINOR: quic: implement a basic "show quic" CLI handler
Must be backported to 2.7 where this previous commit has been already
backported.
It is not really a bug because it does not fix any known issue. And it is
flagged as MEDIUM because it is sensitive. But if there are some extra calls
to process_stream(), it can be an issue because, in si_update_rx(), we may
disable reading for the SC when outgoing data are blocked in the input
channel. But it is not really the process_stream() job to take care of
that. This may block data receipt.
It is an old code, mainly here to avoid wakeup in loop on the stats
applet. Today, it seems useless and can lead to bugs. An endpoint is
responsible to block the SC if it waits for some room and the opposite
endpoint is responsible to unblock it when some data are sent. The stream
should not interfere on this part.
This patch could be backported to 2.7 after a period of observation. And it
should only be backported to lower versions if an issue is reported.
For an unknown reason in the change of uptime calculation for the HTML
page didn't make it to commit 6093ba47c ("BUG/MINOR: clock: do not mix
wall-clock and monotonic time in uptime calculation"). Let's address it
as well otherwise the stats page will display an incorrect uptime.
No backport needed unless the patch above is backported.
Uptime calculation for master process was incorrect as it used
<start_date> as its timestamp base time. Fix this by using the scheduler
time <start_time> for this.
The impact of this bug is minor as timestamp base time is only used for
"show proc" CLI output. it was highlighted by the following commit.
which caused a negative value to be displayed for the master process
uptime on "show proc" output.
28360dc53f
MEDIUM: clock: force internal time to wrap early after boot
This should be backported up to 2.0.
Incorrect printf format specifier "%lu" was used on "show quic" handler
for uint64_t. This breaks build on 32-bits architecture. To fix this
portability issue, force an explicit cast to unsigned long long with
"%llu" specifier.
This must be backported up to 2.7.
With a connection, when data are received, if these data are sent to the
opposite side because the fast-forwarding is possible, the stream may be
woken up on some conditions (at the end of sc_app_chk_snd_conn()):
* The channel is shut for write
* The SC is not in the "established" state
* The stream must explicitly be woken up on write and all data was sent
* The connection was just established.
A bug on the last condition was introduced with the commit d89884153
("MEDIUM: channel: Use CF_WRITE_EVENT instead of CF_WRITE_PARTIAL"). The
stream is now woken up on any write events.
This patch fixes this issue and restores the original behavior. No backport
is needed.
Filtering of closing/draining connections on "show quic" was not
properly implemented. This causes the extra argument "all" to display
all connections to be without effect. This patch fixes this and restores
the output of all connections.
This must be backported up to 2.7.
Reduce default "show quic" output by masking connection on
closing/draing state due to a CONNECTION_CLOSE emission/reception. These
connections can still be displayed using the special argument "all".
This should be backported up to 2.7.
Complete "show quic" handler by displaying information about
quic_stream_desc entries. These structures are used to emit stream data
and store them until acknowledgment is received.
This should be backported up to 2.7.
Complete "show quic" handler by displaying various information related
to each encryption level and packet number space. Most notably, ack
ranges and bytes in flight are present to help debug retransmission
issues.
This should be backported up to 2.7.
Complete "show quic" handler by displaying information related to the
quic_conn owned socket. First, the FD is printed, followed by the
address of the local and remote endpoint.
This should be backported up to 2.7.
Complete "show quic" handler. Source and destination CIDs are printed
for every connection. This is complete by a state info to reflect if
handshake is completed and if a CONNECTION_CLOSE has been emitted or
received and the allocation status of the attached MUX. Finally the idle
timer expiration is also printed.
This should be backported up to 2.7.
Implement a basic "show quic" CLI handler. This command will be useful
to display various information on all the active QUIC frontend
connections.
This work is heavily inspired by "show sess". Most notably, a global
list of quic_conn has been introduced to be able to loop over them. This
list is stored per thread in ha_thread_ctx.
Also add three CLI handlers for "show quic" in order to allocate and
free the command context. The dump handler runs on thread isolation.
Each quic_conn is referenced using a back-ref to handle deletion during
handler yielding.
For the moment, only a list of raw quic_conn pointers is displayed. The
handler will be completed over time with more information as needed.
This should be backported up to 2.7.
Commit 0aba11e9e ("MINOR: quic: remove unnecessary quic_session_accept()")
overlooked one problem, in session_accept_fd() at the end, there's a bunch
of FD-specific stuff that either sets up or resets the socket at the TCP
level. The tests are mostly performed for AF_INET/AF_INET6 families but
they're only for one part (i.e. to avoid setting up TCP options on UNIX
sockets). Other pieces continue to configure the socket regardless of its
family. All of this directly acts on the FD, which is not correct since
the FD is not valid here, it corresponds to the QUIC handle. The issue
is much more visible when "option nolinger" is enabled in the frontend,
because the access to fdatb[cfd].state immediately crashes on the first
connection, as can be seen in github issue #2030.
This patch bypasses this setup for FD-less connections, such as QUIC.
However some of them could definitely be relevant to the QUIC stack, or
even to UNIX sockets sometimes. A better long-term solution would consist
in implementing a setsockopt() equivalent at the protocol layer that would
be used to configure the socket, either the FD or the QUIC conn depending
on the case. Some of them would not always be implemented but that would
allow to unify all this code.
This fix must be backported everywhere the commit above is backported,
namely 2.6 and 2.7.
Thanks to github user @twomoses for the nicely detailed report.
The commit 7f59d68fe ("BUG/MEDIIM: stconn: Flush output data before
forwarding close to write side") introduced a regression. When the read side
is closed, the close is not forwarded to the write side if there are some
pending outgoind data. The idea is to foward data first and the close the
write side. However, when fast-forwarding is enabled and last data block is
received with the read0, the close is never forwarded.
We cannot revert the commit above because it really fix an issue. However,
we can schedule the shutdown for write by setting CF_SHUTW_NOW flag on the
write side. Indeed, it is the purpose of this flag.
To not replicate ugly and hardly maintainable code block at different places
in stconn.c, an helper function is used. Thus, sc_cond_forward_shutw() must
be called to know if the close can be fowarded or not. It returns 1 if it is
possible. In this case, the caller is responsible to forward the close to
the write side. Otherwise, if the close cannot be forwarded, 0 is
returned. It happens when it should not be performed at all. Or when it
should only be delayed, waiting for the input channel to be flushed. In this
last case, the CF_SHUTW_NOW flag is set in the output channel.
This patch should fix the issue #2033. It must be backported with the commit
above, thus at least as far as 2.2.
When a new server was added through the cli using "server add" command,
the maxconn/minconn consistency check historically implemented in
check_config_validity() for static servers was missing.
As a result, when adding a server with the maxconn parameter without the
minconn set, the server was unable to handle any connection because
srv_dynamic_maxconn() would always return 0.
Consider the following reproducer:
| global
| stats socket /tmp/ha.sock mode 660 level admin expose-fd listeners
|
| defaults
| timeout client 5s
| timeout server 5s
| timeout connect 5s
|
| frontend test
| mode http
| bind *:8081
| use_backend farm
|
| listen dummyok
| bind localhost:18999
| mode http
| http-request return status 200 hdr test "ok"
|
| backend farm
| mode http
Start haproxy and perform the following :
echo "add server farm/t1 127.0.0.1:18999 maxconn 100" | nc -U /tmp/ha.sock
echo "enable server farm/t1" | nc -U /tmp/ha.sock
curl localhost:8081 # -> 503 after 5s connect timeout
Thanks to ("MINOR: cfgparse/server: move (min/max)conn postparsing logic into
dedicated function"), we are now able to perform the consistency check after
the new dynamic server has been parsed.
This is enough to fix the issue documented here that was reported by
Thomas Pedoussaut on the ML.
This commit depends on:
- ("MINOR: cfgparse/server: move (min/max)conn postparsing logic into
dedicated function")
It must be backported to 2.6 and 2.7
In check_config_validity() function, we performed some consistency checks to
adjust minconn/maxconn attributes for each declared server.
We move this logic into a dedicated function named srv_minmax_conn_apply()
to be able to perform those checks later in the process life when needed
(ie: dynamic servers)
Deduplicate the code which checks the OCSP update in the ckch_store and
in the crtlist_entry.
Also, jump immediatly to error handling when the ERR_FATAL is catched.
GH issue #2034 clearly indicates yet another case of time roll-over
that went badly. Issues that happen only once every 50 days are hard
to detect and debug, and are usually reported more or less synchronized
from multiple sources. This patch finally does what had long been planned
but never done yet, which is to force the time to wrap early after boot
so that any such remaining issue can be spotted quicker. The margin delay
here is 20s (it may be changed by setting BOOT_TIME_WRAP_SEC to another
value). This value seems sufficient to permit failed health checks to
succeed and traffic to come in and possibly start to update some time
stamps (accept dates in logs, freq counters, stick-tables expiration
dates etc).
It could theoretically be helpful to have this in 2.7, but as can be
seen with the two patches below, we've already had incorrect use cases
of the internal monotonic time when the wall-clock one was needed, so
we could expect to detect other ones in the future. Note that this will
*not* induce bugs, it will only make them happen much faster (i.e. no
need to wait for 50 days before seeing them). If it were to eventually
be backported, these two previous patches must also be backported:
BUG/MINOR: clock: use distinct wall-clock and monotonic start dates
BUG/MEDIUM: cache: use the correct time reference when comparing dates
The cache makes use of dates advertised by external components, such
as "last-modified" or "date". As such these are wall-clock dates, and
not internal dates. However, all comparisons are mistakenly made based
on the internal monotonic date which is designed to drift from the wall
clock one in order to catch up with stolen time (which can sometimes be
intense in VMs). As such after some run time some objects may fail to
validate or fail to expire depending on the direction of the drift. This
is particularly visible when applying an offset to the internal time to
force it to wrap soon after startup, as it will be shifted up to 49.7
days in the future depending on the current date; what happens in this
case is that the reg-test "cache_expires.vtc" fails on the 3rd test by
returning stale contents from the cache at the date of this commit.
It is really important that all external dates are compared against
"date" and not "now" for this reason.
This fix needs to be backported to all versions.
We've had a start date even before the internal monotonic clock existed,
but once the monotonic clock was added, the start date was not updated
to distinguish the wall clock time units and the internal monotonic time
units. The distinction is important because both clocks do not necessarily
progress at the same speed. The very rare occurrences of the wall-clock
date are essentially for human consumption and communication with third
parties (e.g. report the start date in "show info" for monitoring
purposes). However currently this one is also used to measure the distance
to "now" as being the process' uptime. This is actually not correct. It
only works because for now the two dates are initialized at the exact
same instant at boot but could still be wrong if the system's date shows
a big jump backwards during startup for example. In addition the current
situation prevents us from enforcing an abritrary offset at boot to reveal
some heisenbugs.
This patch adds a new "start_time" at boot that is set from "now" and is
used in uptime calculations. "start_date" instead is now set from "date"
and will always reflect the system date for human consumption (e.g. in
"show info"). This way we're now sure that any drift of the internal
clock relative to the system date will not impact the reported uptime.
This could possibly be backported though it's unlikely that anyone has
ever noticed the problem.
At some moments expired stick table records stop being removed. This
happens when the internal time wraps around the 32-bit limit, or every
49.7 days. What precisely happens is that some elements that are collected
close to the end of the time window (2^32 - table's "expire" setting)
might have been updated and will be requeued further, at the beginning
of the next window. Here, three bad situations happen:
- the incorrect integer-based comparison that is not aware of wrapping
will result in the scan to restart from the freshly requeued element,
skipping all those at the end of the window. The net effect of this
is that at each wakeup of the expiration task, only one element from
the end of the window will be expired, and other ones will remain
there for a very long time, especially if they have to wait for all
the predecessors to be picked one at a time after slow wakeups due
to a long expiration ; this is what was observed in issue #2034
making the table fill up and appear as not expiring at all, and it
seems that issue #2024 reports the same problem at the same moment
(since such issues happen for everyone roughly at the same time
when the clock doesn't drift too much).
- the elements that were placed at the beginning of the next window
are skipped as well for as long as there are refreshed entries at
the end of the previous window, so these ones participate to filling
the table as well. This is cause by the restart from the current,
updated node that is generally placed after most other less recently
updated elements.
- once the last element at the end of the window is picked, suddenly
there is a large amount of expired entries at the beginning of the
next window that all have to be requeued. If the expiration delay
is large, the number can be big and it can take a long time, which
can very likely explain the periodic crashes reported in issue #2025.
Limiting the batch size as done in commit dfe79251d ("BUG/MEDIUM:
stick-table: limit the time spent purging old entries") would make
sense for process_table_expire() as well.
This patch addresses the incorrect tree scan algorithm to make sure that:
- there's always a next element to compare against, even when dealing
with the last one in the tree, the first one must be used ;
- time comparisons used to decide whether to restart from the current
element use tick_is_lt() as it is the only case where we know the
current element will be placed before any other one (since the tree
respects insertion ordering for duplicates)
In order to reproduce the issue, it was found that injecting traffic on
a random key that spans over half of the size of a table whose expiration
is set to 15s while the date is going to wrap in 20s does exhibit an
increase of the table's size 5s after startup, when entries start to be
pushed to the next window. It's more effective when a second load
generator constantly hammers a same key to be certain that none of them
is ready to expire. This doesn't happen anymore after this patch.
This fix needs to be backported to all stable versions. The bug has been
there for as long as the stick tables were introduced in 1.4-dev7 with
commit 3bd697e07 ("[MEDIUM] Add stick table (persistence) management
functions and types"). A cleanup could consists in deduplicating that
code by having process_table_expire() call __stktable_trash_oldest(),
with that one improved to support an optional time check.