This one is printed as the iocb in the "show fd" output, and arguably
this wasn't very convenient as-is:
293 : st=0x000123(cl heopI W:sRa R:sRA) ref=0 gid=1 tmask=0x8 umask=0x0 prmsk=0x8 pwmsk=0x0 owner=0x7f488487afe0 iocb=0x50a2c0(main+0x60f90)
Let's unstatify it and export it so that the symbol can now be resolved
from the various points that need it.
This bug was revealed by h2load tests run as follows:
h2load -t 4 --npn-list h3 -c 64 -m 16 -n 16384 -v https://127.0.0.1:4443/
This open (-c) 64 QUIC connections (-n) 16384 h3 requets from (-t) 4 threads, i.e.
256 requests by connection. Such tests could not always pass and often ended with
such results displays by h2load:
finished in 53.74s, 38.11 req/s, 493.78KB/s
requests: 16384 total, 2944 started, 2048 done, 2048 succeeded, 14336
failed, 14336 errored, 0 timeout
status codes: 2048 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 25.92MB (27174537) total, 102.00KB (104448) headers (space
savings 1.92%), 25.80MB (27053569) data
UDP datagram: 3883 sent, 24330 received
min max mean sd ± sd
time for request: 48.75ms 502.86ms 134.12ms 75.80ms 92.68%
time for connect: 20.94ms 331.24ms 189.59ms 84.81ms 59.38%
time to 1st byte: 394.36ms 417.01ms 406.72ms 9.14ms 75.00%
req/s : 0.00 115.45 14.30 38.13 87.50%
The number of successful requests was always a multiple of 256.
Activating the traces also shew that some connections were blocked after having
successfully completed their handshakes due to the fact that the mux. The mux
is started upon the acceptation of the connection.
Under heavy load, some connections were never accepted. From the moment where
more than 4 (MAXACCEPT) connections were enqueued before a listener could be
woken up to accept at most 4 connections, the remaining connections were not
accepted ore lately at the second listener tasklet wakeup.
Add a call to tasklet_wakeup() to the accept list tasklet of the listeners to
wake up it if there are remaining connections to accept after having called
listener_accept(). In this case the listener must not be removed of this
accept list, if not at the next call it will not accept anything more.
Must be backported to 2.7 and 2.6.
This patch completes the previous one with poller subscribe of quic-conn
owned socket on sendto() error. This ensures that mux-quic is notified
if waiting on sending when a transient sendto() error is cleared. As
such, qc_notify_send() is called directly inside socket I/O callback.
qc_notify_send() internal condition have been thus completed. This will
prevent to notify upper layer until all sending condition are fulfilled:
room in congestion window and no transient error on socket FD.
This should be backported up to 2.7.
On sendto() transient error, prior to this patch sending was simulated
and we relied on retransmission to retry sending. This could hurt
significantly the performance.
Thanks to quic-conn owned socket support, it is now possible to improve
this. On transient error, sending is interrupted and quic-conn socket FD
is subscribed on the poller for sending. When send is possible,
quic_conn_sock_fd_iocb() will be in charge of restart sending.
A consequence of this change is on the return value of qc_send_ppkts().
This function will now return 0 on transient error if quic-conn has its
owned socket. This is used to interrupt sending in the calling function.
The flag QUIC_FL_CONN_TO_KILL must be checked to differentiate a fatal
error from a transient one.
This should be backported up to 2.7.
EBADF on sendto() is considered as a fatal error. As such, it is removed
from the list of the transient errors. The connection will be killed
when encountered.
For the record, EBADF can be encountered on process termination with the
listener socket.
This should be backported up to 2.7.
Send is conducted through qc_send_ppkts() for a QUIC connection. There
is two types of error which can be encountered on sendto() or affiliated
syscalls :
* transient error. In this case, sending is simulated with the remaining
data and retransmission process is used to have the opportunity to
retry emission
* fatal error. If this happens, the connection should be closed as soon
as possible. This is done via qc_kill_conn() function. Until this
patch, only ECONNREFUSED errno was considered as fatal.
Modify the QUIC send API to be able to differentiate transient and fatal
errors more easily. This is done by fixing the return value of the
sendto() wrapper qc_snd_buf() :
* on fatal error, a negative error code is returned. This is now the
case for every errno except EAGAIN, EWOULDBLOCK, ENOTCONN, EINPROGRESS
and EBADF.
* on a transient error, 0 is returned. This is the case for the listed
errno values above and also if a partial send has been conducted by
the kernel.
* on success, the return value of sendto() syscall is returned.
This commit will be useful to be able to handle transient error with a
quic-conn owned socket. In this case, the socket should be subscribed to
the poller and no simulated send will be conducted.
This commit allows errno management to be confined in the quic-sock
module which is a nice cleanup.
On a final note, EBADF should be considered as fatal. This will be the
subject of a next commit.
This should be backported up to 2.7.
The send*() syscall which are responsible of such ICMP packets reception
fails with ECONNREFUSED as errno.
man(7) udp
ECONNREFUSED
No receiver was associated with the destination address.
This might be caused by a previous packet sent over the socket.
We must kill asap the underlying connection.
Must be backported to 2.7.
Some counters could potentially be incremented even if send*() syscall returned
no error when ret >= 0 and ret != sz. This could be the case for instance if
a first call to send*() returned -1 with errno set to EINTR (or any previous syscall
which set errno to a non-null value) and if the next call to send*() returned
something positive and smaller than <sz>.
Must be backported to 2.7 and 2.6.
When receiving a QUIC datagram, destination address is retrieved via
recvmsg() and stored in quic-conn as qc.local_addr. This address is then
reused when using the quic-conn owned socket.
When listener socket mode is preferred, send operation did not specify
the source address of the emitted datagram. If listener socket is bound
on a wildcard address, the kernel is free to choose any address assigned
to the local machine. This may be different from the address selected by
the client on its first datagram which will prevent the client to emit
next replies.
To address this, this patch fixes the UDP source address via sendmsg().
This process is similar to the reception and relies on ancillary
message, so the socket is left untouched after the operation. This is
heavily platform specific and may not be supported by some kernels.
This change has only an impact if listener socket only is used for QUIC
communications. This is the default behavior for 2.7 branch but not
anymore on 2.8. Use tune.quic.socket-owner set to listener to ensure set
it.
This should be backported up to 2.7.
During multiple tests we've already noticed that shared stats counters
have become a real bottleneck under large thread counts. With QUIC it's
pretty visible, with qc_snd_buf() taking 2.5% of the CPU on a 48-thread
machine at only 25 Gbps, and this CPU is entirely spent in the atomic
increment of the byte count and byte rate. It's also visible in H1/H2
but slightly less since we're working with larger buffers, hence less
frequent updates. These counters are exclusively used to report the
byte count in "show info" and the byte rate in the stats.
Let's move them to the thread_ctx struct and make the stats reader
just collect each thread's stats when requested. That's way more
efficient than competing on a single cache line.
After this, qc_snd_buf has totally disappeared from the perf profile
and tests made in h1 show roughly 1% performance increase on small
objects.
Shards were completely forgotten in commit f5a0c8abf ("MEDIUM: quic:
respect the threads assigned to a bind line"). The thread mask is
taken from the bind_conf, but since shards were introduced in 2.5,
the per-listener mask is held by the receiver and can be smaller
than the bind_conf's mask.
The effect here is that the traffic is not distributed to the
appropriate thread. At first glance it's not dramatic since it remains
one of the threads eligible by the bind_conf, but it still means that
in some contexts such as "shards by-thread", some concurrency may
persist on listeners while they're expected to be alone. One identified
impact is that it requires more rxbufs than necessary, but there may
possibly be other not yet identified side effects.
This must be backported to 2.7 and everywhere the commit above is
backported.
UDP addresses may change over time for a QUIC connection. When using
quic-conn owned socket, we have to detect address change to break the
bind/connect association on the socket.
For the moment, on change detected, QUIC connection socket is closed and
a new one is opened. In the future, we may improve this by trying to
keep the original socket and reexecute only bind/connect syscalls.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
There is a small race condition when QUIC connection socket is
instantiated between the bind() and connect() system calls. This means
that the first datagram read on the sockets may belong to another
connection.
To detect this rare case, we compare the DCID for each QUIC datagram
read on the QUIC socket. If it does not match the connection CID, the
datagram is requeue using quic_receiver_buf to be able to handle it on
the correct thread.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
This change is the second part for reception on QUIC connection socket.
All operations inside the FD handler has been delayed to quic-conn
tasklet via the new function qc_rcv_buf().
With this change, buffer management on reception has been simplified. It
is now possible to use a local buffer inside qc_rcv_buf() instead of
quic_receiver_buf().
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Try to use the quic-conn socket for reception if it is allocated. For
this, the socket is inserted in the fdtab. This will call the new
handler quic_conn_io_cb() which is responsible to process the recv()
system call. It will reuse datagram dispatch for simplicity. However,
this is guaranteed to be called on the quic-conn thread, so it will be
more efficient to use a dedicated buffer. This will be implemented in
another commit.
This patch should improve performance by reducing contention on the
receiver socket. However, more gain can be obtained when the datagram
dispatch operation will be skipped.
Older quic_sock_fd_iocb() is renamed to quic_lstnr_sock_fd_iocb() to
emphasize its usage for the receiver socket.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
If quic-conn has a dedicated socket, use it for sending over the
listener socket. This should improve performance by reducing contention
over the shared listener socket.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Allocate quic-conn owned socket if possible. This requires that this is
activated in haproxy configuration. Also, this is done only if local
address is known so it depends on the support of IP_PKTINFO.
For the moment this socket is not used. This causes QUIC support to be
broken as received datagram are not read. This commit will be completed
by a following patch to support recv operation on the newly allocated
socket.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Extract individual datagram parsing code outside of datagrams list loop
in quic_lstnr_dghdlr(). This is moved in a new function named
quic_dgram_parse().
To complete this change, quic_lstnr_dghdlr() has been moved into
quic_sock source file : it belongs to QUIC socket lower layer and is
directly called by quic_sock_fd_iocb().
This commit will ease implementation of quic-conn owned socket.
New function quic_dgram_parse() will be easily usable after a receive
operation done on quic-conn IO-cb.
This should be backported up to 2.7.
After reading a datagram, it is enqueud for the thread attached to the
DCID. This is done via quic_lstnr_dgram_dispatch() function. If this
step fails, we remove the datagram from the buffer of quic_receiver_buf.
This step is faulty because we use b_del() instead of b_sub(). If
quic_receiver_buf was not empty, we will remove content from another
datagram while leaving the content of the last read datagram. This
probably produces valid datagram dropping and may even result in crash.
As stated, this bug can only happen if qc_lstnr_dgram_dispatch() fails
which happen on two occaions :
* on quic_dgram allocation failure, which should be pretty rare
* on datagram labelled as invalid for QUIC protocol. This may happen
more frequently depending on the network conditions. Thus, this bug
has been labelled as a medium one.
This should be backported up to 2.6.
Each datagram is received by a random thread and dispatch to its
destination thread linked to the connection. Then, the datagram is
handled by the connection thread. Once this is done, datagram buffer
pointer is atomically set to NULL to mark it as consumed.
Consumed datagrams are purged before recvfrom() invocation on random
receiver threads. The check for NULL buffer must thus be done
atomically. This was not the case before this patch, which may have
triggered race conditions.
This bug has been introduced by commit
91b2305ad7
MINOR: quic: implement datagram cleanup for quic_receiver_buf
This should be backported up to 2.6 after previously mentionned commit.
Add a new counter "quic_rxbuf_full". It is incremented each time
quic_sock_fd_iocb() is interrupted on full buffer.
This should help to debug github issue #1903. It is suspected that
QUIC receiver buffers are full which in turn cause quic_sock_fd_iocb()
to be called repeatedly resulting in a high CPU consumption.
A specialized listener accept was previously used for QUIC. This is now
unneeded and we can revert to the default one session_accept_fd().
One change of importance is that the call order between
conn_xprt_start() and conn_complete_session() is now reverted to the
default one. This means that MUX instance is now NULL during
qc_xprt_start() and its app-ops layer cannot be set here. This operation
has been delayed to qc_init() to prevent a segfault.
This should be backported up to 2.6.
Remove ABORT_NOW() statement on unhandled sendto error. Instead use a
dedicated counter sendto_err_unknown to report these cases.
If we detect increment of this counter, strace can be used to detect
errno value :
$ strace -p $(pidof haproxy) -f -e trace=sendto -Z
This should be backported up to 2.6.
This should help to debug github issue #1903.
Right now the QUIC thread mapping derives the thread ID from the CID
by dividing by global.nbthread. This is a problem because this makes
QUIC work on all threads and ignores the "thread" directive on the
bind lines. In addition, only 8 bits are used, which is no more
compatible with the up to 4096 threads we may have in a configuration.
Let's modify it this way:
- the CID now dedicates 12 bits to the thread ID
- on output we continue to place the TID directly there.
- on input, the value is extracted. If it corresponds to a valid
thread number of the bind_conf, it's used as-is.
- otherwise it's used as a rank within the current bind_conf's
thread mask so that in the end we still get a valid thread ID
for this bind_conf.
The extraction function now requires a bind_conf in order to get the
group and thread mask. It was better to use bind_confs now as the goal
is to make them support multiple listeners sooner or later.
Each time data is read on QUIC receiver socket, we try to reuse the
first datagram of the currently used quic_receiver_buf instead of
allocating a new one. This algorithm is suboptimal if there is several
unused datagrams as only the first one is tested and its buffer removed
from quic_receiver_buf.
If QUIC traffic is quite substential, this can lead to an important
number of quic_dgram occurences allocated from pool_head_quic_dgram and
a lack of free space in allocated quic_receiver_buf buffers.
To improve this, each time we want to reuse a datagram, we pop elements
until a non-yet released datagram is found or the list is empty. All
intermediary elements are freed and the last found datagram can be
reused. This operation has been extracted in a dedicated function named
quic_rxbuf_purge_dgrams().
This should improve memory consumption incured by quic_dgram instances under heavy
QUIC traffic. Note that there is still room for improvement as if the
first datagram is still in use, it may block several unused datagram
after him. However this requires to support removal of datagrams out of
order which is currently not possible.
This should be backported up to 2.6.
QUIC datagrams are read from a random thread. They are then redispatch
to the connection thread according to the first packet DCID. These
operations are implemented through a special buffer designed to avoid
locking.
Refactor this code with the following changes :
* <rxbuf> type is renamed <quic_receiver_buf>. Its list element is also
renamed to highligh its attach point to a receiver.
* <quic_dgram> and <quic_receiver_buf> definition are moved to
quic_sock-t.h. This helps to reduce the size of quic_conn-t.h.
* <quic_dgram> list elements are renamed to highlight their attach point
into a <quic_receiver_buf> and a <quic_dghdlr>.
This should be backported up to 2.6.
Retrieve the frontend destination address for a QUIC connection. This
address is retrieve from the first received datagram and then stored in
the associated quic-conn.
This feature relies on IP_PKTINFO or affiliated flags support on the
socket. This flag is set for each QUIC listeners in
sock_inet_bind_receiver(). To retrieve the destination address,
recvfrom() has been replaced by recvmsg() syscall. This operation and
parsing of msghdr structure has been extracted in a wrapper quic_recv().
This change is useful to finalize the implementation of 'dst' sample
fetch. As such, quic_sock_get_dst() has been edited to return local
address from the quic-conn. As a best effort, if local address is not
available due to kernel non-support of IP_PKTINFO, address of the
listener is returned instead.
This should be backported up to 2.6.
xprt_quic module was too large and did not reflect the true architecture
by contrast to the other protocols in haproxy.
Extract code related to XPRT layer and keep it under xprt_quic module.
This code should only contains a simple API to communicate between QUIC
lower layer and connection/MUX.
The vast majority of the code has been moved into a new module named
quic_conn. This module is responsible to the implementation of QUIC
lower layer. Conceptually, it overlaps with TCP kernel implementation
when comparing QUIC and HTTP1/2 stacks of haproxy.
This should be backported up to 2.6.
Clean up quic sources by adjusting headers list included depending
on the actual dependency of each source file.
On some occasion, xprt_quic.h was removed from included list. This is
useful to help reducing the dependency on this single file and cleaning
up QUIC haproxy architecture.
This should be backported up to 2.6.
The following task/tasklet handlers often appear in "show profiling tasks"
but were not resolved since static:
qc_io_cb, quic_conn_app_io_cb, process_timer,
quic_accept_run, qc_idle_timer_task
This commit simply exports them so they can be resolved now. "process_timer"
which was a bit too generic and renamed to qc_process_timer.
These fake datagrams are only used by the low level I/O handler. They
are not provided to the "by connection" datagram handlers. This
is why they are not MT_LIST_APPEND()ed to the listner RX buffer list
(see &quic_dghdlrs[cid_tid].dgrams in quic_lstnr_dgram_dispatch().
Replace the call to pool_zalloc() to by the lighter call to pool_malloc()
and initialize only the ->buf and ->length members. This is safe because
only these fields are inspected by the low level I/O handler.
The function processes packets sent by other threads in the current
thread's queue. But if, for any reason, other threads write faster
than the current one processes, this can lead to a situation where
the function never returns.
It seems that it might be what's happening in issue #1808, though
unfortunately, this function is one of the rare without traces. But
the amount of calls to functions like qc_lstnr_pkt_rcv() on a single
thread seems to indicate this possibility.
Thanks to Tristan for his efforts in collecting extremely precious
traces!
This likely needs to be backported to 2.6.
qc_snd_buf returned a size_t which means that it was never negative
despite its documentation. Thus the caller who checked for this was
never informed of a sendto error.
Clean this by changing the return value of qc_snd_buf() to an integer.
A 0 is returned on success. Every other values are considered as an
error.
This commit should be backported up to 2.6. Note that to not cause
malfunctions, it must be backported after the previous patch :
906b058954
MINOR: quic: explicitely ignore sendto error
This is to ensure that a sendto error does not cause send to be
interrupted which may cause a stalled transfer without a proper retry
mechanism.
The impact of this bug seems null as caller explicitely ignores sendto
error. However this part of code seems to be subject to strange issues
and it may fix them in part. It may be of interest for github issue #1808.
There is a remaining loop in this ugly qc_snd_buf() function which could
lead haproxy to send truncated UDP datagrams. For now on, we send
a complete UDP datagram or nothing!
Must be backported to 2.6.
First we add a loop around recfrom() into the most low level I/O handler
quic_sock_fd_iocb() to collect as most as possible datagrams before during
its tasklet wakeup with a limit: we recvfrom() at most "maxpollevents"
datagrams. Furthermore we add a local task list into the datagram handler
quic_lstnr_dghdlr() which is passed to the first datagrams parser qc_lstnr_pkt_rcv().
This latter parser only identifies the connection associated to the datagrams then
wakeup the highest level packet parser I/O handlers (quic_conn.*io_cb()) after
it is done, thanks to the call to tasklet_wakeup_after() which replaces from now on
the call to tasklet_wakeup(). This should reduce drastically the latency and the
chances to fulfil the RX buffers at the QUIC connections level as reported in
GH #1737 by Tritan.
These modifications depend on this commit:
"MINOR: task: Add tasklet_wakeup_after()"
Must be backported to 2.6 with the previous commit.
This previous commit:
"BUG/MAJOR: Big RX dgrams leak when fulfilling a buffer"
partially fixed an RX dgram memleak. There is a missing break in the loop which
looks for the first datagram attached to an RX buffer dgrams list which may be
reused (because consumed by the connection thread). So when several dgrams were
consumed by the connection thread and are present in the RX buffer list, some are
leaked because never reused for ever. They are removed for their list.
Furthermore, as commented in this patch, there is always at least one dgram
object attached to an RX dgrams list, excepted the first time we enter this
I/O handler function for this RX buffer. So, there is no need to use a loop
to lookup and reuse the first datagram in an RX buffer dgrams list.
This isssue was reproduced with quiche client with plenty of POST requests
(100000 streams):
cargo run --bin quiche-client -- https://127.0.0.1:8080/helloworld.html
--no-verify -n 100000 --method POST --body /var/www/html/helloworld.html
and could be reproduce with GET request.
This bug was reported by Tristan in GH #1749.
Must be backported to 2.6.
When entering quic_sock_fd_iocb() I/O handler which is responsible
of recvfrom() datagrams, the first thing which is done it to try
to reuse a dgram object containing metadata about the received
datagrams which has been consumed by the connection thread.
If this object could not be used for any reason, so when we
"goto out" of this function, we must release the memory allocated
for this objet, if not it will leak. Most of the time, this happened
when we fulfilled a buffer as reported in GH #1749 by Tristan. This is why we
added a pool_free() call just before the out label. We mark <new_dgram> as NULL
when it successfully could be used.
Thank you for Tristan and Willy for their participation on this issue.
Must be backported to 2.6.
After having fulfilled a buffer, then marked it as full, we must
consume the remaining space. But to do that, and not to erase the
already existing data, we must check there is not remaining data in
after the tail of the buffer (between the tail and the head).
This is done adding a condition to test that adding the number of
bytes from the remaining contiguous space to the tail does not
pass the wrapping postion in the buffer.
Must be backported to 2.6.
If an unlisted errno is reported, abort the process. If a crash is
reported on this condition, we must determine if the error code is a
bug, should interrupt emission on the fd or if we can retry the syscall.
If sendto returns an error, we should not retry the call and break from
the sending loop. An exception is made for EINTR which allows to retry
immediately the syscall.
This bug caused an infinite loop reproduced when the process is in the
closing state by SIGUSR1 but there is still QUIC data emission left.
As we do not have any task to be wake up by the poller after sendto() error,
we add an sendto() error counter to the quic_conn struct.
Dump its values from qc_send_ppkts().
Just like for the conn_stream, now that these addresses are dynamically
allocated, there is no single case where the pointer is set without the
corresponding flag, and the flag is used as a permission to dereference
the pointer. Let's just replace the test of the flag with a test of the
pointer and remove all flag assignment. This makes the code clearer
(especially in "if" conditions) and saves the need for future code to
think about properly setting the flag after setting the pointer.
Some older systems may routinely return EWOULDBLOCK for some syscalls
while we tend to check only for EAGAIN nowadays. Modern systems define
EWOULDBLOCK as EAGAIN so that solves it, but on a few older ones (AIX,
VMS etc) both are different, and for portability we'd need to test for
both or we never know if we risk to confuse some status codes with
plain errors.
There were few entries, the most annoying ones are the switch/case
because they require to only add the entry when it differs, but the
other ones are really trivial.