Go to file
Willy Tarreau 6ed242ece6 BUG/MEDIUM: connection: close a rare race between idle conn close and takeover
The takeover of idle conns between threads is particularly tricky, for
two reasons:
  - there's no way to atomically synchronize kernel-side polling with
    userspace activity, so late events will always be reported for some
    FDs just migrated ;

  - upon error, an FD may be immediately reassigned to whatever other
    thread since it's process-wide.

The current model uses the FD's thread_mask to figure if an FD still
ought to be reported or not, and a per-thread idle connection queue
from which eligible connections are atomically added/picked. I/Os
coming from the bottom for such a connection must remove it from the
list so that it's not elected. Same for timeout tasks and iocbs. And
these last ones check their context under the idle_conn lock to judge
if they're still allowed to run.

One rare case was omitted: the wake() callback. This one is rare, it
may serve to notify about finalized connect() calls that are not being
polled, as well as unhandled shutdowns and errors. This callback was
not protected till now because it wasn't seen as sensitive, but there
exists a particular case where it may be called without protectoin in
parallel to a takeover. This happens in the following sequence:

  - thread T1 wants to establish an outgoing connection
  - the connect() call returns EINPROGRESS
  - the poller adds it using epoll_ctl()
  - epoll_wait() reports it, connect() is done. The connection is
    not being marked as actively polled anymore but is still known
    from the poller.
  - the request is sent over that connection using send(), which
    queues to system buffers while data are being delivered
  - the scheduler switches to other tasks
  - the request is physically sent
  - the server responds
  - the stream is notified that send() succeeded, and makes progress,
    trying to recv() from that connection
  - the recv() succeeds, the response is delivered
  - the poller doesn't need to be touched (still no active polling)
  - the scheduler switches to other tasks
  - the server closes the connection
  - the poller on T1 is notified of the SHUTR and starts to call mux->wake()
  - another thread T2 takes over the connection
  - T2 continues to run inside wake() and releases the connection
  - T2 is just dereferencing it.
  - BAM.

The most logical solution here is to surround the call to wake() with an
atomic removal/insert of the connection from/into the idle conns lists.
This way, wake() is guaranteed to run alone. Any other poller reporting
the FD will not have its tid_bit in the thread_mask si will not bother
it. Another thread trying a takeover will not find this connection. A
task or tasklet being woken up late will either be on the same thread,
or be called on another one with a NULL context since it will be the
consequence of previous successful takeover, and will be ignored. Note
that the extra cost of a lock and tree access here have a low overhead
which is totally amortized given that these ones roughly happen 1-2
times per connection at best.

While it was possible to crash the process after 10-100k req using H2
and a hand-refined configuration achieving perfect synchronism between
a long (20+) chain of proxies and a short timeout (1ms), now with that
fix this never happens even after 10M requests.

Many thanks to Olivier for proposing this solution and explaining why
it works.

This should be backported as far as 2.2 (when inter-thread takeover was
introduced). The code in older versions will be found in conn_fd_handler().
A workaround consists in disabling inter-thread pool sharing using:

    tune.idle-pool.shared off
2021-07-30 08:34:38 +02:00
.github DOC: Replace issue templates by issue forms 2021-06-24 04:15:04 +02:00
addons BUG/MEDIUM: opentracing: initialization before establishing daemon and/or chroot mode 2021-06-10 06:45:39 +02:00
admin BUG/MINOR: systemd: must check the configuration using -Ws 2021-07-26 11:03:54 +02:00
dev CLEANUP: dev/flags: remove useless test in the stdin number parser 2021-04-03 15:29:10 +02:00
doc MEDIUM: connection: Add option to disable legacy error log 2021-07-29 15:40:45 +02:00
examples MEDIUM: proxy: remove long-broken 'option http_proxy' 2021-07-18 19:35:32 +02:00
include BUG/MEDIUM: connection: close a rare race between idle conn close and takeover 2021-07-30 08:34:38 +02:00
reg-tests REGTESTS: ssl: ssl_errors.vtc does not work with old openssl version 2021-07-29 16:00:24 +02:00
scripts CI: ssl: keep the old method for ancient OpenSSL versions 2021-06-17 15:40:53 +02:00
src MEDIUM: connection: Add option to disable legacy error log 2021-07-29 15:40:45 +02:00
tests MINOR: config: reject long-deprecated "option forceclose" 2021-06-11 16:57:34 +02:00
.cirrus.yml CI: introduce scripts/build-vtest.sh for installing VTest 2021-05-18 10:48:30 +02:00
.gitattributes MINOR: Configure the cpp userdiff driver for *.[ch] in .gitattributes 2021-02-22 18:17:57 +01:00
.gitignore ADDONS: make addons/ discoverable by git via .gitignore 2021-05-07 16:48:14 +02:00
.travis.yml CI: introduce scripts/build-vtest.sh for installing VTest 2021-05-18 10:48:30 +02:00
BRANCHES DOC: fix some spelling issues over multiple files 2021-01-08 14:53:47 +01:00
CHANGELOG [RELEASE] Released version 2.5-dev2 2021-07-17 12:35:11 +02:00
CONTRIBUTING CLEANUP: contrib: remove the last references to the now dead contrib/ directory 2021-04-21 15:13:58 +02:00
INSTALL CLEANUP: shctx: remove the different inter-process locking techniques 2021-06-15 16:52:42 +02:00
LICENSE LICENSE: add licence exception for OpenSSL 2012-09-07 13:52:26 +02:00
MAINTAINERS CONTRIB: move spoa_example out of the tree 2021-04-21 09:39:06 +02:00
Makefile REORG: config: move the condition preprocessing code to its own file 2021-07-16 19:18:41 +02:00
README DOC: create a BRANCHES file to explain the life cycle 2019-06-15 22:00:14 +02:00
ROADMAP DOC: update the outdated ROADMAP file 2019-06-15 21:59:54 +02:00
SUBVERS BUILD: use format tags in VERDATE and SUBVERS files 2013-12-10 11:22:49 +01:00
VERDATE [RELEASE] Released version 2.5-dev2 2021-07-17 12:35:11 +02:00
VERSION [RELEASE] Released version 2.5-dev2 2021-07-17 12:35:11 +02:00

The HAProxy documentation has been split into a number of different files for
ease of use.

Please refer to the following files depending on what you're looking for :

  - INSTALL for instructions on how to build and install HAProxy
  - BRANCHES to understand the project's life cycle and what version to use
  - LICENSE for the project's license
  - CONTRIBUTING for the process to follow to submit contributions

The more detailed documentation is located into the doc/ directory :

  - doc/intro.txt for a quick introduction on HAProxy
  - doc/configuration.txt for the configuration's reference manual
  - doc/lua.txt for the Lua's reference manual
  - doc/SPOE.txt for how to use the SPOE engine
  - doc/network-namespaces.txt for how to use network namespaces under Linux
  - doc/management.txt for the management guide
  - doc/regression-testing.txt for how to use the regression testing suite
  - doc/peers.txt for the peers protocol reference
  - doc/coding-style.txt for how to adopt HAProxy's coding style
  - doc/internals for developer-specific documentation (not all up to date)