Go to file
Willy Tarreau cd8914bc52 BUG/MAJOR: fd/threads: close a race on closing connections after takeover
As mentioned in commit 237e6a0d6 ("BUG/MAJOR: fd/thread: fix race between
updates and closing FD"), a race was found during stress tests involving
heavy backend connection reuse with many competing closes.

Here the problem is complex. The analysis in commit f69fea64e ("MAJOR:
fd: get rid of the DWCAS when setting the running_mask") that removed
the DWCAS in 2.5 overlooked a few races.

First, a takeover from thread1 could happen just after fd_update_events()
in thread2 validates it holds the tmask bit in the CAS loop. Since thread1
releases running_mask after the operation, thread2 will succeed the CAS
and both will believe the FD is theirs. This does explain the occasional
crashes seen with h1_io_cb() being called on a bad context, or
sock_conn_iocb() seeing conn->subs vanish after checking it. This issue
can be addressed using a DWCAS in both fd_takeover() and fd_update_events()
as it was before the patch above but this is not portable to all archs and
is not easy to adapt for those lacking it, due to some operations still
happening only on individual masks after the thread groups were added.

Second, the checks after fd_clr_running() for the current thread being
the last one is not sufficient: at the exact moment the operation
completes, another thread may also set and drop the running bit and see
itself as alone, and both can call _fd_close_orphan() in parallel. In
order to prevent this from happening, we cannot rely on the absence of
others, we need an explicit flag indicating that the FD must be closed.
One approach that was attempted consisted in playing with the thread_mask
but that was not reliable since it could still match between the late
deletion and the early insertion that follows. Instead, a new FD flag
was added, FD_MUST_CLOSE, that exactly indicates that the call to
_fd_delete_orphan() must be done. It is set by fd_delete(), and
atomically cleared by the first one which checks it, and which is the
only one to call _fd_delete_orphan().

With both points addressed, there's no more visible race left:

- takeover() only happens under the connection list's lock and cannot
  compete with fd_delete() since fd_delete() must first remove the
  connection from the list before deleting the FD. That's also why it
  doesn't need to call _fd_delete_orphan() when dropping its running
  bit.

- takeover() sets its running bit then atomically replaces the thread
  mask, so that until that's done, it doesn't validate the condition
  to end the synchonization loop in fd_update_events(). Once it's OK,
  the previous thread's bit is lost, and this is checked for in
  fd_update_events()

- fd_update_events() can compete with fd_delete() at various places
  which are explained above. Since fd_delete() clears the thread mask
  as after setting its running bit and after setting the FD_MUST_CLOSE
  bit, the synchronization loop guarantees that the thread mask is seen
  before going further, and that once it's seen, the FD_MUST_CLOSE flag
  is already present.

- fd_delete() may start while fd_update_events() has already started,
  but fd_delete() must hold a bit in thread_mask before starting, and
  that is checked by the first test in fd_update_events() before setting
  the running_mask.

- the poller's _update_fd() will not compete against _fd_delete_orphan()
  nor fd_insert() thanks to the fd_grab_tgid() that's always done before
  updating the polled_mask, and guarantees that we never pretend that a
  polled_mask has a bit before the FD is added.

The issue is very hard to reproduce and is extremely time-sensitive.
Some tests were required with a 1-ms timeout with request rates
closely matching 1 kHz per server, though certain tests sometimes
benefitted from saturation. It was found that adding the following
slowdown at a few key places helped a lot and managed to trigger the
bug in 0.5 to 5 seconds instead of tens of minutes on a 20-thread
setup:

    { volatile int i = 10000; while (i--); }

Particularly, placing it at key places where only one of running_mask
or thread_mask is set and not the other one yet (e.g. after the
synchronization loop in fd_update_events or after dropping the
running bit) did yield great results.

Many thanks to Olivier Houchard for this expert help analysing these
races and reviewing candidate fixes.

The patch must be backported to 2.5. Note that 2.6 does not have tgid
in FDs, and that it requires a change of output on fd_clr_running() as
we need the previous bit. This is provided by carefully backporting
commit d6e1987612 ("MINOR: fd: make fd_clr_running() return the previous
value instead"). Tests have shown that the lack of tgid is a showstopper
for 2.6 and that unless a better workaround is found, it could still be
preferable to backport the minimum pieces required for fd_grab_tgid()
to 2.6 so that it stays stable long.
2023-03-09 14:01:48 +01:00
.github CI: Reformat matrix.py using black 2023-01-03 16:28:34 +01:00
addons MINOR: stconn: Always report READ/WRITE event on shutr/shutw 2023-02-22 15:59:16 +01:00
admin BUILD: halog: fix missing double-quote at end of help line 2022-11-25 11:11:41 +01:00
dev MEDIUM: ring: make the offset relative to the head/tail instead of absolute 2023-02-24 09:26:30 +01:00
doc MINOR: jwt: Add support for RSA-PSS signatures (PS256 algorithm) 2023-03-08 10:43:04 +01:00
examples EXAMPLES: remove completely outdated acl-content-sw.cfg 2022-05-30 18:14:24 +02:00
include BUG/MAJOR: fd/threads: close a race on closing connections after takeover 2023-03-09 14:01:48 +01:00
reg-tests MINOR: jwt: Add support for RSA-PSS signatures (PS256 algorithm) 2023-03-08 10:43:04 +01:00
scripts SCRIPTS: run-regtests: add a version check 2022-11-30 18:44:33 +01:00
src BUG/MAJOR: fd/threads: close a race on closing connections after takeover 2023-03-09 14:01:48 +01:00
tests TESTS: add a unit test for one_among_mask() 2022-06-21 20:29:57 +02:00
.cirrus.yml CI: cirrus-ci: bump FreeBSD image to 13-1 2022-09-09 13:30:17 +02:00
.gitattributes MINOR: Configure the cpp userdiff driver for *.[ch] in .gitattributes 2021-02-22 18:17:57 +01:00
.gitignore CLEANUP: exclude udp-perturb with .gitignore 2022-09-16 15:47:04 +02:00
.mailmap DOC: update Tim's address in .mailmap 2021-09-16 09:14:14 +02:00
.travis.yml CI: travis-ci: temporarily disable arm64 builds 2021-08-07 07:28:15 +02:00
BRANCHES DOC: fix some spelling issues over multiple files 2021-01-08 14:53:47 +01:00
CHANGELOG [RELEASE] Released version 2.8-dev4 2023-02-14 16:55:17 +01:00
CONTRIBUTING CLEANUP: assorted typo fixes in the code and comments 2021-08-16 12:37:59 +02:00
INSTALL MINOR: version: mention that it's development again 2022-12-01 15:24:10 +01:00
LICENSE LICENSE: add licence exception for OpenSSL 2012-09-07 13:52:26 +02:00
MAINTAINERS CLEANUP: assorted typo fixes in the code and comments 2022-11-30 14:02:36 +01:00
Makefile BUILD: makefile: fix PCRE overriding specific lib path 2023-02-03 09:42:49 +01:00
README DOC: create a BRANCHES file to explain the life cycle 2019-06-15 22:00:14 +02:00
SUBVERS BUILD: use format tags in VERDATE and SUBVERS files 2013-12-10 11:22:49 +01:00
VERDATE [RELEASE] Released version 2.8-dev4 2023-02-14 16:55:17 +01:00
VERSION [RELEASE] Released version 2.8-dev4 2023-02-14 16:55:17 +01:00

The HAProxy documentation has been split into a number of different files for
ease of use.

Please refer to the following files depending on what you're looking for :

  - INSTALL for instructions on how to build and install HAProxy
  - BRANCHES to understand the project's life cycle and what version to use
  - LICENSE for the project's license
  - CONTRIBUTING for the process to follow to submit contributions

The more detailed documentation is located into the doc/ directory :

  - doc/intro.txt for a quick introduction on HAProxy
  - doc/configuration.txt for the configuration's reference manual
  - doc/lua.txt for the Lua's reference manual
  - doc/SPOE.txt for how to use the SPOE engine
  - doc/network-namespaces.txt for how to use network namespaces under Linux
  - doc/management.txt for the management guide
  - doc/regression-testing.txt for how to use the regression testing suite
  - doc/peers.txt for the peers protocol reference
  - doc/coding-style.txt for how to adopt HAProxy's coding style
  - doc/internals for developer-specific documentation (not all up to date)