Willy Tarreau 21447b1dd4 BUG/MAJOR: stick-tables: fix race with peers in entry expiration
In 2.9 with commit 7968fe3889 ("MEDIUM: stick-table: change the ref_cnt
atomically") we significantly relaxed the stick-tables locking when
dealing with peers by adjusting the ref_cnt atomically and moving it
out of the lock.

However it opened a tiny window that became problematic in 3.0-dev7
when the table's contention was lowered by commit 1a088da7c2 ("MAJOR:
stktable: split the keys across multiple shards to reduce contention").

What happens is that some peers may access the entry for reading at
the moment it's about to expire, and while the read accesses to push
the data remain unnoticed (possibly that from time to time we push
crap), but the releasing of the refcount causes a new write that may
damage anything else. The scenario is the following:

  process_table_expire()               peer_send_teachmsgs()

                                       RDLOCK(&updt_lock);
     tick_is_expired() != 0

     ebmb_delete(ts->key);

     if (ts->upd.node.leaf_p) {
                                       HA_ATOMIC_INC(&ts->ref_cnt);
                                       RDUNLOCK(&updt_lock);
          WRLOCK(&updt_lock);
          eb32_delete(&ts->upd);
     }
     __stksess_free(t, ts);
                                       peer_send_updatemsg(ts);
                                       RDLOCK(&updt_lock);
                                       HA_ATOMIC_DEC(&ts->ref_cnt);

Here it's clear that the bottom part of peer_send_teachmsgs() believes
to be protected but may act on freed data.

This is more visible when enabling -dMtag,no-merge,integrity because
the ATOMIC_DEC(&ref_cnt) decrements one byte in the area, that makes
the eviction check fail while the tag has the address of the left
__stksess_free(), proving a completed pool_free() before the decrement,
and the anomaly there is pretty visible in the crash dump. Changing
INC()/DEC() with ADD(2)/DEC(2) shows that the byte is now off by two,
confirming that the operation happened there.

The solution is not very hard, it consists in checking for the ref_cnt
on the left after grabbing the lock, and doing both before deleting the
element, so that we have the guarantee that either the peer will not
take it or that it has already started taking it.

This was proven to be sufficient, as instead of crashing after 3s of
injection with 4 peers, 16 threads and 130k RPS, it survived for 15mn.

In order to stress the setup, a config involving 4+ peers, tracking
HTTP request with randoms and applying a bwlim-out filter with a
random key, with a client made of 160 h2 conns downloading 10 streams
of 4MB objects in parallel managed to trigger it within a few seconds:

  frontend ft
    http-request track-sc0 rand(100000) table tbl
    filter bwlim-out lim-out limit 2047m key rand(100000000),ipmask(32) min-size 1 table tbl
    http-request set-bandwidth-limit lim-out
    use_backend bk

  backend bk
    server s1 198.18.0.30:8000
    server s2 198.18.0.34:8000

  backend tbl
        stick-table type ip size 1000k expire 1s store http_req_cnt,bytes_in_rate(1s),bytes_out_rate(1s) peers peers

This seems to be very dependent on the timing and setup though.

This will need to be backported to 2.9. This part of the code was
reindented with shards but the block should remain mostly unchanged.
The logic to apply is the same.
2024-04-12 18:00:13 +02:00
2021-09-16 09:14:14 +02:00
2024-04-06 17:02:07 +02:00
2024-04-06 17:02:07 +02:00
2024-04-06 17:02:07 +02:00

The HAProxy documentation has been split into a number of different files for
ease of use.

Please refer to the following files depending on what you're looking for :

  - INSTALL for instructions on how to build and install HAProxy
  - BRANCHES to understand the project's life cycle and what version to use
  - LICENSE for the project's license
  - CONTRIBUTING for the process to follow to submit contributions

The more detailed documentation is located into the doc/ directory :

  - doc/intro.txt for a quick introduction on HAProxy
  - doc/configuration.txt for the configuration's reference manual
  - doc/lua.txt for the Lua's reference manual
  - doc/SPOE.txt for how to use the SPOE engine
  - doc/network-namespaces.txt for how to use network namespaces under Linux
  - doc/management.txt for the management guide
  - doc/regression-testing.txt for how to use the regression testing suite
  - doc/peers.txt for the peers protocol reference
  - doc/coding-style.txt for how to adopt HAProxy's coding style
  - doc/internals for developer-specific documentation (not all up to date)
Description
No description provided
Readme 51 MiB
Languages
C 98.1%
Shell 0.8%
Makefile 0.5%
Lua 0.2%
Python 0.2%