MAJOR: leastconn: postpone the server's repositioning under contention

When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().

In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.

While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.

Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:

Before:
  Overhead  Shared Object             Symbol
    43.46%  haproxy-master-inlineebo  [.] fwlc_srv_reposition
    21.20%  haproxy-master-inlineebo  [.] fwlc_get_next_server
     0.91%  haproxy-master-inlineebo  [.] process_stream
     0.75%  [kernel]                  [k] ice_napi_poll
     0.51%  [kernel]                  [k] tcp_recvmsg
     0.50%  [kernel]                  [k] ice_start_xmit
     0.50%  [kernel]                  [k] tcp_ack

After:
  Overhead  Shared Object             Symbol
    30.37%  haproxy                   [.] fwlc_get_next_server
     2.51%  haproxy                   [.] fwlc_srv_reposition
     1.91%  haproxy                   [.] process_stream
     1.46%  [kernel]                  [k] ice_napi_poll
     1.36%  [kernel]                  [k] tcp_recvmsg
     1.04%  [kernel]                  [k] tcp_ack
     1.00%  [kernel]                  [k] skb_release_data
     0.96%  [kernel]                  [k] ice_start_xmit
     0.91%  haproxy                   [.] conn_backend_get
     0.82%  haproxy                   [.] connect_server
     0.82%  haproxy                   [.] run_tasks_from_lists

Tested on an Ampere Altra with 64 aarch64 cores dedicated to haproxy,
the gain is even more visible (3.6x):

  Before: 311-323k rps, 3.16-3.25ms, 6400% CPU
  Overhead  Shared Object     Symbol
    55.69%  haproxy-master    [.] fwlc_srv_reposition
    33.30%  haproxy-master    [.] fwlc_get_next_server
     0.89%  haproxy-master    [.] process_stream
     0.45%  haproxy-master    [.] h1_snd_buf
     0.34%  haproxy-master    [.] run_tasks_from_lists
     0.32%  haproxy-master    [.] connect_server
     0.31%  haproxy-master    [.] conn_backend_get
     0.31%  haproxy-master    [.] h1_headers_to_hdr_list
     0.24%  haproxy-master    [.] srv_add_to_idle_list
     0.23%  haproxy-master    [.] http_request_forward_body
     0.22%  haproxy-master    [.] __pool_alloc
     0.21%  haproxy-master    [.] http_wait_for_response
     0.21%  haproxy-master    [.] h1_send

  After: 1.21M rps, 0.842ms, 6400% CPU
  Overhead  Shared Object     Symbol
    17.44%  haproxy           [.] fwlc_get_next_server
     6.33%  haproxy           [.] process_stream
     4.40%  haproxy           [.] fwlc_srv_reposition
     3.64%  haproxy           [.] conn_backend_get
     2.75%  haproxy           [.] connect_server
     2.71%  haproxy           [.] h1_snd_buf
     2.66%  haproxy           [.] srv_add_to_idle_list
     2.33%  haproxy           [.] run_tasks_from_lists
     2.14%  haproxy           [.] h1_headers_to_hdr_list
     1.56%  haproxy           [.] stream_set_backend
     1.37%  haproxy           [.] http_request_forward_body
     1.35%  haproxy           [.] http_wait_for_response
     1.34%  haproxy           [.] h1_send

And at similar loads, the CPU usage considerably drops (3.55x), as
well as the response time (10x):

  After: 320k rps, 0.322ms, 1800% CPU
  Overhead  Shared Object     Symbol
     7.62%  haproxy           [.] process_stream
     4.64%  haproxy           [.] h1_headers_to_hdr_list
     3.09%  haproxy           [.] h1_snd_buf
     3.08%  haproxy           [.] h1_process_demux
     2.22%  haproxy           [.] __pool_alloc
     2.14%  haproxy           [.] connect_server
     1.87%  haproxy           [.] h1_send
   > 1.84%  haproxy           [.] fwlc_srv_reposition
     1.84%  haproxy           [.] run_tasks_from_lists
     1.77%  haproxy           [.] sock_conn_iocb
     1.75%  haproxy           [.] srv_add_to_idle_list
     1.66%  haproxy           [.] http_request_forward_body
     1.65%  haproxy           [.] wake_expired_tasks
     1.59%  haproxy           [.] h1_parse_msg_hdrs
     1.51%  haproxy           [.] http_wait_for_response
   > 1.50%  haproxy           [.] fwlc_get_next_server

The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.

For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.
This commit is contained in:
Willy Tarreau 2025-02-11 17:24:19 +01:00
parent b6a8318cc2
commit 627280e15f

View File

@ -15,6 +15,7 @@
#include <haproxy/backend.h>
#include <haproxy/queue.h>
#include <haproxy/server-t.h>
#include <haproxy/task.h>
/* Remove a server from a tree. It must have previously been dequeued. This
@ -80,7 +81,19 @@ static void fwlc_srv_reposition(struct server *s)
if (s->lb_node.node.leaf_p && eweight && s->lb_node.key == new_key)
return;
HA_RWLOCK_WRLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
if (HA_RWLOCK_TRYWRLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock) != 0) {
/* there's already some contention on the tree's lock, there's
* no point insisting. Better wake up the server's tasklet that
* will let this or another thread retry later. For the time
* being, the server's apparent load is slightly inaccurate but
* we don't care, if there is contention, it will self-regulate.
*/
if (s->requeue_tasklet)
tasklet_wakeup(s->requeue_tasklet);
return;
}
/* below we've got the lock */
if (s->lb_tree) {
/* we might have been waiting for a while on the lock above
* so it's worth testing again because other threads are very