From a727c6eaa54f95e72a45b98c2d5ff9d89ac54448 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Thu, 18 Sep 2025 15:08:12 +0200 Subject: [PATCH] OPTIM: ring: check the queue's owner using a CAS on x86 In the loop where the queue's leader tries to get the tail lock, we also need to check if another thread took ownership of the queue the current thread is currently working for. This is currently done using an atomic load. Tests show that on x86, using a CAS for this is much more efficient because it allows to keep the cache line in exclusive state for a few more cycles that permit the queue release call after the loop to be done without having to wait again. The measured gain is +5% for 128 threads on a 64-core AMD system (11.08M msg/s vs 10.56M). However, ARM loses about 1% on this, and we cannot afford that on machines without a fast CAS anyway, so the load is performed using a CAS only on x86_64. It might not be as efficient on low-end models but we don't care since they are not the ones dealing with high contention. --- src/ring.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/src/ring.c b/src/ring.c index 8a97b37c0..79f023aa1 100644 --- a/src/ring.c +++ b/src/ring.c @@ -275,7 +275,18 @@ ssize_t ring_write(struct ring *ring, size_t maxlen, const struct ist pfx[], siz */ while (1) { - if ((curr_cell = HA_ATOMIC_LOAD(ring_queue_ptr)) != &cell) +#if defined(__x86_64__) + /* read using a CAS on x86, as it will keep the cache line + * in exclusive state for a few more cycles that will allow + * us to release the queue without waiting after the loop. + */ + curr_cell = &cell; + HA_ATOMIC_CAS(ring_queue_ptr, &curr_cell, curr_cell); +#else + curr_cell = HA_ATOMIC_LOAD(ring_queue_ptr); +#endif + /* give up if another thread took the leadership of the queue */ + if (curr_cell != &cell) goto wait_for_flush; /* OK the queue is locked, let's attempt to get the tail lock.