MEDIUM: ring: significant boost in the loop by checking the ring queue ptr first

By doing that and placing the cpu_relax at the right places, the ARM reaches 6.0M/s on 80 threads. On x86_64, at 3C6T the EPYC sees a small increase from 4.45M to 4.57M but at 24C48T it sees a drop from 3.82M to 3.33M due to the write contention hidden behind the CAS that implements the FETCH_OR(), that we'll address next.
2025-11-10 13:31:30 +01:00 · 2024-03-17 10:20:56 +01:00 · 2024-03-17 10:20:56 +01:00 · 30a659c355
commit 30a659c355
parent 1e2311edbc
1 changed files with 11 additions and 9 deletions
--- a/src/ring.c
+++ b/src/ring.c
@ -272,21 +272,23 @@ ssize_t ring_write(struct ring *ring, size_t maxlen, const struct ist pfx[], siz
 		 * we must detect a new leader ASAP so that the fewest possible
 		 * threads check the tail.
 		 */
-		while ((tail_ofs = HA_ATOMIC_LOAD(tail_ptr)) & RING_TAIL_LOCK) {
-			next_cell = HA_ATOMIC_LOAD(ring_queue_ptr);
-			if (next_cell != &cell)
-				goto wait_for_flush; // another thread arrived, we should go to wait now
-			__ha_cpu_relax_for_read();
-		}

 		/* the tail is available again and we're still the leader, try
 		 * again.
 		 */
-		if (HA_ATOMIC_LOAD(ring_queue_ptr) != &cell)
-			goto wait_for_flush; // another thread arrived, we should go to wait now
+		while (1) {
+			next_cell = HA_ATOMIC_LOAD(ring_queue_ptr);
+			if (next_cell != &cell)
+				goto wait_for_flush; // FIXME: another thread arrived, we should go to wait now
+			__ha_cpu_relax_for_read();

+			tail_ofs = HA_ATOMIC_FETCH_OR(tail_ptr, RING_TAIL_LOCK);
+			if (!(tail_ofs & RING_TAIL_LOCK))
+				break;
+
+			__ha_cpu_relax_for_read();
+		}
 		/* OK the queue is locked, let's attempt to get the tail lock */
-		tail_ofs = HA_ATOMIC_FETCH_OR(tail_ptr, RING_TAIL_LOCK);

 		/* did we get it ? */
 		if (!(tail_ofs & RING_TAIL_LOCK)) {