From c633607c06697b8fb09d94b55b239b9bd7888fc6 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Fri, 31 Jan 2020 06:26:39 +0100
Subject: [PATCH] OPTIM: task: refine task classes default CPU bandwidth ratios

Measures with unbounded execution ratios under 40000 concurrent
connections at 100 Gbps showed the following CPU bandwidth
distribution between task classes depending on traffic scenarios:

    scenario           TC0 TC1 TC2   observation
   -------------------+---+---+----+---------------------------
    TCP conn rate     : 29, 48, 23   221 kcps
    HTTP conn rate    : 29, 47, 24   200 kcps
    TCP byte rate     :  3,  5, 92   53 Gbps
    splicing byte rate:  5, 10, 85   70 Gbps
    H2 10k object     : 10, 21, 74   client-limited
    mixed traffic     :  4,  7, 89   2*1m+1*0: 11kcps, 36 Gbps

Thus it seems that we always need a bit of bulk tasks even for short
connections, which seems to imply a suboptimal processing somewhere,
and that there are roughly twice as many tasks (TC1=normal) as regular
tasklets (TC0=urgent). This ratio stands even when data forwarding
increases. So at first glance it looks reasonable to enforce the
following ratio by default:

  - 16% for TL_URGENT
  - 33% for TL_NORMAL
  - 50% for TL_BULK

With this, the TCP conn rate climbs to ~225 kcps, and the mixed traffic
pattern shows a more balanced 17kcps + 35 Gbps with 35ms CLI request
time time instead of 11kcps + 36 Gbps and 400 ms response time. The
byte rate tests (1M objects) are not affected at all. This setting
looks "good enough" to allow immediate merging, and could be refined
later.

It's worth noting that it resists very well to massive increase of
run queue depth and maxpollevents: with the run queue depth changed
from 200 to 10000 and maxpollevents to 10000 as well, the CLI's
request time is back to the previous ~400ms, but the mixed traffic
test reaches 52 Gbps + 7500 CPS, which was never met with the previous
scheduling model, while the CLI used to show ~1 minute response time.
The reason is that in the bulk class it becomes possible to perform
multiple rounds of recv+send and eliminate objects at once, increasing
the L3 cache hit ratio, and keeping the connection count low, without
degrading too much the latency.

Another test with mixed traffic involving 2/3 splicing on huge objects
and 1/3 on empty objects without touching any setting reports 51 Gbps +
5300 cps and 35ms CLI request time.
---
 src/task.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/task.c b/src/task.c
index 2219262b5..3eaa9b4f6 100644
--- a/src/task.c
+++ b/src/task.c
@@ -436,15 +436,15 @@ void process_runnable_tasks()
 	if (likely(niced_tasks))
 		max_processed = (max_processed + 3) / 4;
 
-	/* run up to 3*max_processed/4 urgent tasklets */
-	done = run_tasks_from_list(&tt->tasklets[TL_URGENT], 3*(max_processed + 1) / 4);
+	/* run up to max_processed/6 urgent tasklets */
+	done = run_tasks_from_list(&tt->tasklets[TL_URGENT], (max_processed + 5) / 6);
 	max_processed -= done;
 
-	/* pick up to (max_processed-done+1)/2 regular tasks from prio-ordered run queues */
+	/* pick up to max_processed/3 (~=0.4*(max_processed-done)) regular tasks from prio-ordered run queues */
 
 	/* Note: the grq lock is always held when grq is not null */
 
-	while (tt->task_list_size < (max_processed + 1) / 2) {
+	while (tt->task_list_size < 2 * (max_processed + 4) / 5) {
 		if ((global_tasks_mask & tid_bit) && !grq) {
 #ifdef USE_THREAD
 			HA_SPIN_LOCK(TASK_RQ_LOCK, &rq_lock);
@@ -506,11 +506,11 @@ void process_runnable_tasks()
 		grq = NULL;
 	}
 
-	/* run between max_processed/8 and max_processed/2 regular tasks */
-	done = run_tasks_from_list(&tt->tasklets[TL_NORMAL], (max_processed + 1) / 2);
+	/* run between max_processed/3 and max_processed/2 regular tasks */
+	done = run_tasks_from_list(&tt->tasklets[TL_NORMAL], 2 * (max_processed + 4) / 5);
 	max_processed -= done;
 
-	/* run between max_processed/8 and max_processed bulk tasklets */
+	/* run between max_processed/2 and max_processed bulk tasklets */
 	done = run_tasks_from_list(&tt->tasklets[TL_BULK], max_processed);
 	max_processed -= done;