From c633607c06697b8fb09d94b55b239b9bd7888fc6 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Fri, 31 Jan 2020 06:26:39 +0100 Subject: [PATCH] OPTIM: task: refine task classes default CPU bandwidth ratios Measures with unbounded execution ratios under 40000 concurrent connections at 100 Gbps showed the following CPU bandwidth distribution between task classes depending on traffic scenarios: scenario TC0 TC1 TC2 observation -------------------+---+---+----+--------------------------- TCP conn rate : 29, 48, 23 221 kcps HTTP conn rate : 29, 47, 24 200 kcps TCP byte rate : 3, 5, 92 53 Gbps splicing byte rate: 5, 10, 85 70 Gbps H2 10k object : 10, 21, 74 client-limited mixed traffic : 4, 7, 89 2*1m+1*0: 11kcps, 36 Gbps Thus it seems that we always need a bit of bulk tasks even for short connections, which seems to imply a suboptimal processing somewhere, and that there are roughly twice as many tasks (TC1=normal) as regular tasklets (TC0=urgent). This ratio stands even when data forwarding increases. So at first glance it looks reasonable to enforce the following ratio by default: - 16% for TL_URGENT - 33% for TL_NORMAL - 50% for TL_BULK With this, the TCP conn rate climbs to ~225 kcps, and the mixed traffic pattern shows a more balanced 17kcps + 35 Gbps with 35ms CLI request time time instead of 11kcps + 36 Gbps and 400 ms response time. The byte rate tests (1M objects) are not affected at all. This setting looks "good enough" to allow immediate merging, and could be refined later. It's worth noting that it resists very well to massive increase of run queue depth and maxpollevents: with the run queue depth changed from 200 to 10000 and maxpollevents to 10000 as well, the CLI's request time is back to the previous ~400ms, but the mixed traffic test reaches 52 Gbps + 7500 CPS, which was never met with the previous scheduling model, while the CLI used to show ~1 minute response time. The reason is that in the bulk class it becomes possible to perform multiple rounds of recv+send and eliminate objects at once, increasing the L3 cache hit ratio, and keeping the connection count low, without degrading too much the latency. Another test with mixed traffic involving 2/3 splicing on huge objects and 1/3 on empty objects without touching any setting reports 51 Gbps + 5300 cps and 35ms CLI request time. --- src/task.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/task.c b/src/task.c index 2219262b5..3eaa9b4f6 100644 --- a/src/task.c +++ b/src/task.c @@ -436,15 +436,15 @@ void process_runnable_tasks() if (likely(niced_tasks)) max_processed = (max_processed + 3) / 4; - /* run up to 3*max_processed/4 urgent tasklets */ - done = run_tasks_from_list(&tt->tasklets[TL_URGENT], 3*(max_processed + 1) / 4); + /* run up to max_processed/6 urgent tasklets */ + done = run_tasks_from_list(&tt->tasklets[TL_URGENT], (max_processed + 5) / 6); max_processed -= done; - /* pick up to (max_processed-done+1)/2 regular tasks from prio-ordered run queues */ + /* pick up to max_processed/3 (~=0.4*(max_processed-done)) regular tasks from prio-ordered run queues */ /* Note: the grq lock is always held when grq is not null */ - while (tt->task_list_size < (max_processed + 1) / 2) { + while (tt->task_list_size < 2 * (max_processed + 4) / 5) { if ((global_tasks_mask & tid_bit) && !grq) { #ifdef USE_THREAD HA_SPIN_LOCK(TASK_RQ_LOCK, &rq_lock); @@ -506,11 +506,11 @@ void process_runnable_tasks() grq = NULL; } - /* run between max_processed/8 and max_processed/2 regular tasks */ - done = run_tasks_from_list(&tt->tasklets[TL_NORMAL], (max_processed + 1) / 2); + /* run between max_processed/3 and max_processed/2 regular tasks */ + done = run_tasks_from_list(&tt->tasklets[TL_NORMAL], 2 * (max_processed + 4) / 5); max_processed -= done; - /* run between max_processed/8 and max_processed bulk tasklets */ + /* run between max_processed/2 and max_processed bulk tasklets */ done = run_tasks_from_list(&tt->tasklets[TL_BULK], max_processed); max_processed -= done;