haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-08-07 15:47:01 +02:00

Author	SHA1	Message	Date
Willy Tarreau	4781b1521a	CLEANUP: atomic/tree-wide: replace single increments/decrements with inc/dec This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1) or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.	2021-04-07 18:18:37 +02:00
Willy Tarreau	1db427399c	CLEANUP: atomic: add an explicit _FETCH variant for add/sub/and/or Currently our atomic ops return a value but it's never known whether the fetch is done before or after the operation, which causes some confusion each time the value is desired. Let's create an explicit variant of these operations suffixed with _FETCH to explicitly mention that the fetch occurs after the operation, and make use of it at the few call places.	2021-04-07 18:18:37 +02:00
Willy Tarreau	1691ba3693	MINOR: task: give the scheduler a bit more flexibility in the runqueue size Instead of setting a hard-limit on runqueue-depth and keeping it short to maintain fairness, let's allow the scheduler to automatically cut the existing one in two equal halves if its size is between the configured size and its double. This will allow to increase the default value while keeping a low latency.	2021-03-10 11:15:34 +01:00
Willy Tarreau	018251667e	CLEANUP: config: make the cfg_keyword parsers take a const for the defproxy The default proxy was passed as a variable to all parsers instead of a const, which is not without risk, especially when some timeout parsers used to make some int pointers point to the default values for comparisons. We want to be certain that none of these parsers will modify the defaults sections by accident, so it's important to mark this proxy as const. This patch touches all occurrences found (89).	2021-03-09 10:09:43 +01:00
Willy Tarreau	b7e0c633e8	BUILD: task: fix build at -O0 with threads disabled grq_total was incremented when picking tasks from the global run queue, but this variable was not defined with threads disabled, and the code was optimized away at -O2. No backport is needed.	2021-03-09 10:01:01 +01:00
Willy Tarreau	6fa8bcdc78	MINOR: task: add an application specific flag to the state: TASK_F_USR1 This flag will be usable by any application. It will be preserved across wakeups so the application can use it to do various stuff. Some I/O handlers will soon benefit from this.	2021-03-05 08:30:08 +01:00
Willy Tarreau	144f84a09d	MEDIUM: task: extend the state field to 32 bits It's been too short for quite a while now and is now full. It's still time to extend it to 32-bits since we have room for this without wasting any space, so we now gained 16 new bits for future flags. The values were not reassigned just in case there would be a few hidden u16 or short somewhere in which these flags are placed (as it used to be the case with stream->pending_events). The patch is tagged MEDIUM because this required to update the task's process() prototype to use an int instead of a short, that's quite a bunch of places.	2021-03-05 08:30:08 +01:00
Willy Tarreau	db4e238938	MINOR: task: stop abusing the nice field to detect a tasklet It's cleaner to use a flag from the task's state to detect a tasklet and it's even cheaper. One of the best benefits is that this will allow to get the nice field out of the common part since the tasklet doesn't need it anymore. This commit uses the last task bit available but that's temporary as the purpose of the change is to extend this.	2021-03-05 08:30:08 +01:00
Willy Tarreau	76390dac06	MINOR: task: only limit TL_HEAVY tasks but not others The preliminary approach to dealing with heavy tasks forced us to quit the poller after meeting one. Now instead we process at most one per poll loop and ignore the next ones, so that we get more bandwidth to process all other classes. Doing so further reduced the induced HTTP request latency at 100k req/s under the stress of 1000 concurrent SSL handshakes in the following proportions: \| default \| low-latency ---------+------------+-------------- before \| 2.75 ms \| 2.0 ms after \| 1.38 ms \| 0.98 ms In both cases, the latency is roughly halved. It's worth noting that both values are now exactly 10 times better than in 2.4-dev9. Even the percentiles have much improved. For 16 HTTP connections (1 per thread) competing with 1000 SSL handshakes, we're seeing these long-tail latencies (in milliseconds) : \| 99.5% \| 99.9% \| 100% -----------+---------+---------+-------- 2.4-dev9 \| 48.4 \| 58.1 \| 78.5 previous \| 6.2 \| 11.4 \| 67.8 this patch \| 2.8 \| 2.9 \| 6.1 The task latency profiling report now shows this in default mode: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on\|auto\|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg si_cs_io_cb 3061966 2.224s 726.0ns 42.03s 13.72us h1_io_cb 3061960 6.418s 2.096us 18.76m 367.6us process_stream 3059982 9.137s 2.985us 15.52m 304.3us ssl_sock_io_cb 602657 4.265m 424.7us 4.736h 28.29ms h1_timeout_task 202973 - - 6.254s 30.81us accept_queue_process 135547 1.179s 8.699us 16.29s 120.1us srv_cleanup_toremove_conns 81 15.64ms 193.1us 30.87ms 381.1us task_run_applet 10 758.7us 75.87us 51.77us 5.176us srv_cleanup_idle_conns 4 375.3us 93.83us 54.52us 13.63us And this in low-latency mode, showing that both si_cs_io_cb() and process_stream() have significantly benefitted from the improvement, with values 50 to 200 times smaller than 2.4-dev9: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on\|auto\|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 6407006 11.86s 1.851us 31.14m 291.6us process_stream 6403890 18.40s 2.873us 2.134m 20.00us si_cs_io_cb 6403866 4.139s 646.0ns 1.773m 16.61us ssl_sock_io_cb 894326 6.407m 429.9us 7.326h 29.49ms h1_timeout_task 301189 - - 8.440s 28.02us accept_queue_process 211989 1.691s 7.977us 21.48s 101.3us srv_cleanup_toremove_conns 220 23.46ms 106.7us 65.61ms 298.2us task_run_applet 16 1.219ms 76.17us 181.7us 11.36us srv_cleanup_idle_conns 12 713.3us 59.44us 168.4us 14.03us The changes are slightly more invasive than previous ones and depend on recent patches so they are not likely well suited for backporting.	2021-02-26 12:00:53 +01:00
Willy Tarreau	826fa87246	MINOR: task: place the heavy elements in TL_HEAVY Instead of placing heavy tasklets into the TL_BULK queue, we now place them into the TL_HEAVY one, which is assigned a default weight of ~1% load at once. This way heavy tasks will not block TL_BULK anymore.	2021-02-26 12:00:53 +01:00
Willy Tarreau	401135cee6	MINOR: task: add one extra tasklet class: TL_HEAVY This class will be used exclusively for heavy processing tasklets. It will be cleaner than mixing them with the bulk ones. For now it's allocated ~1% of the CPU bandwidth. The largest part of the patch consists in re-arranging the fields in the task_per_thread structure to preserve a clean alignment with one more list head. Since we're now forced to increase the struct past a second cache line, it now uses 4 cache lines (for easy multiplying) with the first two ones being exclusively used by local operations and the third one mostly by atomic operations. Interestingly, this better arrangement causes less stress and reduced the response time by 8 microseconds at 1 million requests per second.	2021-02-26 12:00:53 +01:00
Willy Tarreau	74dea8caea	MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY While the scheduler is priority-aware and class-aware, and consistently tries to maintain fairness between all classes, it doesn't make use of a fine execution budget to compensate for high-latency tasks such as TLS handshakes. This can result in many subsequent calls adding multiple milliseconds of latency between the various steps of other tasklets that don't even depend on this. An ideal solution would be to add a 4th queue, have all tasks announce their estimated cost upfront and let the scheduler maintain an auto- refilling budget to pick from the most suitable queue. But it turns out that a very simplified version of this already provides impressive gains with very tiny changes and could easily be backported. The principle is to reserve a new task flag "TASK_HEAVY" that indicates that a task is expected to take a lot of time without yielding (e.g. an SSL handshake typically takes 700 microseconds of crypto computation). When the scheduler sees this flag when queuing a tasklet, it will place it into the bulk queue. And during dequeuing, we accept only one of these in a full round. This means that the first one will be accepted, will not prevent other lower priority tasks from running, but if a new one arrives, then the queue stops here and goes back to the polling. This will allow to collect more important updates for other tasks that will be batched before the next call of a heavy task. Preliminary tests consisting in placing this flag on the SSL handshake tasklet show that response times under SSL stress fell from 14 ms before the patch to 3.0 ms with the patch, and even 1.8 ms if tune.sched.low-latency is set to "on".	2021-02-26 00:25:51 +01:00
Willy Tarreau	2a54ffbf43	MINOR: task: make tasklet wakeup latency measurements more accurate First, we don't want to measure wakeup times if the call date had not been set before profiling was enabled at run time. And second, we may only collect the value before clearing the TASK_IN_LIST bit, otherwise another wakeup might happen on another thread and replace the call date we're about to use, hence artificially lower the wakeup times.	2021-02-25 09:44:16 +01:00
Willy Tarreau	b2285de049	MINOR: tasks: also compute the tasklet latency when DEBUG_TASK is set It is extremely useful to be able to observe the wakeup latency of some important I/O operations, so let's accept to inflate the tasklet struct by 8 extra bytes when DEBUG_TASK is set. With just this we have enough to get live reports like this: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on\|auto\|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg si_cs_io_cb 8099492 4.833s 596.0ns 8.974m 66.48us h1_io_cb 7460365 11.55s 1.548us 2.477m 19.92us process_stream 7383828 22.79s 3.086us 18.39m 149.5us h1_timeout_task 4157 - - 348.4ms 83.81us srv_cleanup_toremove_connections751 39.70ms 52.86us 10.54ms 14.04us srv_cleanup_idle_connections 21 1.405ms 66.89us 30.82us 1.467us task_run_applet 16 1.058ms 66.13us 446.2us 27.89us accept_queue_process 7 34.53us 4.933us 333.1us 47.58us	2021-02-25 09:44:16 +01:00
Willy Tarreau	45499c56d3	MINOR: task: make grq_total atomic to move it outside of the grq_lock Instead of decrementing grq_total once per task picked from the global run queue, let's do it at once after the loop like we do for other counters. This simplifies the code everywhere. It is not expected to bring noticeable improvements however, since global tasks tend to be less common nowadays.	2021-02-25 09:44:16 +01:00
Willy Tarreau	c9afbb10f5	MINOR: task: don't decrement then increment the local run queue Now we don't need to decrement rq_total when we pick a tack in the tree to immediately increment it again after installing it into the local list. Instead, we simply add to the local queue count the number of globally picked tasks. Avoiding this shows ~0.5% performance gains at 1Mreq/s (2M task switches/s).	2021-02-25 09:44:16 +01:00
Willy Tarreau	2b363ac092	MINOR: task: do not use __task_unlink_rq() from process_runnable_tasks() As indicated in previous commit, this function tries to guess which tree the task is in to figure what counters to update, while we already have that info in the caller. Let's just pick the relevant parts to place them in the caller.	2021-02-25 09:44:16 +01:00
Willy Tarreau	e7923c1d22	MINOR: task: split the counts of local and global tasks picked In process_runnable_tasks() we're still calling __task_unlink_rq() to pick a task, and this function tries to guess where to pick the task from and which counter to update while the caller's context already has everything. Worse, the number of local tasks is decremented then recredited, doubling the operations. In order to avoid this we first need to keep separate counters for local and global tasks that were picked. This is what this patch does.	2021-02-25 09:44:16 +01:00
Willy Tarreau	9c6dbf0eea	CLEANUP: task: split the large tasklet_wakeup_on() function in two This function has become large with the multi-queue scheduler. We need to keep the fast path and the debugging parts inlined, but the rest now moves to task.c just like was done for task_wakeup(). This has reduced the code size by 6kB due to less inlining of large parts that are always context-dependent, and as a side effect, has increased the overall performance by 1%.	2021-02-24 17:55:58 +01:00
Willy Tarreau	955a11ebfa	MINOR: task: move the allocated tasks counter to the per-thread struct The nb_tasks counter was still global and gets incremented and decremented for each task_new()/task_free(), and was read in process_runnable_tasks(). But it's only used for stats reporting, so doing this this often is pointless and expensive. Let's move it to the task_per_thread struct and have the stats sum it when needed.	2021-02-24 17:42:04 +01:00
Willy Tarreau	eeffb3df41	MINOR: task: limit the remote thread wakeup to the global runqueue only The test in __task_wakeup() to figure if the remote threads are sleeping doesn't make sense outside of the global runqueue test, since there are only two possibilities here: local runqueue or global runqueue, hence a sleeping thread is another one and can only happen when sending to the global run queue. Let's move the test inside the "if" block.	2021-02-24 17:42:04 +01:00
Willy Tarreau	018564eaa2	CLEANUP: task: move the tree root detection from __task_wakeup() to task_wakeup() Historically we used to call __task_wakeup() with a known tree root but this is not the case and the code has remained needlessly complicated with the root calculation in task_wakeup() passed in argument to __task_wakeup() which compares it again. Let's get rid of this and just move the detection code there. This eliminates some ifdefs and allows to simplify the test conditions quite a bit.	2021-02-24 17:42:04 +01:00
Willy Tarreau	1f3b1417b8	CLEANUP: tasks: use a less confusing name for task_list_size This one is systematically misunderstood due to its unclear name. It is in fact the number of tasks in the local tasklet list. Let's call it "tasks_in_list" to remove some of the confusion.	2021-02-24 17:42:04 +01:00
Willy Tarreau	2c41d77ebc	MINOR: tasks: do not maintain the rqueue_size counter anymore This one is exclusively used as a boolean nowadays and is non-zero only when the thread-local run queue is not empty. Better check the root tree's pointer and avoid updating this counter all the time.	2021-02-24 17:42:04 +01:00
Willy Tarreau	9c7b8085f4	MEDIUM: task: remove the tasks_run_queue counter and have one per thread This counter is solely used for reporting in the stats and is the hottest thread contention point to date. Moving it to the scheduler and having a separate one for the global run queue dramatically improves the performance, showing a 12% boost on the request rate on 16 threads! In addition, the thread debugging output which used to rely on rqueue_size was not totally accurate as it would only report task counts. Now we can return the exact thread's run queue length. It is also interesting to note that there are still a few other task/tasklet counters in the scheduler that are not efficiently updated because some cover a single area and others cover multiple areas. It looks like having a distinct counter for each of the following entries would help and would keep the code a bit cleaner: - global run queue (tree) - per-thread run queue (tree) - per-thread shared tasklets list - per-thread local lists Maybe even splitting the shared tasklets lists between pure tasklets and tasks instead of having the whole and tasks would simplify the code because there remain a number of places where several counters have to be updated.	2021-02-24 17:42:04 +01:00
Willy Tarreau	c6ba9a0b9b	MINOR: sched: have one runqueue ticks counter per thread The runqueue_ticks counts the number of task wakeups and is used to position new tasks in the run queue, but since we've had per-thread run queues, the values there are not very relevant anymore and the nice value doesn't apply well if some threads are more loaded than others. In addition, letting all threads compete over a shared counter is not smart as this may cause some excessive contention. Let's move this index close to the run queues themselves, i.e. one per thread and a global one. In addition to improving fairness, this has increased global performance by 2% on 16 threads thanks to the lower contention on rqueue_ticks. Fairness issues were not observed, but if any were to be, this patch could be backported as far as 2.0 to address them.	2021-02-20 13:03:37 +01:00
Willy Tarreau	4e2282f9bf	MEDIUM: tasks/activity: collect per-task statistics when profiling is enabled Now when the profiling is enabled, the scheduler wlil update per-function task-level statistics on number of calls, cpu usage and lateny, that could later be checked using "show profiling". This will immediately make it obvious what functions are responsible for others' high latencies or which ones are suffering from others, and should help spot issues like undesired wakeups. For now the stats are only collected but not reported (though they are readable from sched_activity[] under gdb).	2021-01-29 12:10:33 +01:00
Willy Tarreau	4d6c594998	BUG/MEDIUM: task: close a possible data race condition on a tasklet's list link In issue #958 Ashley Penney reported intermittent crashes on AWS's ARM nodes which would not happen on x86 nodes. After investigation it turned out that the Neoverse N1 CPU cores used in the Graviton2 CPU are much more aggressive than the usual Cortex A53/A72/A55 or any x86 regarding memory ordering. The issue that was triggered there is that if a tasklet_wakeup() call is made on a tasklet scheduled to run on a foreign thread and that tasklet is just being dequeued to be processed, there can be a race at two places: - if MT_LIST_TRY_ADDQ() happens between MT_LIST_BEHEAD() and LIST_SPLICE_END_DETACHED() if the tasklet is alone in the list, because the emptiness tests matches ; - if MT_LIST_TRY_ADDQ() happens during LIST_DEL_INIT() in run_tasks_from_lists(), then depending on how LIST_DEL_INIT() ends up being implemented, it may even corrupt the adjacent nodes while they're being reused for the in-tree storage. This issue was introduced in 2.2 when support for waking up remote tasklets was added. Initially the attachment of a tasklet to a list was enough to know its status and this used to be stable information. Now it's not sufficient to rely on this anymore, thus we need to use a different information. This patch solves this by adding a new task flag, TASK_IN_LIST, which is atomically set before attaching a tasklet to a list, and is only removed after the tasklet is detached from a list. It is checked by tasklet_wakeup_on() so that it may only be done while the tasklet is out of any list, and is cleared during the state switch when calling the tasklet. Note that the flag is not set for pure tasks as it's not needed. However this introduces a new special case: the function tasklet_remove_from_tasklet_list() needs to keep both states in sync and cannot check both the state and the attachment to a list at the same time. This function is already limited to being used by the thread owning the tasklet, so in this case the test remains reliable. However, just like its predecessors, this function is wrong by design and it should probably be replaced with a stricter one, a lazy one, or be totally removed (it's only used in checks to avoid calling a possibly scheduled event, and when freeing a tasklet). Regardless, for now the function exists so the flag is removed only if the deletion could be done, which covers all cases we're interested in regarding the insertion. This removal is safe against a concurrent tasklet_wakeup_on() since MT_LIST_DEL() guarantees the atomic test, and will ultimately clear the flag only if the task could be deleted, so the flag will always reflect the last state. This should be carefully be backported as far as 2.2 after some observation period. This patch depends on previous patch "MINOR: task: remove __tasklet_remove_from_tasklet_list()".	2020-11-30 18:17:59 +01:00
Willy Tarreau	2da4c316c2	MINOR: task: remove __tasklet_remove_from_tasklet_list() This function is only used at a single place directly within the scheduler in run_tasks_from_lists() and it really ought not be called by anything else, regardless of what its comment says. Let's delete it, move the two lines directly into the call place, and take this opportunity to factor the atomic decrement on tasks_run_queue. A comment was added on the remaining one tasklet_remove_from_tasklet_list() to mention the risks in using it.	2020-11-30 18:17:44 +01:00
Willy Tarreau	c309dbdd99	MINOR: task: perform atomic counter increments only once per wakeup In process_runnable_tasks(), we walk the run queue and pick tasks to insert them into the local list. And for each of these operations we perform a few increments, some of which are atomic, and they're even performed under the runqueue's lock. This is useless inside the loop, better do them at the end, since we don't use these values inside the loop and they're not used anywhere else either during this time. The only one is task_list_size which is accessed in parallel by other threads performing remote tasklet wakeups, but it's already approximative and is used to decide to get out of the loop when the limit is reached. So now we compute it first as an initial budget instead.	2020-11-30 18:17:44 +01:00
Willy Tarreau	a868c2920b	MINOR: task: remove tasklet_insert_into_tasklet_list() This function is only called at a single place and adds more confusion than it removes. It also makes one think it could be used outside of the scheduler while it must absolutely not. Let's just move its two lines to the call place, making the code more readable there. In addition this clearly shows that the preliminary LIST_INIT() is useless since the entry is immediately overwritten.	2020-11-30 18:17:44 +01:00
Willy Tarreau	69a7b8fc6c	CLEANUP: task: remove the unused and mishandled global_rqueue_size This counter is only updated and never used, and in addition it's done without any atomicity so it's very unlikely to be correct on multi-CPU systems! Let's just remove it since it's not used.	2020-10-19 14:08:13 +02:00
Willy Tarreau	d48ed6643b	MEDIUM: task: use an upgradable seek lock when scanning the wait queue Right now when running a configuration with many global timers (e.g. many health checks), there is a lot of contention on the global wait queue lock because all threads queue up in front of it to scan it. With 2000 servers checked every 10 milliseconds (200k checks per second), after 23 seconds running on 8 threads, the lock stats were this high: Stats about Lock TASK_WQ: write lock : 9872564 write unlock: 9872564 (0) wait time for write : 9208.409 msec wait time for write/lock: 932.727 nsec read lock : 240367 read unlock : 240367 (0) wait time for read : 149.025 msec wait time for read/lock : 619.991 nsec i.e. ~5% of the total runtime spent waiting on this specific lock. With upgradable locks we don't need to work like this anymore. We can just try to upgade the read lock to a seek lock before scanning the queue, then upgrade the seek lock to a write lock for each element we want to delete there and immediately downgrade it to a seek lock. The benefit is double: - all other threads which need to call next_expired_task() before polling won't wait anymore since the seek lock is compatible with the read lock ; - all other threads competing on trying to grab this lock will fail on the upgrade attempt from read to seek, and will let the current lock owner finish collecting expired entries. Doing only this has reduced the wake_expired_tasks() CPU usage in a very large servers test from 2.15% to 1.04% as reported by perf top, and increased by 3% the health check rate (all threads being saturated). This is expected to help against (and possibly solve) the problem described in issue #875.	2020-10-16 17:15:54 +02:00
Willy Tarreau	3cfaa8d1e0	BUG/MEDIUM: task: bound the number of tasks picked from the wait queue at once There is a theorical problem in the wait queue, which is that with many threads, one could spend a lot of time looping on the newly expired tasks, causing a lot of contention on the global wq_lock and on the global rq_lock. This initially sounds bening, but if another thread does just a task_schedule() or task_queue(), it might end up waiting for a long time on this lock, and this wait time will count on its execution budget, degrading the end user's experience and possibly risking to trigger the watchdog if that lasts too long. The simplest (and backportable) solution here consists in bounding the number of expired tasks that may be picked from the global wait queue at once by a thread, given that all other ones will do it as well anyway. We don't need to pick more than global.tune.runqueue_depth tasks at once as we won't process more, so this counter is updated for both the local and the global queues: threads with more local expired tasks will pick less global tasks and conversely, keeping the load balanced between all threads. This will guarantee a much lower latency if/when wakeup storms happen (e.g. hundreds of thousands of synchronized health checks). Note that some crashes have been witnessed with 1/4 of the threads in wake_expired_tasks() and, while the issue might or might not be related, not having reasonable bounds here definitely justifies why we can spend so much time there. This patch should be backported, probably as far as 2.0 (maybe with some adaptations).	2020-10-16 15:18:48 +02:00
Willy Tarreau	6ce0232a78	BUILD: task: work around a bogus warning in gcc 4.7/4.8 at -O1 As reported in issue #816, when building task.o at -O1 with gcc 4.7 or 4.8, we get the following warning: CC src/task.o In file included from include/haproxy/proxy.h:31:0, from include/haproxy/cfgparse.h:27, from src/task.c:19: src/task.c: In function 'next_timer_expiry': include/haproxy/ticks.h:121:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized] src/task.c:349:2: note: 'key' was declared here It is wrong since the condition to use 'key' is exactly the same as the one used to set it. This warning disappears at -O2 and disappeared from gcc 5 and above. Let's just initialize 'key' there, it only adds 16 bytes of code and remains cheap enough for this function. This should be backported to 2.2.	2020-08-21 05:54:00 +02:00
Willy Tarreau	e5d79bccc0	MINOR: tasks/debug: add a few BUG_ON() to detect use of wrong timer queue This aims at catching calls to task_unlink_wq() performed by the wrong thread based on the shared status for the task, as well as calls to __task_queue() with the wrong timer queue being used based on the task's capabilities. This will at least help eliminate some hypothesis during debugging sessions when suspecting that a wrong thread has attempted to queue a task at the wrong place.	2020-07-22 14:42:52 +02:00
Willy Tarreau	783afbe93b	BUG/MAJOR: tasks: don't requeue global tasks into the local queue A bug was introduced by commit `77015abe0` ("MEDIUM: tasks: clean up the front side of the wait queue in wake_expired_tasks()"): front tasks that are not yet expired were incorrectly requeued into the local wait queue instead of the global one. Because of this, the same task could be found by the same thread on next invocation and be unlinked without locking, allowing another thread to requeue it in parallel, and conversely another thread could unlink it while the task was being walked over, causing all sorts of crashes and endless loops in wake_expired_tasks() and affiliates. This bug can easily be triggered by stressing the do_resolve action in multi-thread (after applying the fixes required to get do_resolve to work with threads). It certainly is the cause of issue #758. This must be backported to 2.2 only.	2020-07-22 14:12:45 +02:00
Willy Tarreau	273aea479d	BUG/MAJOR: tasks: make sure to always lock the shared wait queue if needed In run_tasks_from_task_list() we may free some tasks that have been killed. Before doing so we unlink them from the wait queue. But if such a task is in the global wait queue, the queue isn't locked so this can result in corrupting the global task list and causing loops or crashes. It's very likely one cause of issue #758. This must be backported to 2.2. For 2.1 there doesn't seem to be any case where a task could be freed this way while in the global queue, but it doesn't cost much to apply the same change (the code is in process_runnable_task there).	2020-07-17 14:37:51 +02:00
Willy Tarreau	950954f5f7	MINOR: tasks: use MT_LIST_ADDQ() when killing tasks. A bug in task_kill() was fixed by commy `54d31170a` ("BUG/MAJOR: sched: make sure task_kill() always queues the task") which added a list initialization before adding an element. But in fact an inconditional addition would have done the same and been simpler than first initializing then checking the element was initialized. Let's use MT_LIST_ADDQ() there to add the task to kill into the shared queue and kill the dirty LIST_INIT().	2020-07-10 08:52:13 +02:00
Willy Tarreau	de4db17dee	MINOR: lists: rename some MT_LIST operations to clarify them Initially when mt_lists were added, their purpose was to be used with the scheduler, where anyone may concurrently add the same tasklet, so it sounded natural to implement a check in MT_LIST_ADD{,Q}. Later their usage was extended and MT_LIST_ADD{,Q} started to be used on situations where the element to be added was exclusively owned by the one performing the operation so a conflict was impossible. This became more obvious with the idle connections and the new macro was called MT_LIST_ADDQ_NOCHECK. But this remains confusing and at many places it's not expected that an MT_LIST_ADD could possibly fail, and worse, at some places we start by initializing it before adding (and the test is superflous) so let's rename them to something more conventional to denote the presence of the check or not: MT_LIST_ADD{,Q} : inconditional operation, the caller owns the element, and doesn't care about the element's current state (exactly like LIST_ADD) MT_LIST_TRY_ADD{,Q}: only perform the operation if the element is not already added or in the process of being added. This means that the previously "safe" MT_LIST_ADD{,Q} are not "safe" anymore. This also means that in case of backport mistakes in the future causing this to be overlooked, the slower and safer functions will still be used by default. Note that the missing unchecked MT_LIST_ADD macro was added. The rest of the code will have to be reviewed so that a number of callers of MT_LIST_TRY_ADDQ are changed to MT_LIST_ADDQ to remove the unneeded test.	2020-07-10 08:50:41 +02:00
Willy Tarreau	4f58926352	BUG/MAJOR: sched: make it work also when not building with DEBUG_STRICT Sadly, the fix from commit `54d31170a` ("BUG/MAJOR: sched: make sure task_kill() always queues the task") broke the builds without DEBUG_STRICT as, in order to be careful, it plcaed a BUG_ON() around the previously failing condition to check for any new possible failure, but this BUG_ON strips the condition when DEBUG_STRICT is not set. We don't want BUG_ON to evaluate any condition either as some debugging code calls possibly expensive ones (e.g. in htx_get_stline). Let's just drop the useless BUG_ON(). No backport is needed, this is 2.2-dev.	2020-07-02 17:17:42 +02:00
Willy Tarreau	54d31170a9	BUG/MAJOR: sched: make sure task_kill() always queues the task task_kill() may fail to queue a task if this task has never ever run, because its equivalent (tasklet->list) member has never been "emptied" since it didn't pass through the LIST_DEL_INIT() that's performed by run_tasks_from_lists(). This results in these tasks to never be freed. It happens during the mux takeover since the target task usually is the timeout task which, by definition, has never run yet. This fixes commit `eb8c2c69f` ("MEDIUM: sched: implement task_kill() to kill a task") which was introduced after 2.2-dev11 and doesn't need to be backported.	2020-07-02 14:14:00 +02:00
Willy Tarreau	eb8c2c69fa	MEDIUM: sched: implement task_kill() to kill a task task_kill() may be used by any thread to kill any task with less overhead than a regular wakeup. In order to achieve this, it bypasses the priority tree and inserts the task directly into the shared tasklets list, cast as a tasklet. The task_list_size is updated to make sure it is properly decremented after execution of this task. The task will thus be picked by process_runnable_tasks() after checking the tree and sent to the TL_URGENT list, where it will be processed and killed. If the task is bound to more than one thread, its first thread will be the one notified. If the task was already queued or running, nothing is done, only the flag is added so that it gets killed before or after execution. Of course it's the caller's responsibility to make sur any resources allocated by this task were already cleaned up or taken over.	2020-07-01 16:35:53 +02:00
Willy Tarreau	8a6049c268	MEDIUM: sched: create a new TASK_KILLED task flag This flag, when set, will be used to indicate that the task must die. At the moment this may only be placed by the task itself or by the scheduler when placing it into the TL_NORMAL queue.	2020-07-01 16:35:49 +02:00
Willy Tarreau	d99177f86d	MINOR: sched: make sched->task_list_size atomic We'll need to update it from foreign threads in order to throw killed tasks and maintain correct accounting, so let's make it atomic.	2020-07-01 16:35:41 +02:00
Willy Tarreau	1553b6657d	BUG/MINOR: sched: properly cover for a rare MT_LIST_ADDQ() race In commit `3ef7a190b` ("MEDIUM: tasks: apply a fair CPU distribution between tasklet classes") we compute a total weight to be used to split the CPU time between queues. There is a mention that the total cannot be null, wihch is based on the fact that we only get there if thread_has_task() returns non-zero. But there is a very small race which can break this assumption: if two threads conflict on MT_LIST_ADDQ() on an empty shared list and both roll back before trying again, there is the possibility that a first call to MT_LIST_ISEMPTY() sees the first thread install itself, then the second call will see the list empty when both roll back. Thus we could proceed with the queue while it's temporarily empty and compute max lengths using a divide by zero. This case is very hard to trigger, it seldom happens on 16 threads at 400k req/s. Let's simply test for max_total and leave the loop when we've not found any work. No backport is needed, that's 2.2-only.	2020-06-30 14:06:19 +02:00
Willy Tarreau	e7723bddd7	MEDIUM: tasks: add a tune.sched.low-latency option Now that all tasklet queues are scanned at once by run_tasks_from_lists(), it becomes possible to always check for lower priority classes and jump back to them when they exist. This patch adds tune.sched.low-latency global setting to enable this behavior. What it does is stick to the lowest ranked priority list in which tasks are still present with an available budget, and leave the loop to refill the tasklet lists if the trees got new tasks or if new work arrived into the shared urgent queue. Doing so allows to cut the latency in half when running with extremely deep run queues (10k-100k), thus allowing forwarding of small and large objects to coexist better. It remains off by default since it does have a small impact on large traffic by default (shorter batches).	2020-06-24 12:21:26 +02:00
Willy Tarreau	59153fef86	MINOR: tasks: make run_tasks_from_lists() scan the queues itself Now process_runnable_tasks is responsible for calculating the budgets for each queue, dequeuing from the tree, and calling run_tasks_from_lists(). This latter one scans the queues, picking tasks there and respecting budgets. Note that its name was updated with a plural "s" for this reason.	2020-06-24 12:21:26 +02:00
Willy Tarreau	ba48d5c8f9	MINOR: tasks: pass the queue index to run_task_from_list() Instead of passing it a pointer to the queue, pass it the queue's index so that it can perform all the work around current_queue and tl_class_mask.	2020-06-24 12:21:26 +02:00
Willy Tarreau	49f90bf148	MINOR: tasks: add a mask of the queues with active tasklets It is neither convenient nor scalable to check each and every tasklet queue to figure whether it's empty or not while we often need to check them all at once. This patch introduces a tasklet class mask which gets a bit 1 set for each queue representing one class of service. A single test on the mask allows to figure whether there's still some work to be done. It will later be usable to better factor the runqueue code. Bits are set when tasklets are queued. They're cleared when queues are emptied. It is possible that a queue is empty but has a bit if a tasklet was added then removed, but this is not a problem as this is properly checked for in run_tasks_from_list().	2020-06-24 12:21:26 +02:00

1 2 3 4 5

227 Commits