haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-08-09 00:27:08 +02:00

Author	SHA1	Message	Date
Willy Tarreau	c633607c06	OPTIM: task: refine task classes default CPU bandwidth ratios Measures with unbounded execution ratios under 40000 concurrent connections at 100 Gbps showed the following CPU bandwidth distribution between task classes depending on traffic scenarios: scenario TC0 TC1 TC2 observation -------------------+---+---+----+--------------------------- TCP conn rate : 29, 48, 23 221 kcps HTTP conn rate : 29, 47, 24 200 kcps TCP byte rate : 3, 5, 92 53 Gbps splicing byte rate: 5, 10, 85 70 Gbps H2 10k object : 10, 21, 74 client-limited mixed traffic : 4, 7, 89 21m+10: 11kcps, 36 Gbps Thus it seems that we always need a bit of bulk tasks even for short connections, which seems to imply a suboptimal processing somewhere, and that there are roughly twice as many tasks (TC1=normal) as regular tasklets (TC0=urgent). This ratio stands even when data forwarding increases. So at first glance it looks reasonable to enforce the following ratio by default: - 16% for TL_URGENT - 33% for TL_NORMAL - 50% for TL_BULK With this, the TCP conn rate climbs to ~225 kcps, and the mixed traffic pattern shows a more balanced 17kcps + 35 Gbps with 35ms CLI request time time instead of 11kcps + 36 Gbps and 400 ms response time. The byte rate tests (1M objects) are not affected at all. This setting looks "good enough" to allow immediate merging, and could be refined later. It's worth noting that it resists very well to massive increase of run queue depth and maxpollevents: with the run queue depth changed from 200 to 10000 and maxpollevents to 10000 as well, the CLI's request time is back to the previous ~400ms, but the mixed traffic test reaches 52 Gbps + 7500 CPS, which was never met with the previous scheduling model, while the CLI used to show ~1 minute response time. The reason is that in the bulk class it becomes possible to perform multiple rounds of recv+send and eliminate objects at once, increasing the L3 cache hit ratio, and keeping the connection count low, without degrading too much the latency. Another test with mixed traffic involving 2/3 splicing on huge objects and 1/3 on empty objects without touching any setting reports 51 Gbps + 5300 cps and 35ms CLI request time.	2020-01-31 07:09:10 +01:00
Willy Tarreau	a62917b890	MEDIUM: tasks: implement 3 different tasklet classes with their own queues We used to mix high latency tasks and low latency tasklets in the same list, and to even refill bulk tasklets there, causing some unfairness in certain situations (e.g. poll-less transfers between many connections saturating the machine with similarly-sized in and out network interfaces). This patch changes the mechanism to split the load into 3 lists depending on the task/tasklet's desired classes : - URGENT: this is mainly for tasklets used as deferred callbacks - NORMAL: this is for regular tasks - BULK: this is for bulk tasks/tasklets Arbitrary ratios of max_processed are picked from each of these lists in turn, with the ability to complete in one list from what was not picked in the previous one. After some quick tests, the following setup gave apparently good results both for raw TCP with splicing and for H2-to-H1 request rate: - 0 to 75% for urgent - 12 to 50% for normal - 12 to what remains for bulk Bulk is not used yet.	2020-01-30 18:59:33 +01:00
Willy Tarreau	4ffa0b526a	MINOR: tasks: move the list walking code to its own function New function run_tasks_from_list() will run over a tasklet list and will run all the tasks and tasklets it finds there within a limit of <max> that is passed in arggument. This is a preliminary work for scheduler QoS improvements.	2020-01-30 18:13:13 +01:00
Willy Tarreau	dd0e89a084	BUG/MAJOR: task: add a new TASK_SHARED_WQ flag to fix foreing requeuing Since 1.9 with commit `b20aa9eef3` ("MAJOR: tasks: create per-thread wait queues") a task bound to a single thread will not use locks when being queued or dequeued because the wait queue is assumed to be the owner thread's. But there exists a rare situation where this is not true: the health check tasks may be running on one thread waiting for a response, and may in parallel be requeued by another thread calling health_adjust() after a detecting a response error in traffic when "observe l7" is set, and "fastinter" is lower than "inter", requiring to shorten the running check's timeout. In this case, the task being requeued was present in another thread's wait queue, thus opening a race during task_unlink_wq(), and gets requeued into the calling thread's wait queue instead of the running one's, opening a second race here. This patch aims at protecting against the risk of calling task_unlink_wq() from one thread while the task is queued on another thread, hence unlocked, by introducing a new TASK_SHARED_WQ flag. This new flag indicates that a task's position in the wait queue may be adjusted by other threads than then one currently executing it. This means that such WQ manipulations must be performed under a lock. There are two types of such tasks: - the global ones, using the global wait queue (technically speaking, those whose thread_mask has at least 2 bits set). - some local ones, which for now will be placed into the global wait queue as well in order to benefit from its lock. The flag is automatically set on initialization if the task's thread mask indicates more than one thread. The caller must also set it if it intends to let other threads update the task's expiration delay (e.g. delegated I/Os), or if it intends to change the task's affinity over time as this could lead to the same situation. Right now only the situation described above seems to be affected by this issue, and it is very difficult to trigger, and even then, will often have no visible effect beyond stopping the checks for example once the race is met. On my laptop it is feasible with the following config, chained to httpterm: global maxconn 400 # provoke FD errors, calling health_adjust() defaults mode http timeout client 10s timeout server 10s timeout connect 10s listen px bind :8001 option httpchk /?t=50 server sback 127.0.0.1:8000 backup server-template s 0-999 127.0.0.1:8000 check port 8001 inter 100 fastinter 10 observe layer7 This patch will automatically address the case for the checks because check tasks are created with multiple threads bound and will get the TASK_SHARED_WQ flag set. If in the future more tasks need to rely on this (multi-threaded muxes for example) and the use of the global wait queue becomes a bottleneck again, then it should not be too difficult to place locks on the local wait queues and queue the task on its bound thread. This patch needs to be backported to 2.1, 2.0 and 1.9. It depends on previous patch "MINOR: task: only check TASK_WOKEN_ANY to decide to requeue a task". Many thanks to William Dauchy for providing detailed traces allowing to spot the problem.	2019-12-19 14:42:22 +01:00
Willy Tarreau	8fe4253bf6	MINOR: task: only check TASK_WOKEN_ANY to decide to requeue a task After processing a task, its RUNNING bit is cleared and at the same time we check for other bits to decide whether to requeue the task or not. It happens that we only want to check the TASK_WOKEN_* bits, because : - TASK_RUNNING was just cleared - TASK_GLOBAL and TASK_QUEUE cannot be set yet as the task was running, preventing it from being requeued It's important not to catch yet undefined flags there because it would prevent addition of new task flags. This also shows more clearly that waking a task up with flags 0 is not something safe to do as the task will not be woken up if it's already running.	2019-12-19 14:42:22 +01:00
Willy Tarreau	c49ba52524	MINOR: tasks: split wake_expired_tasks() in two parts to avoid useless wakeups We used to have wake_expired_tasks() wake up tasks and return the next expiration delay. The problem this causes is that we have to call it just before poll() in order to consider latest timers, but this also means that we don't wake up all newly expired tasks upon return from poll(), which thus systematically requires a second poll() round. This is visible when running any scheduled task like a health check, as there are systematically two poll() calls, one with the interval, nothing is done after it, and another one with a zero delay, and the task is called: listen test bind *:8001 server s1 127.0.0.1:1111 check 09:37:38.200959 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8696843}) = 0 09:37:38.200967 epoll_wait(3, [], 200, 1000) = 0 09:37:39.202459 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8712467}) = 0 >> nothing run here, as the expired task was not woken up yet. 09:37:39.202497 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8715766}) = 0 09:37:39.202505 epoll_wait(3, [], 200, 0) = 0 09:37:39.202513 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8719064}) = 0 >> now the expired task was woken up 09:37:39.202522 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 09:37:39.202537 fcntl(7, F_SETFL, O_RDONLY\|O_NONBLOCK) = 0 09:37:39.202565 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 09:37:39.202577 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0 09:37:39.202585 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) 09:37:39.202659 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0 09:37:39.202673 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8814713}) = 0 09:37:39.202683 epoll_wait(3, [{EPOLLOUT\|EPOLLERR\|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1 09:37:39.202693 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8818617}) = 0 09:37:39.202701 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 09:37:39.202715 close(7) = 0 Let's instead split the function in two parts: - the first part, wake_expired_tasks(), called just before process_runnable_tasks(), wakes up all expired tasks; it doesn't compute any timeout. - the second part, next_timer_expiry(), called just before poll(), only computes the next timeout for the current thread. Thanks to this, all expired tasks are properly woken up when leaving poll, and each poll call's timeout remains up to date: 09:41:16.270449 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10223556}) = 0 09:41:16.270457 epoll_wait(3, [], 200, 999) = 0 09:41:17.270130 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10238572}) = 0 09:41:17.270157 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 09:41:17.270194 fcntl(7, F_SETFL, O_RDONLY\|O_NONBLOCK) = 0 09:41:17.270204 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 09:41:17.270216 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0 09:41:17.270224 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) 09:41:17.270299 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0 09:41:17.270314 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10337841}) = 0 09:41:17.270323 epoll_wait(3, [{EPOLLOUT\|EPOLLERR\|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1 09:41:17.270332 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10341860}) = 0 09:41:17.270340 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 09:41:17.270367 close(7) = 0 This may be backported to 2.1 and 2.0 though it's unlikely to bring any user-visible improvement except to clarify debugging.	2019-12-11 09:42:58 +01:00
Olivier Houchard	06910464dd	MEDIUM: task: Split the tasklet list into two lists. As using an mt_list for the tasklet list is costly, instead use a regular list, but add an mt_list for tasklet woken up by other threads, to be run on the current thread. At the beginning of process_runnable_tasks(), we just take the new list, and merge it into the task_list. This should give us performances comparable to before we started using a mt_list, but allow us to use tasklet_wakeup() from other threads.	2019-10-11 16:37:41 +02:00
Olivier Houchard	07308677dd	BUG/MEDIUM: tasks: Don't forget to decrement tasks_run_queue. When executing tasks, don't forget to decrement tasks_run_queue once we popped one task from the task_list. tasks_run_queue used to be decremented by __tasklet_remove_from_tasklet_list(), but we now call MT_LIST_POP().	2019-10-03 14:55:40 +02:00
Willy Tarreau	d022e9c98b	MINOR: task: introduce a thread-local "sched" variable for local scheduler stuff The aim is to rassemble all scheduler information related to the current thread. It simply points to task_per_thread[tid] without having to perform the operation at each time. We save around 1.2 kB of code on performance sensitive paths and increase the request rate by almost 1%.	2019-09-24 11:23:30 +02:00
Willy Tarreau	d66d75656e	MINOR: task: split the tasklet vs task code in process_runnable_tasks() There are a number of tests there which are enforced on tasklets while they will never apply (various handlers, destroyed task or not, arguments, results, ...). Instead let's have a single TASK_IS_TASKLET() test and call the tasklet processing function directly, skipping all the rest. It now appears visible that the only unneeded code is the update to curr_task that is never used for tasklets, except for opportunistic reporting in the debug handler, which can only catch si_cs_io_cb, which in practice doesn't appear in any report so the extra cost incurred there is pointless. This change alone removes 700 bytes of code, mostly in process_runnable_tasks() and increases the performance by about 1%.	2019-09-24 11:23:30 +02:00
Willy Tarreau	4c1e1ad6a8	CLEANUP: task: cache the task_per_thread pointer In process_runnable_tasks() we perform a lot of dereferences to task_per_thread[tid] but tid is thread_local and the compiler cannot know that it doesn't change so this results in making lots of thread local accesses and array dereferences. By just keeping a copy pointer of this, we let the compiler optimize the code. Just doing this has reduced process_runnable_tasks() by 124 bytes in the fast path. Doing the same in wake_expired_tasks() results in 16 extra bytes saved.	2019-09-24 11:23:30 +02:00
Willy Tarreau	9b48c629f2	CLEANUP: task: remove impossible test In process_runnable_task(), after the task's process() function returns, we used to check if the return is not NULL and is not a tasklet, to update profiling measurements. This is useless since only tasks can return non-null here. Let's remove this useless test.	2019-09-24 11:23:30 +02:00
Olivier Houchard	ff1e9f39b9	MEDIUM: tasklets: Make the tasklet list a struct mt_list. Change the tasklet code so that the tasklet list is now a mt_list. That means that tasklet now do have an associated tid, for the thread it is expected to run on, and any thread can now call tasklet_wakeup() for that tasklet. One can change the associated tid with tasklet_set_tid().	2019-09-23 18:16:08 +02:00
Olivier Houchard	859dc80f94	MEDIUM: list: Separate "locked" list from regular list. Instead of using the same type for regular linked lists and "autolocked" linked lists, use a separate type, "struct mt_list", for the autolocked one, and introduce a set of macros, similar to the LIST_* macros, with the MT_ prefix. When we use the same entry for both regular list and autolocked list, as is done for the "list" field in struct connection, we know have to explicitely cast it to struct mt_list when using MT_ macros.	2019-09-23 18:16:08 +02:00
Willy Tarreau	64e6012eb9	MINOR: task: introduce work lists Sometimes we need to delegate some list processing to a function running on another thread. In this case the list element will simply be queued into a dedicated self-locked list and the task responsible for this list will be woken up, calling the associated function which will run over the list. This is what work_list does. Such lists will be dedicated to a limited type of work but will significantly ease such remote handling. A function is provided to create these per-thread lists, their tasks and to properly bind each task to a distinct thread, so that the caller only has to store the resulting pointer to the start of the structure. These structures should not be abused though as each head will consume 4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.	2019-07-12 09:07:48 +02:00
Willy Tarreau	bd20a9dd4e	BUG: tasks: fix bug introduced by latest scheduler cleanup In commit `86eded6c6` ("CLEANUP: tasks: rename task_remove_from_tasklet_list() to tasklet_remove_*") which consisted in removing the casts between tasks and tasklet, I was a bit too fast to believe that we only saw tasklets in this function since process_runnable_tasks() also uses it with tasks under a cast. So removing the bookkeeping on task_list_size was not appropriate. Bah, the joy of casts which hide the real thing... This patch does two things at once to address this mess once for all: - it restores the decrement of task_list_size when it's a real task, but moves it to process_runnable_task() since it's the only place where it's allowed to call it with a task - it moves the increment there as well and renames task_insert_into_tasklet_list() to tasklet_insert_into_tasklet_list() of obvious consistency reasons. This way the increment/decrement of task_list_size is made at the only places where the cast is enforced, so it has less risks to be missed. The comments on top of these functions were updated to reflect that they are only supposed to be used with tasklets and that the caller is responsible for keeping task_list_size up to date if it decides to enforce a task there. Now we don't have to worry anymore about how these functions work outside of the scheduler, which is better longterm-wise. Thanks to Christopher for spotting this mistake. No backport is needed.	2019-06-14 18:16:19 +02:00
Willy Tarreau	86eded6c69	CLEANUP: tasks: rename task_remove_from_tasklet_list() to tasklet_remove_* The function really only operates on tasklets, its arguments are always tasklets cast as tasks to match the function's type, to be cast back to a struct tasklet. Let's rename it to tasklet_remove_from_tasklet_list(), take a struct tasklet, and get rid of the undesired task casts.	2019-06-14 14:57:03 +02:00
Willy Tarreau	5598d171b3	BUILD: task: fix a build warning when threads are disabled The __decl_hathreads() macro will leave a lone semi-colon making the end of variables declarations, resulting in a warning if threads are disabled. Let's simply swap it with the last variable. Thanks to Ilya Shipitsin for reporting this issue. No backport is needed.	2019-06-04 17:18:40 +02:00
Olivier Houchard	cfbb3e6560	MEDIUM: tasks: Get rid of active_tasks_mask. Remove the active_tasks_mask variable, we can deduce if we've work to do by other means, and it is costly to maintain. Instead, introduce a new function, thread_has_tasks(), that returns non-zero if there's tasks scheduled for the thread, zero otherwise.	2019-05-29 21:53:37 +02:00
Willy Tarreau	1e928c074b	MEDIUM: task: don't grab the WR lock just to check the WQ When profiling locks, it appears that the WQ's lock has become the most contended one, despite the WQ being split by thread. The reason is that each thread takes the WQ lock before checking if it it does have something to do. In practice the WQ almost only contains health checks and rare tasks that can be scheduled anywhere, so this is a real waste of resources. This patch proceeds differently. Now that the WQ's lock was turned to RW lock, we proceed in 3 phases : 1) locklessly check for the queue's emptiness 2) take an R lock to retrieve the first element and check if it is expired. This way most visits are performed with an R lock to find and return the next expiration date. 3) if one expiration is found, we perform the WR-locked lookup as usual. As a result, on a one-minute test involving 8 threads and 64 streams at 1.3 million ctxsw/s, before this patch the lock profiler reported this : Stats about Lock TASK_WQ: # write lock : 1125496 # write unlock: 1125496 (0) # wait time for write : 263.143 msec # wait time for write/lock: 233.802 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec And after : Stats about Lock TASK_WQ: # write lock : 173 # write unlock: 173 (0) # wait time for write : 0.018 msec # wait time for write/lock: 103.988 nsec # read lock : 1072706 # read unlock : 1072706 (0) # wait time for read : 60.702 msec # wait time for read/lock : 56.588 nsec Thus the contention was divided by 4.3.	2019-05-28 19:15:44 +02:00
Willy Tarreau	ef28dc11e3	MINOR: task: turn the WQ lock to an RW_LOCK For now it's exclusively used as a write lock though, thus it remains 100% equivalent to the spinlock it replaces.	2019-05-28 19:15:44 +02:00
Willy Tarreau	e6a02fa65a	MINOR: threads: add a "stuck" flag to the thread_info struct This flag is constantly cleared by the scheduler and will be set by the watchdog timer to detect stuck threads. It is also set by the "show threads" command so that it is easy to spot if the situation has evolved between two subsequent calls : if the first "show threads" shows no stuck thread and the second one shows such a stuck thread, it indicates that this thread didn't manage to make any forward progress since the previous call, which is extremely suspicious.	2019-05-22 11:50:48 +02:00
Willy Tarreau	01f3489752	MINOR: task: put barriers after each write to curr_task This one may be watched by signal handlers, we don't want the compiler to optimize its assignment away at the end of the loop and leave some wandering pointers there.	2019-05-17 17:16:20 +02:00
Willy Tarreau	bc13bec548	MINOR: activity: report context switch counts instead of rates It's not logical to report context switch rates per thread in show activity because everything else is a counter and it's not even possible to compare values. Let's only report counts. Further, this simplifies the scheduler's code.	2019-04-30 14:55:18 +02:00
Willy Tarreau	d9add3acc8	MINOR: activity: make the profiling status per thread and not global In order to later support automatic profiling turn on/off, we need to have it per-thread. We're keeping the global option to know whether to turn it or on off, but the profiling status is now set per thread. We're updating the status in activity_count_runtime() which is called before entering poll(). The reason is that we'll extend this with run time measurement when deciding to automatically turn it on or off.	2019-04-25 17:26:19 +02:00
Willy Tarreau	0212fadd65	MINOR: tasks/activity: report the context switch and task wakeup rates It's particularly useful to spot runaway tasks to see this. The context switch rate covers all tasklet calls (tasks and I/O handlers) while the task wakeups only covers tasks picked from the run queue to be executed. High values there will indicate either an intense traffic or a bug that mades a task go wild.	2019-04-24 16:04:23 +02:00
Olivier Houchard	ed1a6a0d8a	MEDIUM: tasks: Use __ha_barrier_store after modifying global_tasks_mask. Now that we no longer use atomic operations to update global_tasks_mask, as it's always modified while holding the TASK_RQ_LOCK, we have to use __ha_barrier_store() instead of __ha_barrier_atomic_store() to ensure any modification of global_tasks_mask is seen before modifying active_tasks_mask. This should be backported to 1.9.	2019-04-18 14:14:10 +02:00
Olivier Houchard	1cfac37b65	MEDIUM: tasks: Don't account a destroyed task as a runned task. In process_runnable_tasks(), if the task we're about to run has been destroyed, and should be free, don't account for it in the number of task we ran. We're only allowed a maximum number of tasks to run per call to process_runnable_tasks(), and freeing one shouldn't take the slot of a valid task.	2019-04-18 10:11:13 +02:00
Olivier Houchard	3f795f76e8	MEDIUM: tasks: Merge task_delete() and task_free() into task_destroy(). task_delete() was never used without calling task_free() just after, and task_free() was only used on error pathes to destroy a just-created task, so merge them into task_destroy(), that will remove the task from the wait queue, and make sure the task is either destroyed immediately if it's not in the run queue, or destroyed when it's supposed to run.	2019-04-18 10:10:04 +02:00
Willy Tarreau	03dd029a5b	CLEANUP: task: remain consistent when using the task's handler A pointer "process" is assigned the task's handler in process_runnable_tasks(), we have no reason to use t->process right after it is assigned.	2019-04-17 22:32:27 +02:00
Olivier Houchard	0c7a4b6371	MINOR: tasks: Don't set the TASK_RUNNING flag when adding in the tasklet list. Now that TASK_QUEUED is enforced, there's no need to set TASK_RUNNING when removing the task from the runqueue to add it to the tasklet list. The flag will only be set right before we run the task.	2019-04-17 19:28:01 +02:00
Olivier Houchard	de82aeaa26	BUG/MEDIUM: tasks: Make sure we modify global_tasks_mask with the rq_lock. When modifying global_tasks_mask, make sure we hold the rq_lock, or we might remove the bit while it has been re-set by somebody else, and we make not be waked when needed.	2019-04-17 19:28:01 +02:00
Willy Tarreau	b038007ae8	BUG/MEDIUM: tasks: Make sure we set TASK_QUEUED before adding a task to the rq. Make sure we set TASK_QUEUED in every case before adding the task to the run queue. task_wakeup() now checks if either TASK_QUEUED or TASK_RUNNING is set, and if neither is set, add TASK_QUEUED and effectively add the task to the runqueue. No longer use __task_wakeup() anywhere except in task_wakeup(), always use task_wakeup() instead. With the old code, process_runnable_task() may re-add a task in the runqueue without setting the TASK_QUEUED flag, and there were race conditions that could lead to a task having the TASK_QUEUED flag but not in the runqueue, thus being unschedulable. This should be backported to 1.9.	2019-04-17 19:28:01 +02:00
Willy Tarreau	3466e3cdcb	BUILD: task/thread: fix single-threaded build of task.c As expected, commit `cde7902ac` ("MEDIUM: tasks: improve fairness between the local and global queues") broke the build with threads disabled, and I forgot to rerun this test before committing. No backport is needed.	2019-04-15 18:52:40 +02:00
Willy Tarreau	c8da044b41	MINOR: tasks: restore the lower latency scheduling when niced tasks are present In the past we used to reduce the number of tasks consulted at once when some niced tasks were present in the run queue. This was dropped in 1.8 when the scheduler started to take batches. With the recent fixes it now becomes possible to restore this behaviour which guarantees a better latency between tasks when niced tasks are present. Thanks to this, with the default number of 200 for tune.runqueue-depth, with a parasitic load of 14000 requests per second, nice 0 gives 14000 rps, nice 1024 gives 12000 rps and nice -1024 gives 16000 rps. The amplitude widens if the runqueue depth is lowered.	2019-04-15 09:50:56 +02:00
Willy Tarreau	2d1fd0a0d2	MEDIUM: tasks: only base the nice offset on the run queue depth The offset calculated for the nice value used to be wrong for a long time and got even worse when the improved multi-thread sheduler was implemented because it continued to rely on the run queue size, which become irrelevant given that we extract tasks in batches, so the run queue size moves following a sawtooth form. However the offsets much better reflects insertion positions in the queue, so it's worth dropping this rq_size component of the equation. Last point, due to the batches made of runqueue-depth entries at once, the higher the depth, the lower the effect of the nice setting since values are picked together in batches and placed into a list. An intuitive approach consists in multiplying the nice value with the batch size to allow tasks to participate to a different batch. And experimentation shows that this works pretty well. With a runqueue-depth of 16 and a parasitic load of 16000 requests per second on 100 streams, a default nice of 0 shows 16000 requests per second for nice 0, 22000 for nice -1024 and 10000 for nice 1024. The difference is even bigger with a runqueue depth of 5. At 200 however it's much smoother (16000-22000).	2019-04-15 09:50:56 +02:00
Willy Tarreau	cde7902ac9	MEDIUM: tasks: improve fairness between the local and global queues Tasks allowed to run on multiple threads, as well as those scheduled by one thread to run on another one pass through the global queue. The local queues only see tasks scheduled by one thread to run on itself. The tasks extracted from the global queue are transferred to the local queue when they're picked by one thread. This causes a priority issue because the global tasks experience a priority contest twice while the local ones experience it only once. Thus if a tasks returns still running, it's immediately reinserted into the local run queue and runs much faster than the ones coming from the global queue. Till 1.9 the tasks going through the global queue were mostly : - health checks initialization - queue management - listener dequeue/requeue These ones are moderately sensitive to unfairness so it was not that big an issue. Since 2.0-dev2 with the multi-queue accept, tasks are scheduled to remote threads on most accept() and it becomes fairly visible under load that the accept slows down, even for the CLI. This patch remedies this by consulting both the local and the global run queues in parallel and by always picking the task whose deadline is the earliest. This guarantees to maintain an excellent fairness between the two queues and removes the cascade effect experienced by the global tasks. Now the CLI always continues to respond quickly even in presence of expensive tasks running for a long time. This patch may possibly be backported to 1.9 if some scheduling issues are reported but at this time it doesn't seem necessary.	2019-04-15 09:50:56 +02:00
Willy Tarreau	24f382f555	CLEANUP: task: do not export rq_next anymore This one hasn't been used anymore since the scheduler changes after 1.8 but it kept being exported and maintained up to date while it's always reset when scanning the trees. Let's stop exporting it and updating it.	2019-04-15 09:50:56 +02:00
Willy Tarreau	587a8130b1	BUG/MINOR: tasks: make sure the first task to be queued keeps its nice value The run queue offset computed from the nice value depends on the run queue size, but for the first task to enter the run queue, this size is zero and the task gets queued just as if its nice value was zero as well. This is problematic for example for the CLI socket if another higher priority task gets queued immediately after as it can steal its place. This patch simply adds one to the rq_size value to make sure the nice is never multiplied by zero. The way the offset is calculated is questionable anyway these days, since with the newer scheduler it seems that just using the nice value as an offset should work (possibly damped by the task's number of calls). This fix must be backported to 1.9. It may possibly be backported to older versions if it proves to make the CLI more interactive.	2019-04-12 15:54:02 +02:00
Willy Tarreau	f8bce3125e	BUG/MEDIUM: task/threads: address a fairness issue between local and global tasks It is possible to hit a fairness issue in the scheduler when a local task runs for a long time (i.e. process_stream() returns running), and a global task wants to run on the same thread and remains in the global queue. What happens in this case is that the condition to extract tasks from the global queue will rarely be satisfied for very low task counts since whatever non-null queue size multiplied by a thread count >1 is always greater than the small remaining number of tasks in the queue. In theory another thread should pick the task but we do have some mono threaded tasks in the global queue as well during inter-thread wakeups. Note that this can only happen with task counts lower than the thread counts, typically one task in each queue for more than two threads. This patch works around the problem by allowing a very small unfairness, making sure that we can always pick at least one task from the global queue even if there is already one in the local queue. A better approach will consist in scanning the two trees in parallel and always pick the best task. This will be more complex and will constitute a separate patch. This fix must be backported to 1.9.	2019-04-12 15:53:43 +02:00
Willy Tarreau	e73256fd2a	BUG/MEDIUM: task/h2: add an idempotent task removal fucntion Previous commit `3ea351368` ("BUG/MEDIUM: h2: Remove the tasklet from the task list if unsubscribing.") uncovered an issue which needs to be addressed in the scheduler's API. The function task_remove_from_task_list() was initially designed to remove a task from the running tasklet list from within the scheduler, and had to be used in h2 to abort pending I/O events. However this function was not designed to be idempotent, occasionally causing a double removal from the tasklet list, with the second doing nothing but affecting the apparent tasks count and making haproxy use 100% CPU on some tests consisting in stopping the client during some transfers. The h2_unsubscribe() function can sometimes be called upon stream exit after an error where the tasklet was possibly already removed, so it. This patch does 2 things : - it renames task_remove_from_task_list() to __task_remove_from_tasklet_list() to discourage users from calling it. Also note the fix in the naming since it's a tasklet list and not a task list. This function is still uesd from the scheduler. - it adds a new, idempotent, task_remove_from_tasklet_list() function which does nothing if the task is already not in the tasklet list. This patch will need to be backported where the commit above is backported.	2019-03-25 18:02:54 +01:00
Olivier Houchard	1b32790324	BUG/MEDIUM: tasks: Make sure we wake sleeping threads if needed. When waking a task on a remote thread, we currently check 1) if this thread was sleeping, and 2) if it was already marked as active before writing to its pipe. Unfortunately this doesn't always work as desired because only one thread from the mask is woken up, while the active_tasks_mask indicates all eligible threads for this task. As a result, if one multi-thread task (e.g. a health check) wakes up to run on any thread, then an accept() dispatches an incoming connection on thread 2, this thread will already have its bit set in active_tasks_mask because of the previous wakeup and will not be woken up. This is easily noticeable on 2.0-dev by injecting on a multi-threaded listener with a single connection at a time while health checks are running quickly in the background : the injection runs slowly with random response times (the poll timeouts). In 1.9 it affects the dequeing of server connections, which occasionally experience pauses if multiple threads share the same queue. The correct solution consists in adjusting the sleeping_thread_mask when waking another thread up. This mask reflects threads that are sleeping, hence that need to be signaled to wake up. Threads with a bit in active_tasks_mask already don't have their sleeping_thread_mask bit set before polling so the principle remains consistent. And by doing so we can remove the old_active_mask field. This should be backported to 1.9.	2019-03-15 14:09:39 +01:00
Olivier Houchard	4c28328572	MEDIUM: task: Use the new _HA_ATOMIC_* macros. Use the new _HA_ATOMIC_* macros and add barriers where needed.	2019-03-11 17:02:37 +01:00
Olivier Houchard	d2b5d16187	MEDIUM: various: Use __ha_barrier_atomic* when relevant. When protecting data modified by atomic operations, use __ha_barrier_atomic* to avoid unneeded barriers on x86.	2019-03-11 17:02:37 +01:00
Willy Tarreau	155acffc13	BUG/MINOR: task: close a tiny race in the inter-thread wakeup __task_wakeup() takes care of a small race that exists between threads, but it uses a store barrier that is not sufficient since apparently the state read after clearing the leaf_p pointer sometimes is incorrect. This results in missed wakeups between threads competing at a high rate. Let's use a full barrier instead to serialize the operations. This may be backported to 1.9 though it's extremely unlikely that this bug will ever manifest itself there.	2019-02-04 14:21:35 +01:00
Willy Tarreau	1ee55fddea	MEDIUM: tasks: check the global task mask instead of the thread number When deciding whether to scan the global run queue or not, we currently check the configured threads number, and if it's 1 we skip the queue since it's not supposed to be used. However when running with a master process and multiple threads in the workers, the master will turn this number back to 1 while some task wakeups might possibly have set bits in the global tasks mask, thus causing active_tasks_mask to have one bit permanently set, preventing the process from sleeping. Instead of checking global.nbthread, let's check for the current thread's bit in global_tasks_mask. First it will make this part of the code more consistent, working like a test and set operation, it will solve the issue with master+nbthread and as a bonus it will save a lock/unlock for each scheduler call when the thread doesn't have a task in the global run queue.	2018-12-14 15:49:45 +01:00
William Lallemand	b582339079	BUG/MEDIUM: mworker: fix several typos in mworker_cleantasks() Commit `27f3fa5` ("BUG/MEDIUM: mworker: stop every tasks in the master") used MAX_THREADS as a mask instead of MAX_THREADS_MASK to clean the global run queue, and used rq_next (global variable) instead of next_rq. Renamed next_rq as tmp_rq and next_wq as tmp_wq to avoid confusion. No backport needed.	2018-12-06 15:38:24 +01:00
William Lallemand	27f3fa56f5	BUG/MEDIUM: mworker: stop every tasks in the master The master is not supposed to run (at the moment) any task before the polling loop, the created tasks should be run only in the workers but in the master they should be disabled or removed. No backport needed.	2018-12-06 14:12:58 +01:00
Willy Tarreau	b6b3df3ed3	MEDIUM: initcall: use initcalls for a few initialization functions signal_init(), init_log(), init_stream(), and init_task() all used to only preset some values and lists. This needs to be done very early to provide a reliable interface to all other users. The calls used to be explicit in haproxy.c:init(). Now they're placed in initcalls at the STG_PREPARE stage. The functions are not exported anymore.	2018-11-26 19:50:32 +01:00
Willy Tarreau	8ceae72d44	MEDIUM: init: use initcall for all fixed size pool creations This commit replaces the explicit pool creation that are made in constructors with a pool registration. Not only this simplifies the pools declaration (it can be done on a single line after the head is declared), but it also removes references to pools from within constructors. The only remaining create_pool() calls are those performed in init functions after the config is parsed, so there is no more user of potentially uninitialized pool now. It has been the opportunity to remove no less than 12 constructors and 6 init functions.	2018-11-26 19:50:32 +01:00
Willy Tarreau	86abe44e42	MEDIUM: init: use self-initializing spinlocks and rwlocks This patch replaces a number of __decl_hathread() followed by HA_SPIN_INIT or HA_RWLOCK_INIT by the new __decl_spinlock() or __decl_rwlock() which automatically registers the lock for initialization in during the STG_LOCK init stage. A few static modifiers were lost in the process, but since they were not essential at all it was not worth extending the API to provide such a variant.	2018-11-26 19:50:32 +01:00
Willy Tarreau	9efd7456e0	MEDIUM: tasks: collect per-task CPU time and latency Right now we measure for each task the cumulated time spent waiting for the CPU and using it. The timestamp uses a 64-bit integer to report a nanosecond-level date. This is only enabled when "profiling.tasks" is enabled, and consumes less than 1% extra CPU on x86_64 when enabled. The cumulated processing time and wait time are reported in "show sess". The task's counters are also reset when an HTTP transaction is reset since the HTTP part pretends to restart on a fresh new stream. This will make sure we always report correct numbers for each request in the logs.	2018-11-22 15:44:21 +01:00
Joseph Herlant	cf92b6d332	CLEANUP: Fix typos in the task subsystem Fix typos in the code comments of the task subsystem.	2018-11-18 22:26:42 +01:00
Willy Tarreau	8d8747abe0	OPTIM: tasks: group all tree roots per cache line Currently we have per-thread arrays of trees and counts, but these ones unfortunately share cache lines and are accessed very often. This patch moves the task-specific stuff into a structure taking a multiple of a cache line, and has one such per thread. Just doing this has reduced the cache miss ratio from 19.2% to 18.7% and increased the 12-thread test performance by 3%. It starts to become visible that we really need a process-wide per-thread storage area that would cover more than just these parts of the tasks. The code was arranged so that it's easy to move the pieces elsewhere if needed.	2018-10-15 19:06:13 +02:00
Willy Tarreau	b20aa9eef3	MAJOR: tasks: create per-thread wait queues Now we still have a main contention point with the timers in the main wait queue, but the vast majority of the tasks are pinned to a single thread. This patch creates a per-thread wait queue and queues a task to the local wait queue without any locking if the task is bound to a single thread (the current one) otherwise to the shared queue using locking. This significantly reduces contention on the wait queue. A test with 12 threads showed 11 ms spent in the WQ lock compared to 4.7 seconds in the same test without this change. The cache miss ratio decreased from 19.7% to 19.2% on the 12-thread test, and its performance increased by 1.5%. Another indirect benefit is that the average queue size is divided by the number of threads, which roughly removes log(nbthreads) levels in the tree and further speeds up lookups.	2018-10-15 19:04:40 +02:00
Willy Tarreau	0b25d5e99f	MEDIUM: task: perform a single tree lookup per run queue batch The run queue is designed to perform a single tree lookup and to use multiple passes to eb32sc_next(). The scheduler rework took a conservative approach first but this is not needed anymore and it increases the processing cost of process_runnable_tasks() and even the time during which the RQ lock is held if the global queue is heavily loaded. Let's simply move the initial lookup to the entry of the loop like the previous scheduler used to do. This has reduced by a factor of 5.5 the number of calls to eb32sc_lookup_get() there.	2018-10-10 16:42:46 +02:00
Olivier Houchard	19bdf2428d	MINOR: tasks: Don't special-case when nbthreads == 1 Instead of checking if nbthreads == 1, just and thread_mask with all_threads_mask to know if we're supposed to add the task to the local or the global runqueue.	2018-08-17 14:50:37 +02:00
Olivier Houchard	d8b7a4701d	BUG/MEDIUM: tasks: Don't insert in the global rqueue if nbthread == 1 Make sure we don't insert a task in the global run queue if nbthread == 1, as, as an optimisation, we avoid reading from it if nbthread == 1.	2018-08-16 19:25:46 +02:00
Willy Tarreau	85d9b84eb1	BUILD/MINOR: threads: unbreak build with threads disabled Depending on the optimization level, gcc may complain that wake_thread() uses an invalid array index for poller_wr_pipe[] when called from __task_wakeup(). Normally the condition to get there never happens, but it's simpler to ifdef out this part of the code which is only used to wake other threads up. No backport is needed, this was brought by the recent introduction of the ability to wake a sleeping thread.	2018-07-27 17:18:22 +02:00
Olivier Houchard	79321b95a8	MINOR: pollers: Add a way to wake a thread sleeping in the poller. Add a new pipe, one per thread, so that we can write on it to wake a thread sleeping in a poller, and use it to wake threads supposed to take care of a task, if they are all sleeping.	2018-07-26 19:09:50 +02:00
Olivier Houchard	eba0c0b51d	MINOR: tasks: Make global_tasks_mask volatile. In order to make sure modifications are noticed by other threads when needed, make global_tasks_mask volatile.	2018-07-26 19:09:50 +02:00
Olivier Houchard	9b03c0c9a7	MINOR: tasks: Make active_tasks_mask volatile. To be sure we have the relevant informations, make active_tasks_mask volatile	2018-07-26 19:09:50 +02:00
Olivier Houchard	77551ee8a7	BUG/MEDIUM: tasks: make __task_unlink_rq responsible for the rqueue size. As __task_wakeup() is responsible for increasing rqueue_local[tid]/global_rqueue_size, make __task_unlink_rq responsible for decreasing it, as process_runnable_tasks() isn't the only one that removes tasks from runqueues.	2018-07-26 16:33:29 +02:00
Olivier Houchard	76e45181b2	MINOR: tasks: Add a flag that tells if we're in the global runqueue. How that we have bits available in task->state, add a flag that tells if we're in the global runqueue or not.	2018-07-26 16:33:10 +02:00
Olivier Houchard	c4aac9effe	BUG/MEDIUM: tasks: Make sure there's no task left before considering inactive. We may remove the thread's bit in active_tasks_mask despite tasks for that thread still being present in the global runqueue. To fix that, introduce global_tasks_mask, and set the correspnding bits when we add a task to the runqueue.	2018-07-26 15:40:22 +02:00
Willy Tarreau	189ea856a7	BUG/MEDIUM: tasks: use atomic ops for active_tasks_mask We don't have the lock anymore so we need to protect it.	2018-07-26 15:16:43 +02:00
Olivier Houchard	e85ee7b663	BUG/MEDIUM: tasks: Decrement rqueue_size at the right time. We need to decrement requeue_size when we remove a task form rqueue_local, not when we remove if from the task list, or we'd also decrement it for any tasklet, that was never in the rqueue in the first place.	2018-07-26 15:00:58 +02:00
Willy Tarreau	9a77186cb0	BUG/MEDIUM: tasks: make sure we pick all tasks in the run queue Commit `09eeb76` ("BUG/MEDIUM: tasks: Don't forget to increase/decrease tasks_run_queue.") addressed a count issue in the run queue and uncovered another issue with the way the tasks are dequeued from the global run queue. The number of tasks to pick is computed using an integral divide, which results in up to nbthread-1 tasks never being run. The fix simply consists in getting rid of the divide and checking the task count in the loop. No backport is needed, this is 1.9-specific.	2018-07-26 14:24:46 +02:00
Olivier Houchard	9db0fedb59	BUG/MINOR: tasklets: Just make sure we don't pass a tasklet to the handler. We can't just set t to NULL if it's a tasklet, or we'd have a hard time accessing to t->process, so just make sure we pass NULL as the first parameter of t->process if it's a tasklet. This should be a non-issue at this point, as tasklets aren't used yet.	2018-06-14 18:57:26 +02:00
Olivier Houchard	b1ca58b245	MINOR: tasks: Don't define rqueue if we're building without threads. To make sure we don't inadvertently insert task in the global runqueue, while only the local runqueue is used without threads, make its definition and usage conditional on USE_THREAD.	2018-06-06 16:35:12 +02:00
David Carlier	cc0a957a50	MINOR: task: Fix compiler warning. Waking up task, when checking if it is a valid entry. Similarly to commit `caa8a37ffe`, casting explicitally to void pointer as HA_ATOMIC_CAS needs.	2018-06-05 13:55:57 +02:00
Olivier Houchard	082627af77	MINOR: task: Also consider the task list size when getting global tasks. We're taking tasks from the global runqueue based on the number of tasks the thread already have in its local runqueue, but now that we have a task list, we also have to take that into account.	2018-05-28 15:20:59 +02:00
Olivier Houchard	736ea41c6c	BUG/MEDIUM: task: Don't forget to decrement max_processed after each task. When the task list was introduced, we bogusly lost max_processed--, that means we would execute as much tasks as present in the list, and we would never set active_tasks_mask, so the thread would go to sleep even if more tasks were to be executed. 1.9-dev only, no backport is needed.	2018-05-28 15:20:57 +02:00
Olivier Houchard	1599b80360	MINOR: tasks: Make the number of tasks to run at once configurable. Instead of hardcoding 200, make the number of tasks to be run configurable using tune.runqueue-depth. 200 is still the default.	2018-05-26 20:03:24 +02:00
Olivier Houchard	b0bdae7b88	MAJOR: tasks: Introduce tasklets. Introduce tasklets, lightweight tasks. They have no notion of priority, they are just run as soon as possible, and will probably be used for I/O later. For the moment they're used to replace the temporary thread-local list that was used in the scheduler. The first part of the struct is common with tasks so that tasks can be cast to tasklets and queued in this list. Once a task is in the tasklet list, it has its leaf_p set to 0x1 so that it cannot accidently be confused as not in the queue. Pure tasklets are identifiable by their nice value of -32768 (which is normally not possible).	2018-05-26 20:03:19 +02:00
Olivier Houchard	f6e6dc12cd	MAJOR: tasks: Create a per-thread runqueue. A lot of tasks are run on one thread only, so instead of having them all in the global runqueue, create a per-thread runqueue which doesn't require any locking, and add all tasks belonging to only one thread to the corresponding runqueue. The global runqueue is still used for non-local tasks, and is visited by each thread when checking its own runqueue. The nice parameter is thus used both in the global runqueue and in the local ones. The rare tasks that are bound to multiple threads will have their nice value used twice (once for the global queue, once for the thread-local one).	2018-05-26 19:27:29 +02:00
Olivier Houchard	9f6af33222	MINOR: tasks: Change the task API so that the callback takes 3 arguments. In preparation for thread-specific runqueues, change the task API so that the callback takes 3 arguments, the task itself, the context, and the state, those were retrieved from the task before. This will allow these elements to change atomically in the scheduler while the application uses the copied value, and even to have NULL tasks later.	2018-05-26 19:23:57 +02:00
Olivier Houchard	9b36cb4a41	BUG/MEDIUM: task: Don't free a task that is about to be run. While running a task, we may try to delete and free a task that is about to be run, because it's part of the local tasks list, or because rq_next points to it. So flag any task that is in the local tasks list to be deleted, instead of run, by setting t->process to NULL, and re-make rq_next a global, thread-local variable, that is modified if we attempt to delete that task. Many thanks to PiBa-NL for reporting this and analysing the problem. This should be backported to 1.8.	2018-05-04 20:11:04 +02:00
Willy Tarreau	d80cb4ee13	MINOR: global: add some global activity counters to help debugging A number of counters have been added at special places helping better understanding certain bug reports. These counters are maintained per thread and are shown using "show activity" on the CLI. The "clear counters" commands also reset these counters. The output is sent as a single write(), which currently produces up to about 7 kB of data for 64 threads. If more counters are added, it may be necessary to write into multiple buffers, or to reset the counters. To backport to 1.8 to help collect more detailed bug reports.	2018-01-23 15:38:33 +01:00
Willy Tarreau	a24d1d0be4	MINOR: task: align the rq and wq locks We really don't want them to share the same cache line as they are expected to be used in parallel. Adding a 64-byte alignment here shows a performance increase of about 4.5% on task-intensive workloads with 2 to 4 threads.	2017-11-26 11:10:51 +01:00
Willy Tarreau	6d1222ce73	MINOR: task: keep a pointer to the currently running task Very often when debugging, the current task's pointer isn't easy to recover (eg: from a core file). Let's keep a copy of it, it will likely help, especially with threads.	2017-11-26 11:10:50 +01:00
Willy Tarreau	bafbe01028	CLEANUP: pools: rename all pool functions and pointers to remove this "2" During the migration to the second version of the pools, the new functions and pool pointers were all called "pool_something2()" and "pool2_something". Now there's no more pool v1 code and it's a real pain to still have to deal with this. Let's clean this up now by removing the "2" everywhere, and by renaming the pool heads "pool_head_something".	2017-11-24 17:49:53 +01:00
Willy Tarreau	51753458c4	BUG/MAJOR: threads/task: dequeue expired tasks under the WQ lock There is a small unprotected window for a task between the wait queue and the run queue where a task could be woken up and destroyed at the same time. What typically happens is that a timeout is reached at the same time an I/O completes and wakes it up, and the I/O terminates the task, causing a use after free in wake_expired_tasks() possibly causing a crash and/or memory corruption : thread 1 thread 2 (wake_expired_tasks) (stream_int_notify) HA_SPIN_UNLOCK(TASK_WQ_LOCK, &wq_lock); task_wakeup(task, TASK_WOKEN_IO); ... process_stream() stream_free() task_free() pool_free(task) task_wakeup(task, TASK_WOKEN_TIMER); This case is reasonably easy to reproduce with a config using very short server timeouts (100ms) and client timeouts (10ms), while injecting on httpterm requesting medium sized objects (5kB) over SSL. All this is easier done with more threads than allocated CPUs so that pauses can happen anywhere and last long enough for process_stream() to kill the task. This patch inverts the lock and the wakeup(), but requires some changes in process_runnable_tasks() to ensure we never try to grab the WQ lock while having the RQ lock held. This means we have to release the RQ lock before calling task_queue(), so we can't hold the RQ lock during the loop and must take and drop it. It seems that a different approach with the scope-aware trees could be easier, but it would possibly not cover situations where a task is allowed to run on multiple threads. The current solution covers it and doesn't seem to have any measurable performance impact.	2017-11-23 18:47:04 +01:00
Christopher Faulet	8a48f67526	MAJOR: polling: Use active_tasks_mask instead of tasks_run_queue tasks_run_queue is the run queue size. It is a global variable. So it is underoptimized because we may be lead to consider there are active tasks for a thread while in fact all active tasks are assigned to the other threads. So, in such cases, the polling loop will be evaluated many more times than necessary. Instead, we now check if the thread id is set in the bitfield active_tasks_mask. Another change has been made in process_runnable_tasks. Now, we always limit the number of tasks processed to 200. This is specific to threads, no backport is needed.	2017-11-16 11:19:46 +01:00
Christopher Faulet	3911ee85df	MINOR: tasks: Use a bitfield to track tasks activity per-thread a bitfield has been added to know if there are runnable tasks for a thread. When a task is woken up, the bits corresponding to its thread_mask are set. When all tasks for a thread have been evaluated without any wakeup, the thread is removed from active ones by unsetting its tid_bit from the bitfield.	2017-11-16 11:19:46 +01:00
Christopher Faulet	919b739862	CLEANUP: tasks: Remove useless double test on rq_next No backport is needed, this is purely 1.8-specific.	2017-11-14 18:11:34 +01:00
Christopher Faulet	9dcf9b6f03	MINOR: threads: Use __decl_hathreads to declare locks This macro should be used to declare variables or struct members depending on the USE_THREAD compile option. It avoids the encapsulation of such declarations between #ifdef/#endif. It is used to declare all lock variables.	2017-11-13 11:38:17 +01:00
Willy Tarreau	9e45b33f7e	BUG/MAJOR: threads/tasks: fix the scheduler again My recent change in commit `ce4e0aa` ("MEDIUM: task: change the construction of the loop in process_runnable_tasks()") was bogus as it used to keep the rq_next across an unlock/lock sequence, occasionally leading to crashes for tasks that are eligible to any thread. We must use the lookup call for each new batch instead. The problem is easily triggered with such a configuration : global nbthread 4 listen check mode http bind 0.0.0.0:8080 redirect location / option httpchk GET / server s1 127.0.0.1:8080 check inter 1 server s2 127.0.0.1:8080 check inter 1 Thanks to Olivier for diagnosing this one. No backport is needed.	2017-11-08 14:05:19 +01:00
Christopher Faulet	2a944ee16b	BUILD: threads: Rename SPIN/RWLOCK macros using HA_ prefix This remove any name conflicts, especially on Solaris.	2017-11-07 11:10:24 +01:00
Willy Tarreau	f0c531ab55	MEDIUM: tasks: implement a lockless scheduler for single-thread usage The scheduler is complex and uses local queues to amortize the cost of locks. But all this comes with a cost that is quite observable with single-thread workloads. The purpose of this patch is to reimplement the much simpler scheduler for the case where threads are not used. The code is very small and simple. It doesn't impact the multi-threaded performance at all, and provides a nice 10% performance increase in single-thread by reaching 606kreq/s on the tests that showed 550kreq/s before.	2017-11-06 11:20:11 +01:00
Willy Tarreau	9d4b56b88e	MINOR: tasks: only visit filled task slots after processing them process_runnable_tasks() needs to requeue or wake up tasks after processing them in batches. By only refilling the existing ones, we avoid revisiting all the queue. The performance gain is measurable starting with two threads, where the request rate climbs to 657k/s compared to 644k.	2017-11-06 11:20:11 +01:00
Willy Tarreau	ce4e0aa7f3	MEDIUM: task: change the construction of the loop in process_runnable_tasks() This patch slightly rearranges the loop to pack the locked code a little bit, and to try to concentrate accesses to the tree together to benefit more from the cache. It also fixes how the loop handles the right margin : now that is guaranteed that the retrieved nodes are filtered to only match the current thread, we don't need to rewind every 16 entries. Instead we can rewind each time we reach the right margin again. With this change, we now achieve the following performance for 10 H2 conns each containing 100 streams : 1 thread : 550kreq/s 2 thread : 644kreq/s 3 thread : 598kreq/s	2017-11-06 11:20:11 +01:00
Willy Tarreau	b992ba16ef	MINOR: task: simplify wake_expired_tasks() to avoid unlocking in the loop This function is sensitive, let's make it shorter by factoring out the unlock and leave code. This reduced the function's size by a few tens of bytes and increased the overall performance by about 1%.	2017-11-06 11:20:11 +01:00
Willy Tarreau	8d38805d3d	MAJOR: task: make use of the scope-aware ebtree functions Currently the task scheduler suffers from an O(n) lookup when skipping tasks that are not for the current thread. The reason is that eb32_lookup_ge() has no information about the current thread so it always revisits many tasks for other threads before finding its own tasks. This is particularly visible with HTTP/2 since the number of concurrent streams created at once causes long series of tasks for the same stream in the scheduler. With only 10 connections and 100 streams each, by running on two threads, the performance drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for only 200000 task lookups, 430 million skips had to be performed, which means that on average, each lookup leads to 2150 nodes to be visited. This commit backports the principle of scope lookups for ebtrees from the ebtree_v7 development tree. The idea is that each node contains a mask indicating the union of the scopes for the nodes below it, which is fed during insertion, and used during lookups. Then during lookups, branches that do not contain any leaf matching the requested scope are simply ignored. This perfectly matches a thread mask, allowing a thread to only extract the tasks it cares about from the run queue, and to always find them in O(log(n)) instead of O(n). Thus the scheduler uses tid_bit and task->thread_mask as the ebtree scope here. Doing this has recovered most of the performance, as can be seen on the test below with two threads, 10 connections, 100 streams each, and 1 million requests total : Before After Gain test duration : 89.6s 4.73s x19 HTTP requests/s (DEBUG) : 11200 211300 x19 HTTP requests/s (PROD) : 15900 447000 x28 spin_lock time : 85.2s 0.46s /185 time per lookup : 13us 40ns /325 Even when going to 6 threads (on 3 hyperthreaded CPU cores), the performance stays around 284000 req/s, showing that the contention is much lower. A test showed that there's no benefit in using this for the wait queue though.	2017-11-06 11:20:11 +01:00
Willy Tarreau	f65610a83d	CLEANUP: threads: rename process_mask to thread_mask It was a leftover from the last cleaning session; this mask applies to threads and calling it process_mask is a bit confusing. It's the same in fd, task and applets.	2017-10-31 16:06:06 +01:00
Willy Tarreau	5f4a47b701	CLEANUP: threads: replace the last few 1UL<<tid with tid_bit There were a few occurences left, better replace them now.	2017-10-31 15:59:32 +01:00
Emeric Brun	c60def8368	MAJOR: threads/task: handle multithread on task scheduler 2 global locks have been added to protect, respectively, the run queue and the wait queue. And a process mask has been added on each task. Like for FDs, this mask is used to know which threads are allowed to process a task. For many tasks, all threads are granted. And this must be your first intension when you create a new task, else you have a good reason to make a task sticky on some threads. This is then the responsibility to the process callback to lock what have to be locked in the task context. Nevertheless, all tasks linked to a session must be sticky on the thread creating the session. It is important that I/O handlers processing session FDs and these tasks run on the same thread to avoid conflicts.	2017-10-31 13:58:30 +01:00
Thierry FOURNIER	d697596c6c	MINOR: tasks: Move Lua notification from Lua to tasks These notification management function and structs are generic and it will be better to move in common parts. The notification management functions and structs have names containing some "lua" references because it was written for the Lua. This patch removes also these references.	2017-09-11 18:59:40 +02:00
Emeric Brun	0194897e54	MAJOR: task: task scheduler rework. In order to authorize call of task_wakeup on running task: - from within the task handler itself. - in futur, from another thread. The lookups on runqueue and waitqueue are re-worked to prepare multithread stuff. If task_wakeup is called on a running task, the woken message flags are savec in the 'pending_state' attribute of the state. The real wakeup is postponed at the end of the handler process and the woken messages are copied from pending_state to the state attribute of the task. It's important to note that this change will cause a very minor (though measurable) performance loss but it is necessary to make forward progress on a multi-threaded scheduler. Most users won't ever notice.	2017-06-27 14:38:02 +02:00
Christopher Faulet	34c5cc98da	MINOR: task: Rename run_queue and run_queue_cur counters <run_queue> is used to track the number of task in the run queue and <run_queue_cur> is a copy used for the reporting purpose. These counters has been renamed, respectively, <tasks_run_queue> and <tasks_run_queue_cur>. So the naming is consistent between tasks and applets. [wt: needed for next fixes, backport to 1.7 and 1.6]	2016-12-12 19:10:54 +01:00

1 2 3 4 5

201 Commits