haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-08-07 23:56:57 +02:00

Author	SHA1	Message	Date
Willy Tarreau	1ed238101a	CLEANUP: tasks: use the local state, not t->state, to check for tasklets There's no point reading t->state to check for a tasklet after we've atomically read the state into the local "state" variable. Not only it's more expensive, it's also less clear whether that state is supposed to be atomic or not. And in any case, tasks and tasklets have their type forever and the one reflected in state is correct and stable.	2025-05-02 11:09:28 +02:00
Willy Tarreau	45e83e8e81	BUG/MAJOR: tasks: fix task accounting when killed After recent commit `b81c9390f` ("MEDIUM: tasks: Mutualize the TASK_KILLED code between tasks and tasklets"), the task accounting was no longer correct for killed tasks due to the decrement of tasks in list that was no longer done, resulting in infinite loops in process_runnable_tasks(). This just illustrates that this code remains complex and should be further cleaned up. No backport is needed, as this was in 3.2.	2025-05-02 11:09:28 +02:00
Olivier Houchard	b81c9390f4	MEDIUM: tasks: Mutualize the TASK_KILLED code between tasks and tasklets The code to handle a task/tasklet when it's been killed before it were to run is mostly identical, so move it outside of task and tasklet specific code, and inside the common code. This commit is just cosmetic, and should have no impact.	2025-04-30 17:09:14 +02:00
Olivier Houchard	2bab043c8c	MEDIUM: tasks: Remove TASK_IN_LIST and use TASK_QUEUED instead. TASK_QUEUED was used to mean "the task has been scheduled to run", TASK_IN_LIST was used to mean "the tasklet has been scheduled to run", remove TASK_IN_LIST and just use TASK_QUEUED for tasklets instead. This commit is just cosmetic, and should not have any impact.	2025-04-30 17:08:57 +02:00
Olivier Houchard	35df7cbe34	MEDIUM: tasks: More code factorization There is some code that should run no matter if the task was killed or not, and was needlessly duplicated, so only use one instance. This also fixes a small bug when a tasklet that got killed before it could run would still count as a tasklet that ran, when it should not, which just means that we'd run one less useful task before going back to the poller. This commit is mostly cosmetic, and should not have any impact.	2025-04-30 17:08:57 +02:00
Olivier Houchard	438c000e9f	MEDIUM: tasks: Mutualize code between tasks and tasklets. The code that checks if we're currently running, and waits if so, was identical between tasks and tasklets, so move it in code common to tasks and tasklets. This commit is just cosmetic, and should not have any impact.	2025-04-30 17:08:57 +02:00
Olivier Houchard	9240cd4a27	BUG/MAJOR: tasklets: Make sure he tasklet can't run twice tasklets were originally designed to alway run on only one thread, so it was not possible to have it run on 2 threads concurrently. The API has been extended so that another thread may wake the tasklet, the idea was still that we wanted to have it run on one thread only. However, the way it's been done meant that unless a tasklet was bound to a specific tid with tasklet_set_tid(), or we explicitely used tasklet_wakeup_on() to specify the thread for the target to run on, it would be scheduled to run on the current thread. This is in fact a desirable feature. There is however a race condition in which the tasklet would be scheduled on a thread, while it is running on another. This could lead to the same tasklet to run on multiple threads, which we do not want. To fix this, just do what we already do for regular tasks, set the "TASK_RUNNING" flag, and when it's time to execute the tasklet, wait until that flag is gone. Only one case has been found in the current code, where the tasklet could run on different threads depending on who wakes it up, in the leastconn load balancer, since commit `627280e15f`. It should not be a problem in practice, as the function called can be called concurrently. If a bug is eventually found in relation to this problem, and this patch should be backported, the following patches should be backported too : MEDIUM: quic: Make sure we return the tasklet from quic_accept_run MEDIUM: quic: Make sure we return NULL in quic_conn_app_io_cb if needed MEDIUM: quic: Make sure we return the tasklet from qcc_io_cb MEDIUM: mux_fcgi: Make sure we return the tasklet from fcgi_deferred_shut MEDIUM: listener: Make sure w ereturn the tasklet from accept_queue_process MEDIUM: checks: Make sure we return the tasklet from srv_chk_io_cb	2025-04-25 16:14:26 +02:00
Willy Tarreau	36ec70c526	MINOR: sched: add a new function is_sched_alive() to report scheduler's health This verifies that the scheduler is still ticking without having to access the activity[] array nor keeping local copies of the ctxsw counter. It just tests and sets a flag that is reset after each return from a ->process() function.	2025-04-17 16:25:47 +02:00
Willy Tarreau	e7510d6230	CLEANUP: task: move the barrier after clearing th_ctx->current There's a barrier after releasing the current task in the scheduler. However it's improperly placed, it's done after pool_free() while in fact it must be done immediately after resetting the current pointer. Indeed, the purpose is to make sure that nobody sees the task as valid when it's in the process of being released. This is something that could theoretically happen if interrupted by a signal in the inlined code of pool_free() if the compiler decided to postpone the write to ->current. In practice since nothing fancy is done in the inlined part of the function, there's currently no risk of reordering. But it could happen if the underlying __pool_free() were to be inlined for example, and in this case we could possibly observe th_ctx->current pointing to something currently being destroyed. With the barrier between the two, there's no risk anymore.	2025-02-21 18:31:46 +01:00
Willy Tarreau	c5052bad8a	MINOR: sched: add TASK_F_WANTS_TIME to make the scheduler update the call date Currently tasks being profiled have th_ctx->sched_call_date set to the current nanosecond in monotonic time. But there's no other way to have this, despite the scheduler being capable of it. Let's just declare a new task flag, TASK_F_WANTS_TIME, that makes the scheduler take the time just before calling the handler. This way, a task that needs nanosecond resolution on the call date will be able to be called with an up-to-date date without having to abuse now_mono_time() if not needed. In addition, if CLOCK_MONOTONIC is not supported (now_mono_time() always returns 0), the date is set to the most recently known now_ns, which is guaranteed to be atomic and is only updated once per poll loop. This date can be more conveniently retrieved using task_mono_time(). This can be useful, e.g. for pacing. The code was slightly adjusted so as to merge the common parts between the profiling case and this one.	2024-11-19 20:13:41 +01:00
Ilya Shipitsin	80813cdd2a	CLEANUP: assorted typo fixes in the code and comments This is 37th iteration of typo fixes	2023-11-23 16:23:14 +01:00
Willy Tarreau	a13f8425f0	MINOR: task/debug: make task_queue() and task_schedule() possible callers It's common to see process_stream() being woken up by wake_expired_tasks in the profiling output, without knowing which timeout was set to cause this. By making it possible to record the call places of task_queue() and task_schedule(), and by making wake_expired_tasks() explicitly not replace it, we'll be able to know which task_queue() or task_schedule() was triggered for a given wakeup. For example below: process_stream 51200 311.4ms 6.081us 34.59s 675.6us <- run_tasks_from_lists@src/task.c:659 task_queue process_stream 19227 70.00ms 3.640us 9.813m 30.62ms <- sc_notify@src/stconn.c:1136 task_wakeup process_stream 6414 102.3ms 15.95us 8.093m 75.70ms <- stream_new@src/stream.c:578 task_wakeup It's visible that it's the run_tasks_from_lists() which in fact applies on the task->expire returned by the ->process() function itself.	2023-11-09 17:24:00 +01:00
Amaury Denoyelle	c361937d51	BUG/MINOR: task: allow to use tasklet_wakeup_after with tid -1 Adjust BUG_ON() statement to allow tasklet_wakeup_after() for tasklets with tid pinned to -1 (the current thread). This is similar to tasklet_wakeup(). This should be backported up to 2.6.	2023-04-18 16:20:47 +02:00
Willy Tarreau	ba4c7a1597	BUG/MEDIUM: sched: allow a bit more TASK_HEAVY to be processed when needed As reported in github issue #1881, there are situations where an excess of TLS handshakes can cause a livelock. What's happening is that normally we process at most one TLS handshake per loop iteration to maintain the latency low. This is done by tagging them with TASK_HEAVY, queuing these tasklets in the TL_HEAVY queue. But if something slows down the loop, such as a connect() call when no more ports are available, we could end up processing no more than a few hundred or thousands handshakes per second. If the llmit becomes lower than the rate of incoming handshakes, we will accumulate them and at some point users will get impatient and give up or retry. Then a new problem happens: the queue fills up with even more handshake attempts, only one of which will be handled per iteration, so we can end up processing only outdated handshakes at a low rate, with basically nothing else in the queue. This can for example happen in parallel with health checks that don't require incoming handshakes to succeed to continue to cause some activity that could maintain the high latency stuff active. Here we're taking a slightly different approach. First, instead of always allowing only one handshake per loop (and usually it's critical for latency), we take the current situation into account: - if configured with tune.sched.low-latency, the limit remains 1 - if there are other non-heavy tasks, we set the limit to 1 + one per 1024 tasks, so that a heavily loaded queue of 4k handshakes per thread will be able to drain them at ~4 per loops with a limited impact on latency - if there are no other tasks, the limit grows to 1 + one per 128 tasks, so that a heavily loaded queue of 4k handshakes per thread will be able to drain them at ~32 per loop with still a very limited impact on latency since only I/O will get delayed. It was verified on a 56-core Xeon-8480 that this did not degrade the latency; all requests remained below 1ms end-to-end in full close+ handshake, and even 500us under low-lat + busy-polling. This must be backported to 2.4.	2023-02-17 16:01:34 +01:00
Willy Tarreau	2e270cf0b0	BUG/MINOR: sched: properly report long_rq when tasks remain in the queue There's a per-thread "long_rq" counter that is used to indicate how often we leave the scheduler with tasks still present in the run queue. The purpose is to know when tune.runqueue-depth served to limit latency, due to a large number of tasks being runnable at once. However there's a bug there, it's not always set: if after the first run, one heavy task was processed and later only heavy tasks remain, we'll loop back to not_done_yet where we try to pick more tasks, but none are eligible (since heavy ones have already run) so we directly return without incrementing the counter. This is what causes ultra-low values on long_rq during massive SSL handshakes, that are confusing because they make one believe that tl_class_mask doesn't have the HEAVY flag anymore. Let's just fix that by not returning from the middle of the function. This can be backported as far as 2.4.	2023-02-17 16:01:34 +01:00
Willy Tarreau	5ec79f1a04	BUILD: sched: fix build with DEBUG_THREAD with the previous commit The build with DEBUG_THREAD was broken by commit `fc50b9dd1` ("BUG/MAJOR: sched: protect task during removal from wait queue"). It took me a while to figure how to declare and aligned and initialized rwlock that wasn't static, but it turns out that __decl_aligned_rwlock() does exactly this, so that we don't have to assign an integer value when a struct is expected in case of debugging. No backport is needed.	2022-11-22 10:24:07 +01:00
Willy Tarreau	fc50b9dd14	BUG/MAJOR: sched: protect task during removal from wait queue The issue addressed by commit `fbb934da9` ("BUG/MEDIUM: stick-table: fix a race condition when updating the expiration task") is still present when thread groups are enabled, but this time it lies in the scheduler. What happens is that a task configured to run anywhere might already have been queued into one group's wait queue. When updating a stick table entry, sometimes the task will have to be dequeued and requeued. For this a lock is taken on the current thread group's wait queue lock, but while this is necessary for the queuing, it's not sufficient for dequeuing since another thread might be in the process of expiring this task under its own group's lock which is different. This is easy to test using 3 stick tables with 1ms expiration, 3 track-sc rules and 4 thread groups. The process crashes almost instantly under heavy traffic. One approach could consist in storing the group number the task was queued under in its descriptor (we don't need 32 bits to store the thread id, it's possible to use one short for the tid and another one for the tgrp). Sadly, no safe way to do this was figured, because the race remains at the moment the thread group number is checked, as it might be in the process of being changed by another thread. It seems that a working approach could consist in always having it associated with one group, and only allowing to change it under this group's lock, so that any code trying to change it would have to iterately read it and lock its group until the value matches, confirming it really holds the correct lock. But this seems a bit complicated, particularly with wait_expired_tasks() which already uses upgradable locks to switch from read state to a write state. Given that the shared tasks are not that common (stick-table expirations, rate-limited listeners, maybe resolvers), it doesn't seem worth the extra complexity for now. This patch takes a simpler and safer approach consisting in switching back to a single wq_lock, but still keeping separate wait queues. Given that shared wait queues are almost always empty and that otherwise they're scanned under a read lock, the contention remains manageable and most of the time the lock doesn't even need to be taken since such tasks are not present in a group's queue. In essence, this patch reverts half of the aforementionned patch. This was tested and confirmed to work fine, without observing any performance degradation under any workload. The performance with 8 groups on an EPYC 74F3 and 3 tables remains twice the one of a single group, with the contention remaining on the table's lock first. No backport is needed.	2022-11-22 09:10:08 +01:00
Willy Tarreau	3d4cdb198c	MEDIUM: tasks/activity: combine the called function with the caller Now instead of getting aggregate stats per called function, we have them per function AND per call place. The "byaddr" sort considers the function pointer first, then the call count, so that dominant callers of a given callee are instantly spotted. This allows to get sorted outputs like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 17357952 40.91s 2.357us 4.849m 16.76us <- sock_conn_iocb@src/sock.c:869 tasklet_wakeup sc_conn_io_cb 10357182 6.297s 607.0ns 27.93m 161.8us <- sc_app_chk_rcv_conn@src/stconn.c:762 tasklet_wakeup process_stream 9891131 1.809m 10.97us 53.61m 325.2us <- sc_notify@src/stconn.c:1209 task_wakeup process_stream 9823934 1.887m 11.52us 48.31m 295.1us <- stream_new@src/stream.c:563 task_wakeup sc_conn_io_cb 9347863 16.59s 1.774us 6.143m 39.43us <- h1_wake_stream_for_recv@src/mux_h1.c:2600 tasklet_wakeup h1_io_cb 501344 1.848s 3.686us 6.544m 783.2us <- conn_subscribe@src/connection.c:732 tasklet_wakeup sc_conn_io_cb 239717 492.3ms 2.053us 3.213m 804.3us <- qcs_notify_send@src/mux_quic.c:529 tasklet_wakeup h2_io_cb 173019 4.204s 24.30us 40.95s 236.7us <- h2_snd_buf@src/mux_h2.c:6712 tasklet_wakeup h2_io_cb 149487 424.3ms 2.838us 14.63s 97.87us <- h2c_restart_reading@src/mux_h2.c:856 tasklet_wakeup other 101893 4.626s 45.40us 14.84s 145.7us quic_lstnr_dghdlr 94389 614.0ms 6.504us 30.54s 323.6us <- quic_lstnr_dgram_dispatch@src/quic_sock.c:255 tasklet_wakeup quic_conn_app_io_cb 92205 3.735s 40.51us 390.9ms 4.239us <- qc_lstnr_pkt_rcv@src/xprt_quic.c:6184 tasklet_wakeup_after qc_io_cb 50355 19.01s 377.5us 10.65s 211.4us <- qc_treat_acked_tx_frm@src/xprt_quic.c:1695 tasklet_wakeup h1_io_cb 44427 155.0ms 3.489us 21.50s 484.0us <- h1_takeover@src/mux_h1.c:4085 tasklet_wakeup qc_io_cb 9018 4.924s 546.0us 3.084s 342.0us <- qc_stream_desc_ack@src/quic_stream.c:128 tasklet_wakeup h1_timeout_task 3236 1.172ms 362.0ns 1.119s 345.9us <- h1_release@src/mux_h1.c:1087 task_wakeup h1_io_cb 2804 7.974ms 2.843us 1.980s 706.0us <- sock_conn_iocb@src/sock.c:849 tasklet_wakeup sc_conn_io_cb 2804 33.44ms 11.92us 2.597s 926.2us <- h1_wake_stream_for_send@src/mux_h1.c:2610 tasklet_wakeup qc_io_cb 2623 2.669s 1.017ms 1.347s 513.5us <- h3_snd_buf@src/h3.c:1084 tasklet_wakeup qc_process_timer 662 526.4us 795.0ns 1.081s 1.633ms <- wake_expired_tasks@src/task.c:344 task_wakeup quic_conn_app_io_cb 648 12.62ms 19.47us 225.7ms 348.2us <- qc_process_timer@src/xprt_quic.c:4635 tasklet_wakeup accept_queue_process 286 1.571ms 5.494us 72.55ms 253.7us <- listener_accept@src/listener.c:1099 tasklet_wakeup process_resolvers 176 157.8us 896.0ns 7.835ms 44.52us <- wake_expired_tasks@src/task.c:429 task_drop_running qc_io_cb 167 10.71ms 64.12us 32.47ms 194.4us <- qc_process_timer@src/xprt_quic.c:4602 tasklet_wakeup sc_conn_io_cb 123 80.05us 650.0ns 50.35ms 409.4us <- qcs_notify_recv@src/mux_quic.c:519 tasklet_wakeup h2_timeout_task 32 30.69us 958.0ns 9.038ms 282.4us <- h2_release@src/mux_h2.c:1191 task_wakeup task_run_applet 24 33.79ms 1.408ms 5.838ms 243.3us <- sc_applet_create@src/stconn.c:489 appctx_wakeup accept_queue_process 17 56.34us 3.314us 7.505ms 441.5us <- accept_queue_process@src/listener.c:165 tasklet_wakeup srv_cleanup_toremove_conns 16 1.133ms 70.81us 5.685ms 355.3us <- srv_cleanup_idle_conns@src/server.c:5948 task_wakeup srv_cleanup_idle_conns 16 74.57us 4.660us 2.797ms 174.8us <- wake_expired_tasks@src/task.c:429 task_drop_running quic_conn_app_io_cb 12 786.9us 65.58us 2.042ms 170.1us <- qc_process_timer@src/xprt_quic.c:4589 tasklet_wakeup sc_conn_io_cb 9 20.55us 2.283us 2.475ms 275.0us <- sock_conn_iocb@src/sock.c:869 tasklet_wakeup h2_io_cb 8 34.12us 4.265us 1.784ms 223.0us <- h2_do_shutw@src/mux_h2.c:4656 tasklet_wakeup task_run_applet 4 6.615ms 1.654ms 2.306us 576.0ns <- sc_app_chk_snd_applet@src/stconn.c:996 appctx_wakeup quic_conn_io_cb 4 4.278ms 1.069ms 6.469us 1.617us <- qc_lstnr_pkt_rcv@src/xprt_quic.c:6184 tasklet_wakeup_after qc_io_cb 2 20.81us 10.40us 4.943us 2.471us <- qc_init@src/mux_quic.c:2057 tasklet_wakeup quic_conn_app_io_cb 2 752.9us 376.4us 63.97us 31.99us <- qc_xprt_start@src/xprt_quic.c:7122 tasklet_wakeup quic_accept_run 2 13.84us 6.920us 172.8us 86.42us <- quic_accept_push_qc@src/quic_sock.c:458 tasklet_wakeup qc_idle_timer_task 2 295.0us 147.5us 8.761us 4.380us <- wake_expired_tasks@src/task.c:344 task_wakeup qc_io_cb 1 867.1us 867.1us 812.8us 812.8us <- qcs_consume@src/mux_quic.c:800 tasklet_wakeup ... and calls sorted by address like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg task_run_applet 23 32.73ms 1.423ms 5.837ms 253.8us <- sc_applet_create@src/stconn.c:489 appctx_wakeup task_run_applet 4 6.615ms 1.654ms 2.306us 576.0ns <- sc_app_chk_snd_applet@src/stconn.c:996 appctx_wakeup accept_queue_process 285 1.566ms 5.495us 72.49ms 254.3us <- listener_accept@src/listener.c:1099 tasklet_wakeup accept_queue_process 17 56.34us 3.314us 7.505ms 441.5us <- accept_queue_process@src/listener.c:165 tasklet_wakeup sc_conn_io_cb 10357182 6.297s 607.0ns 27.93m 161.8us <- sc_app_chk_rcv_conn@src/stconn.c:762 tasklet_wakeup sc_conn_io_cb 9347863 16.59s 1.774us 6.143m 39.43us <- h1_wake_stream_for_recv@src/mux_h1.c:2600 tasklet_wakeup sc_conn_io_cb 239717 492.3ms 2.053us 3.213m 804.3us <- qcs_notify_send@src/mux_quic.c:529 tasklet_wakeup sc_conn_io_cb 2804 33.44ms 11.92us 2.597s 926.2us <- h1_wake_stream_for_send@src/mux_h1.c:2610 tasklet_wakeup sc_conn_io_cb 123 80.05us 650.0ns 50.35ms 409.4us <- qcs_notify_recv@src/mux_quic.c:519 tasklet_wakeup sc_conn_io_cb 9 20.55us 2.283us 2.475ms 275.0us <- sock_conn_iocb@src/sock.c:869 tasklet_wakeup process_resolvers 159 145.9us 917.0ns 7.823ms 49.20us <- wake_expired_tasks@src/task.c:429 task_drop_running srv_cleanup_idle_conns 16 74.57us 4.660us 2.797ms 174.8us <- wake_expired_tasks@src/task.c:429 task_drop_running srv_cleanup_toremove_conns 16 1.133ms 70.81us 5.685ms 355.3us <- srv_cleanup_idle_conns@src/server.c:5948 task_wakeup process_stream 9891130 1.809m 10.97us 53.61m 325.2us <- sc_notify@src/stconn.c:1209 task_wakeup process_stream 9823933 1.887m 11.52us 48.31m 295.1us <- stream_new@src/stream.c:563 task_wakeup h1_io_cb 17357952 40.91s 2.357us 4.849m 16.76us <- sock_conn_iocb@src/sock.c:869 tasklet_wakeup h1_io_cb 501344 1.848s 3.686us 6.544m 783.2us <- conn_subscribe@src/connection.c:732 tasklet_wakeup h1_io_cb 44427 155.0ms 3.489us 21.50s 484.0us <- h1_takeover@src/mux_h1.c:4085 tasklet_wakeup h1_io_cb 2804 7.974ms 2.843us 1.980s 706.0us <- sock_conn_iocb@src/sock.c:849 tasklet_wakeup h1_timeout_task 3236 1.172ms 362.0ns 1.119s 345.9us <- h1_release@src/mux_h1.c:1087 task_wakeup h2_timeout_task 32 30.69us 958.0ns 9.038ms 282.4us <- h2_release@src/mux_h2.c:1191 task_wakeup h2_io_cb 173019 4.204s 24.30us 40.95s 236.7us <- h2_snd_buf@src/mux_h2.c:6712 tasklet_wakeup h2_io_cb 149487 424.3ms 2.838us 14.63s 97.87us <- h2c_restart_reading@src/mux_h2.c:856 tasklet_wakeup h2_io_cb 8 34.12us 4.265us 1.784ms 223.0us <- h2_do_shutw@src/mux_h2.c:4656 tasklet_wakeup qc_io_cb 50355 19.01s 377.5us 10.65s 211.4us <- qc_treat_acked_tx_frm@src/xprt_quic.c:1695 tasklet_wakeup qc_io_cb 9018 4.924s 546.0us 3.084s 342.0us <- qc_stream_desc_ack@src/quic_stream.c:128 tasklet_wakeup qc_io_cb 2623 2.669s 1.017ms 1.347s 513.5us <- h3_snd_buf@src/h3.c:1084 tasklet_wakeup qc_io_cb 167 10.71ms 64.12us 32.47ms 194.4us <- qc_process_timer@src/xprt_quic.c:4602 tasklet_wakeup qc_io_cb 2 20.81us 10.40us 4.943us 2.471us <- qc_init@src/mux_quic.c:2057 tasklet_wakeup qc_io_cb 1 867.1us 867.1us 812.8us 812.8us <- qcs_consume@src/mux_quic.c:800 tasklet_wakeup qc_idle_timer_task 2 295.0us 147.5us 8.761us 4.380us <- wake_expired_tasks@src/task.c:344 task_wakeup quic_conn_io_cb 4 4.278ms 1.069ms 6.469us 1.617us <- qc_lstnr_pkt_rcv@src/xprt_quic.c:6184 tasklet_wakeup_after quic_conn_app_io_cb 92205 3.735s 40.51us 390.9ms 4.239us <- qc_lstnr_pkt_rcv@src/xprt_quic.c:6184 tasklet_wakeup_after quic_conn_app_io_cb 648 12.62ms 19.47us 225.7ms 348.2us <- qc_process_timer@src/xprt_quic.c:4635 tasklet_wakeup quic_conn_app_io_cb 12 786.9us 65.58us 2.042ms 170.1us <- qc_process_timer@src/xprt_quic.c:4589 tasklet_wakeup quic_conn_app_io_cb 2 752.9us 376.4us 63.97us 31.99us <- qc_xprt_start@src/xprt_quic.c:7122 tasklet_wakeup quic_lstnr_dghdlr 94389 614.0ms 6.504us 30.54s 323.6us <- quic_lstnr_dgram_dispatch@src/quic_sock.c:255 tasklet_wakeup qc_process_timer 662 526.4us 795.0ns 1.081s 1.633ms <- wake_expired_tasks@src/task.c:344 task_wakeup quic_accept_run 2 13.84us 6.920us 172.8us 86.42us <- quic_accept_push_qc@src/quic_sock.c:458 tasklet_wakeup other 101892 4.626s 45.40us 14.84s 145.7us It already becomes visible that some tasks have different very costs depending where they're called (e.g. process_stream). The method used to wake them up is also shown. Applets are handled specially and shown as appctx_wakeup.	2022-09-08 16:21:22 +02:00
Willy Tarreau	a9a2384612	CLEANUP: sched: remove duplicate code in run_tasks_from_list() Now that ->wake_date is common to tasks and tasklets, we don't need anymore to carry a duplicate control block to read and update it for tasks and tasklets. And given that this code was present early in the if/else fork between tasks and tasklets, taking it out of the block allows to move the task part into a more visible "else" branch that also allows to factor the epilogue that resets th_ctx->current and updates profile_entry->cpu_time, which also used to be duplicated. Overall, doing just that saved 253 bytes in the function, or ~1/6, which is not bad considering that it's on a hot path. And the code got much ore readable.	2022-09-08 14:30:38 +02:00
Willy Tarreau	6a28a30efa	MINOR: tasks: do not keep cpu and latency times in struct task It was a mistake to put these two fields in the struct task. This was added in 1.9 via commit `9efd7456e` ("MEDIUM: tasks: collect per-task CPU time and latency"). These fields are used solely by streams in order to report the measurements via the lat_ns* and cpu_ns* sample fetch functions when task profiling is enabled. For the rest of the tasks, this is pure CPU waste when profiling is enabled, and memory waste 100% of the time, as the point where these latencies and usages are measured is in the profiling array. Let's move the fields to the stream instead, and have process_stream() retrieve the relevant info from the thread's context. The struct task is now back to 120 bytes, i.e. almost two cache lines, with 32 bit still available.	2022-09-08 14:19:15 +02:00
Willy Tarreau	1efddfa6bf	MINOR: sched: store the current profile entry in the thread context The profile entry that corresponds to the current task/tasklet being profiled is now stored into the thread's context. This will allow it to be accessed from the tasks themselves. This is needed for an upcoming fix.	2022-09-08 14:19:15 +02:00
Willy Tarreau	62b5b96bcc	BUG/MINOR: sched: properly account for the CPU time of dying tasks When task profiling is enabled, the scheduler can measure and report the cumulated time spent in each task and their respective latencies. But this was wrong for tasks with few wakeups as well as for self-waking ones, because the call date needed to measure how long it takes to process the task is retrieved in the task itself (->wake_date was turned to the call date), and we could face two conditions: - a new wakeup while the task is executing would reset the ->wake_date field before returning and make abnormally low values being reported; that was likely the case for task�run_applet for self-waking applets; - when the task dies, NULL is returned and the call date couldn't be retrieved, so that CPU time was not being accounted for. This was particularly visible with process_stream() which is usually called only twice per request, and whose time was systematically halved. The cleanest solution here is to keep in mind that the scheduler already uses quite a bit of local context in th_ctx, and place the intermediary values there so that they cannot vanish. The wake_date has to be reset immediately once read, and only its copy is used along the function. Note that this must be done both for tasks and tasklet, and that until recently tasklets were also able to report wrong values due to their sole dependency on TH_FL_TASK_PROFILING between tests. One nice benefit for future improvements is that such information will now be available from the task without having to be stored into the task itself anymore. Since the tasklet part was computed on wrapping 32-bit arithmetics and the task one was on 64-bit, the values were now consistently moved to 32-bit as it's already largely sufficient (4s spent in a task is more than twice what the watchdog would tolerate). Some further cleanups might be necessary, but the patch aimed at staying minimal. Task profiling output after 1 million HTTP request previously looked like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 2012338 4.850s 2.410us 12.91s 6.417us process_stream 2000136 9.594s 4.796us 34.26s 17.13us sc_conn_io_cb 2000135 1.973s 986.0ns 30.24s 15.12us h1_timeout_task 137 - - 2.649ms 19.34us accept_queue_process 49 152.3us 3.107us 321.7yr 6.564yr main+0x146430 7 5.250us 750.0ns 25.92us 3.702us srv_cleanup_idle_conns 1 559.0ns 559.0ns 918.0ns 918.0ns task_run_applet 1 - - 2.162us 2.162us Now it looks like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 2014194 4.794s 2.380us 13.75s 6.826us process_stream 2000151 20.01s 10.00us 36.04s 18.02us sc_conn_io_cb 2000148 2.167s 1.083us 32.27s 16.13us h1_timeout_task 198 54.24us 273.0ns 3.487ms 17.61us accept_queue_process 52 158.3us 3.044us 409.9us 7.882us main+0x1466e0 18 16.77us 931.0ns 63.98us 3.554us srv_cleanup_toremove_conns 8 282.1us 35.26us 546.8us 68.35us srv_cleanup_idle_conns 3 149.2us 49.73us 8.131us 2.710us task_run_applet 3 268.1us 89.38us 11.61us 3.871us Note the two-fold difference on process_stream(). This feature is essentially used for debugging so it has extremely limited impact. However it's used quite a bit more in bug reports and it would be desirable that at least 2.6 gets this fix backported. It depends on at least these two previous patches which will then also have to be backported: MINOR: task: permanently enable latency measurement on tasklets CLEANUP: task: rename ->call_date to ->wake_date	2022-09-08 14:19:15 +02:00
Willy Tarreau	04e50b3d32	CLEANUP: task: rename ->call_date to ->wake_date This field is misnamed because its real and important content is the date the task was woken up, not the date it was called. It temporarily holds the call date during execution but this remains confusing. In fact before the latency measurements were possible it was indeed a call date. Thus is will now be called wake_date. This change is necessary because a subsequent fix will require the introduction of the real call date in the thread ctx.	2022-09-08 14:19:15 +02:00
Willy Tarreau	768c2c5678	MINOR: task: permanently enable latency measurement on tasklets When tasklet latency measurement was enabled in 2.4 with commit `b2285de04` ("MINOR: tasks: also compute the tasklet latency when DEBUG_TASK is set"), the feature was conditionned on DEBUG_TASK because the field would add 8 bytes to the struct tasklet. This approach was not a very good idea because the struct ends on an int anyway thus it does finish with a 32-bit hole regardless of the presence of this field. What is true however is that adding it turned a 64-byte struct to 72-byte when caller debugging is enabled. This patch revisits this with a minor change. Now only the lowest 32 bits of the call date are stored, so they always fit in the remaining hole, and this allows to remove the dependency on DEBUG_TASK. With debugging off, we're now seeing a 48-byte struct, and with debugging on it's exactly 64 bytes, thus still exactly one cache line. 32 bits allow a latency of 4 seconds on a tasklet, which already indicates a completely dead process, so there's no point storing the upper bits at all. And even in the event it would happen once in a while, the lost upper bits do not really add any value to the debug reports. Also, now one tasklet wakeup every 4 billion will not be sampled due to the test on the value itself. Similarly we just don't care, it's statistics and the measurements are not 9-digit accurate anyway.	2022-09-08 14:19:15 +02:00
Willy Tarreau	91a7c164b4	MINOR: task: move the niced_tasks counter to the thread group context This one is only used as a hint to improve scheduling latency, so there is no more point in keeping it global since each thread group handles its own run q	2022-07-15 19:43:10 +02:00
Willy Tarreau	b0e7712fb2	MEDIUM: task/thread: move the task shared wait queues per thread group Their migration was postponed for convenience only but now's time for having the shared wait queues per thread group and not just per process, otherwise the WQ lock uses a huge amount of CPU alone.	2022-07-15 19:43:10 +02:00
Willy Tarreau	bdcd32598f	MINOR: thread: only use atomic ops to touch the flags The thread flags are touched a little bit by other threads, e.g. the STUCK flag may be set by other ones, and they're watched a little bit. As such we need to use atomic ops only to manipulate them. Most places were already using them, but here we generalize the practice. Only ha_thread_dump() does not change because it's run under isolation.	2022-07-01 19:15:14 +02:00
Willy Tarreau	f3efef4d60	MINOR: thread: make wake_thread() take care of the sleeping threads mask Almost every call place of wake_thread() checks for sleeping threads and clears the sleeping mask itself, while the function is solely used for there. Let's move the check and the clearing of the bit inside the function itself. Note that updt_fd_polling() still performs the check because its rules are a bit different.	2022-07-01 19:15:14 +02:00
Willy Tarreau	319d136ff9	MEDIUM: task: use regular eb32 trees for the run queues Since we don't mix tasks from different threads in the run queues anymore, we don't need to use the eb32sc_ trees and we can switch to the regular eb32 ones. This uses cheaper lookup and insert code, and a 16-thread test on the queues shows a performance increase from 570k RPS to 585k RPS.	2022-07-01 19:15:14 +02:00
Willy Tarreau	c958c70ec8	MINOR: task: replace global_tasks_mask with a check for tree's emptiness This bit field used to be a per-thread cache of the result of the last lookup of the presence of a task for each thread in the shared cache. Since we now know that each thread has its own shared cache, a test of emptiness is now sufficient to decide whether or not the shared tree has a task for the current thread. Let's just remove this mask.	2022-07-01 19:15:14 +02:00
Willy Tarreau	da195e8aab	MINOR: task: remove grq_total and use rq_total instead grq_total was only used to know how many tasks were being queued in the global runqueue for stats purposes, and that was transferred to the per thread rq_total counter once assigned. We don't need this anymore since we know where they are, so let's just directly update rq_total and drop that one.	2022-07-01 19:15:14 +02:00
Willy Tarreau	b17dd6cc19	MEDIUM: task: replace the global rq_lock with a per-rq one There's no point having a global rq_lock now that we have one shared RQ per thread, let's have one lock per runqueue instead.	2022-07-01 19:15:14 +02:00
Willy Tarreau	6f78038d72	MEDIUM: task: move the shared runqueue to one per thread Since we only use the shared runqueue to put tasks only assigned to known threads, let's move that runqueue to each of these threads. The goal will be to arrange an N*(N-1) mesh instead of a central contention point. The global_rqueue_ticks had to be dropped (for good) since we'll now use the per-thread rqueue_ticks counter for both trees. A few points to note: - the rq_lock stlil remains the global one for now so there should not be any gain in doing this, but should this trigger any regression, it is important to detect whether it's related to the lock or to the tree. - there's no more reason for using the scope-based version of the ebtree now, we could switch back to the regular eb32_tree. - it's worth checking if we still need TASK_GLOBAL (probably only to delete a task in one's own shared queue maybe).	2022-07-01 19:15:14 +02:00
Willy Tarreau	a4fb79b4a2	MINOR: task: make rqueue_ticks atomic The runqueue ticks counter is per-thread and wasn't initially meant to be shared. We'll soon have to share it so let's make it atomic. It's only updated when waking up a task, and no performance difference was observed. It was moved in the thread_ctx struct so that it doesn't pollute the local cache line when it's later updated by other threads.	2022-07-01 19:15:14 +02:00
Willy Tarreau	fc5de15baa	CLEANUP: task: remove the now unused TASK_GLOBAL flag TASK_GLOBAL was exclusively used by task_unlink_rq(), as such it can be dropped.	2022-07-01 19:15:14 +02:00
Willy Tarreau	159e3acf5d	MEDIUM: task: remove TASK_SHARED_WQ and only use t->tid TASK_SHARED_WQ was set upon task creation and never changed afterwards. Thus if a task was created to run anywhere (e.g. a check or a Lua task), all its timers would always pass through the shared timers queue with a lock. Now we know that tid<0 indicates a shared task, so we can use that to decide whether or not to use the shared queue. The task might be migrated using task_set_affinity() but it's always dequeued first so the check will still be valid. Not only this removes a flag that's difficult to keep synchronized with the thread ID, but it should significantly lower the load on systems with many checks. A quick test with 5000 servers and fast checks that were saturating the CPU shows that the check rate increased by 20% (hence the CPU usage dropped by 17%). It's worth noting that run_task_lists() almost no longer appears in perf top now.	2022-07-01 19:15:14 +02:00
Willy Tarreau	c44d08ebc4	MAJOR: task: replace t->thread_mask with 1<<t->tid when thread mask is needed At a few places where the task's thread mask. Now we know that it's always either one bit or all bits of all_threads_mask, so we can replace it with either 1<<tid or all_threads_mask depending on what's expected. It's worth noting that the global_tasks_mask is still set this way and that it's reaching its limits. Similarly, the task_new() API would deserve an update to stop using a thread mask and use a thread number instead. Similarly, task_set_affinity() should be updated to directly take a thread number. At this point the task's thread mask is not used anymore.	2022-07-01 19:15:14 +02:00
Willy Tarreau	29ffe26733	MAJOR: task: use t->tid instead of ffsl(t->thread_mask) to take the thread ID At several places we need to figure the ID of the first thread allowed to run a task. Till now this was performed using my_ffsl(t->thread_mask) but since we now have the thread ID stored into the task, let's use it instead. This is tagged major because it starts to assume that tid<0 is strictly equivalent to atleast2(thread_mask), and that as such, among the allowed threads are the current one.	2022-07-01 19:15:14 +02:00
Frédéric Lécaille	ad548b54a7	MINOR: task: Add tasklet_wakeup_after() We want to be able to schedule a tasklet onto a thread after the current tasklet is done. What we have to do is to insert this tasklet at the head of the thread task list. Furthermore, we would like to serialize the tasklets. They must be run in the same order as the order in which they have been scheduled. This is implemented passing a list of tasklet as parameter (see <head> parameters) which must be reused for subsequent calls. _tasklet_wakeup_after_on() is implemented to accomplish this job. tasklet_wakeup_after_on() and tasklet_wake_after() are only wrapper macros around _tasklet_wakeup_after_on(). tasklet_wakeup_after_on() does exactly the same thing as _tasklet_wakeup_after_on() without having to pass the filename and line in the filename as parameters (usefull when DEBUG_TASK is enabled). tasklet_wakeup_after() hides also the usage of the thread parameter which is <tl> tasklet thread ID.	2022-06-30 14:24:04 +02:00
Willy Tarreau	9b3aa63df7	BUG/MINOR: task: fix thread assignment in tasklet_kill() tasklet_kill() was introduced in 2.5-dev4 with commit `7b368339a` ("MEDIUM: task: implement tasklet kill"), but a comparison error there makes tasklets killed on thread 1 assigned to the killing thread. Fortunately, the function was finally not used so there's no harm right now, hence the minor tag, but this must be fixed and backported in case a later fix relies on it. This should be backported to 2.5.	2022-06-16 18:17:44 +02:00
Willy Tarreau	f5aef027ce	OPTIM: task: do not consult shared WQ when we're already full If we've stopped consulting the local wait queue due to too many tasks (max_processed <= 0), there's no point starting to lock the shared WQ, check the first task's expiration date, upgrading the lock just to refrain from doing the work because of the limit. All this does is increase contention on an already contended system. Note that there is still a fairness issue in this WQ dequeuing code. If each thread is busy with expired tasks, no thread will dequeue the global ones. In practice it doesn't make much sense and should quickly resorb, but it could be nice to have an alternating flag indicating where to start from on next call to improve this.	2022-06-14 16:15:15 +02:00
Willy Tarreau	3ccb14d60d	MINOR: thread: get rid of MAX_THREADS_MASK This macro was used both for binding and for lookups. When binding tasks or FDs, using all_threads_mask instead is better as it will later be per group. For lookups, ~0UL always does the job. Thus in practice the macro was already almost not used anymore since the rest of the code could run fine with a constant of all ones there.	2022-06-14 11:18:40 +02:00
Willy Tarreau	680ed5f28b	MINOR: task: move profiling bit to per-thread Instead of having a global mask of all the profiled threads, let's have one flag per thread in each thread's flags. They are never accessed more than one at a time an are better located inside the threads' contexts for both performance and scalability.	2022-06-14 10:38:03 +02:00
Willy Tarreau	6c8babf6c4	BUG/MAJOR: sched: prevent rare concurrent wakeup of multi-threaded tasks Since the relaxation of the run-queue locks in 2.0 there has been a very small but existing race between expired tasks and running tasks: a task might be expiring and being woken up at the same time, on different threads. This is protected against via the TASK_QUEUED and TASK_RUNNING flags, but just after the task finishes executing, it releases it TASK_RUNNING bit an only then it may go to task_queue(). This one will do nothing if the task's ->expire field is zero, but if the field turns to zero between this test and the call to __task_queue() then three things may happen: - the task may remain in the WQ until the 24 next days if it's in the future; - the task may prevent any other task after it from expiring during the 24 next days once it's queued - if DEBUG_STRICT is set on 2.4 and above, an abort may happen - since 2.2, if the task got killed in between, then we may even requeue a freed task, causing random behaviour next time it's found there, or possibly corrupting the tree if it gets reinserted later. The peers code is one call path that easily reproduces the case with the ->expire field being reset, because it starts by setting it to TICK_ETERNITY as the first thing when entering the task handler. But other code parts also use multi-threaded tasks and rightfully expect to be able to touch their expire field without causing trouble. No trivial code path was found that would destroy such a shared task at runtime, which already limits the risks. This must be backported to 2.0.	2022-02-14 20:10:43 +01:00
Willy Tarreau	cc5cd5b8d8	BUILD: task: use list_to_mt_list() instead of casting list to mt_list There were a few casts of list* to mt_list* that were upsetting some old compilers (not sure about the effect on others). We had created list_to_mt_list() purposely for this, let's use it instead of applying this cast.	2022-01-28 19:04:02 +01:00
Willy Tarreau	3193eb9907	BUG/MINOR: task: do not set TASK_F_USR1 for no reason This applicationn specific flag was added in 2.4-dev by commit `6fa8bcdc7` ("MINOR: task: add an application specific flag to the state: TASK_F_USR1") to help preserve a the idle connections status across wakeup calls. While the code to do this was OK for tasklets, it was wrong for tasks, as in an effort not to lose it when setting the RUNNING flag (that tasklets don't have), it ended up being inconditionally set. It just happens that for now no regular tasks use it, only tasklets. This fix makes sure we always atomically perform (state & flags \| running) there, using a CAS. It also does it for tasklets because it was possible to lose some such flags if set by another thread, even though this should not happen with current code. In order to make the code more readable (and avoid the previous mistake of repeated flags in the bit field), a new TASK_PERSISTENT aggregate was declared in task.h for this. In practice the CAS is cheap here because task states are stable or convergent so the loop will almost never be taken. This should be backported to 2.4.	2021-10-21 16:17:29 +02:00
Willy Tarreau	a0b99536c8	REORG: thread/sched: move the thread_info flags to the thread_ctx The TI_FL_STUCK flag is manipulated by the watchdog and scheduler and describes the apparent life/death of a thread so it changes all the time and it makes sense to move it to the thread's context for an active thread.	2021-10-08 17:22:26 +02:00
Willy Tarreau	1a9c922b53	REORG: thread/sched: move the task_per_thread stuff to thread_ctx The scheduler contains a lot of stuff that is thread-local and not exclusively tied to the scheduler. Other parts (namely thread_info) contain similar thread-local context that ought to be merged with it but that is even less related to the scheduler. However moving more data into this structure isn't possible since task.h is high level and cannot be included everywhere (e.g. activity) without causing include loops. In the end, it appears that the task_per_thread represents most of the per-thread context defined with generic types and should simply move to tinfo.h so that everyone can use them. The struct was renamed to thread_ctx and the variable "sched" was renamed to "th_ctx". "sched" used to be initialized manually from run_thread_poll_loop(), now it's initialized by ha_set_tid() just like ti, tid, tid_bit. The memset() in init_task() was removed in favor of a bss initialization of the array, so that other subsystems can put their stuff in this array. Since the tasklet array has TL_CLASSES elements, the TL_* definitions was moved there as well, but it's not a problem. The vast majority of the change in this patch is caused by the renaming of the structures.	2021-10-08 17:22:26 +02:00
Willy Tarreau	f9d5e1079c	REORG: clock: move the updates of cpu/mono time to clock.c The entering_poll/leaving_poll/measure_idle functions that were hard to classify and used to move to various locations have now been placed into clock.c since it's precisely about time-keeping. The functions were renamed to clock_*. The samp_time and idle_time values are now static since there is no reason for them to be read from outside.	2021-10-08 17:22:26 +02:00
Willy Tarreau	5554264f31	REORG: time: move time-keeping code and variables to clock.c There is currently a problem related to time keeping. We're mixing the functions to perform calculations with the os-dependent code needed to retrieve and adjust the local time. This patch extracts from time.{c,h} the parts that are solely dedicated to time keeping. These are the "now" or "before_poll" variables for example, as well as the various now_() functions that make use of gettimeofday() and clock_gettime() to retrieve the current time. The "tv_" functions moved there were also more appropriately renamed to "clock_*". Other parts used to compute stolen time are in other files, they will have to be picked next.	2021-10-08 17:22:26 +02:00

1 2 3 4 5 ...

287 Commits