From 9debe0fb2754c0c29203d50069389d2c55ca3fae Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Thu, 19 Jan 2023 18:52:31 +0100 Subject: [PATCH] BUG/MEDIUM: debug/thread: make the debug handler not wait for !rdv_requests The debug handler may deadlock with some threads waiting for isolation. This may happend during a "show threads" command or even during a panic. The reason is the call to thread_harmless_end() which waits for rdv_requests to turn to zero before releasing its position in thread_dump_state, while that one may not progress if another thread was interrupted in thread_isolate() and is waiting for that thread to drop thread_dump_state. In order to address this, we now use thread_harmless_end_sig() introduced by previous commit: MINOR: threads: add a thread_harmless_end() version that doesn't wait However there's a catch: since commit f7afdd910 ("MINOR: debug: mark oneself harmless while waiting for threads to finish"), there's a second pair of thread_harmless_now()/thread_harmless_end() that surround the loop around thread_dump_state. Marking a thread harmless before this loop and dropping that without checking rdv_requests there could break the harmless promise made to the other thread if it returns first and proceeds with its isolated work. Hence we just drop this pair which was only preventive for other signal handlers, while as indicated in that patch's commit message, other signals are handled asynchronously and do not require that extra protection. This fix must be backported to 2.7. The problem can be seen by running "show threads" in fast loops (100/s) while reloading haproxy very quickly (10/s) and sending lots of traffic to it (100krps, 15 Gbps). In this case the soft stop calls pool_gc() which isolates a lot and manages to race with the dumps after a few tens of seconds, leaving the process with all threads at 100%. --- src/debug.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/src/debug.c b/src/debug.c index 05c7a5cf9..44f9755fe 100644 --- a/src/debug.c +++ b/src/debug.c @@ -1562,7 +1562,7 @@ void debug_handler(int sig, siginfo_t *si, void *arg) ha_thread_relax(); if (!harmless) - thread_harmless_end(); + thread_harmless_end_sig(); /* dump if needed */ if (thread_dump_buffer) @@ -1598,18 +1598,17 @@ void debug_handler(int sig, siginfo_t *si, void *arg) } /* now wait for all others to finish dumping: the lowest part will turn - * to zero. Then all others decrement the done part. + * to zero. Then all others decrement the done part. We must not change + * the harmless status anymore because one of the other threads might + * have been interrupted in thread_isolate() waiting for all others to + * become harmless, and changing the situation here would break that + * promise. */ - if (!harmless) - thread_harmless_now(); /* wait for everyone to finish*/ while (HA_ATOMIC_LOAD(&thread_dump_state) & THREAD_DUMP_PMASK) ha_thread_relax(); - if (!harmless) - thread_harmless_end(); - /* we're gone. Past this point anything can happen including another * thread trying to re-trigger a dump, so thread_dump_buffer and * thread_dump_tid may become invalid immediately after this call.