From 5405c9cdf3127f74201dd6a812fc45689407fe38 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Fri, 17 Feb 2023 14:55:41 +0100 Subject: [PATCH] BUG/MEDIUM: wdt: fix wrong thread being checked for sleeping In 2.7, the method used to check for a sleeping thread changed with commit e7475c8e7 ("MEDIUM: tasks/fd: replace sleeping_thread_mask with a TH_FL_SLEEPING flag"). Previously there was a global sleeping mask and now there is a flag per thread. The commit above partially broke the watchdog by looking at the current thread's flags via th_ctx instead of the reported thread's flags, and using an AND condition instead of an OR to update and leave. This can cause a wrong thread to be killed when the load is uneven. For example, when enabling busy polling and sending traffic over a single connection, all threads have their run time grow, and if the one receiving the signal is also processing some traffic, it will not match the sleeping/harmless condition and will set the stuck flag, then die upon next invocation. While it's reproducible in tests, it's unlikely to be met in field. This fix should be backported to 2.7. --- src/wdt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/wdt.c b/src/wdt.c index 9f5b81729..865bb7b25 100644 --- a/src/wdt.c +++ b/src/wdt.c @@ -83,7 +83,7 @@ void wdt_handler(int sig, siginfo_t *si, void *arg) if (!p || n - p < 1000000000UL) goto update_and_leave; - if ((_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_SLEEPING) && + if ((_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_SLEEPING) || (_HA_ATOMIC_LOAD(&ha_tgroup_ctx[tgrp-1].threads_harmless) & thr_bit)) { /* This thread is currently doing exactly nothing * waiting in the poll loop (unlikely but possible),