Commit Graph

285 Commits

Author SHA1 Message Date
Valentine Krasnobaeva
ff461efc59 MINOR: debug: align output style of debug_parse_cli_show_dev with cpu_dump_topology
Align titles style of debug_parse_cli_show_dev() with
cpu_dump_topology(). We will call the latter inside of
debug_parse_cli_show_dev() to show thread-cpu bindings info.
2025-07-17 19:08:06 +02:00
Willy Tarreau
110625bdb2 MINOR: debug: report haproxy and operating system info in panic dumps
The goal is to help figure the OS version (kernel and userland), any
virtualization/containers, and the haproxy version and build features.
Sometimes even reporters themselve can be mistaken about the running
version or environment. Also printing this at the top hepls draw a
visual delimitation between warnings and panic. Now we get something
like this:

  PANIC! Thread 1 is about to kill the process.

  HAProxy info:
    version: 3.3-dev3-c863c0-18
    features: +51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY (...)

  Operating system info:
    virtual machine: no
    container: no
    kernel: Linux 6.1.131 #1 SMP PREEMPT_DYNAMIC Fri Mar 14 01:04:55 CET 2025 x86_64
    userland: Slackware 15.0 x86_64

  * Thread 1 : id=0x7f615a8775c0 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
        1/1    stuck=0 prof=0 harmless=0 isolated=0
               cpu_ns: poll=1835010197 now=1835066102 diff=55905
               (...)
2025-07-15 17:18:29 +02:00
Valentine Krasnobaeva
0c63883be1 MINOR: debug: add distro name and version in postmortem
Since 2012, systemd compliant distributions contain
/etc/os-release file. This file has some standardized format, see details at
https://www.freedesktop.org/software/systemd/man/latest/os-release.html.

Let's read it in feed_post_mortem_linux() to gather more info about the
distribution.

(cherry picked from commit f1594c41368baf8f60737b229e4359fa7e1289a9)
Signed-off-by: Willy Tarreau <w@1wt.eu>
2025-07-11 11:48:19 +02:00
Willy Tarreau
697a531516 MINOR: debug: bump the dump buffer to 8kB
Now with the improved backtraces, the lock history and details in the
mux layers, some dumps appear truncated or with some chars alone at
the beginning of the line. The issue is in fact caused by the limited
dump buffer size (2kB for stderr, 4kB for warning), that cannot hold
a complete trace anymore.

Let's jump bump them to 8kB, this will be plenty for a long time.
2025-05-07 10:02:58 +02:00
Willy Tarreau
3bb6eea6d5 DEBUG: threads: display held locks in threads dumps
Based on the lock history, we can spot some locks that are still held
by checking the last operation that happened on them: if it's not an
unlock, then we know the lock is held. In this case we append the list
after "locked:" with their label and state like below:

  U:QUEUE S:IDLE_CONNS U:IDLE_CONNS R:TASK_WQ U:TASK_WQ S:QUEUE S:QUEUE S:QUEUE locked: QUEUE(S)
  S:IDLE_CONNS U:IDLE_CONNS S:TASK_RQ U:TASK_RQ S:QUEUE U:QUEUE S:IDLE_CONNS locked: IDLE_CONNS(S)
  R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ locked: TASK_WQ(R)
  W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT locked: STK_TABLE(W) STK_TABLE_UPDT(W)

The format is slightly different (label(status)) so as to easily
differentiate them visually from the history.
2025-05-06 05:20:37 +02:00
Willy Tarreau
d9a659ed96 MINOR: threads/cli: display the lock history on "show threads"
This will display the lock labels and modes for each non-empty step
at the end of "show threads" when these are defined. This allows to
emit up to the last 8 locking operation for each thread on 64 bit
machines.
2025-04-28 16:50:34 +02:00
Willy Tarreau
874ba2afed CLEANUP: debug: no longer set nor use TH_FL_DUMPING_OTHERS
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
2025-04-17 16:25:47 +02:00
Willy Tarreau
513397ac82 MINOR: debug: make ha_stuck_warning() print the whole message at once
It has been noticed quite a few times during troubleshooting and even
testing that warnings can happen in avalanches from multiple threads
at the same time, and that their reporting it interleaved bacause the
output is produced in small chunks. Originally, this code inspired by
the panic code aimed at making sure to log whatever could be emitted
in case it would crash later. But this approach was wrong since writes
are atomic, and performing 5 writes in sequence in each dumping thread
also means that the outputs can be mixed up at 5 different locations
between multiple threads. The output of warnings is never very long,
and the stack-based buffer is 4kB so let's just concatenate everything
in the buffer and emit it at once using a single write(). Now there's
no longer this confusion on the output.
2025-04-17 16:25:47 +02:00
Willy Tarreau
c16d5415a8 MINOR: debug: make ha_stuck_warning() only work for the current thread
Since we no longer call it with a foreign thread, let's simplify its code
and get rid of the special cases that were relying on ha_thread_dump_fill()
and synchronization with a remote thread. We're not only dumping the
current thread so ha_thread_dump_one() is sufficient.
2025-04-17 16:25:47 +02:00
Willy Tarreau
b24d7f248e MINOR: pass a valid buffer pointer to ha_thread_dump_one()
The goal is to let the caller deal with the pointer so that the function
only has to fill that buffer without worrying about locking. This way,
synchronous dumps from "show threads" are produced and emitted directly
without causing undesired locking of the buffer nor risking causing
confusion about thread_dump_buffer containing bits from an interrupted
dump in progress.

It's only the caller that's responsible for notifying the requester of
the end of the dump by setting bit 0 of the pointer if needed (i.e. it's
only done in the debug handler).
2025-04-17 16:25:47 +02:00
Willy Tarreau
5ac739cd0c MINOR: debug: remove unused case of thr!=tid in ha_thread_dump_one()
This function was initially designed to dump any threadd into the presented
buffer, but the way it currently works is that it's always called for the
current thread, and uses the distinction between coming from a sighandler
or being called directly to detect which thread is the caller.

Let's simplify all this by replacing thr with tid everywhere, and using
the thread-local pointers where it makes sense (e.g. th_ctx, th_ctx etc).
The confusing "from_signal" argument is now replaced with "is_caller"
which clearly states whether or not the caller declares being the one
asking for the dump (the logic is inverted, but there are only two call
places with a constant).
2025-04-17 16:25:47 +02:00
Willy Tarreau
5646ec4d40 MINOR: debug: always reset the dump pointer when done
We don't need to copy the old dump pointer to the thread_dump_pointer
area anymore to indicate a dump is collected. It used to be done as an
artificial way to keep the pointer for the post-mortem analysis but
since we now have this pointer stored separately, that's no longer
needed and it simplifies the mechanim to reset it.
2025-04-17 16:25:47 +02:00
Willy Tarreau
6d8a523d14 MINOR: tinfo: keep a copy of the pointer to the thread dump buffer
Instead of using the thread dump buffer for post-mortem analysis, we'll
keep a copy of the assigned pointer whenever it's used, even for warnings
or "show threads". This will offer more opportunities to figure from a
core what happened, and will give us more freedom regarding the value of
the thread_dump_buffer itself. For example, even at the end of the dump
when the pointer is reset, the last used buffer is now preserved.
2025-04-17 16:25:47 +02:00
Willy Tarreau
d20e9cad67 MINOR: debug: protect ha_dump_backtrace() against risks of re-entrance
If a thread is dumping itself (warning, show thread etc) and another one
wants to dump the state of all threads (e.g. panic), it may interrupt the
first one during backtrace() and re-enter it from the signal handler,
possibly triggering a deadlock in the underlying libc. Let's postpone
the debug signal delivery at this point until the call ends in order to
avoid this.
2025-04-17 16:25:47 +02:00
Willy Tarreau
5b5960359f MINOR: debug: do not statify a few debugging functions often used with wdt/dbg
A few functions are used when debugging debug signals and watchdog, but
being static, they're not resolved and are hard to spot in dumps, and
they appear as any random other function plus an offset. Let's just not
mark them static anymore, it only hurts:
  - cli_io_handler_show_threads()
  - debug_run_cli_deadlock()
  - debug_parse_cli_loop()
  - debug_parse_cli_panic()
2025-04-17 16:25:47 +02:00
Willy Tarreau
47f8397afb BUG/MINOR: debug: detect and prevent re-entrance in ha_thread_dump_fill()
In the following trace trying to abuse the watchdog from the CLI's
"debug dev loop" command running in parallel to "show threads" loops,
it's clear that some re-entrance may happen in ha_thread_dump_fill().

A first minimal fix consists in using a test-and-set on the flag
indicating that the function is currently dumping threads, so that
the one from the signal just returns. However the caller should be
made more reliable to serialize all of this, that's for future
work.

Here's an example capture of 7 threads stuck waiting for each other:
  (gdb) bt
  #0  0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
  #1  0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
  #2  0x00000000005ba4f5 in ha_thread_dump_fill (thr=2, buf=0x7ffdd8e08ab0) at src/debug.c:402
  #3  ha_thread_dump_fill (buf=0x7ffdd8e08ab0, thr=<optimized out>) at src/debug.c:384
  #4  0x00000000005baac4 in ha_stuck_warning (thr=thr@entry=2) at src/debug.c:840
  #5  0x00000000006a360d in wdt_handler (sig=<optimized out>, si=<optimized out>, arg=<optimized out>) at src/wdt.c:156
  #6  <signal handler called>
  #7  0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
  #8  0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
  #9  0x00000000005ba4c2 in ha_thread_dump_fill (thr=2, buf=0x7fe78f2d6420) at src/debug.c:426
  #10 ha_thread_dump_fill (buf=0x7fe78f2d6420, thr=2) at src/debug.c:384
  #11 0x00000000005ba7c6 in cli_io_handler_show_threads (appctx=0x2a89ab0) at src/debug.c:548
  #12 0x000000000057ea43 in cli_io_handler (appctx=0x2a89ab0) at src/cli.c:1176
  #13 0x00000000005d7885 in task_process_applet (t=0x2a82730, context=0x2a89ab0, state=<optimized out>) at src/applet.c:920
  #14 0x0000000000659002 in run_tasks_from_lists (budgets=budgets@entry=0x7ffdd8e0a5c0) at src/task.c:644
  #15 0x0000000000659bd7 in process_runnable_tasks () at src/task.c:886
  #16 0x00000000005cdcc9 in run_poll_loop () at src/haproxy.c:2858
  #17 0x00000000005ce457 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3075
  #18 0x0000000000430628 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3665
2025-04-17 16:25:47 +02:00
Willy Tarreau
ebf1757dc2 BUG/MINOR: wdt/debug: avoid signal re-entrance between debugger and watchdog
As seen in issue #2860, there are some situations where a watchdog could
trigger during the debug signal handler, and where similarly the debug
signal handler may trigger during the wdt handler. This is really bad
because it could trigger some deadlocks inside inner libc code such as
dladdr() or backtrace() since the code will not protect against re-
entrance but only against concurrent accesses.

A first attempt was made using ha_sigmask() but that's not always very
convenient because the second handler is called immediately after
unblocking the signal and before returning, leaving signal cascades in
backtrace. Instead, let's mark which signals to block at registration
time. Here we're blocking wdt/dbg for both signals, and optionally
SIGRTMAX if DEBUG_DEV is used as that one may also be used in this case.

This should be backported at least to 3.1.
2025-04-17 16:25:47 +02:00
Willy Tarreau
0b56839455 BUG/MINOR debug: fix !USE_THREAD_DUMP in ha_thread_dump_fill()
The function must make sure to return NULL for foreign threads and
the local buffer for the current thread in this case, otherwise panics
(and sometimes even warnings) will segfault when USE_THREAD_DUMP is
disabled. Let's slightly re-arrange the function to reduce the #if/else
since we have to specifically handle the case of !USE_THREAD_DUMP anyway.

This needs to be backported wherever b8adef065d ("MEDIUM: debug: on
panic, make the target thread automatically allocate its buf") was
backported (at least 2.8).
2025-04-17 16:25:47 +02:00
Willy Tarreau
3cbbf41cd8 MINOR: debug: detect call instructions and show the branch target in backtraces
In backtraces, sometimes it's difficult to know what was called by a
given point, because some functions can be fairly long making one
doubt about the correct pointer of unresolved ones, others might
just use a tail branch instead of a call + return, etc. On common
architectures (x86 and aarch64), it's not difficult to detect and
decode a relative call, so let's do it on both of these platforms
and show the branch location after a '>'. Example:

x86_64:
   call trace(19):
   |       0x6bd644 <64 8b 38 e8 ac f7 ff ff]: debug_handler+0x84/0x95 > ha_thread_dump_one
   | 0x7feb3e5383a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
   | 0x7feb3e53748b <c0 b8 03 00 00 00 0f 05]: libpthread:__close+0x3b/0x8b
   |       0x7619e4 <44 89 ff e8 fc 97 d4 ff]: _fd_delete_orphan+0x1d4/0x1d6 > main-0x2130
   |       0x743862 <8b 7f 68 e8 8e e1 01 00]: sock_conn_ctrl_close+0x12/0x54 > fd_delete
   |       0x5ac822 <c0 74 05 4c 89 e7 ff d0]: main+0xff512
   |       0x5bc85c <48 89 ef e8 04 fc fe ff]: main+0x10f54c > main+0xff150
   |       0x5be410 <4c 89 e7 e8 c0 e1 ff ff]: main+0x111100 > main+0x10f2c0
   |       0x6ae6a4 <28 00 00 00 00 ff 51 58]: cli_io_handler+0x31524
   |       0x6aeab4 <7c 24 08 e8 fc fa ff ff]: sc_destroy+0x14/0x2a4 > cli_io_handler+0x31430
   |       0x6c685d <48 89 ef e8 43 82 fe ff]: process_chk_conn+0x51d/0x1927 > sc_destroy

aarch64:
   call trace(15):
   | 0xaaaaad0c1540 <60 6a 60 b8 c3 fd ff 97]: debug_handler+0x9c/0xbc > ha_thread_dump_one
   | 0xffffa8c177ac <c2 e0 3b d5 1f 20 03 d5]: linux-vdso:__kernel_rt_sigreturn
   | 0xaaaaad0b0964 <c0 03 5f d6 d2 ff ff 97]: cli_io_handler+0x28e44 > sedesc_new
   | 0xaaaaad0b22a4 <00 00 80 d2 94 f9 ff 97]: sc_new_from_strm+0x1c/0x54 > cli_io_handler+0x28dd0
   | 0xaaaaad0167e8 <21 00 80 52 a9 6e 02 94]: stream_new+0x258/0x67c > sc_new_from_strm
   | 0xaaaaad0b21f8 <e1 03 13 aa e7 90 fd 97]: sc_new_from_endp+0x38/0xc8 > stream_new
   | 0xaaaaacfda628 <21 18 40 f9 e7 5e 03 94]: main+0xcaca8 > sc_new_from_endp
   | 0xaaaaacfdb95c <42 c0 00 d1 02 f3 ff 97]: main+0xcbfdc > main+0xc8be0
   | 0xaaaaacfdd3f0 <e0 03 13 aa f5 f7 ff 97]: h1_io_cb+0xd0/0xb90 > main+0xcba40
2025-04-14 20:06:48 +02:00
Willy Tarreau
9740f15274 MINOR: debug: in call traces, dump the 8 bytes before the return address, not after
In call traces, we're interested in seeing the code that was executed, not
the code that was not yet. The return address is where the CPU will return
to, so we want to see the bytes that precede this location. In the example
below on x86 we can clearly see a number of direct "call" instructions
(0xe8 + 4 bytes). There are also indirect calls (0xffd0) that cannot be
exploited but it gives insights about where the code branched, which will
not always be the function above it if that one used tail branching for
example. Here's an example dump output:

         call ------------,
                          v
       0x6bd634 <64 8b 38 e8 ac f7 ff ff]: debug_handler+0x84/0x95
 0x7fa4ea2593a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
       0x752132 <00 00 00 00 00 90 41 55]: htx_remove_blk+0x2/0x354
       0x5b1a2c <4c 89 ef e8 04 07 1a 00]: main+0x10471c
       0x5b5f05 <48 89 df e8 8b b8 ff ff]: main+0x108bf5
       0x60b6f4 <89 ee 4c 89 e7 41 ff d0]: tcpcheck_eval_send+0x3b4/0x14b2
       0x610ded <00 00 00 e8 53 a5 ff ff]: tcpcheck_main+0x7dd/0xd36
       0x6c5ab4 <48 89 df e8 5c ab f4 ff]: wake_srv_chk+0xc4/0x3d7
       0x6c5ddc <48 89 f7 e8 14 fc ff ff]: srv_chk_io_cb+0xc/0x13
2025-04-14 19:28:22 +02:00
Willy Tarreau
b708345c17 DEBUG: counters: add the ability to enable/disable updating the COUNT_IF counters
These counters can have a noticeable cost on large machines, though not
dramatic. There's no single good choice to keep them enabled or disabled.
This commit adds multiple choices:
  - DEBUG_COUNTERS set to 2 will automatically enable them by default, while
    1 will disable them by default
  - the global "debug.counters on/off" will allow to change the setting at
    boot, regardless of DEBUG_COUNTERS as long as it was at least 1.
  - the CLI "debug counters on/off" will also allow to change the value at
    run time, allowing to observe a phenomenon while it's happening, or to
    disable counters if it's suspected that their cost is too high

Finally, the "debug counters" command will append "(stopped)" at the end
of the CNT lines when these counters are stopped.

Not that the whole mechanism would easily support being extended to all
counter types by specifying the types to apply to, but it doesn't seem
useful at all and would require the user to also type "cnt" on debug
lines. This may easily be changed in the future if it's found relevant.
2025-04-14 19:02:13 +02:00
Willy Tarreau
61d633a3ac DEBUG: rename DEBUG_GLITCHES to DEBUG_COUNTERS and enable it by default
Till now the per-line glitches counters were only enabled with the
confusingly named DEBUG_GLITCHES (which would not turn glitches off
when disabled). Let's instead change it to DEBUG_COUNTERS and make sure
it's enabled by default (though it can still be disabled with
-DDEBUG_GLITCHES=0 just like for DEBUG_STRICT). It will later be
expanded to cover more counters.
2025-04-14 19:02:13 +02:00
Willy Tarreau
a8148c313a DEBUG: init: report invalid characters in debug description strings
It's easy to leave some trailing \n or even other characters that can
mangle the debug output. Let's verify at boot time that the debug sections
are clean by checking for chars 0x20 to 0x7e inclusive. This is very simple
to do and it managed to find another one in a multi-line message:

  [WARNING]  (23696) : Invalid character 0x0a at position 96 in description string at src/cli.c:2516 _send_status()

This way new offending code will be spotted before being committed.
2025-04-14 19:02:13 +02:00
William Lallemand
a647839954 DEBUG: init: add a way to register functions for unit tests
Doing unit tests with haproxy was always a bit difficult, some of the
function you want to test would depend on the buffer or trash buffer
initialisation of HAProxy, so building a separate main() for them is
quite hard.

This patch adds a way to register a function that can be called with the
"-U" parameter on the command line, will be executed just after
step_init_1() and will exit the process with its return value as an exit
code.

When using the -U option, every keywords after this option is passed to
the callback and could be used as a parameter, letting the capability to
handle complex arguments if required by the test.

HAProxy need to be built with DEBUG_UNIT to activate this feature.
2025-03-03 12:43:32 +01:00
Willy Tarreau
fb7874c286 MINOR: tinfo: split the signal handler report flags into 3
While signals are not recursive, one signal (e.g. wdt) may interrupt
another one (e.g. debug). The problem this causes is that when leaving
the inner handler, it removes the outer's flag, hence the protection
that comes with it. Let's just have 3 distinct flags for regular signals,
debug signal and watchdog signal. We add a 4th definition which is an
aggregate of the 3 to ease testing.
2025-02-24 13:37:52 +01:00
Willy Tarreau
ddd173355c MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
Signal handlers must absolutely not change anything, but some long and
complex call chains may look innocuous at first glance, yet result in
some subtle write accesses (e.g. pools) that can conflict with a running
thread being interrupted.

Let's add a new thread flag TH_FL_IN_SIG_HANDLER that is only set when
entering a signal handler and cleared when leaving them. Note, we're
speaking about real signal handlers (synchronous ones), not deferred
ones. This will allow some sensitive call places to act differently
when detecting such a condition, and possibly even to place a few new
BUG_ON().
2025-02-21 17:41:38 +01:00
Willy Tarreau
7ddcdff33f BUG/MEDIUM: debug: close a possible race between thread dump and panic()
The rework of the thread dumping mechanism in 2.8 with commit 9a6ecbd590
("MEDIUM: debug: simplify the thread dump mechanism") opened a small
race, which is that a thread in the process of dumping other ones may
block the other one from panicing while it's looping at the end of
ha_thread_dump_fill(), or any other sequence involving the currently
dumped one.

This was emphasized in 3.1 with commit 148eb5875f ("DEBUG: wdt: better
detect apparently locked up threads and warn about them") that allowed
to emit warnings about long-stuck threads, because in this case, what
happens is that sometimes a thread starts to emit a warning (or a set
of warnings), and while the warning is being awaited for, a panic
finally happens and interrupts either the dumping thread, which never
finishes and waits for the target's pointer to become NULL which will
never happen since it was supposed to do it itself, or the currently
dumped thread which could wait for the dumping thread to become ready
while this one has not released the former.

In order to address this, first we now make sure never to dump a thread
that is already in the process of dumping another one. We're adding a
new thread flag to know this situation, that is set in ha_thread_dump_fill()
and cleared in ha_thread_dump_done(). And similarly, we don't trigger
the watchdog on a thread waiting for another one to finish its dump,
as it's likely a case of warning (and maybe even a panic) that makes
them wait for each other and we don't want such cases to be reentrant.
Finally, we check in the main polling loop that the flag never accidentally
leaked (e.g. wrong flag manipulation) as this would be difficult to spot
with bad consequences.

This should be backported at least to 2.8, and should resolve github
issue #2860. Thanks to Chris Staite for the very informative backtrace
that exhibited the problem.
2025-02-10 18:34:26 +01:00
Willy Tarreau
8d63dc50ab BUG/MINOR: debug: make sure the "debug dev sched" tasks don't block stopping
When "debug dev sched" is used to pop up background tasks, these tasks
are never stopped, so we must be careful to stop them when the stopping
flag is set, otherwise they can prevent the process from stopping when
sufficiently numerous (tests went as far as 100 million tasks, leading
the run queue never being completely purged in one poll round).

No backport is needed since this is only used when debugging and tuning
the scheduler.
2025-02-07 18:04:29 +01:00
Willy Tarreau
6765a32eb4 BUG/MINOR: debug: make "debug dev sched" accept a negative TID
The TID passed to "debug dev sched" is used to pin the task to a given
thread. A negative value normally means the task is unpinned and goes
to the shared wait queue and run queue. However due to the type of the
variable, negative values were mapped as highly positive values and were
set to the current thread. Let's add the proper cast to fix this.

No backport is needed since this is only used to experiment with the
scheduler and measure its performance.
2025-02-07 18:04:29 +01:00
Valentine Krasnobaeva
8620ae7962 MINOR: debug: show boot and runtime process settings in table
Let's reformat output of "show dev" in order to show some boot and runtime
process settings in a table. This makes the output less crowded.
2025-01-24 09:54:57 +01:00
Valentine Krasnobaeva
df7f16d960 MINOR: debug: debug_parse_cli_show_dev: use errname
Let's use errname, introduced in the previous commit in the output of
"show dev". This output is destined to engineers. So, no need to provide a
long descriptions of errnos given by strerror.
2025-01-24 09:54:57 +01:00
Ilia Shipitsin
6524fbfb70 BUG/MINOR: debug: handle a possible strdup() failure
This defect was found by the coccinelle script "unchecked-strdup.cocci".
It can be backported to all supported branches.
2024-12-25 12:42:33 +01:00
Willy Tarreau
f486f976c7 BUILD: limits: make normalize_rlim() take an rlim_t to fix build on m68k
As can be seen here, the build fails on m68k since commit 665dde648
("MINOR: debug: use LIM2A to show limits") in 3.1:

  https://github.com/haproxy/haproxy/actions/runs/12440234399/job/34735360177

The reason is the comparison between a ulong limit and RLIM_INFINITY.
Indeed, on m68k, rlim_t is an unsigned long long. Let's just change
the function's input type to take an rlim_t instead. This also allows
to get rid of the casts in the call place.

This can be backported to 3.1 though it's not important given the low
prevalence of this platform for such use cases.
2024-12-25 12:33:06 +01:00
Willy Tarreau
4710ab5604 BUILD: debug: only dump/reset glitch counters when really defined
If neither DEBUG_GLITCHES nor DEBUG_STRICT is set, we end up with
no dbg_cnt section, resulting in debug_parse_cli_counters not
building due to __stop_dbg_cnt and __start_dbg_cnt not being defined.
Let's just condition the end of the function to these conditions.
An alternate approach (less elegant) is to always declare a dummy
entry of type DBG_COUNTER_TYPES in debug.c.

This must be backported to 3.1 since it was brought with glitches.
2024-12-17 16:46:25 +01:00
Willy Tarreau
1151fe6818 BUG/MEDIUM: debug: don't set the STUCK flag from debug_handler()
Since 2.0 with commit e6a02fa65a ("MINOR: threads: add a "stuck" flag
to the thread_info struct"), the TH_FL_STUCK flag was set by the
debugger to flag that a thread was stuck and report it in the output.

However, two commits later (2bfefdbaef "MAJOR: watchdog: implement a
thread lockup detection mechanism"), this flag was used to detect that
a thread had already been reported as stuck. The problem is that it
seldom happens that a "show threads" command instantly crashes because
it calls debug_handler(), which sets the flag, and if the watchdog timer
was about to trigger before going back to the scheduler, the watchdog
believes that the thread has been stuck for a while and will kill the
process.

The issue was magnified in 3.1 with the lower-delay warning, because
it's possible for a thread to die on the next wakeup after the first
warning (which calls debug_handler() hence sets the STUCK flag).

One good approach would have been to use two distinct flags, one for
"stuck" as reported by the debug handler, and one for "stuck" as seen
by the watchdog. However, one could also argue that since the second
commit, given that the wdt monitors the threads, there's no point any
more for the debug handler to set the flag itself. Removing this code
means that two consecutive "show threads" will not report "stuck" until
the watchdog sets it, which aligns better with expectations.

This can be backported to all stable releases. This code has changed a
bit over time, the "if" block and the harmless variables just need to
be removed.
2024-11-21 19:58:05 +01:00
Willy Tarreau
4420939fcd MINOR: debug/cli: replace "debug dev counters" with "debug counters"
"debug dev" commands are not meant to be used by end-users, and are
purposely not documented. Yet due to their usefulness in troubleshooting
sessions, users are increasingly invited by developers to use some of
them.

"debug dev counters" is one of them. Better move it to "debug counters"
and document it so that users can check them even if the output can look
cryptic at times. This, combined with DEBUG_GLITCHES, can be convenient
to observe suspcious activity. The doc however precises that the format
may change between versions and that new entries/types might appear
within a stable branch.
2024-11-15 16:26:01 +01:00
Willy Tarreau
808a7cc777 BUG/MINOR: debug: do not set task expiration to TICK_ETERNITY
Using "debug task", it's possible to change a task's expiration, but
we must be careful not to set it to TICK_ETERNITY. Let's use tick_add()
instead. The risk is basically nul since it's a debugging command, so
no backport is needed.
2024-11-15 15:39:00 +01:00
Willy Tarreau
502790ed7e MINOR: debug: add a new counter type for glitches
COUNT_GLITCH() will implement an unconditional counter on its declaration
line when DEBUG_GLITCHES is set, and do nothing otherwise. The output will
be reported as "GLT" and can be filtered as "glt" on the CLI. The purpose
is to help figure what's happening if some glitches counters start going
through the roof. The macro supports an optional string argument to
describe the cause of the glitch (e.g. "truncated header"), which is then
reported in the dump.

For now this is conditioned by DEBUG_GLITCHES but if it turns out to be
light enough, maybe we'll keep it enabled full time. In this case it
might have to be moved away from debug dev, or at least documented (or
done as debug counters maybe so that dev can remain undocumented and
updatable within a branch?).
2024-11-14 08:49:38 +01:00
Willy Tarreau
e119095290 MINOR: debug: explicitly permit the counter condition to be empty
In order to count new event types, we'll need to support empty conditions
so that we don't have to fake if (1) that would pollute the output. This
change checks if #cond is an empty string before concatenating it with
the optional var args, and avoids dumping the colon on the dump if the
whole description is empty.
2024-11-14 08:47:00 +01:00
Willy Tarreau
5dcf2012fc MINOR: debug: move the "recover now" warn message after the optional notes
At the end of the too long processing warning added by commit 0950778b3a
("MINOR: debug: add a function to dump a stuck thread"), there can be some
optional notes about lua and memory trimming. However it's a bit awkward
that they appear after the "trying to recover now" message. Let's just move
that message after the notes.
2024-11-07 07:56:13 +01:00
Willy Tarreau
84dd05e7d8 DEBUG: wdt: add a stats counter "BlockedTrafficWarnings" in show info
Every time a warning is issued about traffic being blocked, let's
increment a global counter so that we can check for this situation
in "show info".
2024-11-06 18:35:42 +01:00
Willy Tarreau
6127e5a4e9 DEBUG: wdt: make the blocked traffic warning delay configurable
The new global "warn-blocked-traffic-after" allows one to configure
after how much time a warning should be emitted when traffic is blocked.
2024-11-06 18:35:42 +01:00
Willy Tarreau
7337c42224 DEBUG: cli: make it possible for "debug dev loop" to trigger warnings
A new argument "warn" allows to force the emission of a warning while
stuck in the loop by making the internal state inconsistent.
2024-11-06 18:35:42 +01:00
Willy Tarreau
148eb5875f DEBUG: wdt: better detect apparently locked up threads and warn about them
In order to help users detect when threads are behaving abnormally, let's
try to emit a warning when one is no longer making any progress. This will
allow to catch faulty situations more accurately, instead of occasionally
triggering just after the long task. It will also let users know that there
is something wrong with their configuration, and inspect the call trace to
figure whether they're using excessively long rules or Lua for example (the
usual warnings about lua-load vs lua-load-per-thread are still reported).

The warning will only be emitted for threads not yet marked as stuck so
as not to interfere with panic dumps and avoid sending a warning just
before a panic. A tainted flag is set when this happens however (0x2000).
2024-11-06 18:35:42 +01:00
Willy Tarreau
0950778b3a MINOR: debug: add a function to dump a stuck thread
There's currently no way to just emit a warning informing that a thread
is stuck without crashing. This is a problem because sometimes users
would benefit from this info to clean up their configuration (e.g. abuse
of map_regm, lua-load etc).

This commit adds a new function ha_stuck_warning() that will emit a
warning indicating that the designated thread has been stuck for XX
milliseconds, with a number of streams blocked, and will make that
thread dump its own state. The warning will then be sent to stderr,
along with some reminders about the impacts of such situations to
encourage users to fix their configuration.

In order not to disrupt operations, a local 4kB buffer is allocated
in the stack. This should be quite sufficient.

For now the function is not used.
2024-11-06 18:35:42 +01:00
Willy Tarreau
0f1d37a479 DEBUG: cli: support closing "hard" using close() in addition to fd_delete()
"debug dev close <fd>" currently closes that FD using fd_delete() after
checking that it's known from the fdtab. Sometimes we also want to just
perform a pure close() of FDs not in the fdtab (pollers, etc) in order
to provoke certain error cases. The optional "hard" argument to the
command will make it use a plain close() instead of fd_delete() and skip
the fd owner check. The main visible effect when closing a traffic socket
with it is that instead of dying from a double fd_delete() by seeing that
fd.owner is already 0, it will die during the next fd_insert() seeing that
fd.owner was not 0.
2024-11-05 18:57:43 +01:00
Willy Tarreau
52240680f1 MINOR: debug: remove the redundant process.thread_info array from post_mortem
That one is huge and unneeded since we now have the pointer to the
whole thread_info[] array, which does contain the freshest version
of these info and many more. Let's just get rid of it entirely.
2024-10-28 17:14:48 +01:00
Willy Tarreau
da5cf52173 MINOR: debug: also add fdtab and acitvity to struct post_mortem
These ones are often used as well when trying to analyse sequences of
events, let's add them.
2024-10-28 17:14:48 +01:00
Willy Tarreau
2f04ebe14a MINOR: debug: also add a pointer to struct global to post_mortem
The pointer to struct global is also an important element to have in
post_mortem given that it's used a lot to take decisions in the code.
Let's just add it. It's worth noting that we could get rid of argc/argv
at this point since they're also present in the global struct, but they
don't cost much there anyway.
2024-10-26 11:33:09 +02:00
William Lallemand
944a224358 MINOR: cli: remove non-printable characters from 'debug dev fd'
When using 'debug dev fd', the output of laddr and raddr can contain
some garbage.

This patch replaces any control or non-printable character by a '.'.
2024-10-24 16:45:11 +02:00