Let's extract some info about the system (board model, vendor etc),
this will indicate some hypervisors, some cloud instances or some
uncommon embedded boards etc. Typically, vmware, qemu and raspberry-pi
are visible here and can help during the troubleshooting session.
The goal here is to accumulate precious debugging information in a
struct that is easy to find in memory. It's aligned to 256-byte as
it also helps. We'll progressively add a lot of info about the
startup conditions, the operating system, the hardware and hypervisor
so as to limit the number of round trips between developers and users
during debugging sessions. Also, opening a core file with an hex editor
should often be sufficient to extract most of the info.
In addition, a new "show dev" command will show these information so
that they can be checked at runtime without having to wait for a crash
(e.g. if a limit is bad in a container, better know it early).
For now the struct only contains utsname that's fed at boot time.
Now when calling ha_panic() with a thread still under malloc_trim(),
we'll set a new tainted flag to easily report it, and the output
trace will report that this condition happened and will suggest to
use no-memory-trimming to avoid it in the future.
William suggested that since we can detect the presence of Lua in the
stack, let's combine it with stuck detection to set a new pair of flags
indicating a stuck Lua context and a stuck Lua shared context.
Now, executing an infinite loop in a Lua sample fetch function with
yield disabled crashes with tainted=0xe40 if loaded from a lua-load
statement, or tainted=0x640 from a lua-load-per-thread statement.
In addition, at the end of the panic dump, we can check if Lua was
seen stuck and emit recommendations about lua-load-per-thread and
the choice of dependencies depending on the presence of threads
and/or shared context.
This will make it easier to know that the panic function was called,
for the occasional case where the dump crashes and/or the stack is
corrupted and not much exploitable. Now at least it will be sufficient
to check the tainted value to know that someone called ha_panic(), and
it will also be usable to condition extra analysis.
Since commit c185bc465 ("MEDIUM: stream: now provide full stream dumps
in case of loops"), the stuck threads show the stream's pointer in the
margin since it appears immediately after a line feed. Let's add it after
the prefix and "stream=" to make the output more readable.
There used to be two working modes for this function, a single-line one
and a multi-line one, the difference being made on the "eol" argument
which could contain either a space or an LF (and with the prefix being
adjusted accordingly). Let's get rid of the single-line mode as it's
what limits the output contents because it's difficult to produce
exploitable structured data this way. It was only used in the rare case
of spinning streams and applets and these are the ones lacking info. Now
a spinning stream produces:
[ALERT] (3511) : A bogus STREAM [0x227e7b0] is spinning at 5581202 calls per second and refuses to die, aborting now! Please report this error to developers:
strm=0x227e7b0,c4a src=127.0.0.1 fe=public be=public dst=s1
txn=0x2041650,3000 txn.req=MSG_DONE,4c txn.rsp=MSG_RPBEFORE,0
rqf=1840000 rqa=8000 rpf=80000000 rpa=1400000
scf=0x24af280,EST,482 scb=0x24af430,EST,1411
af=(nil),0 sab=(nil),0
cof=0x7fdb28026630,300:H1(0x24a6f60)/RAW((nil))/tcpv4(33)
cob=0x23199f0,10000300:H1(0x24af630)/RAW((nil))/tcpv4(32)
filters={}
call trace(11):
(...)
Ilya reported in issue #2193 that the latest Fedora complains about us
passing NULL to epoll_wait() in the "debug dev fd" code to try to detect
an epoll FD. That was intentional to get the kernel's verifications and
make sure we're facing a poller, but since such a warning comes from the
libc, it's possible that it plans to replace the syscall with a wrapper
in the near future (e.g. epoll_pwait()), and that just hiding the NULL
(as was confirmed to work) might just postpone the problem.
Let's take another approach, instead we'll create a new dummy FD that
we'll try to remove from the epoll set using epoll_ctl(). Since we
created the FD we're certain it cannot be there. In this case (and
only in this case) epoll_ctl() will return -ENOENT, otherwise it will
typically return EINVAL or EBADF. It was verified that it works and
doesn't return false positives for other FD types. It should be
backported to the branches that contain a backport of the commit which
introduced the feature, apparently as far as 2.4:
5be7c198e ("DEBUG: cli: add a new "debug dev fd" expert command")
Task pointer check in debug_parse_cli_task() computes the theoric end
address of provided task pointer to check if it is valid or not thanks to
may_access() helper function.
However, relative ending address is calculated by adding task size to 't'
pointer (which is a struct task pointer), thus it will result to incorrect
address since the compiler automatically translates 't + x' to
't + x * sizeof(*t)' internally (with sizeof(*t) != 1 here).
Solving the issue by using 'ptr' (which is the void * raw address) as
starting address to prevent automatic address scaling.
This was revealed by coverity, see GH #2157.
No backport is needed, unless 9867987 ("DEBUG: cli: add "debug dev task"
to show/wake/expire/kill tasks and tasklets") gets backported.
Commit 986798718 ("DEBUG: cli: add "debug dev task" to show/wake/expire/kill
tasks and tasklets") caused a build failure on 32-bit platforms when parsing
the task's pointer. Let's use strtoul() and not strtoll(). No backport is
needed, unless the commit above gets backported.
The build without thread support was broken by commit b30ced3d8 ("BUG/MINOR:
debug: fix incorrect profiling status reporting in show threads") because
it accesses the isolated_thread variable that is not defined when threads
are disabled. In fact both the test on harmless and this one make no sense
without threads, so let's comment out the block and mark the related
variables as unused.
This may have to be backported to 2.7 if the commit above is.
Previously it would re-dump all threads to the same trash if the output
buffer was full, which it never was since the trash is of the same size.
Now it dumps one thread, copies it to the buffer and yields until it can
continue. Showing 256 threads works as expected.
Currently large setups cannot dump all their threads because they're
first dumped to the trash buffer, then copied to stderr. Here we can
now change this, instead we dump one thread at a time into the trash
and immediately send it to stderr. We also keep a copy into a local
trash chunk that's assigned to thread_dump_buffer so that a core file
still contains a copy of a large number of threads, which is generally
sufficient for the vast majority of situations.
It was verified that dumping 256 threads now produces ~55kB of output
and all of them are properly dumped.
The thread dump mechanism that is used by "show threads" and by the
panic dump is overly complicated due to an initial misdesign. It
firsts wakes all threads, then serializes their dumps, then releases
them, while taking extreme care not to face colliding dumps. In fact
this is not what we need and it reached a limit where big machines
cannot dump all their threads anymore due to buffer size limitations.
What is needed instead is to be able to dump *one* thread, and to let
the requester iterate on all threads.
That's what this patch does. It adds the thread_dump_buffer to the
struct thread_ctx so that the requester offers the buffer to the
thread that is about to be dumped. This buffer also serves as a lock.
A thread at rest has a NULL, a valid pointer indicates the thread is
using it, and 0x1 (NULL+1) is used by the dumped thread to tell the
requester it's done. This makes sure that a given thread is dumped
once at a time. In addition to this, the calling thread decides
whether it accesses the thread by itself or via the debug signal
handler, in order to get a backtrace. This is much saner because the
calling thread is free to do whatever it wants with the buffer after
each thread is dumped, and there is no dependency between threads,
once they've dumped, they're free to continue (and possibly to dump
for another requester if needed). Finally, when the THREAD_DUMP
feature is disabled and the debug signal is not used, the requester
accesses the thread by itself like before.
For now we still have the buffer size limitation but it will be
addressed in future patches.
In 2.3, commit 471425f51 ("BUG/MINOR: debug: Don't dump the lua stack
if it is not initialized") introduced the possibility to emit an empty
line when there's no Lua info to dump. The problem is that doing this
on the CLI in "show threads" marks the end of the output, and it may
affect some external tools. We need to make sure that LFs are only
emitted if there's something on the line and that all lines properly
start with the prefix.
This may be backported as far as 2.0 since the commit above was
backported there.
Sometimes it's convenient to test the effect of tasks running under
isolation, e.g. to validate the contents of the crash dumps. Let's
add an optional "isolated" keyword to "debug dev loop" for this.
Thread dumps include a field "prof" for each thread that reports whether
task profiling is currently active or not. It turns out that in 2.7-dev1,
commit 680ed5f28 ("MINOR: task: move profiling bit to per-thread")
mistakenly replaced it with a check for the current thread's bit in the
thread dumps, which basically is the only place where another thread is
being watched. The same mistake was done a few lines later by confusing
threads_want_rdv_mask with the profiling mask. This mask disappeared
in 2.7-dev2 with commit 598cf3f22 ("MAJOR: threads: change thread_isolate
to support inter-group synchronization"), though instead we know the ID
of the isolated thread. This commit fixes this and now reports "isolated"
instead of "wantrdv".
This can be backported to 2.7.
Commit 986798718 ("DEBUG: cli: add "debug dev task" to show/wake/expire/kill
tasks and tasklets") broke the build on windows due to this:
src/debug.c:940:95: error: array subscript has type char [-Werror=char-subscripts]
940 | caller && may_access(caller) && may_access(caller->func) && isalnum(*caller->func) ? caller->func : "0",
| ^~~~~~~~~~~~~
It's classical on platforms which implement ctype.h as macros instead of
functions, let's cast it as uchar. No backport is needed.
When analyzing certain types of bugs in field, sometimes it would be
nice to be able to wake up a task or tasklet to see how events progress
(e.g. to detect a missing wakeup condition), or expire or kill such a
task. This restricted command shows hte current state of a task or tasklet
and allows to manipulate it like this. However it must be used with extreme
care because while it does verify that the pointers are mapped, it cannot
know if they point to a real task, and performing such actions on something
not a task will easily lead to a crash. In addition, performing a "kill"
on a task has great chances of provoking a deferred crash due to a double
free and/or another kill that is not idempotent. Use with extreme care!
The purpose of this patch is only a one-to-one replacement, as far as
possible.
CF_SHUTR(_NOW) and CF_SHUTW(_NOW) flags are now carried by the
stream-connecter. CF_ prefix is replaced by SC_FL_ one. Of course, it is not
so simple because at many places, we were testing if a channel was shut for
reads and writes in same time. To do the same, shut for reads must be tested
on one side on the SC and shut for writes on the other side on the opposite
SC. A special care was taken with process_stream(). flags of SCs must be
saved to be able to detect changes, just like for the channels.
The goal is to send signals to random threads at random instants so that
they spin for a random delay in a relax() loop, trying to give back the
CPU to another competing hardware thread, in hope that from time to time
this can trigger in critical areas and increase the chances to provoke a
latent concurrency bug. For now none were observed.
For example, this command starts 64 such tasks waking after random delays
of 0-1ms and delivering signals to trigger such loops on 3 random threads:
for i in {1..64}; do
socat - /tmp/sock1 <<< "expert-mode on;debug dev delay-inj 2 3"
done
This command is only enabled when DEBUG_DEV is set at build time.
These timers are related to the I/O. Thus it is logical to move them into
the SE descriptor. The patch is a bit huge but it is just a
replacement. However it is error-prone.
From the stconn or the stream, helper functions are used to get, set or
reset these timers. This simplify the timers manipulations.
The debug handler may deadlock with some threads waiting for isolation.
This may happend during a "show threads" command or even during a panic.
The reason is the call to thread_harmless_end() which waits for rdv_requests
to turn to zero before releasing its position in thread_dump_state,
while that one may not progress if another thread was interrupted in
thread_isolate() and is waiting for that thread to drop thread_dump_state.
In order to address this, we now use thread_harmless_end_sig() introduced
by previous commit:
MINOR: threads: add a thread_harmless_end() version that doesn't wait
However there's a catch: since commit f7afdd910 ("MINOR: debug: mark
oneself harmless while waiting for threads to finish"), there's a second
pair of thread_harmless_now()/thread_harmless_end() that surround the
loop around thread_dump_state. Marking a thread harmless before this
loop and dropping that without checking rdv_requests there could break
the harmless promise made to the other thread if it returns first and
proceeds with its isolated work. Hence we just drop this pair which was
only preventive for other signal handlers, while as indicated in that
patch's commit message, other signals are handled asynchronously and do
not require that extra protection.
This fix must be backported to 2.7.
The problem can be seen by running "show threads" in fast loops (100/s)
while reloading haproxy very quickly (10/s) and sending lots of traffic
to it (100krps, 15 Gbps). In this case the soft stop calls pool_gc()
which isolates a lot and manages to race with the dumps after a few
tens of seconds, leaving the process with all threads at 100%.
In applets, we stop processing when a write error (CF_WRITE_ERROR) or a shutdown
for writes (CF_SHUTW) is detected. However, any write error leads to an
immediate shutdown for writes. Thus, it is enough to only test if CF_SHUTW is
set.
Commit f0c86ddfe ("BUG/MEDIUM: debug: fix parallel thread dumps again")
added a clearing of the TH_FL_STUCK flag before dumping threads in case
of parallel dumps, but that was in part a sort of workaround for some
remains of the commit that introduced the flag in 2.0 before the watchdog
existed, and which would set it after dumping a thread: e6a02fa65 ("MINOR:
threads: add a "stuck" flag to the thread_info struct"), and in part an
attempt to avoid that a thread waiting for too long during the dump would
get the flag set. But that is not possible, a thread waiting for being
dumped has the harmless bit set and doesn't get the stuck bit. What happens
in fact is that issuing "show threads" in fast loops ends up causing some
threads to keep their STUCK bit that was set at the end of "show threads",
and confuses the output.
The problem with doing this is that the flag is cleared before the thread
is dumped, and since this flag is used to decide whether to show a backtrace
or not, we don't get backtraces anymore of stuck threads since the commit
above in 2.7.
This patch just removes the two points where the flag was cleared by the
commit above. It should be backported to 2.7.
When digging into suspected memory leaks, it's cumbersome to count the
number of allocations and free calls. Here we're adding a summary at the
end of the sum of allocs minus the sum of frees, excluding realloc since
we can't know how much it releases upon each call. This means that when
doing many realloc+free the count may be negative but in practice there
are very few reallocs so that's not a problem. Also the size/call is signed
and corresponds to the average size allocated (e.g. leaked) per call.
It seems to work reasonably well for now:
> debug dev memstats match buf
quic_conn.c:2978 P_FREE size: 1239547904 calls: 75656 size/call: 16384 buffer
quic_conn.c:2960 P_ALLOC size: 1239547904 calls: 75656 size/call: 16384 buffer
mux_quic.c:393 P_ALLOC size: 9112780800 calls: 556200 size/call: 16384 buffer
mux_quic.c:383 P_ALLOC size: 17783193600 calls: 1085400 size/call: 16384 buffer
mux_quic.c:159 P_FREE size: 8935833600 calls: 545400 size/call: 16384 buffer
mux_quic.c:142 P_FREE size: 9112780800 calls: 556200 size/call: 16384 buffer
h3.c:776 P_ALLOC size: 8935833600 calls: 545400 size/call: 16384 buffer
quic_stream.c:166 P_FREE size: 975241216 calls: 59524 size/call: 16384 buffer
quic_stream.c:127 P_FREE size: 7960592384 calls: 485876 size/call: 16384 buffer
stream.c:772 P_FREE size: 8798208 calls: 537 size/call: 16384 buffer
stream.c:768 P_FREE size: 2424832 calls: 148 size/call: 16384 buffer
stream.c:751 P_ALLOC size: 8852062208 calls: 540287 size/call: 16384 buffer
stream.c:641 P_FREE size: 8849162240 calls: 540110 size/call: 16384 buffer
stream.c:640 P_FREE size: 8847360000 calls: 540000 size/call: 16384 buffer
channel.h:850 P_ALLOC size: 2441216 calls: 149 size/call: 16384 buffer
channel.h:850 P_ALLOC size: 5914624 calls: 361 size/call: 16384 buffer
dynbuf.c:55 P_FREE size: 32768 calls: 2 size/call: 16384 buffer
Total BALANCE size: 0 calls: 5606906 size/call: 0 (excl. realloc)
Let's see how useful this becomes over time.
Sometimes when debugging it's convenient to be able to focus only on
certain pools. Just like we did for "show pools", let's add a filter
based on a prefix on "debug dev memstats match <prefix>".
These two have absolutely zero impact on the process and do not need to
be restricted to the expert mode. The first one calculates a string hash
that can be used by anyone when checking a dump; the second one may be
used by anyone tracking a memory leak, and is cumbersome to use due to
the "expert-mode on" that needs to be prepended. In addition this gives
bad habits to users and needlessly taints the process. So let's drop
this restriction for these two commands.
This command is used to hash a section name using the current anon key,
it was brought in 2.7 by commit 54966dffd ("MINOR: anon: store the
anonymizing key in the CLI's appctx"). However the help message only
says "return msg hashed" which is misleading because if anon mode is
not enabled, it returns the string as-is. Let's just mention this
condition in the help message, and also fix the alphabetical ordering
and alignment on the line.
"debug dev memstats" supports various options but silently ignores the
unknown ones. Let's make sure it returns indications about what it
expects, as the help message is quite limited otherwise.
In order to allow users to dump internal states using a specific key
without changing the global one, we're introducing a key in the CLI's
appctx. This key is preloaded from the global one when "set anon on"
is used (and if none exists, a random one is assigned). And the key
can optionally be assigned manually for the whole CLI session.
A "show anon" command was also added to show the anon state, and the
current key if the users has sufficient permissions. In addition, a
"debug dev hash" command was added to test the feature.
The memstats code currently defines its own file/function/line number,
type and extra pointer. We don't need to keep them separate and we can
easily replace them all with just a struct ha_caller. Note that the
extra pointer could be converted to a pool ID stored into arg8 or
arg32 and be dropped as well, but this would first require to define
IDs for pools (which we currently do not have).
This field is misnamed because its real and important content is the
date the task was woken up, not the date it was called. It temporarily
holds the call date during execution but this remains confusing. In
fact before the latency measurements were possible it was indeed a call
date. Thus is will now be called wake_date.
This change is necessary because a subsequent fix will require the
introduction of the real call date in the thread ctx.
The first column's width may vary a lot depending on outputs, and it's
annoying to have large empty columns on small names and mangled large
columns that are not yet large enough. In order to overcome this, this
patch adds a width field to the memstats applet's context, and this
width is calculated the first time the function is entered, by estimating
the width of all lines that will be dumped. This is simple enough and
does the job well. If in the future some filtering criteria are added,
it will still be possible to perform a single pass on everything
depending on the desired output format.
The calling function name is now stored in the structure, and it's
reported when the "all" argument is passed. The first column is
significantly enlarged because some names are really wide :-(
This command will create the requested number of tasks competing on a
lock, resulting in triggering the watchdog and crashing the process.
This will help stress the watchdog and inspect the lock debugging parts.
The previous attempt to fix thread dumps in commit 672972604 ("BUG/MEDIUM:
debug: fix possible hang when multiple threads dump at once") still had
some shortcomings. Sometimes parallel dumps are jerky essentially due to
the way that threads synchronize on startup and end. In addition the risk
of waiting forever for a stopped thread exists, and panics happening in
parallel to thread dumps are not more reliable either.
This commit revisits the state transitions so that all threads may request
a dump in parallel, that all of them wait for each other in the handler,
and that one thread is responsible for counting every other and checking
that the total matches the number of active threads.
Then for stopping there's a finishing phase that all threads wait for so
that none quits this area too early. Given that we now know the number of
participants to the dump, we can let them each decrement the counter when
leaving so that another dump may only start after the last participant
has completely left.
Now many thread dumps in parallel are running fine, so do panics. No
backport is needed as this was the result of the changes for thread
groups.
Some panic dumps are mangled or truncated due to the watchdog firing at
the same time on multiple threads and calling ha_panic() simultaneously.
What may happen in this case is that the second one waits for the first
one to finish but as soon as it's done the second one resets the buffer
and dumps again, sometimes resetting the first one's dump. Also the first
one's abort() may trigger while the second one is currently dumping,
resulting in a full dump followed by a truncated one, leading to
confusion. Sometimes some lines appear in the middle of a dump as well.
It doesn't happen often and is easier to trigger by causing massive
deadlocks.
There's no reason for the process to resist to a panic, so we can safely
add a counter and no nothing on subsequent calls. Ideally we'd wait there
forever but as this may happen inside a signal handler (e.g. watchdog),
it doesn't always work, so the easiest thing to do is to return so that
the thread is interrupted as soon as possible and brought to the debug
handler to be dumped.
This should be backported, at least to 2.6 and possibly to older versions
as well.
Running several concurrent "show threads" in loops might occasionally
cause a segfault when trying to retrieve the stream from appctx_sc()
which may be null while the applet is finishing. It's not easy to
reproduce, it requires 3-5 sessions in parallel for about a minute
or so. The appctx_sc must be checked before passing it to sc_strm().
This must be backported to 2.6 which also has the bug.
A bug in the thread dumper was introduced by commit 00c27b50c ("MEDIUM:
debug: make the thread dumper not rely on a thread mask anymore"). If
two or more threads try to trigger a thread dump exactly at the same
time, the second one may loop indefinitely trying to set the value to 1
while the other ones will wait for it to finish dumping before leaving.
This is a consequence of a logic change using thread numbers instead of
a thread mask, as threads do not need to see all other ones there anymore.
No backport is needed, this is only for 2.7.
Building with threads and without thread dump (e.g. macos, freebsd)
warns that thread_dump_state is unused. This happened in fact with
recentcommit 1229ef312 ("MINOR: wdt: do not rely on threads_to_dump
anymore"). The solution would be to mark it unused, but after a
second thought, it can be convenient to keep it exported to help
debug crashes, so let's export it again. It's just not referenced in
include files since it's not needed outside.
The thread mask is too short to dump more than 64 bits. Thus here we're
using a different approach with two counters, one for the next thread ID
to dump (which always exists, as it's looked up), and the second one for
the number of threads done dumping. This allows to dump threads in ascending
order then to let them wait for all others to be done, then to leave without
the risk of an overlapping dump until the done count is null again.
This allows to remove threads_to_dump which was the last non-FD variable
using a global thread mask.