On armv7 haproxy doesn't work because of the fixes on the double-word
CAS. There are two issues. The first one is that the last argument in
case of dwcas is a pointer to the set of value and not a value ; the
second is that it's not enough to cast the data as (void*) since it will
be a single word. Let's fix this by using the pointers as an array of
long. This was tested on i386, armv7, x86_64 and aarch64 and it is now
fine. An alternate approach using a struct was attempted as well but it
used to produce less optimal code.
This fix must be backported to 1.9. This fixes github issue #105.
Cc: Olivier Houchard <ohouchard@haproxy.com>
In pendconn_redistribute() we scan the queue using eb32_next() on the
node we've just deleted, which is wrong since the node is not in the
tree anymore, and it could dereference one node that has already been
released by another thread. Note that we cannot use eb32_first() in the
loop here instead because we need to skip pendconns having SF_FORCE_PRST.
Instead, let's keep a copy of the next node before deleting it.
In addition, the pendconn retrieved there is wrong, it uses &node as
the pointer instead of node, resulting in very quick crashes when the
server list is scanned.
Fortunately this only happens when "option redispatch" is used in
conjunction with "maxconn" on server lines, "cookie" for the stickiness,
and when a server goes down with entries in its queue.
This bug was introduced by commit 0355dabd7 ("MINOR: queue: replace
the linked list with a tree") so the fix must be backported to 1.9.
In fwrr_get_next_server(), we optionally pass a server to avoid. It
usually points to the current server during a redispatch operation. If
this server is usable, an "avoided" pointer is set and we continue to
look for another server. If in the end no other server is found, then
we fall back to this avoided one, which is still better than nothing.
The problem that may arise with threads is that in the mean time, this
avoided server might have received extra connections and might not be
usable anymore. This causes it to be queued a second time in the "full"
list and the loop to search for a server again, ending up on this one
again and so on.
This patch makes sure that we break out of the loop when we have to
pick the avoided server. It's probably what the code intended to do
as the current break statement causes fwrr_update_position() and
fwrr_dequeue_srv() to be called again on the avoided server.
It must be backported to 1.9 and 1.8, and seems appropriate for older
versions though it's unclear what the impact of this bug might be
there since the race doesn't exist and we're left with the double
update of the server's position.
The unused fd_del and fd_skip were being abused during debugging sessions
as general purpose event counters. With their removal, let's officially
have dedicated counters for such use cases. These counters are called
"ctr0".."ctr2" and are listed at the end when DEBUG_DEV is set.
starting with OpenSSL 1.0.0 recommended way to disable compression is
using SSL_OP_NO_COMPRESSION when creating context.
manipulations with SSL_COMP_get_compression_methods, sk_SSL_COMP_num
are only required for OpenSSL < 1.0.0
Since commit 88698d9 ("MEDIUM: connections: Add a way to control the
number of idling connections.") when building without threads, gcc
complains that the operations made on the idle_orphan_conns[] list is
out of bounds, which is always false since 1) <i> can only equal zero,
and 2) given it's equal to <tid> we never even enter the loop. But as
usual it thinks it knows better, so let's mask the origin of this <i>
value to shut it up. Another solution consists in making <i> unsigned
and adding an explicit range check.
Now when we fail to send because the mux buffer is full, before giving
up and marking MFULL, we try to allocate another buffer in the mux's
ring to try again. Thanks to this (and provided there are enough buffers
allocated to the mux's ring), a single stream picked in the send_list
cannot steal all the mux's room at once. For this, we expand the ring
size to 31 buffers as it seems to be optimal on benchmarks since it
divides the number of context switches by 3. It will inflate each H2
conn's memory by 1 kB.
The bandwidth is now much more stable. Prior to this, it a test on
h2->h1 with very large objects (1 GB), a few tens of connections and
a few tens of streams per connection would show a varying performance
between 34 and 95 Gbps on 2 cores/4 threads, with h2_snd_buf() stopped
on a buffer full condition between 300000 and 600000 times per second.
Now the performance is constantly between 88 and 96 Gbps. Measures show
that buffer full conditions are met around only 159 times per second
in this case, or rougly 2000 to 4000 times less often.
This makes the code more readable and reduces the calls to br_tail().
In addition, all calls to h2_get_buf() are now made via this local
variable, which should significantly help for retries.
Now send() uses a loop to iterate over all buffers to be sent. These
buffers are released and deleted from the vector once completely sent.
If any buffer gets released, offer_buffers() is called to wake up some
waiters.
For now it's only one buffer long so the head and tails are always the
same, thus it doesn't change what used to work. In short, br_tail(h2c->mbuf)
was inserted everywhere we used to have h2c->mbuf.
Transferring large objects over H2 sometimes shows unexplained performance
variations. A long analysis resulted in the following discovery. Often the
mux buffer looks like this :
[ empty_head | data | empty_tail ]
Typical numbers are (very common) :
- empty_head = 31
- empty_tail = 16 (total free=47)
- data = 16337
- size = 16384
- data to copy: 43
The reason for these holes are the blocking factors that are not always
the same in and out (due to keeping 9 bytes for the frame size, or the
56 bytes corresponding to the HTX header). This can easily happen 10000
times a second if the network bandwidth permits it!
In this case, while copying a DATA frame we find that the buffer has its
free space wrapped so we decide to realign it to optimize the copy. It's
possible that this practice stems from the code used to emit headers,
which do not support fragmentation and which had no other option left.
But it comes with two problems :
- we don't check if the data fits, which results in a memcpy for nothing
- we can move huge amounts of data to just copy a small block.
This patch addresses this two ways :
- first, by not forcing a data realignment if what we have to copy does
not fit, as this is totally pointless ;
- second, by refusing to move too large data blocks. The threshold was
set to 1 kB, because it may make sense to move 1 kB of data to copy
a 15 kB one at once, which will leave as a single 16 kB block, but
it doesn't make sense to mvoe 15 kB to copy just 1 kB. In all cases
the data would fit and would just be split into two blocks, which is
not very expensive, hence the low limit to 1 kB
With such changes, realignments are very rare, they show up around once
every 15 seconds at 60 Gbps, and look like this, resulting in a much more
stable bit rate :
buf=0x7fe6ec0c3510,h=16333,d=35,s=16384 room=16349 in=16337
This patch should be safe for backporting to 1.9 if some performance
issues are reported there.
according to manpage:
sk_TYPE_zero() sets the number of elements in sk to zero. It
does not free sk so after this call sk is still valid.
so we need to free all elements
[wt: seems like it has been there forever and should be backported
to all stable branches]
Fortunately, this loop does nothing. Otherwise it would have led to an infinite
loop. It was probably forgotten during a refactoring, in the early stage of the
HTX.
This patch must be backported to 1.9.
When an 1xx reponse is processed, we forward it immediatly. But another message
may already be in the channel's buffer, waiting to be processed. This may be
another 1xx reponse or the final one. So instead of forwarding everything, we
must take care to only forward the processed 1xx response.
This patch must be backported to 1.9.
When a parsing error occurrs in the H1 multiplexer, we stop to copy HTX
blocks. So the error may be reported with an emtpy HTX message. For instance, if
the headers parsing failed. When it happens, the flag CS_FL_EOS is also set on
the conn_stream. But it is an error. Most of time, it is set on established
connections, so it is not really an issue. But if it happens when the server
connection is not fully established, the connection is shut down immediatly and
the stream-interface is switched from SI_ST_CON to SI_ST_DIS/CLO. So HTX
analyzers have no chance to catch the error.
Instead of setting CS_FL_EOS, it is fairly better to set CS_FL_EOI, which is the
right flag to use. The same is also done on H2 upgrade. As a side effet of this
fix, in the stream-interface code, we must now set the flag CF_READ_PARTIAL on
the channel when the flag CF_EOI is set. It is a warranty to wakeup the stream
when EOI is reported to the channel while no data are received.
This patch must be backported to 1.9.
In HTX, when a HEADERS frame is formatted before sending it to the client or the
server, If an EOM is found because there is no body, we must count it in the
number bytes sent.
This patch must be backported to 1.9.
When a LUA HTTP object is created using the current TXN object, it is important
to also set the right direction and flags, using ones from the TXN object.
This patch may be backported to all supported branches with the lua
support. But, it seems to have no impact for now.
In spoe_release_appctx(), the SPOE applet may be used after it was released to
get its exit status code. Of course, HAProxy crashes when this happens.
This patch must be backported to 1.9 and 1.8.
As fat as possible, we try to keep the connections alive on redirect. It's
possible when the request has no body or when the request parsing is finished.
No backport is needed.
The stats page now reports the per-process output bit rate and applies
the usual conversions needed to turn the TCP payload rate to an Ethernet
bit rate in order to give a reasonably accurate estimate of how far from
interface saturation we are.
Many times we've been missing per-process traffic statistics. While it
didn't make sense in multi-process mode, with threads it does. Thus we
now have a counter of bytes emitted by raw_sock, and a freq counter for
these as well. However, freq_ctr are limited to 32 bits, and given that
loads of 300 Gbps have already been reached over a loopback using
splicing, we need to downscale this a bit. Here we're storing 1/32 of
the byte rate, which gives a theorical limit of 128 GB/s or ~1 Tbps,
which is more than enough. Let's have fun re-reading this sentence in
2029 :-) The values can be read in "show info" output on the CLI.
It's needed on Linux to have access to timerfd_*, and on FreeBSD this
lib is needed as well, though not enabled in our default build. We can
see later if it's OK to enable it, for now let's fix the build issues.
Bah, the linux manpage suggests to use si_int but it's a fake, it's only
a define on sigval.sival_int where sigval is defined as si_value. Let's
use si_value.sival_int, at least it builds on both Linux and FreeBSD. It's
likely that this code will have to be limited to a small subset of OSes
if it causes difficulties like this.
These commands don't follow the same flow as the rest of the commands,
each of them iterates over all header lines before switching to the
next directive. In addition they make no distinction between start
line and headers and can lead to unparsable rewrites which are very
difficult to deal with internally.
Most of them are still occasionally found in configurations, mainly
because of the usual "we've always done this way". By marking them
deprecated and emitting a warning and recommendation on first use of
each of them, we will raise users' awareness of users regarding the
cleaner, faster and more reliable alternatives.
Some use cases of "reqrep" still appear from time to time for URL
rewriting that is not so convenient with other rules. But at least
users facing this requirement will explain their use case so that we
can best serve them. Some discussion started on this subject in a
thread linked to from github issue #100.
The goal is to remove them in 2.1 since they require to reparse the
result before indexing it and we don't want this hack to live long.
The following directives were marked deprecated :
-reqadd
-reqallow
-reqdel
-reqdeny
-reqiallow
-reqidel
-reqideny
-reqipass
-reqirep
-reqitarpit
-reqpass
-reqrep
-reqtarpit
-rspadd
-rspdel
-rspdeny
-rspidel
-rspideny
-rspirep
-rsprep
We've been dealing with a workaround for a bug in splice that used to
affect version 2.6.25 to 2.6.27.12 and which was fixed 10 years ago
in kernel versions which are not supported anymore. Given that people
who would use a kernel in such a range would face much more serious
stability and security issues, it's about time to get rid of this
workaround and of the ASSUME_SPLICE_WORKS build option used to disable
it.
We still have quite a number of build macros which are mapped 1:1 to a
USE_something setting in the makefile but which have a different name.
This patch cleans this up by renaming them to use the USE_something
one, allowing to clean up the makefile and make it more obvious when
reading the code what build option needs to be added.
The following renames were done :
ENABLE_POLL -> USE_POLL
ENABLE_EPOLL -> USE_EPOLL
ENABLE_KQUEUE -> USE_KQUEUE
ENABLE_EVPORTS -> USE_EVPORTS
TPROXY -> USE_TPROXY
NETFILTER -> USE_NETFILTER
NEED_CRYPT_H -> USE_CRYPT_H
CONFIG_HAP_CRYPT -> USE_LIBCRYPT
CONFIG_HAP_NS -> DUSE_NS
CONFIG_HAP_LINUX_SPLICE -> USE_LINUX_SPLICE
CONFIG_HAP_LINUX_TPROXY -> USE_LINUX_TPROXY
CONFIG_HAP_LINUX_VSYSCALL -> USE_LINUX_VSYSCALL
It seems it's not defined on FreeBSD while it's mentioned on Linux that
clock_gettime() can be detected using this. Given that we also have the
test for _POSIX_TIMERS>0 that should cover it well enough. If it breaks
on other systems, we'll see.
Report was here :
https://github.com/haproxy/haproxy/runs/133866993
We currently have the ability to register functions to be called early
on thread creation and at thread deinitialization. It turns out this is
not sufficient because certain such functions may use resources that are
being allocated by the other ones, thus creating a race condition depending
only on the linking order. For example the mworker needs to register a
file descriptor while the pollers will reallocate the fd_updt[] array.
Similarly logs and trashes may be used by some init functions while it's
unclear whether they have been deduplicated.
The same issue happens on deinit, if the fd_updt[] or trash is released
before some functions finish to use them, we'll get into trouble.
This patch creates a couple of early and late callbacks for per-thread
allocation/freeing of resources. A few init functions were moved there,
and the fd init code was split between the two (since it used to both
allocate and initialize at once). This way the init/deinit sequence is
expected to be safe now.
This patch should be backported to 1.9 as at least the trash/log issue
seems to be present. The run_thread_poll_loop() code is a bit different
there as the mworker is not a callback, but it will have no effect and
it's enough to drop the mworker changes.
This bug was reported by Ilya Shipitsin in github issue #104.
At the moment the WURFL module emits 3 lines of warnings upon startup
when it is not referenced in the configuration file, which is quite
confusing. Let's make sure to keep it silent when not configured, as
detected by the absence of the wurfl-data-file statement.
A segfault may happen in ha_wurfl_get() when dereferencing information not
present in wurfl-information-list. Check the node retrieved from the tree,
not its container.
This fix must be backported to 1.9.
The mux's name is the only one reported in lower case in "show sess"
or "haproxy -vv" while the other ones are upper case, so it loses and
the other ones win :-)
It was not as efficient as the watchdog in that it would only trigger
after the problem resolved by itself, and still required a huge margin
to make sure we didn't trigger for an invalid reason. This used to leave
little indication about the cause. Better use the watchdog now and
improve it if needed.
The detector of unkillable tasks remains active though.
Since threads were introduced, we've naturally had a number of bugs
related to locking issues. In addition we've also got some issues
with corrupted lists in certain rare cases not necessarily involving
threads. Not only these events cause a lot of trouble to the production
as it is very hard to detect that the process is stuck in a loop and
doesn't deliver the service anymore, but it's often difficult (or too
late) to collect more debugging information.
The patch presented here implements a lockup detection mechanism, also
known as "watchdog". The principle is that (on systems supporting it),
each thread will have its own CPU timer which progresses as the thread
consumes CPU cycles, and when a deadline is met, a signal is delivered
(SIGALRM here since it doesn't interrupt gdb by default).
The thread handling this signal (which is not necessarily the one which
triggered the timer) figures the thread ID from the signal arguments and
checks if it's really stuck by looking at the time spent since last exit
from poll() and by checking that the thread's scheduler is still alive
(so that even when dealing with configuration issues resulting in insane
amount of tasks being called in turn, it is not possible to accidently
trigger it). Checking the scheduler's activity will usually result in a
second chance, thus doubling the detecting time.
In order not to incorrectly flag a thread as being the cause of the
lockup, the thread_harmless_mask is checked : a thread could very well
be spinning on itself waiting for all other threads to join (typically
what happens when issuing "show sess"). In this case, once all threads
but one (or two) have joined, all the innocent ones are marked harmless
and will not trigger the timer. Only the ones not reacting will.
The deadline is set to one second, which already appears impossible to
reach, especially since it's 1 second of CPU usage, not elapsed time
with the CPU being preempted by other threads/processes/hypervisor. In
practice due to the scheduler's health verification it takes up to two
seconds to decide to panic.
Once all conditions are met, the goal is to crash from the offending
thread. So if it's the current one, we call ha_panic() otherwise the
signal is bounced to the offending thread which deals with it. This
will result in all threads being woken up in turn to dump their context,
the whole state is emitted on stderr in hope that it can be logged, and
the process aborts, leaving a chance for a core to be dumped and for a
service manager to restart it.
An alternative mechanism could be implemented for systems unable to
wake up a thread once its CPU clock reaches a deadline (e.g. FreeBSD).
Instead of waking the timer each and every deadline, it is possible to
use a standard timer which is reset each time we leave poll(). Since
the signal handler rechecks the CPU consumption this will also work.
However a totally idle process may trigger it from time to time which
may or may not confuse some debugging sessions. The same is true for
alarm() which could be another option for systems not having such a
broad choice of timers (but it seems that in this case they will not
have per-thread CPU measurements available either).
The feature is currently implemented only when threads are enabled in
order to keep the code clean, since the main purpose is to detect and
address inter-thread deadlocks. But if it proves useful for other
situations this condition might be relaxed.
This flag is constantly cleared by the scheduler and will be set by the
watchdog timer to detect stuck threads. It is also set by the "show
threads" command so that it is easy to spot if the situation has evolved
between two subsequent calls : if the first "show threads" shows no stuck
thread and the second one shows such a stuck thread, it indicates that
this thread didn't manage to make any forward progress since the previous
call, which is extremely suspicious.
Whenever we can retrieve a valid stream pointer, we now call stream_dump()
to get a detailed dump of the stream currently running on the processor.
This is used by "show threads" and by ha_panic().
This function dumps a lot of information about a stream into the provided
buffer. It is now used by stream_dump_and_crash() and will be used by the
debugger as well.
These functions are used respectively to signal one thread or all threads.
When multithreading is disabled, it's always the current thread which is
signaled.
Commit 5a6e2245f ("REORG: threads: move the struct thread_info from
global.h to hathreads.h") didn't hold its promise well, as the thread_info
struct was still declared and initialized in haproxy.c in addition to being
in hathreads.c. Let's move it for real now.