When starting with a huge maxconn (say 1 billion), the only error seen
is "No polling mechanism available". This doesn't help at all to resolve
the problem. Let's add specific alerts for the failed mallocs. Now we can
get this instead:
[ALERT] 286/154439 (23408) : Not enough memory to allocate 2000000033 entries for fdtab!
This may be backported as far as 2.0 as it helps debugging bad configurations.
Since 2.2 it's safe to enable/disable another thread's FD but the fd_wake
calls will not immediately be considered because nothing wakes the other
threads up. This will have an impact on listeners when deciding to resume
them after they were paused, so at minima we want to wake up one of their
threads, just like the scheduler does on task_kill(). This is what this
patch does.
Changes performed using the following coccinelle patch:
@@
type T;
expression E;
expression t;
@@
(
t = calloc(E, sizeof(*t))
|
- t = calloc(E, sizeof(T))
+ t = calloc(E, sizeof(*t))
)
Looking through the commit history, grepping for coccinelle shows that the same
replacement with a different patch was already performed in the past in commit
02779b6263a177b1e462e53db6eaf57bcda574bc.
This new flag will be used to mark FDs that must be passed to any future
process across the CLI's "_getsocks" command.
The scheme here is quite complex and full of special cases:
- FDs inherited from parent processes are *not* exported this way, as
they are supposed to instead be passed by the master process itself
across reloads. However such FDs ought never to be paused otherwise
this would disrupt the socket in the parent process as well;
- FDs resulting from a "bind" performed over a socket pair, which are
in fact one side of a socket pair passed inside another control socket
pair must not be passed either. Since all of them are used the same
way, for now it's enough never to put this "exported" flag to FDs
bound by the socketpair code.
- FDs belonging to temporary listeners (e.g. a passive FTP data port)
must not be passed either. Fortunately we don't have such FDs yet.
- the rest of the listeners for now are made of TCP, UNIX stream, ABNS
sockets and are exportable, so they get the flag.
- UDP listeners were wrongly created as listeners and are not suitable
here. Their FDs should be passed but for now they are not since the
client doesn't even distinguish the SO_TYPE of the retrieved sockets.
In addition, it's important to keep in mind that:
- inherited FDs may never be closed in master process but may be closed
in worker processes if the service is shut down (useless since still
bound, but technically possible) ;
- inherited FDs may not be disabled ;
- exported FDs may be disabled because the caller will perform the
subsequent listen() on them. However that might not work for all OSes
- exported FDs may be closed, it just means the service was shut down
from the worker, and will be rebound in the new process. This implies
that we have to disable exported on close().
=> as such, contrary to an apparently obvious equivalence, the "exported"
status doesn't imply anything regarding the ability to close a
listener's FD or not.
This essentially undoes what we did in fd.c in 1.8 to support seamless
reload. Since we don't need to remove an fd anymore we can turn
fd_delete() to the simple function it used to be.
When DEBUG_FD is set at build time, we'll keep a counter of per-FD events
in the fdtab. This counter is reported in "show fd" even for closed FDs if
not zero. The purpose is to help spot situations where an apparently closed
FD continues to be reported in loops, or where some events are dismissed.
Some of the recent optimizations around the polling to save a few
epoll_ctl() calls have shown that they could also cause some trouble.
However, over time our code base has become totally asynchronous with
I/Os always attempted from the upper layers and only retried at the
bottom, making it look like we're getting closer to EPOLLET support.
There are showstoppers there such as the listeners which cannot support
this. But given that most of the epoll_ctl() dance comes from the
connections, we can try to enable edge-triggered polling on connections.
What this patch does is to add a new global tunable "tune.fd.edge-triggered",
that makes fd_insert() automatically set an et_possible bit on the fd if
the I/O callback is conn_fd_handler. When the epoll code sees an update
for such an FD, it immediately registers it in both directions the first
time and doesn't update it anymore.
On a few tests it proved quite useful with a 14% request rate increase in
a H2->H1 scenario, reducing the epoll_ctl() calls from 2 per request to
2 per connection.
The option is obviously disabled by default as bugs are still expected,
particularly around the subscribe() code where it is possible that some
layers do not always re-attempt reading data after being woken up.
Since there was a risk of leaving fd_takeover() without properly
stopping the fd, let's take this opportunity for factoring the code
around a commont exit point that's common to both double-cas and locked
modes. This means using the "ret" variable inside the double-CAS code,
and inverting the loop to first test the old values. Doing do also
produces cleaner code because the compiler cannot factorize common
exit paths using asm statements that are present in some atomic ops.
The loop in fd_takeover() around the double-CAS is conditionned on
a previous value of old_masks[0] that always matches tid_bit on the
first iteration because it does not result from the atomic op but
from a pre-loaded value. Let's set the result of the atomic op there
instead so that the conflict between threads can be detected earlier
and before performing the double-word CAS.
When haproxy is compiled without double-word CAS, we use a migration lock
in fd_takeover(). This lock was covering the atomic OR on the running_mask
before checking its value, while it is not needed since this atomic op
already returns the result. Let's just refine the code to avoid grabbing
the lock in the event another thread has already stolen the FD, this may
reduce contention in high reuse rate scenarios.
In fd_takeover(), when a double-width compare-and-swap is implemented,
make sure, if we managed to get the fd, to call fd_stop_recv() on it, so
that the thread that used to own it will know it has to stop polling it.
In fd_takeover(), if we failed to grab the fd, when a double-width
compare-and-swap is not implemented, do not call fd_stop_recv() on the
fd, it is not ours and may be used by another thread.
In issue #648 a second problem was reported, indicating that some users
mistakenly send the log to an FD mapped on a file. This situation doesn't
even enable O_NONBLOCK and results in huge access times in the order of
milliseconds with the lock held and other threads waiting till the
watchdog fires to unblock the situation.
The problem with files is that O_NONBLOCK is ignored, and we still need
to lock otherwise we can end up with interleaved log messages.
What this patch does is different. Instead of locking all writers, it
uses a trylock so that there's always at most one logger and that other
candidates can simply give up and report a failure, just as would happen
if writev() returned -1 due to a pipe full condition. This solution is
elegant because it gives back the control to haproxy to decide to give
up when it takes too much time, while previously it was the kernel that
used to block the syscall.
However at high log rates (500000 req/s) there was up to 50% dropped logs
due to the contention on the lock. In order to address this, we try to
grab the lock up to 200 times and call ha_thread_relax() on failure. This
results in almost no failure (no more than previously with O_NONBLOCK). A
typical test with 6 competing threads writing to stdout chained to a pipe
to a single process shows around 1000 drops for 10 million logs at 480000
lines per second.
Please note that this doesn't mean that writing to a blocking FD is a good
idea, and it might only be temporarily done on testing environments for
debugging. A file or a terminal will continue to block the writing thread
while others spin a little bit and lose their logs, but the writing thread
will still experience performance-killing latencies.
This patch should be backported to 2.1 and 2.0. The code is in log.c in
2.0, but the principle is the same.
This patch fixes all the leftovers from the include cleanup campaign. There
were not that many (~400 entries in ~150 files) but it was definitely worth
doing it as it revealed a few duplicates.
The current state of the logging is a real mess. The main problem is
that almost all files include log.h just in order to have access to
the alert/warning functions like ha_alert() etc, and don't care about
logs. But log.h also deals with real logging as well as log-format and
depends on stream.h and various other things. As such it forces a few
heavy files like stream.h to be loaded early and to hide missing
dependencies depending where it's loaded. Among the missing ones is
syslog.h which was often automatically included resulting in no less
than 3 users missing it.
Among 76 users, only 5 could be removed, and probably 70 don't need the
full set of dependencies.
A good approach would consist in splitting that file in 3 parts:
- one for error output ("errors" ?).
- one for log_format processing
- and one for actual logging.
global.h was one of the messiest files, it has accumulated tons of
implicit dependencies and declares many globals that make almost all
other file include it. It managed to silence a dependency loop between
server.h and proxy.h by being well placed to pre-define the required
structs, forcing struct proxy and struct server to be forward-declared
in a significant number of files.
It was split in to, one which is the global struct definition and the
few macros and flags, and the rest containing the functions prototypes.
The UNIX_MAX_PATH definition was moved to compat.h.
A few includes were missing in each file. A definition of
struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was
moved to defaults.h
Stdio used to be silently inherited from whatever path but it's needed
for list_pollers() which takes a FILE* and which can thus not be
forward-declared.
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
Implement a new function, fd_takeover(), that lets you become the thread
responsible for the fd. On architectures that do not have a double-width CAS,
use a global rwlock.
fd_set_running() was also changed to be able to compete with fd_takeover(),
either using a dooble-width CAS on both running_mask and thread_mask, or
by claiming a reader on the global rwlock. This extra operation should not
have any measurable impact on modern architectures where threading is
relevant.
In the struct fdtab, introduce a new mask, running_mask. Each thread should
add its bit before using the fd.
Use the running_mask instead of a lock, in fd_insert/fd_delete, we'll just
spin as long as the mask is non-zero, to be sure we access the data
exclusively.
fd_set_running_excl() spins until the mask is 0, fd_set_running() just
adds the thread bit, and fd_clr_running() removes it.
Enabling strict aliasing fails in fd.c when using the double-word CAS,
let's get rid of the (void**)(void*)&cur_list junk and use a union
instead. This way the compiler knows they do alias.
There's a very hard-to-trigger bug in the FD list code where the
fd_add_to_fd_list() function assumes that if the FD it's trying to add
is already locked, it's in the process of being added. Unfortunately, it
can also be in the process of being removed. It is very hard to trigger
because it requires that one thread is removing the FD while another one
is adding it. First very few FDs run on multiple threads (listeners and
DNS), and second, it does not make sense to add and remove the FD at the
same time.
In practice the DNS code built on the older callback-only model does
perform bursts of fd_want_send() for all resolvers at once when it wants
to send a new query (dns_send_query()). And this is more likely to happen
when here are lots of resolutions in parallel and many resolvers, because
the dns_response_recv() callback can also trigger a series of queries on
all resolvers for each invalid response it receives. This means that it
really is perfectly possible to both stop and start in parallel during
short periods of time there.
This issue was not reported before 2.1, but 2.1 had the FD cache, built
on the exact same code base. It's very possible that the issue caused
exactly the opposite situation, where an event was occasionally lost,
causing a DNS retry that worked, and nobody noticing the problem in the
end. In 2.1 the lost entries are the updates asking for not polling for
writes anymore, and the effect is that the poller contiuously reports
writability on the socket when the issue happens.
This patch fixes bug #416 and must be backported as far as 1.8, and
absolutely requires that previous commit "MINOR: fd/threads: make
_GET_NEXT()/_GET_PREV() use the volatile attribute" is backported as
well otherwise it will make the issue worse.
Special thanks to Julien Pivotto for setting up a reliable reproducer
for this difficult issue.
These macros are either used between atomic ops which cause the volatile
to be implicit, or with an explicit volatile cast. However not having it
in the macro causes some traps in the code because certain loop paths
cannot safely be used without risking infinite loops if one isn't careful
enough.
Let's place the volatile attribute inside the macros and remove them from
the explicit places to avoid this. It was verified that the output executable
remains exactly the same byte-wise.
Since commit 7ac0e35f2 in 1.9-dev1 ("MAJOR: fd: compute the new fd polling
state out of the fd lock") we've started to update the FD POLLED bit a
bit more aggressively. Lately with the removal of the FD cache, this bit
is always equal to the ACTIVE bit. There's no point continuing to watch
it and update it anymore, all it does is create confusion and complicate
the code. One interesting side effect is that it now becomes visible that
all fd_*_{send,recv}() operations systematically call updt_fd_polling(),
except fd_cant_recv()/fd_cant_send() which never saw it change.
Logs and sinks were resorting to dirty hacks to initialize an FD to
non-blocking mode. Now we have a bit for this in the fd tab so we can
do it on the fly on first use of the file descriptor. Previously it was
set per log server by writing value 1 to the port, or during a sink
initialization regardless of the usage of the fd.
The "cache" entry was still present in the fdtab struct and it was
reported in "show sess". Removing it broke the cache-line alignment
on 64-bit machines which is important for threads, so it was fixed
by adding an attribute(aligned()) when threads are in use. Doing it
only in this case allows 32-bit thread-less platforms to see the
struct fit into 32 bytes.
Currently both logs and event sinks may use a file descriptor to
atomically emit some output contents. The two may use the same FD though
nothing is done to make sure they use the same lock. Also there is quite
some redundancy between the two. Better make a specific function to send
a fragmented message to a file descriptor which will take care of the
locking via the fd's lock. The function is also able to truncate a
message and to enforce addition of a trailing LF when building the
output message.
In fd_dodelete(), always reset the polled_mask bits, instead on only doing
it if we're closing the file descriptor. We call the poller clo() method
anyway, and failing to do so means that if fd_remove() is used while the
fd is polled, the poller won't attempt to poll on a fd with the same value
as the old one.
This leads to fd being stuck in the SSL code while using the async engine.
This should be backported to 2.0, 1.9 and 1.8.
In the poller code, instead of just remembering if we're currently polling
a fd or not, remember if we're polling it for writing and/or for reading, that
way, we can avoid to modify the polling if it's already polled as needed.
Now that the architecture was changed so that attempts to receive/send data
always come from the upper layers, instead of them only trying to do so when
the lower layer let them know they could try, we can finally get rid of the
fd cache. We don't really need it anymore, and removing it gives us a small
performance boost.
On armv7 haproxy doesn't work because of the fixes on the double-word
CAS. There are two issues. The first one is that the last argument in
case of dwcas is a pointer to the set of value and not a value ; the
second is that it's not enough to cast the data as (void*) since it will
be a single word. Let's fix this by using the pointers as an array of
long. This was tested on i386, armv7, x86_64 and aarch64 and it is now
fine. An alternate approach using a struct was attempted as well but it
used to produce less optimal code.
This fix must be backported to 1.9. This fixes github issue #105.
Cc: Olivier Houchard <ohouchard@haproxy.com>
We still have quite a number of build macros which are mapped 1:1 to a
USE_something setting in the makefile but which have a different name.
This patch cleans this up by renaming them to use the USE_something
one, allowing to clean up the makefile and make it more obvious when
reading the code what build option needs to be added.
The following renames were done :
ENABLE_POLL -> USE_POLL
ENABLE_EPOLL -> USE_EPOLL
ENABLE_KQUEUE -> USE_KQUEUE
ENABLE_EVPORTS -> USE_EVPORTS
TPROXY -> USE_TPROXY
NETFILTER -> USE_NETFILTER
NEED_CRYPT_H -> USE_CRYPT_H
CONFIG_HAP_CRYPT -> USE_LIBCRYPT
CONFIG_HAP_NS -> DUSE_NS
CONFIG_HAP_LINUX_SPLICE -> USE_LINUX_SPLICE
CONFIG_HAP_LINUX_TPROXY -> USE_LINUX_TPROXY
CONFIG_HAP_LINUX_VSYSCALL -> USE_LINUX_VSYSCALL
We currently have the ability to register functions to be called early
on thread creation and at thread deinitialization. It turns out this is
not sufficient because certain such functions may use resources that are
being allocated by the other ones, thus creating a race condition depending
only on the linking order. For example the mworker needs to register a
file descriptor while the pollers will reallocate the fd_updt[] array.
Similarly logs and trashes may be used by some init functions while it's
unclear whether they have been deduplicated.
The same issue happens on deinit, if the fd_updt[] or trash is released
before some functions finish to use them, we'll get into trouble.
This patch creates a couple of early and late callbacks for per-thread
allocation/freeing of resources. A few init functions were moved there,
and the fd init code was split between the two (since it used to both
allocate and initialize at once). This way the init/deinit sequence is
expected to be safe now.
This patch should be backported to 1.9 as at least the trash/log issue
seems to be present. The run_thread_poll_loop() code is a bit different
there as the mworker is not a callback, but it will have no effect and
it's enough to drop the mworker changes.
This bug was reported by Ilya Shipitsin in github issue #104.
This low-level asm implementation of a double CAS was implemented only
for certain architectures (x86_64, armv7, armv8). When threads are not
used, they were not defined, but since they were called directly from
a few locations, they were causing build issues on certain platforms
with threads disabled. This was addressed in commit f4436e1 ("BUILD:
threads: Add __ha_cas_dw fallback for single threaded builds") by
making it fall back to HA_ATOMIC_CAS() when threads are not defined,
but this actually made the situation worse by breaking other cases.
This patch fixes this by creating a high-level macro HA_ATOMIC_DWCAS()
which is similar to HA_ATOMIC_CAS() except that it's intended to work
on a double word, and which rely on the asm implementations when threads
are in use, and uses its own open-coded implementation when threads are
not used. The 3 call places relying on __ha_cas_dw() were updated to
use HA_ATOMIC_DWCAS() instead.
This change was tested on i586, x86_64, armv7, armv8 with and without
threads with gcc 4.7, armv8 with gcc 5.4 with and without threads, as
well as i586 with gcc-3.4 without threads. It will need to be backported
to 1.9 along with the fix above to fix build on armv7 with threads
disabled.
Add a new option, USE_CLOSEFROM. If set, it is assumed the system provides
a closefrom() function, so use it.
It is only implicitely used on FreeBSD for now, it should work on
OpenBSD/NetBSD/DragonflyBSD/Solaris too, but as I have no such system to
test it, I'd rather leave it disabled by default. Users can add USE_CLOSEFROM
explicitely on their make command line to activate it.
Calculate if the fd or task should be locked once, before locking, and
reuse the calculation when determing when to unlock.
Fixes a race condition added in 87d54a9a for fds, and b20aa9ee for tasks,
released in 1.9-dev4. When one thread modifies thread_mask to be a single
thread for a task or fd while a second thread has locked or is waiting on a
lock for that task or fd, the second thread will not unlock it. For FDs,
this is observable when a listener is polled by multiple threads, and is
closed while those threads have events pending. For tasks, this seems
possible, where task_set_affinity is called, but I did not observe it.
This must be backported to 1.9.
The optimized my_closefrom() implementation introduced with previous commit
9188ac60e ("MINOR: fd: implement an optimised my_closefrom() function")
has a small bug causing it to miss some FDs at the end of each batch.
The reason is that poll() returns the number of non-zero events, so
it contains the size of the batch minus the FDs to close. Thus if the
FDs to close are at the beginning they'll be seen but if they're at the
end after all other closed ones, the returned count will not cover them.
No backport is needed.
The idea is that poll() can set the POLLNVAL flag for each invalid
FD in a pollfd list. Thus this function makes use of poll() when
compiled in, and builds lists of up to 1024 FDs at once, checks the
output and only closes those which do not have this flag set. Tests
show that this is about twice as fast as blindly calling close() for
each closed fd.
This is a naive implementation of closefrom() which closes all FDs
starting from the one passed in argument. closefrom() is not provided
on all operating systems, and other versions will follow.
If we fail to initialize pollers due to fdtab/fdinfo/polled_mask
not getting allocated, we free any of those that were allocated
and exit. However the ordering was incorrect, and there was an old
unused and unreachable "fail_cache" path as well, which needs to
be taken when no poller works.
This was introduced with this commit during 1.9-dev :
cb92f5c ("MINOR: pollers: move polled_mask outside of struct fdtab.")
It needs to be backported to 1.9 only.