Some recent glibc updates have added controls on FD_SET/FD_CLR/FD_ISSET
that crash the program if it tries to use a file descriptor larger than
FD_SETSIZE.
For this reason, we now control the compatibility between global.maxsock
and FD_SETSIZE, and refuse to use select() if there too many FDs are
expected to be used. Note that on Solaris, FD_SETSIZE is already forced
to 65536, and that FreeBSD and OpenBSD allow it to be redefined, though
this is not needed thanks to kqueue which is much more efficient.
In practice, since poll() is enabled on all targets, it should not cause
any problem, unless it is explicitly disabled.
This change must be backported to 1.4 because the crashes caused by glibc
have already been reported on this version.
When a polled I/O event is detected, the event is added to the updates
list and the I/O handler is called. Upon return, if the event handler
did not experience an EAGAIN, the event remains in the updates list so
that it will be processed later. But if the event was already in the
spec list, its state is updated and it will be called again immediately
upon exit, by fd_process_spec_events(), so this creates unfairness
between speculative events and polled events.
So don't call the I/O handler upon I/O detection when the FD already is
in the spec list. The fd events are still updated so that the spec list
is up to date with the possible I/O change.
Errors and Hangups are sticky events, which means that once they're
detected, we never clear them, allowing them to be handled later if
needed.
Till now when an error was reported, it used to register a speculative
I/O event for both recv and send. Since the connection had not requested
such events, it was not able to detect a change and did not clear them,
so the events were called in loops until a timeout caused their owner
task to die.
So this patch does two things :
- stop registering spec events when no I/O activity was requested,
so that we don't end up with non-disablable polling state ;
- keep the sticky polling flags (ERR and HUP) when leaving the
connection handler so that an error notification doesn't
magically become a normal recv() or send() report once the
event is converted to a spec event.
It is normally not needed to make the connection handler emit an
error when it detects POLL_ERR because either a registered data
handler will have done it, or the event will be disabled by the
wake() callback.
The old EV_FD_SET() macro was confusing, as it would enable receipt but there
was no way to indicate that EAGAIN was received, hence the recently added
FD_WAIT_* flags. They're not enough as we're still facing a conflict between
EV_FD_* and FD_WAIT_*. So let's offer I/O functions what they need to explicitly
request polling.
These primitives were initially introduced so that callers were able to
conditionally set/disable polling on a file descriptor and check in return
what the state was. It's been long since we last had an "if" on this, and
all pollers' functions were the same for cond_* and their systematic
counter parts, except that this required a check and a specific return
value that are not always necessary.
So let's simplify the FD API by removing this now unused distinction and
by making all specific functions return void.
In an attempt to get rid of fdtab[].state, and to move the relevant
parts to the connection struct, we remove the FD_STCLOSE state which
can easily be deduced from the <owner> pointer as there is a 1:1 match.
fdtab[].ev was only set in ev_sepoll. Unfortunately, some I/O handling
functions now rely on this, so depending on the polling mechanism, some
useless operations might have been performed, such as performing a useless
recv() when a HUP was reported.
This is a very old issue, the flags were only added to the fdtab and not
propagated into any poller. Then they were used in ev_sepoll which needed
them for the cache. It is unsure whether a backport to 1.4 is appropriate
or not.
We now measure the work and idle times in order to report the idle
time in the stats. It's expected that we'll be able to use it at
other places later.
If an asynchronous signal is received outside of the poller, we don't
want the poller to wait for a timeout to occur before processing it,
so we set its timeout to zero, just like we do with pending tasks in
the run queue.
The INTBITS macro was found to be already defined on some platforms,
and to equal 32 (while INTBITS was 5 here). Due to pure luck, there
was no declaration conflict, but it's nonetheless a problem to fix.
Looking at the code showed that this macro was only used for left
shifts and nothing else anymore. So the replacement is obvious. The
new macro, BITS_PER_INT is more obviously correct.
It should be stated as a rule that a C file should never
include types/xxx.h when proto/xxx.h exists, as it gives
less exposure to declaration conflicts (one of which was
caught and fixed here) and it complicates the file headers
for nothing.
Only types/global.h, types/capture.h and types/polling.h
have been found to be valid includes from C files.
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.
A new header file has been added : include/common/ticks.h.
It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.
Some function and variable names are still subject to change for
a better overall logics.
The first implementation of the monotonic clock did not verify
forward jumps. The consequence is that a fast changing time may
expire a lot of tasks. While it does seem minor, in fact it is
problematic because most machines which boot with a wrong date
are in the past and suddenly see their time jump by several
years in the future.
The solution is to check if we spent more apparent time in
a poller than allowed (with a margin applied). The margin
is currently set to 1000 ms. It should be large enough for
any poll() to complete.
Tests with randomly jumping clock show that the result is quite
accurate (error less than 1 second at every change of more than
one second).
If the system date is set backwards while haproxy is running,
some scheduled events are delayed by the amount of time the
clock went backwards. This is particularly problematic on
systems where the date is set at boot, because it seldom
happens that health-checks do not get sent for a few hours.
Before switching to use clock_gettime() on systems which
provide it, we can at least ensure that the clock is not
going backwards and maintain two clocks : the "date" which
represents what the user wants to see (mostly for logs),
and an internal date stored in "now", used for scheduled
events.
Under some circumstances, a task may already lie in the run queue
(eg: inter-task wakeup). It is disastrous to wait for an event in
this case because some processing gets delayed.
The timeout functions were difficult to manipulate because they were
rounding results to the millisecond. Thus, it was difficult to compare
and to check what expired and what did not. Also, the comparison
functions were heavy with multiplies and divides by 1000. Now, all
timeouts are stored in timevals, reducing the number of operations
for updates and leading to cleaner and more efficient code.
Gcc provides __attribute__((constructor)) which is very convenient
to execute functions at startup right before main(). All the pollers
have been converted to have their register() function declared like
this, so that it is not necessary anymore to call them from a centralized
file.
Some pollers such as kqueue lose their FD across fork(), meaning that
the registered file descriptors are lost too. Now when the proxies are
started by start_proxies(), the file descriptors are not registered yet,
leaving enough time for the fork() to take place and to get a new pollfd.
It will be the first call to maintain_proxies that will register them.
select, poll and epoll now have their dedicated functions and have
been split into distinct files. Several FD manipulation primitives
have been provided with each poller.
The rest of the code needs to be cleaned to remove traces of
StaticReadEvent/StaticWriteEvent. A trick involving a macro has
temporarily been used right now. Some work needs to be done to
factorize tests and sets everywhere.