In raw_sock, we already check for FD_POLL_HUP after a short recv()
to avoid a useless syscall and detect the end of stream. However,
we fail to check for FD_POLL_ERR here, which causes major issues
as some errors might be delivered and ignored if they are delivered
at the same time as a HUP, and there is no data to send to detect
them on the other direction.
Since the connections flags do not have the CO_FL_ERROR flag, the
polling is not disabled on the socket and the pollers immediately
call the conn_fd_handler() again, resulting in CPU spikes for as
long as the timeouts allow them.
Note that this patch alone fixes the issue but a few patches will
follow to strengthen this fragile area.
Big thanks to Bryan Berry who reported the issue with significant
amounts of detailed traces that helped rule out many other initially
suspected causes and to finally reproduce the issue in the lab.
At least on a heavily patched 2.6.35.9, we can see splice() fail
with EBADF :
recv(6, "789.123456789.123456789.12345678"..., 1049, 0) = 1049
send(5, "HTTP/1.1 200\r\nContent-length: 10"..., 8030, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE) = 8030
gettimeofday({1352717854, 515601}, NULL) = 0
epoll_wait(0x3, 0x40221008, 0x7, 0) = 0
gettimeofday({1352717854, 515793}, NULL) = 0
pipe([7, 8]) = 0
splice(0x6, 0, 0x8, 0, 0xfe12c, 0x3) = -1 EBADF (Bad file descriptor)
close(6) = 0
This clearly is a kernel issue since all FDs are valid here, so let's
simply disable splice() on the connection when this happens so that
the session correctly recovers from that issue using recv().
A failed send() may return ENOTCONN when the connection is not yet established.
On Linux, we generally see EAGAIN but on OpenBSD we clearly have ENOTCONN, so
let's ensure we poll for write when we encounter this error.
Till now we used to perform the L4_CONN check in the data layer
(eg: stream interface) but that does not make sense, because some transport
layers will imply that the connection is opened (eg: SSL), and also because
the complexity to check for this is higher in the data layer than in the
transport layer. This is so much true that some read0 cases did not validate
the connection.
So as of now, the transport layer is responsible for clearing L4_CONN when
it detects an activity, and the data layer may safely rely on this flag. This
only impacts a minor change in raw_sock and stream_interface for now.
While working on the changes required to make the health checks use the
new connections, it started to become obvious that some naming was not
logical at all in the connections. Specifically, it is not logical to
call the "data layer" the layer which is in charge for all the handshake
and which does not yet provide a data layer once established until a
session has allocated all the required buffers.
In fact, it's more a transport layer, which makes much more sense. The
transport layer offers a medium on which data can transit, and it offers
the functions to move these data when the upper layer requests this. And
it is the upper layer which iterates over the transport layer's functions
to move data which should be called the data layer.
The use case where it's obvious is with embryonic sessions : an incoming
SSL connection is accepted. Only the connection is allocated, not the
buffers nor stream interface, etc... The connection handles the SSL
handshake by itself. Once this handshake is complete, we can't use the
data functions because the buffers and stream interface are not there
yet. Hence we have to first call a specific function to complete the
session initialization, after which we'll be able to use the data
functions. This clearly proves that SSL here is only a transport layer
and that the stream interface constitutes the data layer.
A similar change will be performed to rename app_cb => data, but the
two could not be in the same commit for obvious reasons.
When a connection setup is pending and we receive an error without a
POLL_IN flag, we're certain there will be nothing to read from it and
we can safely report an error without attempting a recv() call. This
will be significantly better for health checks which will avoid a useless
recv() on all failed checks.
Depending on the pollers used, a connection error may be notified
with POLLOUT|POLLERR|POLLHUP. POLLHUP by itself is enough for the
connection handler to call the read actor, which would only consider
this flag as a good indication of a hangup, without considering the
POLLERR flag.
In order to address this, we directly jump to the read0 label if
POLLERR was not set.
This will be important with health checks as we don't want to believe
a connection was properly established when it's not the case !
I/O handlers now all use __conn_{sock,data}_{stop,poll,want}_* instead
of returning dummy flags. The code has become slightly simpler because
some tricks such as the MIN_RET_FOR_READ_LOOP are not needed anymore,
and the data handlers which switch to a handshake handler do not need
to disable themselves anymore.
Some parts of the sock_ops structure were only used by the stream
interface and have been moved into si_ops. Some of them were callbacks
to the stream interface from the connection and have been moved into
app_cp as they're the application seen from the connection (later,
health-checks will need to use them). The rest has moved to data_ops.
Normally at this point the connection could live without knowing about
stream interfaces at all.
The splicing is now provided by the data-layer rcv_pipe/snd_pipe functions
which in turn are called by the stream interface's recv and send callbacks.
The presence of the rcv_pipe/snd_pipe functions is used to attest support
for splicing at the data layer. It looks like the stream-interface's
SI_FL_CAP_SPLICE flag does not make sense anymore as it's used as a proxy
for the pointers above.
It also appears that we call chk_snd() from the recv callback and then
try to call it again in update_conn(). It is very likely that this last
function will progressively slip into the recv/send callbacks in order
to avoid duplicate check code.
The code works right now with and without splicing. Only raw_sock provides
support for it and it is automatically selected when the various splice
options are set. However it looks like splice-auto doesn't enable it, which
possibly means that the streamer detection code does not work anymore, or
that it's only called at a time where it's too late to enable splicing (in
process_session).
Similar to what was done on the receive path, the data layer now provides
only an snd_buf() callback that is iterated over by the stream interface's
si_conn_send_loop() function.
The data layer now has no knowledge about channels nor stream interfaces.
The splice() code still need to be ported as it currently is disabled.
The recv function is now generic and is usable to iterate any connection-to-buf
reading function from a stream interface. So let's move it to stream-interface.
This is the start of the stream connection iterator which calls the
data-layer reader. This still looks a bit tricky but is OK. Splicing
is not handled at all at the moment.
The "raw_sock" prefix will be more convenient for naming functions as
it will be prefixed with the data layer and suffixed with the data
direction. So let's rename the files now to avoid any further confusion.
The #include directive was also removed from a number of files which do
not need it anymore.