The SUB_CAN_SEND/SUB_CAN_RECV enum values have been confusing a few
times, especially when checking them on reading. After some discussion,
it appears that calling them SUB_RETRY_SEND/SUB_RETRY_RECV more
accurately reflects their purpose since these events may only appear
after a first attempt to perform the I/O operation has failed or was
not completed.
In addition the wait_reason field in struct wait_event which carries
them makes one think that a single reason may happen at once while
it is in fact a set of events. Since the struct is called wait_event
it makes sense that this field is called "events" to indicate it's the
list of events we're subscribed to.
Last, the values for SUB_RETRY_RECV/SEND were swapped so that value
1 corresponds to recv and 2 to send, as is done almost everywhere else
in the code an in the shutdown() call.
There is an issue with some medium sized transfers occasionally not
shutting down at the end. Olivier tracked this to being caused by a
missing wakeup of process_stream(). What happens is that one of the
analysers sets CF_WAKE_WRITE to be woken up at the end of the transfer
to take note of the end of transaction, but a failed si_cs_send() at
the end of process_stream causes the call to be attempted again, with
CF_WAKE_WRITE lost. Then stream_int_notify() doesn't find any valid
condition to wake up process_stream(), and the stream stays there,
idling till the timeout.
In fact, CF_WAKE_WRITE has been designed for calling the analysers
to complete an operation without closing (keep-alive HTTP transfer
for instance). It only applies once the buffer is empty and there
is nothing left to be forwarded. In case the channel is closed, the
wakeup is already granted. So what we need here is to make sure to
wake process_stream() up in case the channel will not be closed and
it doesn't have anything left to be transferred. This is detected by
the lack of CF_AUTO_CLOSE and the emptiness of the buffer + to_forward
after a write activity. So now we take care of always waking the stream
up on end of transfers even if the analysers didn't subscribe to this
or if their subscription was lost.
CF_WAKE_WRITE should probably be killed now, though this first requires
careful inspection.
No backport is needed.
Cc: Olivier Houchard <ohouchard@haproxy.com>
Cc: Christopher Faulet <cfaulet@haproxy.com>
Commit d94f877cd ("BUG/MINOR: mux_pt: Set CS_FL_WANT_ROOM when count is
zero in rcv_buf() callback") triggered a pending issue with this flag,
which is that it's cleared too late and sometimes causes some Rx
transfers to stall. We need to clear it before attempting to receive
otherwise we may risk to see an earlier copy of the flag.
Note that it should probably be defined that this flag could be purged
on each invocation of mux->rcv_buf(), which would make sense.
No backport is needed.
In si_cs_recv(), report that arrive at the end of stream only if we were
indeed connected, we don't want that if the connection failed and we're about
to retry.
The flag CS_FL_EOS can be set while no data was received. So the flas
CS_FL_RCV_MORE is not set. In this case, the read0 was never processed by the
stream interface. To be sure to process it, the test on CS_FL_RCV_MORE has been
moved after the one on CS_FL_EOS.
CS_FL_RCV_MORE is used in two cases, to let the conn_stream
know there may be more data available, and to let it know that
it needs more room. We can't easily differentiate between the
two, and that may leads to hangs, so split it into two flags,
CS_FL_RCV_MORE, that means there may be more data, and
CS_FL_WANT_ROOM, that means we need more room.
This should not be backported.
After 8706c8131 ("BUG/MEDIUM: mux_pt: Always set CS_FL_RCV_MORE."), a
side effect caused failed receives to mark the buffer as missing room,
a flag that no other place can remove since it's empty. Ideally we need
a separate flag to mean "failed to deliver data by lack of room", but
in the mean time at the very least we must not mark as blocked an
empty buffer.
No backport is needed.
In si_cs_recv(), when there's an error on the connection or the conn_stream,
don't give up if CS_FL_RCV_MORE is set on the conn_stream, as it means there's
still data available.
In si_cs_send(), don't give up and subscribe if the connection is still
waiting for a SSL handshake. We will never be woken up once the handshake is
done if we're using HTTP/2. Instead, directly try to send data. When using
the mux_pt, if the handshake is not done yet, snd_buf() would return 0 and
we will subscribe anyway.
In si_cs_send(), some checks are done the CS flags en the connection flags
before calling snd_buf(). But these checks are useless because they have already
been done earlier in the function. The harder to figure out is the flag
CO_FL_SOCK_WR_SH. So it is now tested with CF_SHUTW at the beginning.
In si_cs_send, as said in comments, snd_buf() should only be called if there is
no data in the pipe anymore. But actually, this condition was not respected.
For the same reason than for the commit b46784b1c ("MINOR: stream-int: Notify
caller when an error is reported after a rcv_pipe()"), we return 1 after the
call to rcv_buf() in si_cs_send() to notify the caller some processing may be
triggered.
This patch is not flagged as a bug because no strange behaviour was yet observed
without it. It is just a proactive fix to be consistent.
In si_cs_send(), when an error is found on the CS or the connection at the
beginning of the function, we return 1 to notify the caller some processing may
be triggered. So, it seems logical to do the same after the call to rcv_pipe().
This patch is not flagged as a bug because no strange behaviour was yet observed
without it. It is just a proactive fix to be consistent.
The flag CS_FL_READ_PARTIAL can be set by the mux on the conn_stream to notify
the stream interface that some data were received. Is is used in si_cs_recv to
re-arm read timeout on the channel.
By setting the flag CO_RFL_KEEP_RSV when calling mux->rcv_buf, the
stream-interface notifies the mux it must keep some space to preserve the
buffer's reserve. This flag is only useful for multiplexers handling structured
data, because in such case, the stream-interface cannot know the real amount of
free space in the channel's buffer.
By setting the flag CO_RFL_BUF_FLUSH when calling mux->rcv_buf, the
stream-interface notifies the mux it should flush its buffers without reading
more data. This flag is set when the SI want to use the kernel TCP splicing to
forward data. Of course, the mux can respect it or not, depending on its
state. It's just an information.
Do not destroy the connection when we're about to destroy a stream. This
prevents us from doing keepalive on server connections when the client is
using HTTP/2, as a new stream is created for each request.
Instead, the session is now responsible for destroying connections.
When reusing connections, the attach() mux method is now used to create a new
conn_stream.
Remaining calls to si_cant_put() were all for lack of room and were
turned to si_rx_room_blk(). A few places where SI_FL_RXBLK_ROOM was
cleared by hand were converted to si_rx_room_rdy().
The now unused si_cant_put() function was removed.
The channel can disable reading from the stream-interface using various
methods, such as :
- CF_DONT_READ
- !channel_may_recv()
- and possibly others
Till now this was done by mangling SI_FL_RX_WAIT_EP which is not
appropriate at all since it's not the stream interface which decides
whether it wants to deliver data or not. Some places were also wrongly
relying on SI_FL_RXBLK_ROOM since it was the only other alternative,
but it's not suitable for CF_DONT_READ.
Let's use the SI_FL_RXBLK_CHAN flag for this instead. It will properly
prevent the stream interface from being woken up and reads from
subscribing to more receipt without being accidently removed. It is
automatically reset if CF_DONT_READ is not set in stream_int_notify().
The code is not trivial because it splits the logic between everything
related to buffer contents (channel_is_empty(), CF_WRITE_PARTIAL, etc)
and buffer policy (CF_DONT_READ). Also it now needs to decide timeouts
based on any blocking flag and not just SI_FL_RXBLK_ROOM anymore.
It looks like this patch has caused a minor performance degradation on
connection rate, which possibly deserves being investigated deeper as
the test conditions are uncertain (e.g. slightly more subscribe calls?).
For a long time, stream_int_update() and stream_int_notify() used to only
conditionally call si_chk_rcv() based on state change detection. This
detection is not reliable and quite complex. With the new blocked flags
that si_chk_rcv() checks, it's much more reliable to always call the
function to take into account recent changes,and let it decide if it needs
to wake something up or not.
This also removes the calls to si_chk_rcv() that were performed in
si_update_both() since these ones are systematically performed in
stream_int_update() after updating the Rx flags.
Till now we were using si_done_put() upon shutr, but these flags could
be reset upon next activity. Now let's switch to SI_FL_RXBLK_SHUT which
doesn't go away. It's also set in stream_int_update() in case a shutr
condition is detected.
The now unused si_done_put() was removed.
If an applet reports being blocked due to any of the channel-side flags,
it's reportedly ready to deliver incoming data. It's better to do this
after the return from the applet handler so that applet developers don't
have to worry about details related to flags ordering.
Instead of first indicating that there's more data to read from the
conn_stream then re-adjusting this info along the function, we now
instead set the status according to the subscription status at the
end. It's easier, more accurate, and less sensitive to intermediary
changes.
This will soon allow to remove all the si_cant_put() calls that were
placed in the middle to force a subsequent callback and prevent the
function from subscribing to the mux layer.
The SI_FL_WANT_PUT flag is used in an awkward way, sometimes it's
set by the stream-interface to mean "I have something to deliver",
sometimes it's cleared by the channel to say "I don't want you to
send what you have", and it has to be set back once CF_DONT_READ
is cleared. This will have to be split between SI_FL_RX_WAIT_EP
and SI_FL_RXBLK_CHAN. This patch only replaces all uses of the
flag with its natural (but negated) replacement SI_FL_RX_WAIT_EP.
The code is expected to be strictly equivalent. The now unused flag
was completely removed.
This flag is not enough to describe all blocking situations, as can be
seen in each case we remove it. The muxes has taught us that using multiple
blocking flags in parallel will be much easier, so let's start to do this
now. This patch only renames this flags in order to make next changes more
readable.
There currently is an optimization in stream_int_notify() consisting
in not trying to forward small bits of data if extra data remain to be
processed. The purpose is to avoid forwarding one chunk at a time if
multiple chunks are available to be parsed at once. It consists in
avoiding sending pending output data if there are still data to be
parsed in the channel's buffer, since process_stream() will have the
opportunity to deal with them all at once.
Not only this optimization is less useful with the new way the connections
work, but it even causes problems like lost events since WAIT_ROOM will
not be removed. And with HTX, it will never be able to update the input
buffer after the first read.
Let's relax the rules now, by always sending if we don't have the
CF_EXPECT_MORE flag (used to group writes), or if the buffer is
already full.
The function used to abuse the internals of mux_pt to retrieve a
conn_stream, which will not work anymore after the idle connection
changes. Let's make it rely on the more reliable cs_get_first()
instead.
When splice() reports a pipe full condition, we go through the common
code used to release a possibly empty pipe (which we don't have) and which
immediately tries to allocate a buffer that will never be used. Further,
it may even subscribe to get this buffer if the resources are low. Let's
simply get out of this way if the pipe is full.
This fix could be backported to 1.8 though the code is a bit different
overthere.
Since we don't necessarily pass through conn_fd_handler() when reading,
conn_refresh_polling_flags() is not necessarily called when performing
a recv() operation, thus flags like CO_FL_WAIT_ROOM are not cleared.
It happens that si_cs_recv() checks CO_FL_WAIT_ROOM before deciding to
receive into a buffer, to see if the previous rcv_pipe() call failed by
lack of pipe room. The combined effect of these two statements is that
at the end of a file transmission, when there's too little data to
warrant the use of a pipe and the pipe is empty, we refrain from using
rcv_pipe() for the last few bytes, but since CO_FL_WAIT_ROOM is still
present, we don't use rcv_buf() either, and the connection remains
frozen in this state with si_cs_recv() called in loops.
In order to fix this we can simply manually clear CO_FL_WAIT_ROOM when
not using pipe so that the next check sees the result of the previous
operation and not an old one. We could equally call
cond_refresh_polling_flags() but that would be overkill and dangerous
given that it would manipulate the connection's flags under the mux.
By the way ideally the mux should report this flag into the connstream
for cleaner manipulation.
No backport is needed as this is only post 1.9-dev2.
As part of the changes that went into 1.9-dev2 regarding the polling
modifications, the changes consecutive to the removal of the wait_list
from the conn_streams (commit 71384551a) made si_cs_recv() occasionally
return without subscribing to receive events, causing spliced transfers
to randomly fail if the client was at least as fast as the server. This
may remain unnoticed on most deployments since servers are usually close
to haproxy with higher bandwidth than clients have, resulting in buffers
always being full.
In order to reproduce his effect, it is better to do it on the local
machine and to transfer very large objects (hundreds of gigs) over a
single connection, to see it suddenly stall after a few tens of gigs.
Now with this fix it's fine even after 3 TB over a single connection.
No backport is needed.
Splicing was in great part broken over the last few development version
due to the use of co_data() to detect if data are available in the channel.
But co_data() only looks at buffered data, not spliced data.
Channel_is_empty() takes care of both and should be used. With this,
splicing restarts to work but there are still a few cases where transfers
may stall.
No backport is needed.
Subsequent to the recent stream-int updates, we started to consider that
SI_FL_WANT_PUT needs to be set when receipt is enabled, but this is wrong
and results in 100% CPU when an HTTP client stays idle after a keep-alive
request because the stream-int has nothing to provide and nothing to send.
In fact just like for applets this flag should reflect the continuation
of an attempt. So it's si_cs_recv() which should set the flag, and clear
it if it has nothing more to provide. This function is called the first
time in process_stream()), and called again during transfers, so it will
always be up to date during stream_int_update() and stream_int_notify().
As a special case, it should also be set when a connection switches to
the established state. And we should absolutely refrain from calling
si_cs_recv() to re-enable reading, normally just setting this flag
(from within the stream-int's handler or prior to calling si_chk_rcv())
is expected to be OK.
A corner case remains where it was observed that in stream_int_notify() we
can sometimes be called with an empty output channel with SI_FL_WAIT_ROOM
and no CF_WRITE_PARTIAL, so there's no way to detect that we should
re-enable receiving. It's easy to also take care of this condition
there for the time it takes to figure if this situation is expected
or not.
Now it becomes more obvious that relying on a single flag to request
room (or on two flags to arbiter activity) is not workable given the
autonomy of both sides. The mux_h2 has taught us that blocking flags
are much more reliable, require much less condition and are much easier
to deal with. That's probably something to consider quickly in this
area.
No backport is needed.
Before calling the mux to get incoming data, we get the amount of space
available at the input of the buffer. If there is no space, we don't try to read
more data. This is good enough when raw data are stored in the buffer. But this
info has no meaning when structured data are stored. Because with the HTTP
refactoring, such kind of data will be stored in buffers, it is a bit annoying.
So, to avoid any problems, we always call the mux. It is the mux's responsiblity
to notify the stream interface it needs more space to store more data. This must
be done by setting the flag CS_FL_RCV_MORE on the conn_stream.
This is exactly what we do in the pass-through mux when <count> is null.
This flag is set on the stream interface when we should wait for more space in
the channel's buffer to store more incoming data. This means we should wait some
outgoing data are sent before retrying to receive more data.
But in stream interface functions, at many places, instead of checking this
flag, we use the function channel_may_recv to know if we can (re)start
reading. This currently works but it is not really consistent. And, it works
because only raw data are stored in buffers. But it will be a problem when we
start to store structured data in buffers.
So to avoid any problems with futur implementations, we now rely only on
SI_FL_WAIT_ROOM. The function channel_may_recv can still be called, but only
when we are sure to handle raw data (for instance in functions ci_put*). To do
so, among other things, we must be sure to unset SI_FL_WAIT_ROOM and offer an
opportunity to call chk_rcv() on a stream interface when some data are sent
on the other end, which is now granted by the previous patch series.
We exclusively use stream_int_update() now, the lower layers are not
called anymore so let's remove them, as well as si_update() which used
to be their wrapper.
It's far from being clean, but at least it allows to resync both CS and
applets from the same place, taking into account the fact that CS are
processed synchronously for the send side while appletx are processed
outside of the process_stream() loop. The arrangement is optimised to
minimize the amount of iteration by handling send first, then updating
the SI_FL_WAIT_ROOM flags and only then dealing with si_chk_rcv() on
both sides. The SI_FL_WANT_PUT flag is set if needed before calling
si_chk_rcv() since this is done prior to calling stream_int_update().
Now there's no risk that stream_int_notify() is called anymore during
such operations, thus we cannot have any spurious wake-up anymore. The
case where a successful send() could complete a pending connect() is
handled by taking any stream-int state changes into account at the
call place, which is normal since process_stream() is designed to
iterate till stabilisation.
Doing this solves most of the remaining inconsistencies between CS and
applets.
The function used to be called in turn for each side of the stream, but
since it's called exclusively from process_stream(), it prevents us from
making use of the knowledge we have of the operations in progress for
each side, resulting in having to go all the way through functions like
stream_int_notify() which are not appropriate there.
That patch creates a new function, si_update_both() which takes two
stream interfaces expected to belong to the same stream, and processes
their flags in a more suitable order, but for now doesn't change the
logic at all.
The next step will consist in trying to reinsert the rest of the socket
layer-specific update code to ultimately update the flags correctly at
the end of the operation.
Instead of clearing the SI_FL_WAIT_ROOM flag and losing the information
about the need from the producer to be woken up, we now call si_chk_rcv()
immediately. This is cheap to do and it could possibly be further improved
by only doing it when SI_FL_WAIT_ROOM was still set, though this will
require some extra auditing of the code paths.
The only remaining place where the flag was cleared without a call to
si_chk_rcv() is si_alloc_ibuf(), but since this one is called from a
receive path woken up from si_chk_rcv() or not having failed, the
clearing was not necessary anymore either.
And there was one place in stream_int_notify() where si_chk_rcv() was
called with SI_FL_WAIT_ROOM still explicitly set so this place was
adjusted in order to clear the flag prior to calling si_chk_rcv().
Now we don't have any situation where we randomly clear SI_FL_WAIT_ROOM
without trying to wake the other side up, nor where we call si_chk_rcv()
with the flag set, so this flag should accurately represent a failed
attempt at putting data into the buffer.
When CF_DONT_READ is set, till now we used to set SI_FL_WAIT_ROOM, which
is not appropriate since it would lose the subscribe status. Instead let's
clear SI_FL_WANT_PUT (just like applets do), and set the flag only when
CF_DONT_READ is cleared.
We have to do this in stream_int_update(), and in si_cs_io_cb() after
returning from si_cs_recv() since it would be a bit invasive to hack
this one for now. It must not be done in stream_int_notify() otherwise
it would re-enable blocked applets.
Last, when si_chk_rcv() is called, it immediately clears the flag before
calling ->chk_rcv() so that we are not tempted to uselessly loop on the
same call until the receive function is called. This is the same principle
as what is done with the applet scheduler.
This flag should already be cleared before calling the *chk_rcv() functions.
Before adapting all call places, let's first make sure si_chk_rcv() clears
it before calling them so that these functions do not have to check it again
and so that they do not adjust it. This function will only call the lower
layers if the SI_FL_WANT_PUT flag is present so that the endpoint can decide
not to be called (as done with applets).
We now do this on the si_cs_recv() path so that we always have
SI_FL_WANT_PUT properly set when there's a need to receive and
SI_FL_WAIT_ROOM upon failure.
It doesn't make sense to limit this code to applets, as any stream
interface can use it. Let's rename it by simply dropping the "applet_"
part of the name. No other change was made except updating the comments.
Commit eafd8ebcf ("MEDIUM: stream-int: call si_cs_process() in
stream_int_update_conn") uncovered a sleeping bug. By calling
si_cs_process() within si_update(), we end up calling stream_int_notify().
We rely on it to update the stream-int before quitting as a hack, but
it happens to immediately wake the task up while the stream int's
state is still SI_ST_CON (during the connection establishment). The
observable effect is that an unreachable server causes haproxy to
use 100% CPU until the connection timeout strikes.
This patch fixes this by not causing the wake up for the SI_ST_CON
state. It would equally be possible to check for states higher than
SI_ST_EST as is done in other places, but for now better stay on the
safe side by covering the only issue that can be triggered. It's
suspected that this issue slightly affects older versions by causing
one extra call to process_stream() during the connection setup for
each activity change on the other side, but this should not have
any observable effect.
No backport is needed.
It took me 17 minutes this morning to figure where si->wait_event was
set (it's in si_reset() which should now probably be renamed since it
doesn't just perform a reset anymore but also an allocation) and what
its task was assigned to (si_cs_io_cb() even for applets and empty SI).
This is too confusing and not intuitive enough, let's at least add a
few comments for now to help figure how this stuff works next time.
This one causes some events to be lost. It has already been tested in
an experimental branch but was not merged until being certain it was
needed. Fred figured that requesting /?k=1&s=447392 from httpterm through
haproxy-master was enough to stall the transfer.
No backport is needed, this only affects 1.9-dev5.
There's a call there to si_cs_send() while we're supposed to come from
si_cs_io_cb() which has just done it. But in fact we can also come here
as a lower layer callback from ->wake() after a connection is established.
Since most of the time we'll end up here with either no data in the buffer
or a blocked output, let's simply check if we're already susbcribed to send
events before calling si_cs_send().
stream_int_notify() is I/O agnostic and should not wake up the tasklet,
it's up to si_cs_process() to do that, just like si_applet_wake_cb()
does it for the applet.
This one was added by commit 53216e7db ("MEDIUM: connections: Don't
directly mess with the polling from the upper layers.") after the
removal of the conditional cs_want_send() call. But after analysis
it turned out that it's not needed since the si_cs_send() call will
either succeed or subscribe.
Calling si_cs_send() alone is always dangerous because it can result
in the loss of an event if it manages to empty the buffer. Indeed, in
this case it's critical to call si_chk_rcv() on the opposite stream-int.
Given that si_cs_process() takes care of all this, let's call it instead.
All this code could possibly be refined soon to avoid redoing the whole
stream_int_notify() and do it only after a send(), but at the moment it's
not important.
With the new synchronous si_cs_send() at the end of process_stream(),
we're seeing re-appear the I/O layer specific part of the stream interface
which is supposed to deal with I/O event subscription. The only difference
is that now we subscribe to I/Os only after having attempted (and failed)
them.
This patch brings a cleanup in this by reintroducing stream_int_update_conn()
with the send code from process_stream(). However this alone would not be
enough because the flags which are cleared afterwards would result in the
loss of the possible events (write events only at the moment). So the flags
clearing and stream-int state updates are also performed inside si_update()
between the generic code and the I/O specific code. This definitely makes
sense as after this call we can simply check again for channel and SI flag
changes and decide to loop once again or not.
Well that's only 3 places (applet.c, stream_interface.c, hlua.c). This
ensures we always clear SI_FL_WAIT_ROOM before setting it on failure,
so that it is granted that SI_FL_WAIT_ROOM always indicates a lack of
room for doing an operation, including the inability to allocate a
buffer for this.
The behaviour of the flag CF_WRITE_PARTIAL was modified by commit
95fad5ba4 ("BUG/MAJOR: stream-int: don't re-arm recv if send fails") due
to a situation where it could trigger an immediate wake up of the other
side, both acting in loops via the FD cache. This loss has caused the
need to introduce CF_WRITE_EVENT as commit c5a9d5bf, to replace it, but
both flags express more or less the same thing and this distinction
creates a lot of confusion and complexity in the code.
Since the FD cache now acts via tasklets, the issue worked around in the
first patch no longer exists, so it's more than time to kill this hack
and to restore CF_WRITE_PARTIAL's semantics (i.e.: there has been some
write activity since we last left process_stream).
This patch mostly reverts the two commits above. Only the part making
use of CF_WROTE_DATA instead of CF_WRITE_PARTIAL to detect the loss of
data upon connection setup was kept because it's more accurate and
better suited.
With the previous connection model, when we purposely decided to stop
receiving in order to avoid polling after a complete request was received
for example, it was needed to set SI_FL_WAIT_ROOM to prevent receive
polling from being re-armed. Now with the new subscription-based model
there is no such thing anymore and there is noone to remove this flag
either. Thus if a request takes more than one packet to come in or
spans over too many packets, this flag will cause it to wait forever.
Let's simply remove this flag now.
This patch should not be backported since older versions still need
that this flag is set here to stop receiving.
Avoid using conn_xprt_want_send/recv, and totally nuke cs_want_send/recv,
from the upper layers. The polling is now directly handled by the connection
layer, it is activated on subscribe(), and unactivated once we got the event
and we woke the related task.
When subscribing, we don't need to provide a list element, only the h2 mux
needs it. So instead, Add a list element to struct h2s, and use it when a
list is needed.
This forces us to use the unsubscribe method, since we can't just unsubscribe
by using LIST_DEL anymore.
This patch is larger than it should be because it includes some renaming.
Instead of using si_cs_io_cb() in process_stream() use si_cs_send/si_cs_recv
instead, as si_cs_io_cb() may lead to process_stream being woken up when it
shouldn't be, and thus timeout would never get triggered.
Instead of waiting for the connection layer to let us know we can read,
attempt to receive as soon as process_stream() is called, and subscribe
to receive events if we can't receive yet.
Now, except for idle connections, the recv(), send() and wake() methods are
no more, all the lower layers do is waking tasklet for anybody waiting
for I/O events.
Remove the recv() method from mux and conn_stream.
The goal is to always receive from the upper layers, instead of waiting
for the connection later. For now, recv() is still called from the wake()
method, but that should change soon.
For struct connection, struct conn_stream, and for the h2 mux, add 2 new
lists, one that handles waiters for recv, and one that handles waiters for
recv and send. That way we can ask to subscribe for either recv or send.
Call si_cs_send() at the beginning of si_cs_wake_cb(), instead of from
stream_int_notify-), so that if we get a connection error while trying to
send, the stream_interface will get SI_FL_ERR, the associated task will
be woken up, and the connection will be properly destroyed.
No backport needed.
If we subscribed to send, and the callback is called, call the wake callback
after, so that process_stream() may be woken up if needed.
This is 1.9-specific, no backport is needed.
It is possible that the conn_stream gets detached from the stream_interface,
and as it subscribed to the wait list, si_cs_io_cb() gets called anyway,
so make sure we have a conn_stream before attempting to send more data.
This is 1.9-specific, no backport is needed.
Instead of just using the conn_stream wait_list, give the stream_interface
its own. When the conn_stream will have its own buffers, the stream_interface
may have to wait on it.
Instead of using si_cs_send() as a task handler, define a new function,
si_cs_io_cb(), and give si_cs_send() its original prototype. Right now
si_cs_io_cb() just handles send, but later it'll handle recv() too.
It is mandatory to be sure to process data blocked in the RX buffer of the
conn_stream while the shutr/read0 was already processed. The stream interface
doesn't need to rely on this flags because it already tests CS_FL_EOS.
This function is generic and is able to automatically transfer data from a
buffer to the conn_stream's tx buffer. It does this automatically if the mux
doesn't define another snd_buf() function.
It cannot yet be used as-is with the conn_stream's txbuf without risking to
lose data on close since conn_streams need to be orphaned for this.
There is a long-time issue which affects some applets, at least the stats
applet. If a large stats page is read over a slow link, regularly the
channel's buffer contains too many response data to allow another round
of ci_putblk() to copy a new message. In this case the applet calls
si_applet_cant_put() to mention that it failed to emit data into the
channel's buffer, and wants to be called only once some room is made.
The problem is that stream_int_update(), which is called from
process_stream(), will clear this flag whenever it sees there's some
spare room in the channel's buffer. It causes the applet to be woken
again immediately. This is very visible when reading a large stats
page over a slow link, because in this case haproxy will run at 100%
CPU and strace shows mostly epoll_wait(0). It is very likely that some
other applets like CLI, Lua, peers or SPOE have also been affected but
that the effect were less noticeable because it was mixed with traffic.
Ideally stream_int_update() should not touch these flags, but changing
this would require a very careful auditing of all users. Instead here
what we do is that we respect the flag if the channel still has output
data. This way the flag will automatically disappear once the buffer is
empty, and the applet function will be called only when input data
remains, if at all.
This patch alone is not enough to observe the behaviour change on the
stats page because another bug takes over, addressed by next patch
(BUG/MEDIUM: stats: don't ask for more data as long as we're responding).
When both are applied, dumping stats for 5k backends over a 10 Mbps link
take 1% CPU instead of 100%, with 1.5k epoll_wait() calls instead of 80k.
This fix should be backported at least as far as 1.5.
If the cs has data pending or shutdown and the input channel is still
waiting for reads, let's simply call the recv() function from the wake()
callback. This will allow the lower layers to simply wake the upper one
up without having to consider the recv() nor anything else.
This function is generic and is able to automatically transfer data
from a conn_stream's rx buffer to the destination buffer. It does this
automatically if the mux doesn't define another rcv_buf() function.
This new function wl_set_waitcb() prepopulates a wait_list with a tasklet
and a context and returns it so that it can be passed to ->subscribe() to
be added to a connection or conn_stream's wait_list. The caller doesn't
need to know all the insiders details anymore this way.
Totally nuke the "send" method, instead, the upper layer decides when it's
time to send data, and if it's not possible, uses the new subscribe() method
to be called when it can send data again.
Now all the code used to manipulate chunks uses a struct buffer instead.
The functions are still called "chunk*", and some of them will progressively
move to the generic buffer handling code as they are cleaned up.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
A few locations still accessing ->i and ->o directly were changed to
use ci_data() and co_data() respectively. A call to b_del() was replaced
with co_set_data() in si_cs_send() so that ->o will is automatically be
decremented after the migration.
With this flag we introduce the notion of "dry" vs "wet" buffers : some
demultiplexers like the H2 mux require as much room as possible for some
operations that are not retryable like decoding a headers frame. For this
they need to know if the buffer is congested with data scheduled for
leaving soon or not. Since the new API will not provide this information
in the buffer itself, the caller must indicate it. We never need to know
the amount of such data, just the fact that the buffer is not in its
optimal condition to be used for receipt. This "CO_RFL_BUF_WET" flag is
used to mention that such outgoing data are still pending in the buffer
and that a sensitive receiver should better let it "dry" before using it.
The mux and transport rcv_buf() now takes a "flags" argument, just like
the snd_buf() one or like the equivalent syscall lower part. The upper
layers will use this to pass some information such as indicating whether
the buffer is free from outgoing data or if the lower layer may allocate
the buffer itself.
This way the mux doesn't need to modify the buffer's metadata anymore
nor to know the output's size. The mux->snd_buf() function now takes a
const buffer and it's up to the caller to update the buffer's state.
The return type was updated to return a size_t to comply with the count
argument.
These ones manipulate the output data count which will be specific to
the channel soon, so prepare the call points to use the channel only.
The b_* functions are now unused and were removed.
Instead of looking for CO_FL_EARLY_DATA to know if we have to try to wake
up a stream, because it is waiting for a SSL handshake, instead add a new
conn_stream flag, CS_FL_WAIT_FOR_HS. This way we don't have to rely on
CO_FL_EARLY_DATA, and we will only wake streams that are actually waiting.
When an end of stream has been reported, we should not try to receive again
as the mux layer might not be prepared to this and could report unexpected
errors.
This is more of a strengthening measure that follows the introduction of
conn_stream that came in 1.8. It's desired to backport this into 1.8 though
it's uncertain at this time whether it may have caused real issues.
When a stream interface tries to read data from a mux using rcv_buf(),
sometimes it sees 0 as the return value and concludes that there's no
more data while there are, resulting in the connection being polled for
more data and no new attempt being made at reading these pending data.
Now it will automatically check for flag CS_FL_RCV_MORE to know if the
mux really did not have anything available or was not able to provide
these data by lack of room in the destination buffer, and will set
SI_FL_WAIT_ROOM accordingly. This will ensure that once current data
lying in the buffer are forwarded to the other side, reading chk_rcv()
will be called to re-enable reading.
It's important to note that in practice it will rely on the mux's
update_poll() function to re-enable reading and that where the calls
are placed in the stream interface, it's not possible to perform a
new synchronous rcv_buf() call. Thus a corner case remains where the
mux cannot receive due to a full buffer or any similar condition, but
needs to be able to wake itself up to deliver pending data. This is a
limitation of the current connection/conn_stream API which will likely
need a new event subscription to at least call ->wake() asynchronously
(eg: mux->{kick,restart,touch,update} ?).
For now the affected mux (h2 only) will have to take care of the extra
logic to carefully enable polling to restart processing incoming data.
This patch relies on previous one (MINOR: conn_stream: add new flag
CS_FL_RCV_MORE to indicate pending data) and both must be backported to
1.8.
In case any stream was waiting for the handshake after receiving early data,
we have to wake all of them. Do so by making the mux responsible for
removing the CO_FL_EARLY_DATA flag after all of them are woken up, instead
of doing it in si_cs_wake_cb(), which would then only work for the first one.
This makes wait_for_handshake work with HTTP/2.
Commit 9aaf778 ("MAJOR: connection : Split struct connection into struct
connection and struct conn_stream.") had to change the way the stream
interface deals with incoming data to accomodate the mux. A break
statement got lost during a change, leading to the receive call being
performed twice even when CF_READ_DONTWAIT is set. The most noticeable
effect is that it made the bug described in commit 33982cb ("BUG/MAJOR:
stream: ensure analysers are always called upon close") much easier to
reproduce as it would appear even with an HTTP frontend.
Let's just restore the stream-interface flag and the break here, as in
the previous code.
No backport is needed as this was introduced during 1.8-dev.
When a write activity is reported on a channel, it is important to keep this
information for the stream because it take part on the analyzers' triggering.
When some data are written, the flag CF_WRITE_PARTIAL is set. It participates to
the task's timeout updates and to the stream's waking. It is also used in
CF_MASK_ANALYSER mask to trigger channels anaylzers. In the past, it was cleared
by process_stream. Because of a bug (fixed in commit 95fad5ba4 ["BUG/MAJOR:
stream-int: don't re-arm recv if send fails"]), It is now cleared before each
send and in stream_int_notify. So it is possible to loss this information when
process_stream is called, preventing analyzers to be called, and possibly
leading to a stalled stream.
Today, this happens in HTTP2 when you call the stat page or when you use the
cache filter. In fact, this happens when the response is sent by an applet. In
HTTP1, everything seems to work as expected.
To fix the problem, we need to make the difference between the write activity
reported to lower layers and the one reported to the stream. So the flag
CF_WRITE_EVENT has been added to notify the stream of the write activity on a
channel. It is set when a send succedded and reset by process_stream. It is also
used in CF_MASK_ANALYSER. finally, it is checked in stream_int_notify to wake up
a stream and in channel_check_timeouts.
This bug is probably present in 1.7 but it seems to have no effect. So for now,
no needs to backport it.
Commit 4ac4928 ("BUG/MINOR: stream-int: don't set MSG_MORE on SHUTW_NOW
without AUTO_CLOSE") was incomplete. H2 reveals another situation where
the input stream is marked closed with the request and we set MSG_MORE,
causing a delay before the request leaves.
Better avoid setting the flag on the request path for close cases in
general.
At all call places where a conn_stream is in use, we can now use
cs_close() to get rid of a conn_stream and of its underlying connection
if the mux estimates it makes sense. This is what is currently being
done for the pass-through mux.
Instead of having to manually handle lingering outside, let's make
conn_sock_shutw() check for it before calling shutdown(). We simply
don't want to emit the FIN if we're going to reset the connection
due to lingering. It's particularly important for silent-drop where
it's absolutely mandatory that no packet leaves the machine.
In a 1:1 connection:stream there's no problem relying on the connection
flags alone to check for errors. But in a mux, it will be possible to mark
certain streams in error without having to mark all of them. An example is
an H2 client sending RST_STREAM frames to abort a long download, or a parse
error requiring to abort only this specific stream.
This commit ensures that stream-interface and checks properly check for
CS_FL_ERROR in cs->flags wherever CO_FL_ERROR was in use. Most likely over
the long term, any check for CO_FL_ERROR will have to disappear.
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
Add a new sample fetch, "ssl_fc_has_early", a boolean that will be true
if early data were sent, and a new action, "wait-for-handshake", if used,
the request won't be forwarded until the SSL handshake is done.
We've been keep this test for a connection being established since 1.5-dev14
when the stream-interface was still accessing the FD directly. The test on
CO_FL_HANDSHAKE and L{4,6}_CONN is totally useless here, and can even be
counter-productive on pure TCP where it could prevent a request from being
sent on a connection still attempting to complete its establishment. And it
creates an abnormal dependency between the layers that will complicate the
implementation of the mux, so let's get rid of it now.
Instead of having to manually handle lingering outside, let's make
conn_sock_shutw() check for it before calling shutdown(). We simply
don't want to emit the FIN if we're going to reset the connection
due to lingering. It's particularly important for silent-drop where
it's absolutely mandatory that no packet leaves the machine.
These flags are not exactly for the data layer, they instead indicate
what is expected from the transport layer. Since we're going to split
the connection between the transport and the data layers to insert a
mux layer, it's important to have a clear idea of what each layer does.
All function conn_data_* used to manipulate these flags were renamed to
conn_xprt_*.
For HTTP/2 we'll need some buffer-only equivalent functions to some of
the ones applying to channels and still squatting the bi_* / bo_*
namespace. Since these names have kept being misleading for quite some
time now and are really getting annoying, it's time to rename them. This
commit will use "ci/co" as the prefix (for "channel in", "channel out")
instead of "bi/bo". The following ones were renamed :
bi_getblk_nc, bi_getline_nc, bi_putblk, bi_putchr,
bo_getblk, bo_getblk_nc, bo_getline, bo_getline_nc, bo_inject,
bi_putchk, bi_putstr, bo_getchr, bo_skip, bi_swpbuf
Since around 1.5-dev12, we've been setting MSG_MORE on send() on various
conditions, including the fact that SHUTW_NOW is present, but we don't
check that it's accompanied with AUTO_CLOSE. The result is that on requests
immediately followed by a close (where AUTO_CLOSE is not set), the request
gets delayed in the TCP stack before being sent to the server. This is
visible with the H2 code where the end-of-stream flag is set on requests,
but probably happens when a POLL_HUP is detected along with the request.
The (lack of) presence of option abortonclose has no effect here since we
never send the SHUTW along with the request.
This fix can be backported to 1.7, 1.6 and 1.5.
When
1) HAProxy configured to enable splice on both directions
2) After some high load, there are 2 input channels with their socket buffer
being non-empty and pipe being full at the same time, sitting in `fd_cache`
without any other fds.
The 2 channels will repeatedly be stopped for receiving (pipe full) and waken
for receiving (data in socket), thus getting out and in of `fd_cache`, making
their fd swapping location in `fd_cache`.
There is a `if (entry < fd_cache_num && fd_cache[entry] != fd) continue;`
statement in `fd_process_cached_events` to prevent frequent polling, but since
the only 2 fds are constantly swapping location, `fd_cache[entry] != fd` will
always hold true, thus HAProxy can't make any progress.
The root cause of the issue is dual :
- there is a single fd_cache, for next events and for the ones being
processed, while using two distinct arrays would avoid the problem.
- the write side of the stream interface wakes the read side up even
when it couldn't write, and this one really is a bug.
Due to CF_WRITE_PARTIAL not being cleared during fast forwarding, a failed
send() attempt will still cause ->chk_rcv() to be called on the other side,
re-creating an entry for its connection fd in the cache, causing the same
sequence to be repeated indefinitely without any opportunity to make progress.
CF_WRITE_PARTIAL used to be used for what is present in these tests : check
if a recent write operation was performed. It's part of the CF_WRITE_ACTIVITY
set and is tested to check if timeouts need to be updated. It's also used to
detect if a failed connect() may be retried.
What this patch does is use CF_WROTE_DATA() to check for a successful write
for connection retransmits, and to clear CF_WRITE_PARTIAL before preparing
to send in stream_int_notify(). This way, timeouts are still updated each
time a write succeeds, but chk_rcv() won't be called anymore after a failed
write.
It seems the fix is required all the way down to 1.5.
Without this patch, the only workaround at this point is to disable splicing
in at least one direction. Strictly speaking, splicing is not absolutely
required, as regular forwarding could theorically cause the issue to happen
if the timing is appropriate, but in practice it appears impossible to
reproduce it without splicing, and even with splicing it may vary.
The following config manages to reproduce it after a few attempts (haproxy
going 100% CPU and having to be killed) :
global
maxpipes 50000
maxconn 10000
listen srv1
option splice-request
option splice-response
bind :8001
server s1 127.0.0.1:8002
server$ tcploop 8002 L N20 A R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000
client$ tcploop 8001 N20 C T S1000000 R10 J
After careful inspection, this flag is set at exactly two places :
- once in the health-check receive callback after receipt of a
response
- once in the stream interface's shutw() code where CF_SHUTW is
always set on chn->flags
The flag was checked in the checks before deciding to send data, but
when it is set, the wake() callback immediately closes the connection
so the CO_FL_SOCK_WR_SH flag is also set.
The flag was also checked in si_conn_send(), but checking the channel's
flag instead is enough and even reveals that one check involving it
could never match.
So it's time to remove this flag and replace its check with a check of
CF_SHUTW in the stream interface. This way each layer is responsible
for its shutdown, this will ease insertion of the mux layer.
This flag is both confusing and wrong. It is supposed to report the
fact that the data layer has received a shutdown, but in fact this is
reported by CO_FL_SOCK_RD_SH which is set by the transport layer after
this condition is detected. The only case where the flag above is set
is in the stream interface where CF_SHUTR is also set on the receiving
channel.
In addition, it was checked in the health checks code (while never set)
and was always test jointly with CO_FL_SOCK_RD_SH everywhere, except in
conn_data_read0_pending() which incorrectly doesn't match the second
time it's called and is fortunately protected by an extra check on
(ic->flags & CF_SHUTR).
This patch gets rid of the flag completely. Now conn_data_read0_pending()
accurately reports the fact that the transport layer has detected the end
of the stream, regardless of the fact that this state was already consumed,
and the stream interface watches ic->flags&CF_SHUTR to know if the channel
was already closed by the upper layer (which it already used to do).
The now unused conn_data_read0() function was removed.
The stream interface chk_snd() code checks if the connection has already
subscribed to write events in order to avoid attempting a useless write()
which will fail. But it used to check both the CO_FL_CURR_WR_ENA and the
CO_FL_DATA_WR_ENA flags, while the former may only be present without the
latterif either the other side just disabled writing did not synchronize
yet (which is harmless) or if it's currently performing a handshake, which
is being checked by the next condition and will be better dealt with by
properly subscribing to the data events.
This code was added back in 1.5-dev20 to limit the number of useless calls
to splice() but both flags were checked at once while only CO_FL_DATA_WR_ENA
was needed. This bug seems to have no impact other than making code changes
more painful. This fix may be backported down to 1.5 though is unlikely to
be needed there.
Introduced regression with 'MAJOR: applet scheduler rework' (1.8-dev only).
The fix consist to re-enable the appctx immediatly from the
applet wake cb if the process_stream is not pending in runqueue
and the applet want perform a put or a get and the WAIT_ROOM
flag was removed by stream_int_notify.
In order to authorize call of appctx_wakeup on running task:
- from within the task handler itself.
- in futur, from another thread.
The appctx is considered paused as default after running the handler.
The handler should explicitly call appctx_wakeup to be re-called.
When the appctx_free is called on a running handler. The real
free is postponed at the end of the handler process.
Very early in the connection rework process leading to v1.5-dev12, commit
56a77e5 ("MEDIUM: connection: complete the polling cleanups") marked the
end of use for this flag which since was never set anymore, but it continues
to be tested. Let's kill it now.
A tcp half connection can cause 100% CPU on expiration.
First reproduced with this haproxy configuration :
global
tune.bufsize 10485760
defaults
timeout server-fin 90s
timeout client-fin 90s
backend node2
mode tcp
timeout server 900s
timeout connect 10s
server def 127.0.0.1:3333
frontend fe_api
mode tcp
timeout client 900s
bind :1990
use_backend node2
Ie timeout server-fin shorter than timeout server, the backend server
sends data, this package is left in the cache of haproxy, the backend
server continue sending fin package, haproxy recv fin package. this
time the session information is as follows:
time the session information is as follows:
0x2373470: proto=tcpv4 src=127.0.0.1:39513 fe=fe_api be=node2
srv=def ts=08 age=1s calls=3 rq[f=848000h,i=0,an=00h,rx=14m58s,wx=,ax=]
rp[f=8004c020h,i=0,an=00h,rx=,wx=14m58s,ax=] s0=[7,0h,fd=6,ex=]
s1=[7,18h,fd=7,ex=] exp=14m58s
rp has set the CF_SHUTR state, next, the client sends the fin package,
session information is as follows:
0x2373470: proto=tcpv4 src=127.0.0.1:39513 fe=fe_api be=node2
srv=def ts=08 age=38s calls=4 rq[f=84a020h,i=0,an=00h,rx=,wx=,ax=]
rp[f=8004c020h,i=0,an=00h,rx=1m11s,wx=14m21s,ax=] s0=[7,0h,fd=6,ex=]
s1=[9,10h,fd=7,ex=] exp=1m11s
After waiting 90s, session information is as follows:
0x2373470: proto=tcpv4 src=127.0.0.1:39513 fe=fe_api be=node2
srv=def ts=04 age=4m11s calls=718074391 rq[f=84a020h,i=0,an=00h,rx=,wx=,ax=]
rp[f=8004c020h,i=0,an=00h,rx=?,wx=10m49s,ax=] s0=[7,0h,fd=6,ex=]
s1=[9,10h,fd=7,ex=] exp=? run(nice=0)
cpu information:
6899 root 20 0 112224 21408 4260 R 100.0 0.7 3:04.96 haproxy
Buffering is set to ensure that there is data in the haproxy buffer, and haproxy
can receive the fin package, set the CF_SHUTR flag, If the CF_SHUTR flag has been
set, The following code does not clear the timeout message, causing cpu 100%:
stream.c:process_stream:
if (unlikely((res->flags & (CF_SHUTR|CF_READ_TIMEOUT)) == CF_READ_TIMEOUT)) {
if (si_b->flags & SI_FL_NOHALF)
si_b->flags |= SI_FL_NOLINGER;
si_shutr(si_b);
}
If you have closed the read, set the read timeout does not make sense.
With or without cf_shutr, read timeout is set:
if (tick_isset(s->be->timeout.serverfin)) {
res->rto = s->be->timeout.serverfin;
res->rex = tick_add(now_ms, res->rto);
}
After discussion on the mailing list, setting half-closed timeouts the
hard way here doesn't make sense. They should be set only at the moment
the shutdown() is performed. It will also solve a special case which was
already reported of some half-closed timeouts not working when the shutw()
is performed directly at the stream-interface layer (no analyser involved).
Since the stream interface layer cannot know the timeout values, we'll have
to store them directly in the stream interface so that they are used upon
shutw(). This patch does this, fixing the problem.
An easier reproducer to validate the fix is to keep the huge buffer and
shorten all timeouts, then call it under tcploop server and client, and
wait 3 seconds to see haproxy run at 100% CPU :
global
tune.bufsize 10485760
listen px
bind :1990
timeout client 90s
timeout server 90s
timeout connect 1s
timeout server-fin 3s
timeout client-fin 3s
server def 127.0.0.1:3333
$ tcploop 3333 L W N20 A P100 F P10000 &
$ tcploop 127.0.0.1:1990 C S10000000 F
Recent fix 7bf3fa3 ("BUG/MAJOR: connection: update CO_FL_CONNECTED before
calling the data layer") marked an end to a fragile situation where the
absence of CO_FL_{CONNECTED,L4,L6}* flags is used to mark the completion
of a connection setup. The problem is that by setting the CO_FL_CONNECTED
flag earlier, we can indeed call the ->wake() function from conn_fd_handler
but the stream-interface's wake function needs to see CO_FL_CONNECTED unset
to detect that a connection has just been established, so if there's no
pending data in the buffer, the connection times out. The other ->wake()
functions (health checks and idle connections) don't do this though.
So instead of trying to detect a subtle change in connection flags,
let's simply rely on the stream-interface's state and validate that the
connection is properly established and that handshakes are completed
before reporting the WRITE_NULL indicating that a pending connection was
just completed.
This patch passed all tests of handshake and non-handshake combinations,
with synchronous and asynchronous connect() and should be safe for backport
to 1.7, 1.6 and 1.5 when the fix above is already present.
While developing an experimental applet performing only one read per full
line, it appeared that it would be woken up for the client's close, not
read all data (missing LF), then wait for a subsequent call, and would only
be woken up on client timeout to finish the read. The reason is that we
preset SI_FL_WAIT_DATA in the stream-interface's flags to avoid a fast loop,
but there's nothing which can remove this flag until there's a read operation.
We must definitely remove it in stream_int_notify() each time we're called
with CF_SHUTW_NOW because we know there will be no more subsequent read
and we don't want an applet which keeps the WANT_GET flag to block on this.
This fix should be backported to 1.7 and 1.6 though it's uncertain whether
cli, peers, lua or spoe really are affected there.
When an entity tries to get a buffer, if it cannot be allocted, for example
because the number of buffers which may be allocated per process is limited,
this entity is added in a list (called <buffer_wq>) and wait for an available
buffer.
Historically, the <buffer_wq> list was logically attached to streams because it
were the only entities likely to be added in it. Now, applets can also be
waiting for a free buffer. And with filters, we could imagine to have more other
entities waiting for a buffer. So it make sense to have a generic list.
Anyway, with the current design there is a bug. When an applet failed to get a
buffer, it will wait. But we add the stream attached to the applet in
<buffer_wq>, instead of the applet itself. So when a buffer is available, we
wake up the stream and not the waiting applet. So, it is possible to have
waiting applets and never awakened.
So, now, <buffer_wq> is independant from streams. And we really add the waiting
entity in <buffer_wq>. To be generic, the entity is responsible to define the
callback used to awaken it.
In addition, applets will still request an input buffer when they become
active. But they will not be sleeped anymore if no buffer are available. So this
is the responsibility to the applet I/O handler to check if this buffer is
allocated or not. This way, an applet can decide if this buffer is required or
not and can do additional processing if not.
[wt: backport to 1.7 and 1.6]
Commit 5fddab0 ("OPTIM: stream_interface: disable reading when
CF_READ_DONTWAIT is set") improved the connection layer's efficiency
back in 1.5-dev13 by avoiding successive read attempts on an active
FD. But by disabling this on a polled FD, it causes an unpleasant
side effect which is that the FD that was subscribed to polling is
suddenly stopped and may need to be re-enabled once the kernel
starts to slow down on data eviction (eg: saturated server at the
other end, bursty traffic caused by too large maxpollevents).
This behaviour is observable with persistent connections when there
is a large enough connection count so that there's no data in the
early connection and polling is required, because there are then
up to 4 epoll_ctl() calls per request. It's important that the
server is slower than haproxy to cause some delays when reading
response.
The current connection layer as designed in 1.6 with the FD cache
doesn't require this trick anymore, though it still benefits from
it when it saves an FD from being uselessly polled. But compared
to the increased cost of enabling and disabling poll all the time,
it's still better to disable it. In some cases it's possible to
observe a performance increase as high as 30% by avoiding this
epoll_ctl() dance.
In the end we only want to disable it when the FD is speculatively
read and not when it's polled. For this we introduce a new function
__conn_data_done_recv() which is used to indicate that we're done
with recv() and not interested in new attempts. If/when we later
support event-triggered epoll, this function will have to change
a bit to do the same even in the polled case.
A quick test with keep-alive requests run on a dual-core / dual-
thread Atom shows a significant improvement :
single process, 0 bytes :
before: Requests per second: 12243.20 [#/sec] (mean)
after: Requests per second: 13354.54 [#/sec] (mean)
single process, 4k :
before: Requests per second: 9639.81 [#/sec] (mean)
after: Requests per second: 10991.89 [#/sec] (mean)
dual process, 0 bytes (unstable) :
before: Requests per second: 16900-19800 ~ 17600 [#/sec] (mean)
after: Requests per second: 18600-21400 ~ 20500 [#/sec] (mean)
While the SI_ST_DIS state is set *after* doing the close on a connection,
it was set *before* calling release on an applet. Applets have no internal
flags contrary to connections, so they have no way to detect they were
already released. Because of this it happened that applets were closed
twice, once via si_applet_release() and once via si_release_endpoint() at
the end of a transaction. The CLI applet could perform a double free in
this case, though the situation to cause it is quite hard because it
requires that the applet is stuck on output in states that produce very
few data.
In order to solve this, we now assign the SI_ST_DIS state *after* calling
->release, and we refrain from doing so if the state is already assigned.
This makes applets work much more like connections and definitely avoids
this double release.
In the future it might be worth making applets have their own flags like
connections to carry their own state regardless of the stream interface's
state, especially when dealing with connection reuse.
No backport is needed since this issue was caused by the rearchitecture
in 1.6.
If an applet tries to write to a closed connection, it hangs forever.
This results in some "get map" commands on the CLI to leave orphaned
connections alive.
Now the applet wakeup function detects that the applet still wants to
write while the channel is closed for reads, which is the equivalent
to the common "broken pipe" situation. In this case, an error is
reported on the stream interface, just as it happens with connections
trying to perform a send() in a similar situation.
With this fix the stats socket is properly released.
This function is a callback made only for calls from the applet handler.
Rename it to remove confusion. It's currently called from the Lua code
but that's not correct, we should call the notify and update functions
instead otherwise it will not enable the applet again.
This one is not needed anymore as what it used to do is either
completely covered by the new stream_int_notify() function, or undesired
and inherited from the past as a side effect of introducing the
connections.
This update is theorically never called since it's assigned only when
nothing is connected to the stream interface. However a test has been
added to si_update() to stay safe if some foreign code decides to call
si_update() in unsafe situations.
The code to report completion after a connection update or an applet update
was almost the same since applets stole it from the connection. But the
differences made them hard to maintain and prevented the creation of new
functions doing only one part of the work.
This patch replaces the common code from the si_conn_wake_cb() and
si_applet_wake_cb() with a single call to stream_int_notify() which only
notifies the stream (si+channels+task) from the outside.
No functional change was made beyond this.
stream_int_notify() was taken from the common part between si_conn_wake_cb()
and si_applet_done(). It is designed to report activity to a stream from
outside its handler. It'll generally be used by lower layers to report I/O
completion but may also be used by remote streams if the buffer processing
is shared.
The condition to release the SI_FL_WAIT_ROOM flag was abnormally
complicated because it was inherited from 6 years ago before we used
to check for the buffer's emptiness. The CF_READ_PARTIAL flag had to be
removed, and the complex test was replaced with a simpler one checking
if *some* data were moved out or not.
The reason behind this change is to have a condition compatible with
both connections and applets, as applets currently don't work very
well in this area. Specifically, some optimizations on the applet
side cause them not to release the flag above until the buffer is
empty, which may prevent applets from taking together (eg: peers
over large haproxy buffers and small kernel buffers).
Now the call to stream_int_update() is moved to si_update(), which
is exclusively called from the stream, so that the socket layer may
be updated without updating the stream layer. This will later permit
to call it individually from other places (other tasks or applets for
example).
Now that we have a generic stream_int_update() function, we can
replace the equivalent part in stream_int_update_conn() and
stream_int_update_applet() to avoid code duplication.
There is no functional change, as the code is the same but split
in two functions for each call.
This function is designed to be called from within the stream handler to
update the channels' expiration timers and the stream interface's flags
based on the channels' flags. It needs to be called only once after the
channels' flags have settled down, and before they are cleared, though it
doesn't harm to call it as often as desired (it just slightly hurts
performance). It must not be called from outside of the stream handler,
as what it does will be used to compute the stream task's expiration.
The code was taken directly from stream_int_update_applet() and
stream_int_update_conn() which had exactly the same one except for
applet-specific or connection-specific status update.
The purpose is to separate the connection-specific parts so that the
stream-int specific one can be factored out. There's no functional
change here, only code displacement.
If we're going to call the task we don't need to call the appctx anymore
since the task may decide differently in the end and will do the proper
thing using ->update(). This reduces one wake up call per session and
may go down to half in case of high concurrency (scheduling races).
The applets don't fiddle with SI_FL_WAIT_ROOM anymore, instead they indicate
what they want, possibly that they failed (eg: WAIT_ROOM), and it's done() /
update() which finally updates the WAIT_* flags according to the channels'
and stream interface's states. This solves the issue of the pauses during a
"show sess" without creating busy loops.
Now si->update() is used to update any type of stream interface, whether
it's an applet, a connection or even nothing. We don't call si_applet_call()
anymore at the end of the resync and we don't have the risk that the
stream's task is reinserted into the run queue, which makes the code
a bit simpler.
The stream_int_update_applet() function was simplified to ensure that
it remained compatible with this standardized calling convention. It
was almost copy-pasted from the update code dedicated to connections.
Just like for si_applet_done(), it seems that it should be possible to
merge the two functions except that it would require some slow operations,
except maybe if the type of end point is tested inside the update function
itself.
The applet I/O handlers now rely on si_applet_done() which itself decides
to wake up or sleep the appctx. Now it becomes critical that applte handlers
properly call this on every exit path so that the appctx is removed from the
active list after I/O have been handled. One such call was added to the Lua
socket handler. It used to work without it probably because the main task is
woken up by the parent task but now it's needed.
This is the equivalent of si_conn_wake() but for applets. It will be
called after changes to the stream interface are brought by the applet
I/O handler. Ultimately it will release buffers and may be even wake
the stream's task up if some important changes are detected.
It would be nice to be able to merge it with the connection's wake
function since it mostly manipulates the stream interface, but there
are minor differences (such as how to enable/disable polling on a fd
vs applet) and some specificities to applets (eg: don't wake the
applet up until the output is empty) which would require abstract
functions which would slow down everything.