A session doesn't need buffers all the time, especially when they're
empty. With this patch, we don't allocate buffers anymore when the
session is initialized, we only allocate them in two cases :
- during process_session()
- during I/O operations
During process_session(), we try hard to allocate both buffers at once
so that we know for sure that a started operation can complete. Indeed,
a previous version of this patch used to allocate one buffer at a time,
but it can result in a deadlock when all buffers are allocated for
requests for example, and there's no buffer left to emit error responses.
Here, if any of the buffers cannot be allocated, the whole operation is
cancelled and the session is added at the tail of the buffer wait queue.
At the end of process_session(), a call to session_release_buffers() is
done so that we can offer unused buffers to other sessions waiting for
them.
For I/O operations, we only need to allocate a buffer on the Rx path.
For this, we only allocate a single buffer but ensure that at least two
are available to avoid the deadlock situation. In case buffers are not
available, SI_FL_WAIT_ROOM is set on the stream interface and the session
is queued. Unused buffers resulting either from a successful send() or
from an unused read buffer are offered to pending sessions during the
->wake() callback.
When a session_alloc_buffers() fails to allocate one or two buffers,
it subscribes the session to buffer_wq, and waits for another session
to release buffers. It's then removed from the queue and woken up with
TASK_WAKE_RES, and can attempt its allocation again.
We decide to try to wake as many waiters as we release buffers so
that if we release 2 and two waiters need only once, they both have
their chance. We must never come to the situation where we don't wake
enough tasks up.
It's common to release buffers after the completion of an I/O callback,
which can happen even if the I/O could not be performed due to half a
failure on memory allocation. In this situation, we don't want to move
out of the wait queue the session that was just added, otherwise it
will never get any buffer. Thus, we only force ourselves out of the
queue when freeing the session.
Note: at the moment, since session_alloc_buffers() is not used, no task
is subscribed to the wait queue.
In stream_int_register_handler(), we call si_alloc_appctx(si) but as a
mistake, instead of checking the return value for a NULL, we test <si>.
This bug was discovered under extreme memory contention (memory for only
two buffers with 500 connections waiting) and after 3 million failed
connections. While it was very hard to produce it, the fix is tagged
major because in theory it could happen when haproxy runs with a very
low "-m" setting preventing from allocating just the few bytes needed
for an appctx. But most users will never be able to trigger it. The
fix was confirmed to address the bug.
This fix must be backported to 1.5.
Commit bb2e669 ("BUG/MAJOR: http: correctly rewind the request body
after start of forwarding") was incorrect/incomplete. It used to rely on
CF_READ_ATTACHED to stop updating msg->sov once data start to leave the
buffer, but this is unreliable because since commit a6eebb3 ("[BUG]
session: clear BF_READ_ATTACHED before next I/O") merged in 1.5-dev1,
this flag is only ephemeral and is cleared once all analysers have
seen it. So we can start updating msg->sov again each time we pass
through this place with new data. With a sufficiently large amount of
data, it is possible to make msg->sov wrap and validate the if()
condition at the top, causing the buffer to advance by about 2GB and
crash the process.
Note that the offset cannot be controlled by the attacker because it is
a sum of millions of small random sizes depending on how many bytes were
read by the server and how many were left in the buffer, only because
of the speed difference between reading and writing. Also, nothing is
written, the invalid pointer resulting from this operation is only read.
Many thanks to James Dempsey for reporting this bug and to Chris Forbes for
narrowing down the faulty area enough to make its root cause analysable.
This fix must be backported to haproxy 1.5.
This commit modifies the PROXY protocol V2 specification to support headers
longer than 255 bytes allowing for optional extensions. It implements the
PROXY protocol V2 which is a binary representation of V1. This will make
parsing more efficient for clients who will know in advance exactly how
many bytes to read. Also, it defines and implements some optional PROXY
protocol V2 extensions to send information about downstream SSL/TLS
connections. Support for PROXY protocol V1 remains unchanged.
The new tune.idletimer value allows one to set a different value for
idle stream detection. The default value remains set to one second.
It is possible to disable it using zero, and to change the default
value at build time using DEFAULT_IDLE_TIMER.
Disabling the streamer flags after an idle period will help TCP proxies
to better adapt to the streams they're forwarding, especially with SSL
where this will allow the SSL sender to use smaller records. This is
typically used to optimally relay HTTP and derivatives such as SPDY or
HTTP/2 in pure TCP mode when haproxy is used as an SSL offloader.
This idea was first proposed by Ilya Grigorik on the haproxy mailing
list, and his tests seem to confirm the improvement :
https://www.mail-archive.com/haproxy@formilux.org/msg12576.html
By having the stream interface pass the CF_STREAMER flag to the
snd_buf() primitive, we're able to tell the send layer whether
we're sending large chunks or small ones.
We use this information in SSL to adjust the max record dynamically.
This results in small chunks respecting tune.ssl.maxrecord at the
beginning of a transfer or for small transfers, with an automatic
switch to full records if the exchanges last long. This allows the
receiver to parse HTML contents on the fly without having to retrieve
16kB of data, which is even more important with small initcwnd since
the receiver does not need to wait for round trips to start fetching
new objects. However, sending large files still produces large chunks.
For example, with tune.ssl.maxrecord = 2859, we see 5 write(2885)
sent in two segments each and 6 write(16421).
This idea was first proposed on the haproxy mailing list by Ilya Grigorik.
This prevents us from passing other useful info and requires the
upper levels to know these flags. Let's use a new flags category
instead : CO_SFL_*. For now, only MSG_MORE has been remapped.
We don't need to call fd_stop_both() since we already call
conn_cond_update_polling() which will do it. This call was introduced by
commit d29a066 ("BUG/MAJOR: connection: always recompute polling status
upon I/O").
We used to only update the polling flags in data phase, but after that
we could update other flags. It does not seem possible to trigger a
bug here but it's not very safe either. Better always keep them up to
date.
The recv/send callbacks must check for readiness themselves instead of
having their callers do it. This will strengthen the test and will also
ensure we never refrain from calling a handshake handler because a
direction is being polled while the other one is ready.
We simply remove these functions and replace their calls with the
appropriate ones :
- if we're in the data phase, we can simply report wait on the FD
- if we're in the socket phase, we may also have to signal the
desire to read/write on the socket because it might not be
active yet.
These flags were used to report the readiness of the file descriptor.
Now this readiness is directly checked at the file descriptor itself.
This removes the need for constantly synchronizing updates between the
file descriptor and the connection and ensures that all layers share
the same level of information.
For now, the readiness is updated in conn_{sock,data}_poll_* by directly
touching the file descriptor. This must move to the lower layers instead
so that these functions can disappear as well. In this state, the change
works but is incomplete. It's sensible enough to avoid making it more
complex.
Now the sock/data updates become much simpler because they just have to
enable/disable access to a file descriptor and not to care anymore about
its readiness.
Recent commit d7ad9f5 ("MAJOR: channel: add a new flag CF_WAKE_WRITE to
notify the task of writes") was not correct. It used to wake up the task
as soon as there was some write activity and the flag was set, even if there
were still some data to be forwarded. This resulted in process_session()
being called a lot when transfering chunk-encoded HTTP responses made of
very large chunks.
The purpose of the flag is to wake up only a task waiting for some
room and not the other ones, so it's totally counter-productive to
wake it up as long as there are data to forward because the task
will not be allowed to write anyway.
Also, the commit above was taking some risks by not considering
certain events anymore (eg: state != SI_ST_EST). While such events
are not used at the moment, if some new features were developped
in the future relying on these, it would be better that they could
be notified when subscribing to the WAKE_WRITE event, so let's
restore the condition.
Now we can more safely rely on the connection state to decide how to
drain and what to do when data are drained. Callers don't need to
manipulate the file descriptor's state anymore.
Note that it also removes the need for the fix ea90063 ("BUG/MEDIUM:
stream-int: fix the keep-alive idle connection handler") since conn_drain()
correctly sets the polling flags.
It was not possible to know if the drain() function had hit an
EAGAIN, so now we change the API of this function to return :
< 0 if EAGAIN was met
= 0 if some data remain
> 0 if a shutdown was received
Since commit 6b66f3e ([MAJOR] implement autonomous inter-socket forwarding)
introduced in 1.3.16-rc1, we've been relying on a stupid mechanism to wake
up the task after a write, which was an exact copy-paste of the reader side.
The principle was that if we empty a buffer and there's no forwarding
scheduled or if the *producer* is not in a connected state, then we wake
the task up.
That does not make any sense. It happens to wake up too late sometimes (eg,
when the request analyser waits for some room in the buffer to start to
work), and leads to unneeded wakeups in client-side keep-alive, because
the task is woken up when the response is sent, while the analysers are
simply waiting for a new request.
In order to fix this, we introduce a new channel flag : CF_WAKE_WRITE. It
is designed so that an analyser can explicitly request being notified when
some data were written. It is used only when the HTTP request or response
analysers need to wait for more room in the buffers. It is automatically
cleared upon wake up.
The flag is also automatically set by the functions which try to write into
a buffer from an applet when they fail (bi_putblk() etc...).
That allows us to remove the stupid condition above and avoid some wakeups.
In http-server-close and in http-keep-alive modes, this reduces from 4 to 3
the average number of wakeups per request, and increases the overall
performance by about 1.5%.
Since the applet rework and the removal of the inter-task applets,
we must not clear the stream-interface's owner task anymore otherwise
we risk a crash when maintaining keep-alive with an applet. This is
not possible right now so there is no impact yet, but this bug is not
easy to track down. No backport is needed.
Commit 2737562 (MEDIUM: stream-int: implement a very simplistic idle
connection manager) implemented an idle connection handler. In the
case where all data is drained from the server, it fails to disable
polling, resulting in a busy spinning loop.
Thanks to Sander Klein and Guillaume Castagnino for reporting this bug.
No backport is needed.
Idle connections are not monitored right now. So if a server closes after
a response without advertising it, it won't be detected until a next
request wants to use the connection. This is a bit problematic because
it unnecessarily maintains file descriptors and sockets in an idle
state.
This patch implements a very simple idle connection manager for the stream
interface. It presents itself as an I/O callback. The HTTP engine enables
it when it recycles a connection. If a close or an error is detected on the
underlying socket, it tries to drain as much data as possible from the socket,
detect the close and responds with a close as well, then detaches from the
stream interface.
Right now we see many places doing their own setsockopt(SO_LINGER).
Better only do it just before the close() in fd_delete(). For this
we add a new flag on the file descriptor, indicating if it's safe or
not to linger. If not (eg: after a connect()), then the setsockopt()
call is automatically performed before a close().
The flag automatically turns to safe when receiving a read0.
conn_xprt_ready() reports if the transport layer is ready.
conn_ctrl_ready() reports if the control layer is ready.
The stream interface uses si_conn_ready() to report that the
underlying connection is ready. This will be used for connection
reuse in keep-alive mode.
From now on, a call to stream_int_register_handler() causes a call
to si_alloc_appctx() and returns an initialized appctx for the
current stream interface. If one was previously allocated, it is
released. If the stream interface was attached to a connection, it
is released as well.
The appctx are allocated from the same pools as the connections, because
they're substantially smaller in size, and we can't have both a connection
and an appctx on an interface at any moment.
In case of memory shortage, the call may return NULL, which is already
handled by all consumers of stream_int_register_handler().
The field appctx was removed from the stream interface since we only
rely on the endpoint now. On 32-bit, the stream_interface size went down
from 108 to 44 bytes. On 64-bit, it went down from 144 to 64 bytes. This
represents a memory saving of 160 bytes per session.
It seems that a later improvement could be to move the call to
stream_int_register_handler() to session.c for most cases.
The task returned by stream_int_register_handler() is never used, however we
always need to access the appctx afterwards. So make it return the appctx
instead. We already plan for it to fail, which is the reason for the addition
of a few tests and the possibility for the HTTP analyser to return a status
code 500.
We used to have two very similar functions for sending a PROXY protocol
line header. The reason is that the default one relies on the stream
interface to retrieve the other end's address, while the "local" one
performs a local address lookup and sends that instead (used by health
checks).
Now that the send_proxy_ofs is stored in the connection and not the
stream interface, we can make the local_send_proxy rely on it and
support partial sends. This also simplifies the code by removing the
local_send_proxy function, making health checks use send_proxy_ofs,
resulting in the removal of the CO_FL_LOCAL_SPROXY flag, and the
associated test in the connection handler. The other flag,
CO_FL_SI_SEND_PROXY was renamed without the "SI" part so that it
is clear that it is not dedicated anymore to a usage with a stream
interface.
Till now the send_proxy_ofs field remained in the stream interface,
but since the dynamic allocation of the connection, it makes a lot
of sense to move that into the connection instead of the stream
interface, since it will not be statically allocated for each
session.
Also, it turns out that moving it to the connection fils an alignment
hole on 64 bit architectures so it does not consume more memory, and
removing it from the stream interface was an opportunity to correctly
reorder fields and reduce the stream interface's size from 160 to 144
bytes (-10%). This is 32 bytes saved per session.
The outgoing connection is now allocated dynamically upon the first attempt
to touch the connection's source or destination address. If this allocation
fails, we fail on SN_ERR_RESOURCE.
As we didn't use si->conn anymore, it was removed. The endpoints are released
upon session_free(), on the error path, and upon a new transaction. That way
we are able to carry the existing server's address across retries.
The stream interfaces are not initialized anymore before session_complete(),
so we could even think about allocating them dynamically as well, though
that would not provide much savings.
The session initialization now makes use of conn_new()/conn_free(). This
slightly simplifies the code and makes it more logical. The connection
initialization code is now shorter by about 120 bytes because it's done
at once, allowing the compiler to remove all redundant initializations.
The si_attach_applet() function now takes care of first detaching the
existing endpoint, and it is called from stream_int_register_handler(),
so we can safely remove the calls to si_release_endpoint() in the
application code around this call.
A call to si_detach() was made upon stream_int_unregister_handler() to
ensure we always free the allocated connection if one was allocated in
parallel to setting an applet (eg: detect HTTP proxy while proceeding
with stats maybe).
si_prepare_conn() is not appropriate in our case as it both initializes and
attaches the connection to the stream interface. Due to the asymmetry between
accept() and connect(), it causes some fields such as the control and transport
layers to be reinitialized.
Now that we can separately initialize these fields using conn_prepare(), let's
break this function to only attach the connection to the stream interface.
Also, by analogy, si_prepare_none() was renamed si_detach(), and
si_prepare_applet() was renamed si_attach_applet().
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
The connection will only remain there as a pre-allocated entity whose
goal is to be placed in ->end when establishing an outgoing connection.
All connection initialization can be made on this connection, but all
information retrieved should be applied to the end point only.
This change is huge because there were many users of si->conn. Now the
only users are those who initialize the new connection. The difficulty
appears in a few places such as backend.c, proto_http.c, peers.c where
si->conn is used to hold the connection's target address before assigning
the connection to the stream interface. This is why we have to keep
si->conn for now. A future improvement might consist in dynamically
allocating the connection when it is needed.
The long-term goal is to have a context for applets as an alternative
to the connection and not as a complement. At the moment, the context
is still stored into the stream interface, and we only put a pointer
to the applet's context in si->end, initialize the context with object
type OBJ_TYPE_APPCTX, and this allows us not to allocate an entry when
deciding to switch to an applet.
A special care is taken to never dereference si->conn anymore when
dealing with an applet. That's why it's important that si->end is
always set to the proper type :
si->end == NULL => not connected to anything
*si->end == OBJ_TYPE_APPCTX => connected to an applet
*si->end == OBJ_TYPE_CONN => real connection (server, proxy, ...)
The session management code used to check the applet from the connection's
target. Now it uses the stream interface's end point and does not touch the
connection at all. Similarly, we stop checking the connection's addresses
and file descriptors when reporting the applet's status in the stats dump.
Since last commit, we now have a pointer to the applet in the
applet context. So we don't need the si->release function pointer
anymore, it can be extracted from applet->applet.release. At many
places, the ->release function was still tested for real connections
while it is only limited to applets, so most of them were simply
removed. For the remaining valid uses, a new inline function
si_applet_release() was added to simplify the check and the call.
si_prepare_embedded() was used both to attach an applet and to detach
anything from a stream interface. Split it into si_prepare_none() to
detach and si_prepare_applet() to attach an applet.
si->conn->target is now assigned from within these two functions instead
of their respective callers.
This is to be more consistent with the other functions. The only
reason why these functions used to return a value was to let the
caller adjust polling by itself, but now their only callers were
the si_shutr()/si_shutw() inline functions. Now these functions
do not depend anymore on the connection.
These connection variant of these functions now call
conn_data_stop_recv()/conn_data_stop_send() before returning order
not to require a return code anymore. The applet version does not
need this at all.
These functions induce a lot of ifs everywhere because they consider two
different cases, one which is where the connection exists and has a file
descriptor, and the other one which is the default case where at most an
applet has to be notified.
Let's have them in si_ops and automatically decide which one to use.
The connection shutdown sequence has been slightly simplified, and we
now clear the flags at the end.
Also we remove SHUTR_NOW after a shutw with nolinger, as it's cleaner
not to keep it.
When we get a hard error from a syscall indicating the socket is dead,
it makes sense to set the CO_FL_SOCK_WR_SH and CO_FL_SOCK_RD_SH flags
to indicate that the socket may not be used anymore. It will ease the
error processing in health checks where the state of socket is very
important. We'll also be able to avoid some setsockopt(nolinger) after
an error.
For now, the rest of the code is not impacted because CO_FL_ERROR is
always tested prior to these flags.
PROXY protocol header was not tolerant to signals, so it might cause a
connection to report an error if a signal comes in at the exact same
moment the send is done.
This is 1.5-specific and does not need any backport.
The connection flag CO_FL_ERROR will be tested in the functions both
si_conn_recv_cb() and si_conn_send_cb(). If CO_FL_ERROR has been set, out_error
branch will be executed. But the only job of out_error branch is to set
CO_FL_ERROR on connection flag. So it's better return directly than goto
out_error branch under such conditions. As a result, out_error branch becomes
needless and can be removed.
In addition, the return type of si_conn_send_loop() is also changed to void.
The caller should check conn->flags for errors just like stream_int_chk_snd_conn()
does as below:
static void stream_int_chk_snd_conn(struct stream_interface *si)
{
...
conn_refresh_polling_flags(si->conn);
- if (si_conn_send(si->conn) < 0) {
+ si_conn_send(si->conn);
+ if (si->conn->flags & CO_FL_ERROR) {
...
}
Signed-off-by: Godbach <nylzhaowei@gmail.com>
Though si_conn_send_loop() does not loop over ->snd_buf() after commit ed7f836,
there is still some codes left which use `while` but only execute once. This
commit does the cleanup job and rename si_conn_send_loop() to si_conn_send().
Signed-off-by: Godbach <nylzhaowei@gmail.com>
David Berard reported that send-proxy was broken on FreeBSD and tracked the
issue to be an error returned by send(). We already had the same issue in
the past in another area which was addressed by the following commit :
0ea0cf6 BUG: raw_sock: also consider ENOTCONN in addition to EAGAIN
In fact, on Linux send() returns EAGAIN when the connection is not yet
established while other OSes return ENOTCONN. Let's consider ENOTCONN for
send-proxy there as the same as EAGAIN.
David confirmed that this change properly fixed the issue.
Another place was affected as well (health checks with send-proxy), and
was fixed.
This fix does not need any backport since it only affects 1.5.
Splicing is avoided for small transfers because it's generally cheaper
to perform a couple of recv+send calls than pipe+splice+splice. This
has the consequence that the last chunk of a large transfer may be
transferred using recv+send if it's less than 4 kB. But when the pipe
is already set up, it's better to use splice() to read the pending data,
since they will get merged with the pending ones. This is what now
happens everytime the reader is slower than the writer.
Note that this change alone could have fixed most of the CPU hog bug,
except at the end when only the close was pending.
As explained in previous patch, we incorrectly call chk_snd() when
performing a read even if the write event is already subscribed to
poll(). This is counter-productive because we're almost sure to get
an EAGAIN.
A quick test shows that this fix halves the number of failed splice()
calls without adding any extra work on other syscalls.
This could have been tagged as an improvement, but since this behaviour
made the analysis of previous bug more complex, it still qualifies as
a fix.
Mark Janssen reported an issue in 1.5-dev19 which was introduced
in 1.5-dev12 by commit 96199b10. From time to time, randomly, the
CPU usage spikes to 100% for seconds to minutes.
A deep analysis of the traces provided shows that it happens when
waiting for the response to a second pipelined HTTP request, or
when trying to handle the received shutdown advertised by epoll()
after the last block of data. Each time, splice() was involved with
data pending in the pipe.
The cause of this was that such events could not be taken into account
by splice nor by recv and were left pending :
- the transfer of the last block of data, optionally with a shutdown
was not handled by splice() because of the validation that to_forward
is higher than MIN_SPLICE_FORWARD ;
- the next recv() call was inhibited because of the test on presence
of data in the pipe. This is also what prevented the recv() call
from handling a response to a pipelined request until the client
had ACKed the previous response.
No less than 4 different methods were experimented to fix this, and the
current one was finally chosen. The principle is that if an event is not
caught by splice(), then it MUST be caught by recv(). So we remove the
condition on the pipe's emptiness to perform an recv(), and in order to
prevent recv() from being used in the middle of a transfer, we mark
supposedly full pipes with CO_FL_WAIT_ROOM, which makes sense because
the reason for stopping a splice()-based receive is that the pipe is
supposed to be full.
The net effect is that we don't wake up and sleep in loops during these
transient states. This happened much more often than expected, sometimes
for a few cycles at end of transfers, but rarely long enough to be
noticed, unless a client timed out with data pending in the pipe. The
effect on CPU usage is visible even when transfering 1MB objects in
pipeline, where the CPU usage drops from 10 to 6% on a small machine at
medium bandwidth.
Some further improvements are needed :
- the last chunk of a splice() transfer is never done using splice due
to the test on to_forward. This is wrong and should be performed with
splice if the pipe has not yet been emptied ;
- si_chk_snd() should not be called when the write event is already being
polled, otherwise we're almost certain to get EAGAIN.
Many thanks to Mark for all the traces he cared to provide, they were
essential for understanding this issue which was not reproducible
without.
Only 1.5-dev is affected, no backport is needed.
Commit 7bb68abb introduced the SI_FL_NOHALF flag in dev10. It is used
to automatically close the write side of a connection whose read side
is closed. But the patch also caused the opposite to happen, which is
that a simple shutw() call would immediately close the connection. This
is not desired because when using option abortonclose, we want to pass
the client's shutdown to the server which will decide what to do with
it. So let's avoid the close when SHUTR is not set.
The code review during the chase for the POST freeze uncovered another possible
issue which might appear when we perform an incomplete read and want to stop because
of READ_DONTWAIT or because we reached the maximum read_poll limit. Reading is
disabled but SI_FL_WAIT_ROOM was not set, possibly causing some cases where a
send() on the other side would not wake the reader up until another activity
on the same side calls the update function which fixes its status.
Since the changes in connection management, it became necessary to re-enable
polling after a fast-forward transfer would complete.
One such issue was addressed after dev12 by commit 9f7c6a18 (BUG/MAJOR:
stream_interface: certain workloads could cause get stuck) but unfortunately,
it was incomplete as very subtle cases would occasionally remain unaddressed
when a buffer was marked with the NOEXP flag, which is used during POST
uploads. The wake up must be performed even when the flag is there, the
flag is used only to refresh the timeout.
Too many conditions need to be hit together for the situation to be
reproducible, but it systematically appears for some users.
It is particularly important to credit Sander Klein and John Rood from
Picturae ICT ( http://picturae.com/ ) for reporting this bug on the mailing
list, providing configs and countless traces showing the bug in action, and
for their patience testing litteraly tens of snapshots and versions of
supposed fixes during a full week to narrow the commit range until the bug
was really knocked down! As a side effect of their numerous tests, several
other bugs were fixed.
stream_int_chk_rcv_conn() did not clear connection flags before updating them. It
is unsure whether this could have caused the stalled transfers that have been
reported since dev15.
In order to avoid such further issues, we now use a simple inline function to do
all the job.
Back in the days where polling was made with select() where all FDs
were checked at once, stream_int_chk_snd_conn() used to check whether
the file descriptor it was passed was ready or not, so that it did
not perform the work for nothing.
Right now FDs are checked just before calling the I/O handler so this
test never matches at best, or may return false information at worst.
Since conn_fd_handler() always clears the flags upon exit, it looks
like a missed event cannot happen right now. Still, better remove
this outdated check than wait for it to cause issues.