This one enforces a per-process connection rate limit, regardless of what
may be set per frontend. It can be a way to limit the CPU usage of a process
being severely attacked.
The side effect is that the global process connection rate is now measured
for each incoming connection, so it will be possible to report it.
Some older libc don't define splice() and and don't define _syscall*()
either, which causes build errors if splicing is enabled.
To solve this, we now split the syscall redefinition into two layers :
- one file per syscall (epoll, splice)
- one common file to declare the _syscall*() macros
The code is cleaner because files using the syscalls just have to include
their respective file. It's not adviced to merge multiple syscall families
into a same file if all are not intended to be used simultaneously, because
defining unused static functions causes warnings to be emitted during build.
As a result, the new USE_MY_SPLICE parameter was added in order to be able
to define the splice() syscall separately.
This global task is used to periodically check for end of resource shortage
and to try to enable queued listeners again. This is important in case some
temporary system-wide shortage is encountered, so that we don't have to wait
for an existing connection to be released before checking the queue again.
For situations where listeners are queued due to the global maxconn being
reached, the task is woken up at least every second. For situations where
a system resource shortage is detected (memory, sockets, ...) the task is
woken up at least every 100 ms. That way, recovery from severe events can
still be achieved under acceptable conditions.
This function is finally not needed anymore, as it has been replaced with
a per-proxy task that is scheduled when some limits are encountered on
incoming connections or when the process is stopping. The savings should
be noticeable on configs with a large number of proxies. The most important
point is that the rate limiting is now enforced in a clean and solid way.
All listeners that are limited by a proxy-specific resource are now
queued at the proxy's and not globally. This allows finer-grained
wakeups when releasing resource.
When an accept() fails because of a connection limit or a memory shortage,
we now disable it and queue it so that it's dequeued only when a connection
is released. This has improved the behaviour of the process near the fd limit
as now a listener with a no connection (eg: stats) will not loop forever
trying to get its connection accepted.
The solution is still not 100% perfect, as we'd like to have this used when
proxy limits are reached (use a per-proxy list) and for safety, we'd need
to have dedicated tasks to periodically re-enable them (eg: to overcome
temporary system-wide resource limitations when no connection is released).
For listeners that are not bound to a frontend, the limit on the
number of accepted connections is tested at the end of the accept()
loop, but we don't break out of the loop, meaning that if more
connections than what the listener allows are available and if this
is less than the proxy's limits and within the size of a batch, then
they could be accepted. In practice, this problem currently cannot
appear since all listeners are bound to a frontend, and it's a very
minor issue anyway.
1.4 has the same issue (which cannot happen there either), but there
is some code after it, so it's the code cleanup which revealed it.
When an accept() returns -1 ENFILE du to system limits, it leaves the
connection pending in the backlog and epoll() comes back immediately
afterwards trying to make it accept it again. This causes haproxy to
remain at 100% CPU until something makes an accept() possible again.
Now upon such resource shortage, we mark the listener FULL so that we
only enable it again once at least one connection has been released.
In fact we only do that if there are some active connections on this
proxy, so that it has a chance to be marked not full again. This makes
haproxy remain idle when all resources are used, which helps a lot
releasing those resource as fast as possible.
Backport to 1.4 might be desirable but difficult and tricky.
It's more expensive to call splice() on short payloads than to use
recv()+send(). One of the reasons is that doing a splice() involves
allocating a pipe. One other reason is that the kernel will have to
copy itself if we try to splice less than a page. So let's fix a
short offset of 4kB below which we don't splice.
A quick test shows that on chunked encoded data, with splice we had
6826 syscalls (1715 splice, 3461 recv, 1650 send) while with this
patch, the same transfer resulted in 5793 syscalls (3896 recv, 1897
send).
Fast-forwarding between file descriptors is nice but can be counter-productive
when only one part of the buffer is forwarded, because it can result in doubling
the number of send() syscalls. This is what happens on HTTP chunking, because
the chunk data are sent, then the CRLF + next chunk size are parsed and immediately
scheduled for forwarding. This results in two send() for the same block while a
single one would have done it.
There are some very rare server-to-server applications that abuse the HTTP
protocol and expect the payload phase to be highly interactive, with many
interleaved data chunks in both directions within a single request. This is
absolutely not supported by the HTTP specification and will not work across
most proxies or servers. When such applications attempt to do this through
haproxy, it works but they will experience high delays due to the network
optimizations which favor performance by instructing the system to wait for
enough data to be available in order to only send full packets. Typical
delays are around 200 ms per round trip. Note that this only happens with
abnormal uses. Normal uses such as CONNECT requests nor WebSockets are not
affected.
When "option http-no-delay" is present in either the frontend or the backend
used by a connection, all such optimizations will be disabled in order to
make the exchanges as fast as possible. Of course this offers no guarantee on
the functionality, as it may break at any other place. But if it works via
HAProxy, it will work as fast as possible. This option should never be used
by default, and should never be used at all unless such a buggy application
is discovered. The impact of using this option is an increase of bandwidth
usage and CPU usage, which may significantly lower performance in high
latency environments.
This change should be backported to 1.4 since the first report of such a
misuse was in 1.4. Next patch will also be needed.
The FL_TCP flag was a leftover from the old days we were using TCP_CORK.
With MSG_MORE it's not needed anymore so we can remove the condition and
sensibly simplify the test.
When sending is complete, it's preferred to systematically clear the flags
that were set for that transfer. What could happen is that the to_forward
counter had caused the MSG_MORE flag to be set and BF_EXPECT_MORE not to
be cleared, resulting in this flag being unexpectedly maintained for next
round.
The code has taken extreme care of not doing this till now, but it's not
acceptable that the caller has to know these precise semantics. So let's
unconditionnally clear the flag instead.
For the sake of safety, this fix should be backported to 1.4.
Patch 5ab04ec47c9946a2bbc535687c023215ca813da0 was incomplete,
because if the first send() fails on an empty buffer, we fail
to rearm the polling and we can't establish the connection
anymore.
The issue was reported by Ben Timby who provided large amounts
of traces of various tests helping to reliably reproduce the issue.
Upon connection establishment, stream_sock is now able to send a PROXY
line before sending any data. Since it's possible that the buffer is
already full, and we don't want to allocate a block for that line, we
compute it on-the-fly when we need it. We just store the offset from
which to (re-)send from the end of the line, since it's assumed that
multiple outputs of the same proxy line will be strictly equivalent. In
practice, one call is enough. We just make sure to handle the case where
the first send() would indicate an incomplete output, eventhough it's
very unlikely to ever happen.
Now that we have the target pointer and type in the stream interface,
we don't need the applet.handler pointer anymore. That makes the code
somewhat cleaner because we know we're dealing with an applet by checking
its type instead of checking the pointer is not null.
I/O handlers are still delicate to manipulate. They have no type, they're
just raw functions which have no knowledge of themselves. Let's have them
declared as applets once for all. That way we can have multiple applets
share the same handler functions and we can store their names there. When
we later need to add more parameters (eg: usage stats), we'll be able to
do so in the applets themselves.
The CLI functions has been prefixed with "cli" instead of "stats" as it's
clearly what is going on there.
The applet descriptor in the stream interface should get all the applet
specific data (st0, ...) but this will be done in the next patch so that
we don't pollute this one too much.
When a client connection aborts while the server-side connection is in
turn-around after a failed connection attempt, the turn-around timeout
is reset in shutw() but the state is not changed. The session then
remains stuck in this state forever. Change the QUE and TAR states to
DIS just as we do for CER to fix this.
This patch should be backported to 1.4.
Some distros' libc are built for CPUs earlier than i686 and as such do
not offer support for Linux kernel's faster vsyscalls. This code adds
a new build option USE_VSYSCALLS to bypass libc for most commonly used
system calls. A net gain of about 10% can be observed with this change
alone.
It only works when /proc/sys/abi/vsyscall32 equals exactly 2. When it's
set to 1, the VDSO is randomized and cannot be used.
The stream_sock's accept() used to close the FD upon error, but this
was also sometimes performed by the frontend's accept() called via the
session's accept(). Those interlaced calls were also responsible for the
spaghetti-looking error unrolling code in session.c and stream_sock.c.
Now the frontend must not close the FD anymore, the session is responsible
for that. It also takes care of just closing the FD or also removing from
the FD lists, depending on its state. The socket-level accept() does not
have to care about that anymore.
Some Alert() messages were remaining in the accept() path, which they
would have no chance to be detected. Remove some of them (the impossible
ones) and replace the relevant ones with send_log() so that the admin
has a chance to catch them.
Jozsef R.Nagy reported a reliability issue on FreeBSD. Sometimes an error
would be emitted, reporting the inability to switch a socket to non-blocking
mode and the listener would definitely not accept anything. Cyril Bont
narrowed this bug down to the call to EV_FD_CLR(l->fd, DIR_RD).
He was right because this call is wrong. It only disables input events on
the listening socket, without setting the listener to the LI_LISTEN state,
so any subsequent call to enable_listener() from maintain_proxies() is
ignored ! The correct fix consists in calling disable_listener() instead.
It is discutable whether we should keep such error path or just ignore the
event. The goal in earlier versions was to temporarily disable new activity
in order to let the system recover while releasing resources.
This counter is incremented for each incoming connection and each active
listener, and is used to prevent haproxy from stopping upon SIGUSR1. It
will thus be possible for some tasks in increment this counter in order
to prevent haproxy from dying until they have completed their job.
After a read, there was a condition to mandatorily wake the task
up if the BF_READ_DONTWAIT flag was set. This was wrong because
the wakeup condition in this case can be deduced from the other
ones. Another condition was put on the other side not being in
SI_ST_EST state. It is not appropriate to do this because it
causes a useless wakeup at the beginning of every first request
in case of speculative polling, due to the fact that we don't
read anything and that the other side is still in SI_ST_INI.
Also, the wakeup was performed whenever to_forward was null,
which causes an unexpected wakeup upon the first read for the
same reason. However, those two conditions are valid if and
only if at least one read was performed.
Also, the BF_SHUTR flag was tested as part of the wakeup condition,
while this one can only be set if BF_READ_NULL is set too. So let's
simplify this ambiguous test by removing the BF_SHUTR part from the
condition to only process events.
Last, the BF_READ_DONTWAIT flag was unconditionally cleared,
while sometimes there would have been no I/O. Now we only clear
it once the I/O operation has been performed, which maintains
its validity until the I/O occurs.
Finally, those fixes saved approximately 16% of the per-session
wakeups and 20% of the epoll_ctl() calls, which translates into
slightly less under high load due to the request often being ready
when the read() occurs. A performance increase between 2 and 5% is
expected depending on the workload.
It does not seem necessary to backport this change to 1.4, eventhough
it fixes some performance issues. It may later be backported if
required to fix something else because the risk of regression seems
very low due to the fact that we're more in line with the documented
semantics.
When a connection is closed on a stream interface, some iohandlers
will need to be informed in order to release some resources. This
normally happens upon a shutr+shutw. It is the equivalent of the
fd_delete() call which is done for real sockets, except that this
time we release internal resources.
It can also be used with real sockets because it does not cost
anything else and might one day be useful.
The frontend's connection was accounted for once the session was
instanciated. This was problematic because the early ACLs weren't
able to correctly account for the number of concurrent connections.
Now we count the connection once it is assigned to the frontend.
It also brings the nice advantage of being more symmetrical, because
the stream_sock's accept() does not have to account for that anymore,
only the session's accept() does.
It's not normal to initialize the server-side stream interface from the
accept() function, because it may change later. Thus, we introduce a new
stream_sock_prepare_interface() function which is called just before the
connect() and which sets all of the stream_interface's callbacks to the
default ones used for real sockets. The ->connect function is also set
at the same instant so that we can easily add new server-side protocols
soon.
For a long time we had two large accept() functions, one for TCP
sockets instanciating proxies, and another one for UNIX sockets
instanciating the stats interface.
A lot of code was duplicated and both did not work exactly the same way.
Now we have a stream_sock layer accept() called for either TCP or UNIX
sockets, and this function calls the frontend-specific accept() function
which does the rest of the frontend-specific initialisation.
Some code is still duplicated (session & task allocation, stream interface
initialization), and might benefit from having an intermediate session-level
accept() callback to perform such initializations. Still there are some
minor differences that need to be addressed first. For instance, the monitor
nets should only be checked for proxies and not for other connection templates.
Last, we renamed l->private as l->frontend. The "private" pointer in
the listener is only used to store a frontend, so let's rename it to
eliminate this ambiguity. When we later support detached listeners
(eg: FTP), we'll add another field to avoid the confusion.
The 'client.c' file now only contained frontend-specific functions,
so it has naturally be renamed 'frontend.c'. Same for client.h. This
has also been an opportunity to remove some cross references from
files that should not have depended on it.
In the end, this file should contain a protocol-agnostic accept()
code, which would initialize a session, task, etc... based on an
accept() from a lower layer. Right now there are still references
to TCP.
We get a lot of those, especially with web crawlers :
recv(2, 0x810b610, 7000, 0) = -1 ECONNRESET (Connection reset by peer)
shutdown(2, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected)
close(2) = 0
There's no need to perform the shutdown() here, the socket is already
in error so it is down.
By default we automatically wait for enough data to fill large
packets if buf->to_forward is not null. This causes a problem
with POST/Expect requests which have a data size but no data
immediately available. Instead of causing noticeable delays on
such requests, simply add a flag to disable waiting when sending
requests.
Many times we see a lot of short responses in HTTP (typically 304 on a
reload). It is a waste of network bandwidth to send that many small packets
when we know we can merge them. When we know that another HTTP request is
following a response, we set BF_EXPECT_MORE on the response buffer, which
will turn MSG_MORE on exactly once. That way, multiple short responses can
leave pipelined if their corresponding requests were also pipelined.
While it could be dangerous to enable MSG_MORE on infinite data (eg:
interactive sessions), it makes sense to enable it when we know the
chunk to be sent is just a part of a larger one.
This new flag may be set by any user on a stream interface to tell
the underlying protocol that there is no need for lingering on the
socket since we know the other side either received everything or
does not care about what we sent.
This will typically be used with forced server close in HTTP mode,
where we want to quickly close a server connection after receiving
its response. Otherwise the system would prevent us from reusing
the same port for some time.
Since we'll soon be able to close a connection with remaining data in a
buffer, it becomes obvious that we can prepare to close when we're about
to send the last chunk of data and not the whole buffer.
Since the introduction of the automatic sizing of buffers during reads,
a bug appeared where the max size could be negative, causing large
chunks of memory to be overwritten during recv() calls if a read pointer
was already past the buffer's limit.
We used to apply a limit to each buffer's size in order to leave
some room to rewrite headers, then we used to remove this limit
once the session switched to a data state.
Proceeding that way becomes a problem with keepalive because we
have to know when to stop reading too much data into the buffer
so that we can leave some room again to process next requests.
The principle we adopt here consists in only relying on to_forward+send_max.
Indeed, both of those data define how many bytes will leave the buffer.
So as long as their sum is larger than maxrewrite, we can safely
fill the buffers. If they are smaller, then we refrain from filling
the buffer. This means that we won't risk to fill buffers when
reading last data chunk followed by a POST request and its contents.
The only impact identified so far is that we must ensure that the
BF_FULL flag is correctly dropped when starting to forward. Right
now this is OK because nobody inflates to_forward without using
buffer_forward().
Yohan Tordjman at Dstorage found that upgrading haproxy to 1.4-dev4
caused truncated objects to be returned. An strace quickly exhibited
the issue which was 100% reproducible :
4297 epoll_wait(0, {}, 10, 0) = 0
4297 epoll_wait(0, {{EPOLLIN, {u32=7, u64=7}}}, 10, 1000) = 1
4297 splice(0x7, 0, 0x5, 0, 0xffffffffffffffff, 0x3) = -1 EINVAL (Invalid argument)
4297 shutdown(7, 1 /* send */) = 0
4297 close(7) = 0
4297 shutdown(2, 1 /* send */) = 0
4297 close(2) = 0
This is caused by the fact that the forward length is taken from
BUF_INFINITE_FORWARD, which is -1. The problem does not appear
in 32-bit mode because this value is first cast to an unsigned
long, truncating it to 32-bit (4 GB). Setting an upper bound
fixes the issue.
Also, a second error check has been added for splice. If EINVAL
is returned, we fall back to recv().
When processing a GET or HEAD request in close mode, we know we don't
need to read anything anymore on the socket, so we can disable it.
Doing this can save up to 40% of the recv calls, and half of the
epoll_ctl calls.
For this we need a buffer flag indicating that we're not interesting in
reading anymore. Right now, this flag also disables both polled reads.
We might benefit from disabling only speculative reads, but we will need
at least this flag when we want to support keepalive anyway.
Currently we don't disable the flag on completion, but it does not
matter as we close ASAP when performing the shutw().
Some rarely information are stored in fdtab, making it larger for no
reason (source port ranges, remote address, ...). Such information
lie there because the checks can't find them anywhere else. The goal
will be to move these information to the stream interface once the
checks make use of it.
For now, we move them to an fdinfo array. This simple change might
have improved the cache hit ratio a little bit because a 0.5% of
performance increase has measured.
In old versions, before 1.3.16, we had to refresh the timeouts after
each call to process_session() because the stream socket handler did
not do it. Now that the sockets can exchange data for a long period
without calling process_session(), we can detect an old activity and
refresh a timeout long after the last activity, causing too late a
detection of some timeouts.
The fix simply consists in not checking for activity anymore in
stream_sock_data_finish() but only set a timeout if it was not
previously set.
By default, when data is sent over a socket, both the write timeout and the
read timeout for that socket are refreshed, because we consider that there is
activity on that socket, and we have no other means of guessing if we should
receive data or not.
While this default behaviour is desirable for almost all applications, there
exists a situation where it is desirable to disable it, and only refresh the
read timeout if there are incoming data. This happens on sessions with large
timeouts and low amounts of exchanged data such as telnet session. If the
server suddenly disappears, the output data accumulates in the system's
socket buffers, both timeouts are correctly refreshed, and there is no way
to know the server does not receive them, so we don't timeout. However, when
the underlying protocol always echoes sent data, it would be enough by itself
to detect the issue using the read timeout. Note that this problem does not
happen with more verbose protocols because data won't accumulate long in the
socket buffers.
When this option is set on the frontend, it will disable read timeout updates
on data sent to the client. There probably is little use of this case. When
the option is set on the backend, it will disable read timeout updates on
data sent to the server. Doing so will typically break large HTTP posts from
slow lines, so use it with caution.
We had to add a new stream_interface flag : SI_FL_DONT_WAKE. This flag
is used to indicate that a stream interface is being updated and that
no wake up should be sent to its owner. This will be required for tasks
embedded into stream interfaces. Otherwise, we could have the
owner task send wakeups to itself during status updates, thus
preventing the state from converging. As long as a stream_interface's
status is being monitored and adjusted, there is no reason to wake it
up again, as we know its changes will be seen and considered.
In TCP, we don't want to forward chunks of data, we want to forward
indefinitely. This patch introduces a special value for the amount
of data to be forwarded. When buffer_forward() is called with
BUF_INFINITE_FORWARD, it configures the buffer to never stop
forwarding until the end.