The remains of the stats socket code has nothing to do in proto_uxst
anymore and must move to dumpstats. The code is much cleaner and more
structured. It was also an opportunity to rename AN_REQ_UNIX_STATS
as AN_REQ_STATS_SOCK as the stats socket is no longer unix-specific
either.
The last item refering to stats in proto_uxst is the setting of the
task's nice value which should in fact come from the listener.
process_session() is now ready to handle unix stats sockets. This
first step works and old code has not been removed. A cleanup is
required. The stats handler is not unix socket-centric anymore and
should move to dumpstats.c.
When a stream interface has no connect() function, it means it is
immediately connected, so we don't need any connection request.
This will be used with unix sockets.
In order to merge the unix session handling code, we have to maintain
the number of per-listener connections in the session. This was only
performed for unix sockets till now.
Creating a frontend for the global stats socket will help merge
unix sockets management with the other socket management. Since
frontends are huge structs, we only allocate it if required.
The connection establishment was completely handled by backend.c which
normally just handles LB algos. Since it's purely TCP, it must move to
proto_tcp.c. Also, instead of calling it directly, we now call it via
the stream interface, which will later help us unify session handling.
Andrew Azarov reported that haproxy-1.4-dev1 does not build
under FreeBSD 7.2 because SOL_TCP is not defined. So add a
check for its definition before using it. This only impacts
network optimisations anyway.
This Linux-specific option was never really used in production and
has since been superseded by new splicing options brought by recent
Linux kernels.
It caused several particular cases in the code because the kernel
would take care of the session without haproxy being able to do
anything on it, which became hard to handle in the new architecture.
Let's simply get rid of it now that there is a replacement available.
The new "node-name" stats setting enables reporting of a node ID on
the stats page. It is possible to return the system's host name as
well as a specific name.
Romuald du Song reported a strange bug causing "option tcplog" to
unexpectedly use global log parameters if no log server was declared.
Eventhough it can be useful in some circumstances, it only hides
configuration bugs and can even cause traffic logs to be sent to
the wrong logger, since global settings are just for the process.
This has been fixed and a warning has been added for configurations
where tcplog or httplog are set without any logger. This fix must
be backported to 1.3.20, but not to 1.3.15.X in order not to risk
any regression on old configurations.
Cristian Ditoiu reported a major regression when testing 1.3.19 at
transfer.ro. It would crash within a few minutes while 1.3.15.10
was OK. He offered to help so we could run gdb and debug the crash
live. We finally found that the crash was the result of a regression
introduced by recent fix 814c978fb6
(task: fix possible timer drift after update) which makes it possible
for a tree walk to start from a detached task if this task has got
its timeout disabled due to a missing timeout.
The trivial fix below has been extensively tested and confirmed not
to crash anymore.
Special thanks to Cristian who spontaneously provided a lot of help
and trust to debug this issue which at first glance looked impossible
after reading the code and traces, but took less than an hour to spot
and fix when caught live in gdb ! That's really appreciated !
During a direct data transfer from the server to the client, if the
system did not have enough buffers anymore, haproxy would not enable
write polling again if it could write at least one data chunk. Under
normal conditions, this would remain undetected because the remaining
data would be pushed by next data chunks.
However, when this happens on the last chunk of a session, or the last
in a series in an interactive bidirectional TCP transfer, haproxy would
only start sending again when the read timeout was reached on the side
it stopped writing, causing long pauses on some protocols such as SQL.
This bug was reported by an Exceliance customer who generously offered
to help us by sending large amounts of traces and running various tests
on production systems.
It is quite hard to trigger it but it becomes easier with a ping-pong
TCP service which transfers random data sizes, with a modified version
of send() able to send packets smaller than the average transfer size.
A cleaner fix would imply only updating the write timeout when data
transfers are *attempted*, not succeeded, but that requires more
sensible code changes without fixing the result. It is a candidate
for a later patch though.
I've discovered a configuration with lots of occurrences of the
following :
acl xxx hdr_beg (host) xxx
The problem is that hdr_beg will match every header against patterns
(host) and xxx due to the space between both, which certainly is not
what the user wanted. Now we detect such ACLs and report a warning
with a suggestion to add "--" between "hdr_beg" and "(host)" if this
is definitely what is wanted.
When issuing commands on the unix socket, there's no way to
know if the result is empty or if the command is wrong. This
patch makes invalid command return a help message.
AIX wants string.h in signal.c (and is right to do so) :
gcc -Iinclude -Wall -O2 -g -DTPROXY -DENABLE_POLL -DCONFIG_HAPROXY_VERSION=\"1.3.18\" -DCONFIG_HAPROXY_DATE=\"2009/05/10\" -c -o src/signal.o src/signal.c
src/signal.c: In function 'signal_init':
src/signal.c:32: warning: implicit declaration of function 'memset'
src/signal.c:32: warning: incompatible implicit declaration of built-in function 'memset'
Do not exit early at the first error found while checking configuration
validity. This particularly helps spotting multiple wrong tracked server
names at once.
Try not to immediately exit on non-fatal errors while parsing a
listen section, so that the user has a chance to get most of the
errors at once, which is quite convenient especially during config
checks with the -c argument.
Try not to immediately exit on non-fatal errors while parsing the
global section, so that the user has a chance to get most of the
errors at once, which is quite convenient especially during config
checks with the -c argument. Some other errors such as unresolved
server names also don't make the parser exit too early.
MSIE does not correctly display spaced digits. It requires a margin of
at least one pixel. Also, it does not correctly hide empty cells, so we
work around this by setting the background white. Last, the H1 font was
too large, so we reduce it by one size, which is still OK in other
browsers.
We should respect tcp-smart-connect for checks too. First it reduces
the traffic, and second it ensures that the checks see the same thing
as the production traffic, which is better for debugging.
When the scheduler detected that a task was misplaced in the timer
queue, it used to place it right again. Unfortunately, it did not
check whether it would still call the new task from its new place.
This resulted in some tasks not getting called on timeout once in
a while, causing a minor drift for repetitive timers. This effect
was only observable with slow health checks and without any activity
because no other task would cause the scheduler to be immediately
called again.
In practice, it does not affect any real-world configuration, but
it's still better to fix it.
As reported by Maik Broemme, if something different from "if" or
"unless" was specified after "tcp-request content accept", the
condition would silently remain void. The parser must obviously
complain since this typically corresponds to a forgotten "if".
As reported by Jean-Baptiste Quenot and Robbie Aelter, sometimes a
backend server error is converted to a 502 error if the backend stops
before reading all the request. The reason is that the remote system
sends a TCP RST packet because there are still unread data pending in
the socket buffer. This RST is translated as a socket error on the
local system, and this error is reported by the poller.
However, most of the time, it's a write error, but the system is
still able to read the remaining pending data, such as in the trace
below :
send(7, "GET /aaa HTTP/1.0\r\nUser-Agent: Mo"..., 1123, MSG_DONTWAIT|MSG_NOSIGNAL) = 1123
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}}, 8, 1000) = 1
gettimeofday({1247593958, 643572}, NULL) = 0
recv(7, "HTTP/1.0 400 Bad request\r\nCache-C"..., 7000, MSG_NOSIGNAL) = 187
setsockopt(6, SOL_TCP, TCP_NODELAY, [0], 4) = 0
setsockopt(6, SOL_TCP, TCP_CORK, [1], 4) = 0
send(6, "HTTP/1.0 400 Bad request\r\nCache-C"..., 187, MSG_DONTWAIT|MSG_NOSIGNAL) = 187
shutdown(6, 1 /* send */) = 0
The recv succeeded while epoll_wait() reported an error.
Note: This case is very hard to reproduce and requires that the backend
server is reached via the loopback in order to minimise latency and
reduce the risk of sent data being ACKed.
When we close a socket with unread data in the buffer, or when the
nolinger option is set, we regularly lose the last fragment, which
often contains the error message. This typically occurs when sending
too large a request. Only the RST is seen due to the close() (since
not all data were read) and the output message never reaches the
network.
Doing a shutdown() before the close() solves this annoying issue
because the data are really pushed before the system sends the RST.
The new statement "persist rdp-cookie" enables RDP cookie
persistence. The RDP cookie is then extracted from the RDP
protocol, and compared against available servers. If a server
matches the RDP cookie, then it gets the connection.
This patch adds support for hashing RDP cookies in order to
use them as a load-balancing key. The new "rdp-cookie(name)"
load-balancing metric has to be used for this. It is still
mandatory to wait for an RDP cookie in the frontend, otherwise
it will randomly work.
The RDP protocol is quite simple and documented, which permits
an easy detection and extraction of cookies. It can be useful
to match the MSTS cookie which can contain the username specified
by the client.
Since we can call the HTTP parser from TCP inspection rules, it makes
sense to be able to use the HTTP ACLs with it. That way, we can decide
from a TCP frontend to take a switching decision based on full layer7
decoding. This might be useful to perform layer7 content switching from
a layer4 frontend in fact. For instance, we might want to be able to
detect http/https on a frontend, but still switch to backend X or Y
depending on the Host header. Note that it is mandatory to wait for
an HTTP request otherwise the ACLs will randomly match.
Since we can now switch from TCP to HTTP, we need to be able to apply
the HTTP request timeout after switching. That means we need to take
it from the backend and not from the frontend. Since the backend points
to the frontend before switching, that changes nothing for the normal
case.
In case of switching from TCP to HTTP, we want the HTTP request timeout
to be properly initialized. For this, we have to jump to the analyser
without breaking out of the loop nor waiting for incoming data. The way
it is done right now is not particularly clean but it works.
A cleaner method might involve pushing function pointers into a circular
list.
This patch allows a TCP frontend to switch to an HTTP backend.
During the switch, missing structures are automatically allocated.
The HTTP parser is enabled so that the backend first waits for a
full HTTP request.
Now that we can perform TCP-based content switching, it makes sense
to be able to detect HTTP traffic and act accordingly. We already
have an HTTP decoder, we just have to call it in order to detect HTTP
protocol. Note that since the decoder will automatically fill in the
interesting fields of the HTTP transaction, it would make sense to
use this parsing to extend HTTP matching to TCP.
Right now only HTTP proxies may use HTTP headers in ACLs, but
when this evolves, we'll need to be able to allocate the hdr_idx
on demand. The solution consists in allocating it only when it is
certain that at least one ACL requires HTTP parsing, regardless
of the mode the proxy is in. This is what is achieved by this
patch.
This patch propagates the ACL conditions' "requires" bitfield
to the proxies. This makes it possible to know exactly what a
proxy might have to support for any request, which helps knowing
whether we have to allocate some space for certain types of
structures or not (eg: the hdr_idx struct).
The concept might be extended to a lot more types of information,
such as detecting whether we need to allocate some space for some
request ACLs which need a result in the response, etc...
The HTTP processing has been splitted into 7 steps, one of which
is not anymore HTTP-specific (content-switching). That way, it
becomes possible to use "use_backend" rules in TCP mode. A new
"use_server" directive should follow soon.
Some stream analysers might become generic enough to be called
for several bits. So we cannot have the analyser bit hard coded
into the analyser itself. Let's make the caller inform the callee.
We want to split several steps in HTTP processing so that
we can call individual analysers depending on what processing
we want to perform. The first step consists in splitting the
part that waits for a request from the rest.
redirect rules are documented as being processed last before
use_backend but were mistakenly processed before block rules.
Fortunately very few people use a mix of block and redirect
rules, so this bug has never been reported yet.
The splice code did not consider compatibility between both ends
of the connection. Now we set different capabilities on each
stream interface, depending on what the protocol can splice to/from.
Right now, only TCP is supported. Thanks to this, we're now able to
automatically detect when splice() is not implemented and automatically
disable it on one end instead of reporting errors to the upper layer.
It will soon be necessary to support permanent analysers (eg: HTTP in
keep-alive mode). We first have to slightly rework the call to the
request analysers so that we don't force ->analysers to be 0 before
forwarding data.
When the nolinger option is used, we must not close too fast because
some data might be left unsent. Instead we must proceed with a normal
shutdown first, then a close. Also, we want to avoid merging FIN with
the last segment if nolinger is set, because if that one gets lost,
there is no chance for it to be retransmitted.
We now support up to 10 distinct configuration files. They are
all loaded in the order defined by -f <file1> -f <file2> ...
This can be useful in order to store global, private, public,
etc... configurations in distinct files.
This is a first step towards support of multiple configuration files.
Now readcfgfile() only reads a file in memory and performs very minimal
parsing. The checks are performed afterwards.
Buffer errors (timeouts and I/O errors) were handled at two places,
just after the analysers and after again.
Now that the timeout detection has moved, it has become easier to
handle those errors.
This has also made it possible for the request and response analysers
to be processed together as a down-up event, and all the up-down I/O
updates to be processed afterwards, which is exactly what we're looking
for. Interestingly this has reduced the number of iterations of
(stream_int, req_resp) from (5,6,5) to (5,5,4).
Several tests have been run without any issue found.
It's useless to check for buffer timeouts every time we call
process_session() because we already control when we set the flag. So
let's check them at the precise moment where the flag is set.
We want to be able to keep information about errors and timeouts
as long as possible in the buffer. Let's not clear these flags
anymore and keep them static. This does not seem to cause any
trouble, though a finer review might be wise.
Sometimes it can be useful to limit the advertised TCP MSS on
incoming connections, for instance when requests come through
a VPN or when the system is running with jumbo frames enabled.
Passing the "mss <value>" arguments to a "bind" line will set
the value. This works under Linux >= 2.6.28, and maybe a few
earlier ones, though due to an old kernel bug most of earlier
versions will probably ignore it. It is also possible that some
other OSes will support this.
This new option enables combining of request buffer data with
the initial ACK of an outgoing TCP connection. Doing so saves
one packet per connection which is quite noticeable on workloads
mostly consisting in small objects. The option is not enabled by
default.
Setting TCP_CORK on a socket before sending the last segment enables
automatic merging of this segment with the FIN from the shutdown()
call. Playing with TCP_CORK is not easy though as we have to track
the status of the TCP_NODELAY flag since both are mutually exclusive.
Doing so saves one more packet per session and offers about 5% more
performance.
There is no reason not to do it, so there is no associated option.
This option disables TCP quick ack upon accept. It is also
automatically enabled in HTTP mode, unless the option is
explicitly disabled with "no option tcp-smart-accept".
This saves one packet per connection which can bring reasonable
amounts of bandwidth for servers processing small requests.
A new keyword prefix "default" has been introduced in order to
reset some options to their default values. This can be needed
for instance when an option is forced disabled or enabled in a
defaults section and when later sections want to use automatic
settings regardless of what was specified there. Right now it
is only supported by options, just like the "no" prefix.
Sometimes we would want to implement implicit default options,
but for this we need to be able to disable them, which requires
to keep track of "no option" settings. With this change, an option
explicitly disabled in a defaults section will still be seen as
explicitly disabled. There should be no regression as nothing makes
use of this yet.
Some users are already hitting the 64k source port limit when
connecting to servers. The system usually maintains a list of
unused source ports, regardless of the source IP they're bound
to. So in order to go beyond the 64k concurrent connections, we
have to manage the source ip:port lists ourselves.
The solution consists in assigning a source port range to each
server and use a free port in that range when connecting to that
server, either for a proxied connection or for a health check.
The port must then be put back into the server's range when the
connection is closed.
This mechanism is used only when a port range is specified on
a server. It makes it possible to reach 64k connections per
server, possibly all from the same IP address. Right now it
should be more than enough even for huge deployments.
When a new process fails to grab some ports, it sends a signal to
the old process in order to release them. Then it tries to bind
again. If it still fails (eg: one of the ports is bound to a
completely different process), it must send the continue signal
to the old process so that this one re-binds to the ports. This
is correctly done, but the newly bound ports are not released
first, which sometimes causes the old process to remain running
with no port bound. The fix simply consists in unbinding all
ports before sending the signal to the old process.
It is recommended to have -D in init scripts, but -D also implies
quiet mode, which hides warning messages, and both options are now
completely unrelated. Remove the implication to get warnings with
-D.
The stats HTML output were barely readable on some browsers such as
firefox on Linux, due to the selected helvetica font which is too
small. Specifying "arial" first fixes the issue without changing the
table size. Also, the default size of 0.8em choosen to get 10px out
of 12px is wrong because it gets 9px when rounded down.
Some users want to keep the max sessions/s seen on servers, frontends
and backends for capacity planning. It's easy to grab it while the
session count is updated, so let's keep it.
Some people are using haproxy in a shared environment where the
system logger by default sends alert and emerg messages to all
consoles, which happens when all servers go down on a backend for
instance. These people can not always change the system configuration
and would like to limit the outgoing messages level in order not to
disturb the local users.
The addition of an optional 4th field on the "log" line permits
exactly this. The minimal log level ensures that all outgoing logs
will have at least this level. So the logs are not filtered out,
just set to this level.
There is a patch made by me that allow for balancing on any http header
field.
[WT:
made minor changes:
- turned 'balance header name' into 'balance hdr(name)' to match more
closely the ACL syntax for easier future convergence
- renamed the proxy structure fields header_* => hh_*
- made it possible to use the domain name reduction to any header, not
only "host" since it makes sense to do it with other ones.
Otherwise patch looks good.
/WT]
Since 1.3.17, a config containing one of the following lines would
crash the parser :
tcp content reject
tcp content accept
This is because a check is performed on the condition which is not
specified. The obvious fix consists in checkinf for a condition
first.
Some big traffic sites have trouble dealing with logs and tend to
disable them. Here are two new options to help cope with massive
logs.
- dontlog-normal only disables logging for 100% successful
connections, other ones will still be logged
- log-separate-errors will cause non-100% successful connections
to be logged at level "err" instead of level "info" so that a
properly configured syslog daemon can send them to a different
file for longer conservation.
epoll, sepoll and kqueue pollers should check that their fd is not
closed before attempting to close it, otherwise we can end up with
multiple closes of fd #0 upon exit, which is harmless but dirty.
The small list of signals currently handled by haproxy were processed
as soon as they were received. This has caused trouble with calls to
pool_gc2() occuring in the middle of libc's memory management functions
seldom causing deadlocks preventing the old process from leaving.
Now these signals use the new async signal framework and are called
asynchronously, when there is no risk of recursion. This ensures more
reliable operation, especially for sensible processing such as memory
management.
If an asynchronous signal is received outside of the poller, we don't
want the poller to wait for a timeout to occur before processing it,
so we set its timeout to zero, just like we do with pending tasks in
the run queue.
These functions will be used to deliver asynchronous signals in order
to make the signal handling functions more robust. The goal is to keep
the same interface to signal handlers.
I have attached a patch which will add on every http request a new
header 'X-Original-To'. If you have HAProxy running in transparent mode
with a big number of SQUID servers behind it, it is very nice to have
the original destination ip as a common header to make decisions based
on it.
The whole thing is configurable with a new option 'originalto'. I have
updated the sourcecode as well as the documentation. The 'haproxy-en.txt'
and 'haproxy-fr.txt' files are untouched, due to lack of my french
language knowledge. ;)
Also the patch adds this header for IPv4 only. I haven't any IPv6 test
environment running here and don't know if getsockopt() with SO_ORIGINAL_DST
will work on IPv6. If someone knows it and wants to test it I can modify
the diff. Feel free to ask me questions or things which should be changed. :)
--Maik
The pointer arithmetics was wrong in http_capture_bad_message().
This has no impact right now because the error only msg->som was
affected and right now it's always 0. But this was a bug waiting
for keepalive support to strike.
The response message in the transaction structure was not properly
initialised at session initialisation. In theory it cannot cause any
trouble since the affected field os expected to always remain NULL.
However, in some circumstances, such as building on 64-bit platforms
with certain options, the struct session can be exactly 1024 bytes,
the same size of the requri field, so the pools are merged and the
uninitialised field may contain non-null data, causing crashes if
an invalid response is encountered and archived.
The fix simply consists in correctly initialising the missing fields.
This bug cannot affect architectures where the session pool is not
shared (32-bit architectures), but this is only by pure luck.
A race condition exists in the hot reconfiguration code. It is
theorically possible that the second signal is sent during a free()
in the first list, which can cause crashes or freezes (the later
have been observed). Just set up a counter to ensure we do not
recurse.
The byte counters have long been 64-bit to avoid overflows. But with
several sites nowadays, we see session counters wrap around every 10-days
or so. So it was the moment to switch counters to 64-bit, including
error and warning counters which can theorically rise as fast as session
counters even if in practice there is very low risk.
The performance impact should not be noticeable since those counters are
only updated once per session. The stats output have been carefully checked
for proper types on both 32- and 64-bit platforms.
It's useful to be able to accept an invalid header name in a request
or response but still be able to monitor further such errors. Now,
when an invalid request/response is received and accepted due to
an "accept-invalid-http-{request|response}" option, the invalid
request will be captured for later analysis with "show errors" on
the stats socket.
Sometimes it is required to let invalid requests pass because
applications sometimes take time to be fixed and other servers
do not care. Thus we provide two new options :
option accept-invalid-http-request (for the frontend)
option accept-invalid-http-response (for the backend)
When those options are set, invalid requests or responses do
not cause a 403/502 error to be generated.
This function sets CSS letter spacing after each 3rd digit. The page must
create a class "rls" (right letter spacing) with style "letter-spacing: 0.3em"
in order to use it.
Under some circumstances, it appears possible to refresh a timeout
just after a side has been shut. For instance, if poll() plans to
call both read and write, and the read side calls chk_snd() which
in turn causes a shutw to occur, then stream_sock_write could update
its write timeout. The same problem happens the other way.
The timeout checks will then not catch these cases because they
ignore timeouts in case of shut{r,w}.
This is very likely to be the major cause of the 100% CPU usages
reported by Bart Bobrowski.
The fix consists in always ensuring that a side is not shut before
updating its timeout.
For complex troubleshooting, it's sometimes useful to be able to
completely dump all the states and flags related to a session.
Now "show sess" will report the stream interfaces and buffers
status for each session.
sepoll counts the number of speculative events it has processed in
order to remain fair with epoll_wait(). If a same FD is processed
both for read and for write, it is counted twice. Fix this.
Upon read or write error, we cannot immediately close the FD because
we want to first report the error to the upper layer which will do it
itself. However, we want to prevent any further I/O from being performed
on the FD. This is especially important in case of speculative I/O where
nothing else could stop the FD from still being polled until the upper
layer takes care of the condition.
Some I/O callbacks are able to close their socket themselves. We
want to check this before calling epoll_ctl(EPOLL_CTL_DEL), otherwise
we get a -1 EBADF. Right now is looks like this could not cause any
trouble but the case is racy enough to fix it.
unix sockets are not attached to a real frontend, so there is
no way to disable/enable the listener depending on the global
session count. For this reason, if the global maxconn is reached
and a unix socket comes in, it will just be ignored and remain
in the poll list, which will call again indefinitely.
So we need to accept then drop incoming unix connections when
the table is full.
This should not happen with clean configurations since the global
maxconn should provide enough room for unix sockets.
The stream_interface timeout was not reset upon a connect success or
error, leading to busy loops when requeuing tasks in the past.
Thanks to Bart Bobrowski for reporting the issue.
There is already an optimisation in the speculative poller which
causes newly created FDs to be checked immediately after being
created. Unfortunately, this optimisation causes the whole spec
list to be re-checked while we're only interested in the new FDs.
Doing this minor change causes performance gains of up to 6% on
medium-sized objects with a few hundreds concurrent connections.
If the accept() is done before checking for global.maxconn, we can
accept too many connections and encounter a lack of file descriptors
when trying to connect to the server. This is the cause of the
"cannot get a server socket" message encountered in debug mode
during injections with low timeouts.
While processing the session, we used to resync the FSMs when buffer
flags changed. But since BF_KERN_SPLICING and BF_READ_DONTWAIT were
introduced, sometimes we could resync after they were set, which is
not what we want. This was because there were some old checks left
which did not mask changes with BF_MASK_STATIC before checking.
When the reader does not expect to read lots of data, it can
set BF_READ_DONTWAIT on the request buffer. When it is set,
the stream_sock_read callback will not try to perform multiple
reads, it will return after only one, and clear the flag.
That way, we can immediately return when waiting for an HTTP
request without trying to read again.
On pure request/responses schemes such as monitor-uri or
redirects, this has completely eliminated the EAGAIN occurrences
and the epoll_ctl() calls, resulting in a performance increase of
about 10%. Similar effects should be observed once we support
HTTP keep-alive since we'll immediately disable reads once we
get a full request.
If we get very large data at once, it's almost certain that it's
worthless trying to read again, because we got everything we could
get.
Doing this has made all -EAGAIN disappear from splice reads. The
threshold has been put in the global tunable structures so that if
we one day want to make it accessible from user config, it will be
easy to do so.
If server check interval is null, we might end up looping in
process_srv_chk().
Prevent those values from being zero and add some control in
process_srv_chk() against infinite loops.
It's sometimes useful at least for statistics to keep a task count.
It's easy to do by forcing the rare task creators to always use the
same functions to create/destroy a task.
If a task wants to stay in the run queue, it is possible. It just
needs to wake itself up. We just want to ensure that a reniced
task will be processed at the right instant.
The top of a duplicate tree is not where bit == -1 but at the most
negative bit. This was causing tasks to be queued in reverse order
within duplicates. While this is not dramatic, it's incorrect and
might lead to longer than expected duplicate depths under some
circumstances.
When there are niced tasks, we would only process #tasks/4 per
turn, without taking care of running #tasks when #tasks was below
4, leaving those tasks waiting for a few other tasks to push them.
The fix simply consists in checking (#tasks+3)/4.
Since we're now able to search from a precise expiration date in
the timer tree using ebtree 4.1, we don't need to maintain 4 trees
anymore. Not only does this simplify the code a lot, but it also
ensures that we can always look 24 days back and ahead, which
doubles the ability of the previous scheduler. Indeed, while based
on absolute values, the timer tree is now relative to <now> as we
can always search from <now>-31 bits.
The run queue uses the exact same principle now, and is now simpler
and a bit faster to process. With these changes alone, an overall
0.5% performance gain was observed.
Tests were performed on the few wrapping cases and everything works
as expected.
tcp_request is not meant to decide how an error or a timeout has to
be handled. It must just apply it rules. Now that the error checks
have been added to the session, we don't need to check them anymore
in tcp_request_inspect(), which will only consider the shutdown which
may be the result of such an error.
That makes a lot more sense since tcp_request is not really waiting
for a request.
In order to get termination flags properly updated, the session was
relying a bit too much on http_return_srv_error() which is http-centric.
A generic srv_error function was implemented in the session in order to
catch all connection abort situations. It was then noticed that a request
abort during a connection attempt was not reported, which is now fixed.
Read and write errors/timeouts were not logged either. It was necessary
to add those tests at 4 new locations.
Now it looks like everything is correctly logged. Most likely some error
checking code could now be removed from some analysers.
The connect timeout was not properly detected due to the fact that
it was not correctly initialized. It must be set as the stream interface
timeout, not the buffer's write timeout.
There are some configurations in which redirect rules are declared
after use_backend rules. We can also find "block" rules after any
of these ones. The processing sequence is :
- block
- redirect
- use_backend
So as of now we try to detect wrong ordering to warn the user about
a possibly undesired behaviour.
People are regularly complaining that proxies are linked in reverse
order when reading the stats. This is now definitely fixed because
the proxy order is now fixed to match configuration order.
Sometimes it may make sense to be able to immediately apply a verdict
without waiting at all. It was not possible because no inspect-delay
meant no inspection at all. This is now fixed.
When a backend has no LB algo specified and is not in dispatch, proxy
nor transparent mode, use "balance roundrobin" by default instead of
complaining. This will be particularly useful with stats and redirects.
When data are forwarded between socket, we must update the output
socket's write timeout. This was forgotten, causing sessions to
unexpectedly expire during long posts.
The forwarding condition was not very clear. We would only enable
forwarding when send_max is zero, and we would only splice when no
analyser is installed. In fact we want to enable forward when there
is no analyser and we want to splice at soon as there is data to
forward, regardless of the analysers.
In process_session(), we used to re-run through all the evaluation
loop when only the response had changed. Now we carefully check in
this order :
- changes to the stream interfaces (only SI_ST_DIS)
- changes to the request buffer flags
- changes to the response buffer flags
And we branch to the appropriate section. This saves significant
CPU cycles, which is important since process_session() is one of
the major CPU eaters.
The same changes have been applied to uxst_process_session().
Most of the time, task_queue() will immediately return. By extracting
the preliminary checks and putting them in an inline function, we can
significantly reduce the number of calls to the function itself, and
most of the tests can be optimized away due to the caller's context.
Another minor improvement in process_runnable_tasks() consisted in
taking benefit from the processor's branch prediction unit by making
a special case of the process_session() callback which is by far the
most common one.
All this improved performance by about 1%, mainly during the call
from process_runnable_tasks().
Timers are unsigned and used as tree positions. Ticks are signed and
used as absolute date within current time frame. While the two are
normally equal (except zero), it's important not to confuse them in
the code as they are not interchangeable.
We add two inline functions to turn each one into the other.
The comments have also been moved to the proper location, as it was
not easy to understand what was a tick and what was a timer unit.
All the tasks callbacks had to requeue the task themselves, and update
a global timeout. This was not convenient at all. Now the API has been
simplified. The tasks callbacks only have to update their expire timer,
and return either a pointer to the task or NULL if the task has been
deleted. The scheduler will take care of requeuing the task at the
proper place in the wait queue.
We don't need to remove then add tasks in the wait queue every time we
update a timeout. We only need to do that when the new timeout is earlier
than previous one. We can rely on wake_expired_tasks() to perform the
proper checks and bounce the misplaced tasks in the rare case where this
happens. The motivation behind this is that we very rarely hit timeouts,
so we save a lot of CPU cycles by moving the tasks very rarely. This now
means we can also find tasks with expiration date set to eternity in the
queue, and that is not a problem.
In many situations, we wake a task on an I/O event, then queue it
exactly where it was. This is a real waste because we delete/insert
tasks into the wait queue for nothing. The only reason for this is
that there was only one tree node in the task struct.
By adding another tree node, we can have one tree for the timers
(wait queue) and one tree for the priority (run queue). That way,
we can have a task both in the run queue and wait queue at the
same time. The wait queue now really holds timers, which is what
it was designed for.
The net gain is at least 1 delete/insert cycle per session, and up
to 2-3 depending on the workload, since we save one cycle each time
the expiration date is not changed during a wake up.
A bug was introduced with the ebtree-based scheduler. It seldom causes
some timeouts to last longer than required if they hit an expiration
date which is the same as the last queued date, is also part of a
duplicate tree without being the top of the tree. In this case, the
task will not be expired until after the duplicate tree has been
flushed.
It is easier to reproduce by setting a very short client timeout (1s)
and sending connections and waiting for them to expire with the 408
status. Then in parallel, inject at about 1kh/s. The bug causes the
connections to sometimes wait longer than 1s before timing out.
The cause was the use of eb_insert_dup() on wrong nodes, as this
function is designed to work only on the top of the dup tree. The
solution consists in updating last_timer only when its bit is -1,
and using it only if its bit is still -1 (top of a dup tree).
The fix has not reduced performance because it only fixes the case
where this bug could fire, which is extremely rare.
It's easier to take the counter's age into account when consulting it
than to rotate it first. It also saves some CPU cycles and avoids the
multiply for outdated counters, finally saving CPU cycles here too
when multiple operations need to read the same counter.
The freq_ctr code has also shrinked by one third consecutively to these
optimizations.
term_trace was very useful while reworking the lower layers but has almost
completely been removed from every place it was referenced. Even the few
remaining ones were not accurate, so it's better to completely remove those
references and re-add them from scratch later if needed.
In pure TCP mode, there is no response analyser to switch the server-side
stream interface from INI to CLO when the output has been closed after an
abort. This caused sessions to remain indefinitely active when they were
aborted by the client during a TCP content analysis.
The proper action is to switch the stream interface to the CLO state from
INI when we have write enable and shutdown write.
The rate-limit was applied to the smoothed value which does a special
case for frequencies below 2 events per period. This caused irregular
limitations when set to 1 session per second.
The proper way to handle this is to compute the number of remaining
events that can occur without reaching the limit. This is what has
been added. It also has the benefit that the frequency calculation
is now done once when entering event_accept(), before the accept()
loop, and not once per accept() loop anymore, thus saving a few CPU
cycles during very high loads.
With this fix, rate limits of 1/s are perfectly respected.
The new "rate-limit sessions" statement sets a limit on the number of
new connections per second on the frontend. As it is extremely accurate
(about 0.1%), it is efficient at limiting resource abuse or DoS.
These new ACLs match frontend session rate and backend session rate.
Examples are provided in the doc to explain how to use that in order
to limit abuse of service.
With this change, all frontends, backends, and servers maintain a session
counter and a timer to compute a session rate over the last second. This
value will be very useful because it varies instantly and can be used to
check thresholds. This value is also reported in the stats in a new "rate"
column.
Several algorithms will need to know the millisecond value within
the current second. Instead of doing a divide every time it is needed,
it's better to compute it when it changes, which is when now and now_ms
are recomputed.
curr_sec_ms_scaled is the same multiplied by 2^32/1000, which will be
useful to compute some ratios based on the position within last second.
The new "show errors" command sent on a unix socket will dump
all captured request and response errors for all proxies. It is
also possible to bound the log to frontends and backends whose
ID is passed as an optional parameter.
The output provides information about frontend, backend, server,
session ID, source address, error type, and error position along
with a complete dump of the request or response which has caused
the error.
If a new error scratches the one currently being reported, then
the dump is aborted with a warning message, and processing goes
on to next error.
Each proxy instance, either frontend or backend, now has some room
dedicated to storing a complete dated request or response in case
of parsing error. This will make it possible to consult errors in
order to find the exact cause, which is particularly important for
troubleshooting faulty applications.
If an invalid character is encountered while parsing an HTTP message, we
want to get buf->lr updated to reflect it.
Along this change, a few useless __label__ declarations have been removed
because they caused gcc to consume stack space without putting anything
there.
On overloaded systems, it sometimes happens that hundreds or thousands
of incoming connections are queued in the system's backlog, and all get
dequeued at once. The problem is that when haproxy processes them and
does not apply any limit, this can take some time and the internal date
does not progress, resulting in wrong timer measures for all sessions.
The most common effect of this is that all of these sessions report a
large request time (around several hundreds of ms) which is in fact
caused by the time spent accepting other connections. This might happen
on shared systems when the machine swaps.
For this reason, we finally apply a reasonable limit even in mono-process
mode. Accepting 100 connections at once is fast enough for extreme cases
and will not cause that much of a trouble when the system is saturated.
Problem reported by John Lauro. When "source ... usesrc ..." is
set in the defaults section, it is not possible anymore to remove
the "usesrc" part when declaring a more precise "source" in a
backend. The only workaround was to declare it by server.
We need to clear optional settings when declaring a new "source".
The problem was the same with the "interface" declaration.
Unix socket processing was still quite buggy. It did not properly
handle interrupted output due to a full response buffer. The fix
mainly consists in not trying to prematurely enable write on the
response buffer, just like the standard session works. This also
gets the unix socket code closer to the standard session code
handling.
Commit 8a5c626e73 introduced the sessions
dump on the unix socket. This implementation is buggy because it may try
to link to the sessions list's head after the last session is removed
with a backref. Also, for the LIST_ISEMPTY test to succeed, we have to
proceed with LIST_INIT after LIST_DEL.
As subject when i try to compile haproxy with -DDEBUG_FULL it stop at
stream_sock.c file with:
gcc -Iinclude -Wall -O2 -g -DDEBUG_FULL -DTPROXY -DENABLE_POLL
-DENABLE_EPOLL -DENABLE_SEPOLL -DNETFILTER -DUSE_GETSOCKNAME
-DCONFIG_HAPROXY_VERSION=\"1.3.15\"
-DCONFIG_HAPROXY_DATE=\"2008/04/19\" -c -o src/stream_sock.o
src/stream_sock.c
src/stream_sock.c: In function 'stream_sock_chk_rcv':
src/stream_sock.c:905: error: 'fd' undeclared (first use in this function)
src/stream_sock.c:905: error: (Each undeclared identifier is reported only once
src/stream_sock.c:905: error: for each function it appears in.)
src/stream_sock.c:905: error: 'ob' undeclared (first use in this function)
src/stream_sock.c: In function 'stream_sock_chk_snd':
src/stream_sock.c:940: error: 'fd' undeclared (first use in this function)
src/stream_sock.c:940: error: 'ib' undeclared (first use in this function)
make: *** [src/stream_sock.o] Error 1
With this patch all build fine:
Using the wrong operator (&& instead of &) causes DOWN->UP
transition to take longer than it should and to produce a lot of
redundant logs. With typical "track" usage (1-6 tracking servers) it
shouldn't make a big difference but for heavily tracked servers
this bug leads to hang with 100% CPU usage and extremely big
log spam.
The "bind-process" keyword lets the admin select which instances may
run on which process (in multi-process mode). It makes it easier to
more evenly distribute the load across multiple processes by avoiding
having too many listen to the same IP:ports.
Specifying "interface <name>" after the "source" statement allows
one to bind to a specific interface for proxy<->server traffic.
This makes it possible to use multiple links to reach multiple
servers, and to force traffic to pass via an interface different
from the one the system would have chosen based on the routing
table.
By appending "interface <name>" to a "bind" line, it is now possible
to specifically bind to a physical interface name. Note that this
currently only works on Linux and requires root privileges.
Setting "nosplice" in the global section will disable the use of TCP
splicing (both tcpsplice and linux 2.6 splice). The same will be
achieved using the "-dS" parameter on the command line.
The global tuning options right now only concern the polling mechanisms,
and they are not in the global struct itself. It's not very practical to
add other options so let's move them to the global struct and remove
types/polling.h which was not used for anything else.
global.maxconn/4 seems to be a good hint for global.maxpipes when that
one must be guessed. If the limit is reached, it's still possible to
set it manually in the configuration.
Using pipe pools makes pipe management a lot easier. It also allows to
remove quite a bunch of #ifdefs in areas which depended on the presence
or not of support for kernel splicing.
The buffer now holds a pointer to a pipe structure which is always NULL
except if there are still data in the pipe. When it needs to use that
pipe, it dynamically allocates it from the pipe pool. When the data is
consumed, the pipe is immediately released.
That way, there is no need anymore to care about pipe closure upon
session termination, nor about pipe creation when trying to use
splice().
Another immediate advantage of this method is that it considerably
reduces the number of pipes needed to use splice(). Tests have shown
that even with 0.2 pipe per connection, almost all sessions can use
splice(), because the same pipe may be used by several consecutive
calls to splice().
A new data type has been added : pipes. Some pre-allocated empty pipes
are maintained in a pool for users such as splice which use them a lot
for very short times.
Pipes are allocated using get_pipe() and released using put_pipe().
Pipes which are released with pending data are immediately killed.
The struct pipe is small (16 to 20 bytes) and may even be further
reduced by unifying ->data and ->next.
It would be nice to have a dedicated cleanup task which would watch
for the pipes usage and destroy a few of them from time to time.
Kernels before 2.6.27.13 would have splice() return EAGAIN on shutdown.
By adding a few tricks, we can deal with the situation. If splice()
returns EAGAIN and the pipe is empty, then fallback to recv() which
will be able to check if it's an end of connection or not.
The advantage of this method is that it remains transparent for good
kernels since there is no reason that epoll() will return EPOLLIN
without anything to read, and even if it would happen, the recv()
overhead on this check is minimal.
If splicing is enabled in a backend, we need to guess how many
pipes will be needed. We used to rely on fullconn, but this leads
to non-working splicing when fullconn is not specified. So we now
fallback to global.maxconn.
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
Some older libc don't define the splice() syscall, and some even
define a wrong one. For this reason, we try our best to declare
it correctly. These definitions still work with recent glibc.
When CONFIG_HAP_LINUX_SPLICE is defined, the buffer structure will be
slightly enlarged to support information needed for kernel splicing
on Linux.
A first attempt consisted in putting this information into the stream
interface, but in the long term, it appeared really awkward. This
version puts the information into the buffer. The platform-dependant
part is conditionally added and will only enlarge the buffers when
compiled in.
One new flag has also been added to the buffers: BF_KERN_SPLICING.
It indicates that the application considers it is appropriate to
use splicing to forward remaining data.
Three new options have been added when CONFIG_HAP_LINUX_SPLICE is
set :
- splice-request
- splice-response
- splice-auto
They are used to enable splicing per frontend/backend. They are also
supported in defaults sections. The "splice-auto" option is meant to
automatically turn splice on for buffers marked as fast streamers.
This should save quite a bunch of file descriptors.
It was required to add a new "options2" field to the proxy structure
because the original "options" is full.
When global.maxpipes is not set, it is automatically adjusted to
the max of the sums of all frontend's and backend's maxconns for
those which have at least one splice option enabled.
When the producer calls stream_sock_chk_snd(), we now try to send
all pending data asynchronously. If it succeeds, we don't have to
enable polling on the FD which saves about half of the calls to
epoll_wait().
In stream_sock_read(), we finally set the WAIT_ROOM flag as soon as
possible, in preparation of the splice code. We reset it when we
detect that some room has been released either in the buffer or in
the splice.
The condition to cakk ->chk_snd() in stream_sock_read() was suboptimal
because we did not call it when the socket was shut down nor when there
was an error after data were added.
Now we ensure to call is whenever there are data pending.
Also, the "full" condition was handled before calling chk_snd(), which
could cause deadlock issues if chk_snd() did consume some data.
stream_sock_write() has been split in two parts :
- the poll callback, intented to be called when an I/O event has
been detected
- the write() core function, which ought to be usable from various
other places, possibly not meant to wake the task up.
The code has also been slightly cleaned up in the process. It's more
readable now.
Some tricks to handle situations where we write nothing were in the
middle of the main loop in stream_sock_write(). This cleanup provides
better source and object code, and slightly shrinks the output code.
This construct collapses into ((flags & (X|Y)) == X) when X is a
single-bit flag. This provides a noticeable code shrink and the
output code results in less conditional jumps.
In the buffers, the read limit used to leave some place for header
rewriting was set by a pointer to the end of the buffer. Not only
this required subtracts at every place in the code, but this will
also soon not be usable anymore when we want to support keepalive.
Let's replace this with a length limit, comparable to the buffer's
length. This has also sightly reduced the code size.
It is not always wise to return 0 in stream_sock_read() upon EAGAIN,
because if we have read enough data, we should consider that enough
and try again later without polling in between.
We still make a difference between small reads and large reads though.
Small reads still lead to polling because we're sure that there's
nothing left in the system's buffers if we read less than one MSS.
The way the buffers and stream interfaces handled ->to_forward was
really not handy for multiple reasons. Now we've moved its control
to the receive-side of the buffer, which is also responsible for
keeping send_max up to date. This makes more sense as it now becomes
possible to send some pre-formatted data followed by forwarded data.
The following explanation has also been added to buffer.h to clarify
the situation. Right now, tests show that the I/O is behaving extremely
well. Some work will have to be done to adapt existing splice code
though.
/* Note about the buffer structure
The buffer contains two length indicators, one to_forward counter and one
send_max limit. First, it must be understood that the buffer is in fact
split in two parts :
- the visible data (->data, for ->l bytes)
- the invisible data, typically in kernel buffers forwarded directly from
the source stream sock to the destination stream sock (->splice_len
bytes). Those are used only during forward.
In order not to mix data streams, the producer may only feed the invisible
data with data to forward, and only when the visible buffer is empty. The
consumer may not always be able to feed the invisible buffer due to platform
limitations (lack of kernel support).
Conversely, the consumer must always take data from the invisible data first
before ever considering visible data. There is no limit to the size of data
to consume from the invisible buffer, as platform-specific implementations
will rarely leave enough control on this. So any byte fed into the invisible
buffer is expected to reach the destination file descriptor, by any means.
However, it's the consumer's responsibility to ensure that the invisible
data has been entirely consumed before consuming visible data. This must be
reflected by ->splice_len. This is very important as this and only this can
ensure strict ordering of data between buffers.
The producer is responsible for decreasing ->to_forward and increasing
->send_max. The ->to_forward parameter indicates how many bytes may be fed
into either data buffer without waking the parent up. The ->send_max
parameter says how many bytes may be read from the visible buffer. Thus it
may never exceed ->l. This parameter is updated by any buffer_write() as
well as any data forwarded through the visible buffer.
The consumer is responsible for decreasing ->send_max when it sends data
from the visible buffer, and ->splice_len when it sends data from the
invisible buffer.
A real-world example consists in part in an HTTP response waiting in a
buffer to be forwarded. We know the header length (300) and the amount of
data to forward (content-length=9000). The buffer already contains 1000
bytes of data after the 300 bytes of headers. Thus the caller will set
->send_max to 300 indicating that it explicitly wants to send those data,
and set ->to_forward to 9000 (content-length). This value must be normalised
immediately after updating ->to_forward : since there are already 1300 bytes
in the buffer, 300 of which are already counted in ->send_max, and that size
is smaller than ->to_forward, we must update ->send_max to 1300 to flush the
whole buffer, and reduce ->to_forward to 8000. After that, the producer may
try to feed the additional data through the invisible buffer using a
platform-specific method such as splice().
*/
Previously, we wrote nothing only if the buffer was empty. Now with
send_max, we can also write nothing because we are not allowed to send
anything due to send_max.
The code starts to look like spaghetti. It needs to be rearranged a
lot before merging the splice patches.
In preparation of splice support, let's add the splice_len member
to the buffer struct. An earlier implementation made it conditional,
which made the whole logics very complex due to a large number of
ifdefs.
Now BF_EMPTY is only set once both buf->l and buf->splice_len are
null. Splice_len is initialized to zero during buffer creation and
is currently not changed, so the whole logics remains unaffected.
When splice gets merged, splice_len will reflect the number of bytes
in flight out of the buffer but not yet sent, typically in a pipe for
the Linux case.
If an analyser sets buf->to_forward to a given value, that many
data will be forwarded between the two stream interfaces attached
to a buffer without waking the task up. The same applies once all
analysers have been released. This saves a large amount of calls
to process_session() and a number of task_dequeue/queue.
By letting the producer tell the consumer there is data to check,
and the consumer tell the producer there is some space left again,
we can cut in half the number of session wakeups.
This is also an important starting point for future splicing support.
Sometimes we don't care about a read timeout, for instance, from the
client when waiting for the server, but we still want the client to
be able to read.
Till now it was done by articially forcing the read timeout to ETERNITY.
But this will cause trouble when we want the low level stream sock to
communicate without waking the session up. So we add a BF_READ_NOEXP
flag to indicate that when the read timeout is to be set, it might
have to be set to ETERNITY.
Since BF_READ_ENA was not used, we replaced this flag.
For keep-alive, line-mode protocols and splicing, we will need to
limit the sender to process a certain amount of bytes. The limit
is automatically set to the buffer size when analysers are detached
from the buffer.
"option transparent" was set and checked on frontends only while it
is purely a backend thing as it replaces the "balance" mode. For this
reason, it did only work in "listen" sections. This change will then
not affect the rare users of this option.
This causes health checks to stop after some time since the new
ticks-based scheduler because a check timeout is set to eternity.
This fix must be merged into master but not in earlier versions
as it only affects the new scheduler.
(cherry picked from commit e349eb452b655dc1adc059f05ba8b36565753393)
Kai Krueger found that previous patch was incomplete, because there is
an unconditionnal call to process_srv_queue() in session_free() which
still causes a dead server to consume pending connections from the
backend.
This call was made unconditionnal so that we don't leave unserved
connections in the server queue, for instance connections coming
in with "option persist" which can bypass the server status check.
However, the server must not touch the backend's queue if it is down.
Another fear was that some connections might remain unserved when
the server is using a dynamic maxconn if the number of connections
to the backend is too low. Right now, srv_dynamic_maxconn() ensures
this cannot happen, so the call can remain conditionnal.
The fix consists in allowing a server to process it own queue whatever
its state, but not to touch the backend's queue if it is down. Its
queue should normally be empty when the server is down because it is
redistributed when the server goes down. The only remaining cases are
precisely the persistent connections with "option persist" set, coming
in after the queue has been redispatched. Those ones must still be
processed when a connection terminates.
(cherry picked from commit cd485c4480)
If the prefix is set to "/", it means the user does not want to alter
the original URI, so we don't want to insert a new slash before the
original URI.
(cherry-picked from commit 02a35c74942c1bce762e996698add1270e6a5030)
It is now possible to set or clear a cookie during a redirection. This
is useful for logout pages, or for protecting against some DoSes. Check
the documentation for the options supported by the "redirect" keyword.
(cherry-picked from commit 4af993822e880d8c932f4ad6920db4c9242b0981)
If "drop-query" is present on a "redirect" line using the "prefix" mode,
then the returned Location header will be the request URI without the
query-string. This may be used on some login/logout pages, or when it
must be decided to redirect the user to a non-secure server.
(cherry-picked from commit f2d361ccd73aa16538ce767c766362dd8f0a88fd)
Josh Goebel reported that haproxy silently dies when it fails to
chroot. In fact, it does so when in daemon mode, because daemon
mode has been disabling output for ages.
Since the code has been reworked, this could have been changed
because there is no reason for this anymore, hence this patch.
(cherry picked from commit 304d6fb00f)
(cherry picked from commit 50b7f7f12c67322c793f50a6be009f0fd0eec1bb)
was just looking through the source, and noticed this... :)
(cherry picked from commit 63b76be713)
(cherry picked from commit a801db6c5ea750f93a3795dbb2e70c03e05bbef4)
Cookie capture would only work by pure luck on the request but did
never work on responses since only the backend was checked. The fix
consists in always checking frontend for cookie captures.
(cherry picked from commit a83c5ba9315a7c47cda2698280b7e49a9d3eb374)
Using an ACL-related keyword in the defaults section causes a
segfault during parsing because the list headers are not initialized.
We must initialize list headers for default instance and reject
keywords relying on ACLs.
(cherry picked from commit 1c90a6ec20)
(cherry picked from commit eb8131b4e418b838b2d62d991d91d94482ba49de)
There is a problem when an instance is marked "disabled". Its ports are
still bound but will not be unbound upon termination. This causes processes
to accumulate during soft restarts, and might even cause failures to restart
new ones due to the inability to bind to the same port.
The ideal solution would be to bind all ports at the end of the configuration
parsing. An acceptable workaround is to unbind all listeners of disabled
proxies. This is what the current patch does.
(cherry picked from commit a944218e9c)
(cherry picked from commit 8cfebbb82b87345bade831920177077e7d25840a)
During a configuration reload, haproxy tried to pause all proxies.
Unfortunately, it also tried to pause backends, which would fail
and cause trouble to the new process since the port was still bound.
(backported from commit eab5c70f93)
(cherry picked from commit ac1ca38e9b07422e21b5b4778918d243768e5498)
srv_dynamic_maxconn() is clearly documented as returning at least 1
possible connection under throttling. But the computation was wrong,
the minimum 1 was divided and got lost in case of very low maxconns.
Apply the MAX(1, max) before returning the result in order to ensure
that a newly appeared server will get some traffic.
(cherry picked from commit 819970098f)
(forward-port of commit 8262d8bd7f)
A bug was introduced during last queue management fix. If a server
connection fails, the allocated connection slot is released, but it
will be needed again after the turn-around. This also causes more
connections than expected to go to the server because it appears to
have less connections than real.
Many thanks to Rupert Fiasco, Mark Imbriaco, Cody Fauser, Brian
Gupta and Alexander Staubo for promptly providing configuration
and diagnosis elements to help reproduce this problem easily.
I'm in the process of setting up one haproxy instance now, and I find
the following acl option useful. I'm not too sure why this option has
not been available before, but I find this useful for my own usage, so
I'm submitting this patch in the hope that it will be useful as well.
The basic idea is to be able to measure the available connection slots
still available (connection, + queue) - anything beyond that can be
redirected to a different backend. 'connslots' = number of available
server connection slots, + number of available server queue slots. In
the case where we encounter srv maxconn = 0, or srv maxqueue = 0 (in
which case we dont need to care about connslots) the value you get is
-1. Note also that this code does not take care of dynamic connections
at this point in time.
The reason why I'm using this new acl (as opposed to 'nbsrv') is that
'nbsrv' only measures servers that are actually *down*. Whereas this
other acl is more fine-grained, and looks into the number of conn
slots available as well.
It is now possible to list all known sessions by issuing "show sess"
on the unix stats socket. The format is not much evolved but it is
very useful for debugging.
The doc has been updated to reflect the new keyword.
This is the first step in implementing a session dump tool.
A session dump will need restart points. It will be necessary for
it to get references to sessions which can be moved when the session
dies.
The principle is not that complex : when a session ends, it looks for
any potential back-references. If it finds any, then it moves them to
the next session in the list. The dump function will of course have
to restart from that new point.
Both should process the response buffer equally. They now both
clear the hijack bit once done, and both receive a pointer to
the response buffer in their arguments.
Instead of calling a hard-coded function to produce data, let's
reference this function into the buffer and call it from there
when BF_HIJACK is set. This goes in the direction of more generic
session management code.
The listener referenced in the fd was only used to check the
listener state upon session termination. There was no guarantee
that the FD had not been reassigned by the moment it was processed,
so this was a bit racy. Having it in the session is more robust.
The unix protocol handler had not been updated during the last
stream_sock changes. This has been done now. There is still a
lot of duplicated code between session.c and proto_uxst.c due
to the way the session is handled. Session.c relies on the existence
of a frontend while it does not exist here.
It is easier to see the difference between the stats part (placed
in dumpstats.c) and the unix-stream part (in proto_uxst.c).
The hijacking function still needs to be dynamically set into the
response buffer, and some cleanup is still required, then all those
changes should be forward-ported to the HTTP part. Adding support
for new keywords should not cause trouble now.
It will be very convenient to have an analyser state in the session.
It will always be initialized to zero. The analysers can make use of
it, but must reset it to zero when they leave.
In order to achieve more generic accept() code, we can set the request
analysers at the listener registration time. It's better than doing it
during accept(), and allows more code reuse.
The accept function must be adapted to the new framework. It is
still broken, and calling it will still result in a segfault. But
this cleanup is needed anyway.
The TCP analyser has moved to proto_tcp.c. Breaking the function
has required finer use of the return value and adding some tests
to process_session().
It was a bit awkward to have session.c call return_srv_error() for
HTTP error messages related to servers. The function has been adapted
to be passed a pointer to the faulty stream interface, and is now a
pointer in the session. It is possible that in the future, it will
become a callback in the stream interface itself.
The new function looks like the previous one except that it operates
at the stream interface level and assumes an already closed SI.
Also remove some old unused occurrences of srv_close_with_err().
In order to avoid having to call per-protocol logging function directly
from session.c, it's better to assign the logging function when the session
is created. This also eliminates a test when the function is needed, and
opens the way to more complete logging functions.
proto_http.c was not suitable for session-related processing, it was
just convenient for the tranformation.
Some more splitting must occur: process_request/response in proto_http.c
must be split again per protocol, and the caller must run a list.
Some functions should be directly attached to the session or the buffer
(eg: perform_http_redirect, return_srv_error, http_sess_log).
All the processing has now completely been split in layers. As of
now, everything is still in process_session() which is not the right
place, but the code sequence works. Timeouts, retries, errors, all
work.
The shutdown sequence has been strictly applied: BF_SHUTR/BF_SHUTW
are only assigned by lower layers. Upper layers can only indicate
their wish to close using BF_SHUTR_NOW and BF_SHUTW_NOW.
When a shutdown is performed on a stream interface, the buffer flags
are updated accordingly and re-checked by upper layers. A lot of care
has been taken to ensure that aborts during intermediate connection
setups are correctly handled and shutdowns correctly propagated to
both buffers.
A future evolution would consist in ensuring that BF_SHUT?_NOW may
be set at any time, and applies only when the buffer is empty. This
might help with error messages, but might complicate the processing
of data remaining in buffers.
Some useless buffer flag combinations have been removed.
Stat counters are still broken (eg: per-server total number of sessions).
Error messages should be delayed to the close instant and be produced by
protocol.
Many functions must now move to proper locations.
It sometimes happens that a connection is aborted at the exact same moment
it establishes. We have to close the socket and not only to shut it down
for writes.
Some corner cases remain. We have to handle the shutr/shutw at the stream
interface and only report the status to the buffer, not the opposite.
The sessions which were remaining stuck were being connecting to the
server while they received a shutw which caused them to partially
stop. A shutw() during a connect() must imply a close().
Now the global variable 'sessions' will be a dual-linked list of all
known sessions. The list element is set at the beginning of the session
so that it's easier to follow them all with gdb.
Two new functions are used instead : buffer_check_{shutr,shutw}.
It is indeed more adequate to check for new closures only when the
buffer reports them.
Several remaining unclosed connections were detected after a test,
even before this patch, so a bug remains. To reproduce, try the
following during 30 seconds :
inject30l4 -n 20000 -l -t 1000 -P 10 -o 4 -u 100 -s 100 -G 127.0.0.1:8000/
There were rare situations where it was not easy to detect that a failed
session attempt had occurred and needed some server cleanup. In particular,
client aborts sometimes lead to session leaks on the server side.
A new state "SI_ST_DIS" (disconnected) has been introduced for this. When
a session has been closed at a stream interface but the server cleanup has
not occurred, this state is entered instead of CLO. The cleanup is then
performed there and the state goes to CLO.
A new diagram has been added to show possible stream_interface state
transitions that can occur in a stream-sock. It makes debugging easier.
The server sessions are now only decremented when entering SI_ST_CER
and SI_ST_CLO states. A state is clearly missing between EST and CLO,
or after CLO (eg: END), because many cleanups are performed upon CLO
and must rely on tricks to ensure being done only once.
The goal of next changes will be to improve what has been started.
Ideally, the FD should only notify the SI about the change, which
should itself only notify the session when it has some news or when
it needs help (eg: redispatch). The buffer's error processing should
not change the FD's status immediately, otherwise we risk race conds
between a pending connect and a shutw (for instance). Also, the new
connect attempt should only be made after layer 7 and all the crap
above buffers.
It is quite hard to track when the current session has already been counted
or discounted from the server's total number of established sessions. For
this reason, we introduce a new session flag, SN_CURR_SESS, which indicates
if the current session is one of those reported by the server or not. It
simplifies session accounting and makes it far more robust. It also makes
it possible to perform a last-minute cleanup during session_free().
Right now, with this fix and a few more buffer transitions fixes, no session
were found to remain after a test.
Tracking connection status changes was hard, and some code was
redundant. A new SI_ST_CER state was added to the stream interface
to indicate a past connection error, and an SI_FL_ERR flag was
added to report past I/O error. The stream_sock code does not set
the connection to SI_ST_CLO anymore in case of I/O error, it's
the upper layer which does it. This makes it possible to know
exactly when the file descriptors are allocated.
The new SI_ST_CER state permitted to split tcp_connection_status()
in two parts, one processing SI_ST_CON and the other one SI_ST_CER.
Synchronous connection errors now make use of this last state, hence
eliminating duplicate code.
Some ib<->ob copy paste errors were found and fixed, and all entities
setting SI_ST_CLO also shut the buffers down.
Some of these stream_interface specific functions and structures
have migrated to a new stream_interface.c file.
Some types of errors are still not detected by the buffers. For
instance, let's assume the following scenario in one single pass
of process_session: a connection sits in SI_ST_TAR state during
a retry. At TAR expiration, a new connection attempt is made, the
connection is obtained and srv->cur_sess is increased. Then the
buffer timeout is fires and everything is cleared, the new state
becomes SI_ST_CLO. The cleaning code checks that previous state
was either SI_ST_CON or SI_ST_EST to release the connection. But
that's wrong because last state is still SI_ST_TAR. So the
server's connection count does not get decreased.
This means that prev_state must not be used, and must be replaced
by some transition detection instead of level detection.
The following debugging line was useful to track state changes :
fprintf(stderr, "%s:%d: cs=%d ss=%d(%d) rqf=0x%08x rpf=0x%08x\n", __FUNCTION__, __LINE__,
s->si[0].state, s->si[1].state, s->si[1].err_type, s->req->flags, s-> rep->flags);
The connection setup code has been refactored in order to
make it run only on low level (stream interface). Several
complicated functions have been removed from backend.c,
and we now have sess_update_stream_int() to manage
an assigned connection, sess_prepare_conn_req() to assign a
server to a connection request, perform_http_redirect() to
redirect instead of connecting to server, and return_srv_error()
to return connection error status messages.
The stream_interface status changes are checked before adjusting
buffer flags, so that the buffers can be informed about this lower
level update.
A new connection is initiated by changing si->state from SI_ST_INI
to SI_ST_REQ.
The code seems to work but is awfully dirty. Some functions need
to be moved, and the layering is not yet quite clear.
A lot of dead old code has simply been removed.
It was not practical to have QUEUE and TAR timers in buffers, as they caused
triggering of the timeout flags. Move them to the stream interface where they
belong.
Now we have almost two distinct parts between tcp and http.
Only the connection establishment code still requires some
resynchronization, the rest does not.
Those entries were really needed for cleaner and better code. Using them
has permitted to automatically close a file descriptor during a shut write,
reducing by 20% the number of calls to process_session() and derived
functions.
Process_session() does not need to know the file descriptor anymore, though
it still remains very complicated due to the special case for the connect
mode.
As of now, a stream socket does not directly wake up the task
but it does contact the stream interface which itself knows the
task. This allows us to perform a few cleanups upon errors and
shutdowns, which reduces the number of calls to data_update()
from 8 per session to 2 per session, and make all the functions
called in the process_session() loop completely swappable.
Some improvements are required. We need to provide a shutw()
function on stream interfaces so that one side which closes
its read part on an empty buffer can propagate the close to
the remote side.
The owner of an fd was initially a task but this was sometimes
casted to a (struct listener *). We'll soon need more types,
so void* is more appropriate.
It's very frequent to require some information about the
reason why a task is running. Some flags have been added
so that a task now knows if it got woken up due to I/O
completion, timeout, etc...
A test has shown that more than 16% of the calls to task_wakeup()
could be avoided because the task is already woken up. So make it
inline and move the test to the inline part.
When an accept() creates a new FD, it is already marked as set for
reads. But the task will be woken up without first checking if the
socket could be read.
The speculative I/O gives us a chance to either read the FD if there
are data pending on it, or immediately mark it for poll mode if
nothing is pending.
Simply doing this reduces the number of calls to process_session
from 6 to 5 per session, 2 to 1 calls to process_request, 10% less
calls to epoll_ctl, fd_clr, fd_set, stream_sock_data_update, 20%
less eb32_insert/eb_delete, etc... General performance increase
seems to be around 3%.
The buffer flags became a big bazaar. Re-arrange them
so that their names are more explicit and so that they
are more easily readable in hex form. Some aggregates
have also been adjusted.
With small HTTP messages, stream_sock_read() tends to wake the
task up for a message read without indicating that it may be
the last one. The reason is that level-triggered pollers generally
don't report HUP with data, but only afterwards, so stream_sock_read
has no chance to detect this condition and needs a respin.
So now we return on incomplete buffers only when the buffer is known
as a streamer, because here it generally makes sense. The net result
is that the number of calls in a single HTTP session has dropped
from 5 to 3, with one less wake up and several less calls to
stream_sock_data_update().
It was a waste to constantly update the file descriptor's status
and timeouts during a flags update. So stream_sock_process_data
has been slit in two parts :
stream_sock_data_update() => computes updated flags
stream_sock_data_finish() => computes timeouts
Only the first one is called during flag updates. The second one
is only called upon completion. The number of calls to fd_set/fd_clr
has now significantly dropped.
Also, it's useless to check for errors and timeouts in the
process_session() loop, it's enough to check for them at the
beginning.
The client side now relies on stream_sock_process_data(). One
part has not yet been re-implemented, it concerns the calls
to produce_content().
process_session() has been adjusted to correctly check for
changing bits in order not to call useless functions too many
times.
It already appears that stream_sock_process_data() should be
split so that the timeout computations are only performed at
the exit of process_session().
We really want to ensure that we don't miss a timeout update and do not
update them for nothing. So the code takes care of updating the timeout
in the two following circumstances :
- it was not set
- some I/O has been performed
Maybe we'll be able to remove that from stream_sock_{read|write}, or
we'll find a way to ensure that we never have to re-enable this.
srv_state has been removed from HTTP state machines, and states
have been split in either TCP states or analyzers. For instance,
the TARPIT state has just become a simple analyzer.
New flags have been added to the struct buffer to compensate this.
The high-level stream processors sometimes need to force a disconnection
without touching a file-descriptor (eg: report an error). But if
they touched BF_SHUTW or BF_SHUTR, the file descriptor would not
be closed. Thus, the two SHUT?_NOW flags have been added so that
an application can request a forced close which the stream interface
will be forced to obey.
During this change, a new BF_HIJACK flag was added. It will
be used for data generation, eg during a stats dump. It
prevents the producer on a buffer from sending data into it.
BF_SHUTR_NOW /* the producer must shut down for reads ASAP */
BF_SHUTW_NOW /* the consumer must shut down for writes ASAP */
BF_HIJACK /* the producer is temporarily replaced */
BF_SHUTW_NOW has precedence over BF_HIJACK. BF_HIJACK has
precedence over BF_MAY_FORWARD (so that it does not need it).
New functions buffer_shutr_now(), buffer_shutw_now(), buffer_abort()
are provided to manipulate BF_SHUT* flags.
A new type "stream_interface" has been added to describe both
sides of a buffer. A stream interface has states and error
reporting. The session now has two stream interfaces (one per
side). Each buffer has stream_interface pointers to both
consumer and producer sides.
The server-side file descriptor has moved to its stream interface,
so that even the buffer has access to it.
process_srv() has been split into three parts :
- tcp_get_connection() obtains a connection to the server
- tcp_connection_failed() tests if a previously attempted
connection has succeeded or not.
- process_srv_data() only manages the data phase, and in
this sense should be roughly equivalent to process_cli.
Little code has been removed, and a lot of old code has been
left in comments for now.
In backend.c, we had an EV_FD_SET() called before fd_insert().
This is wrong because fd_insert updates maxfd which might be
used by some of the pollers during EV_FD_SET(), although this
is not currently the case.
The following patch introduced a minor bug :
[MINOR] permit renaming of x-forwarded-for header
If "option forwardfor" is declared in a defaults section, the header name
is never set and we see an empty header name before the value. Also, the
header name was not reset between two defaults sections.
When any processing remains on a buffer, it must be up to the
processing functions to set the termination flags, because they
are the only ones who know about higher levels.
It's a shame not to use buffer->wex for connection timeouts since by
definition it cannot be used till the connection is not established.
Using it instead of ->cex also makes the buffer processing more
symmetric.
Instead of calling all functions in a loop, process_session now
calls them according to buffer flags changes. This ensures that
we almost never call functions for nothing. The flags settings
are still quite coarse, but the number of average functions
calls per session has dropped from 31 to 18 (the calls to
process_srv dropped from 13 to 7 and the calls to process_cli
dropped from 13 to 8).
This could still be improved by memorizing which flags each
function uses, but that would add a level of complexity which
is not desirable and maybe even not worth the small gain.
It is not always convenient to run checks on req->l in functions to
check if a buffer is empty or full. Now the stream_sock functions
set flags BF_EMPTY and BF_FULL according to the buffer contents. Of
course, functions which touch the buffer contents adjust the flags
too.
BF_SHUTR_PENDING and BF_SHUTW_PENDING were poor ideas because
BF_SHUTR is the pending of BF_SHUTW_DONE and BF_SHUTW is the
pending of BF_SHUTR_DONE. Remove those two useless and confusing
"pending" versions and rename buffer_shut{r,w}_* functions.