6665 Commits

Author SHA1 Message Date
Willy Tarreau
d0a201b35c [CLEANUP] task: distinguish between clock ticks and timers
Timers are unsigned and used as tree positions. Ticks are signed and
used as absolute date within current time frame. While the two are
normally equal (except zero), it's important not to confuse them in
the code as they are not interchangeable.

We add two inline functions to turn each one into the other.

The comments have also been moved to the proper location, as it was
not easy to understand what was a tick and what was a timer unit.
2009-03-08 15:58:07 +01:00
Willy Tarreau
721fdbc381 [BUG] event_accept() must always wake the task up, even in health mode
event_accept() did not wake the task up in health mode, so that mode was
not working anymore.
2009-03-08 12:25:07 +01:00
Willy Tarreau
26c250683f [MEDIUM] minor update to the task api: let the scheduler queue itself
All the tasks callbacks had to requeue the task themselves, and update
a global timeout. This was not convenient at all. Now the API has been
simplified. The tasks callbacks only have to update their expire timer,
and return either a pointer to the task or NULL if the task has been
deleted. The scheduler will take care of requeuing the task at the
proper place in the wait queue.
2009-03-08 09:38:41 +01:00
Willy Tarreau
4136522527 [OPTIM] displace tasks in the wait queue only if absolutely needed
We don't need to remove then add tasks in the wait queue every time we
update a timeout. We only need to do that when the new timeout is earlier
than previous one. We can rely on wake_expired_tasks() to perform the
proper checks and bounce the misplaced tasks in the rare case where this
happens. The motivation behind this is that we very rarely hit timeouts,
so we save a lot of CPU cycles by moving the tasks very rarely. This now
means we can also find tasks with expiration date set to eternity in the
queue, and that is not a problem.
2009-03-08 07:59:27 +01:00
Willy Tarreau
4726f53794 [OPTIM] task: don't unlink a task from a wait queue when waking it up
In many situations, we wake a task on an I/O event, then queue it
exactly where it was. This is a real waste because we delete/insert
tasks into the wait queue for nothing. The only reason for this is
that there was only one tree node in the task struct.

By adding another tree node, we can have one tree for the timers
(wait queue) and one tree for the priority (run queue). That way,
we can have a task both in the run queue and wait queue at the
same time. The wait queue now really holds timers, which is what
it was designed for.

The net gain is at least 1 delete/insert cycle per session, and up
to 2-3 depending on the workload, since we save one cycle each time
the expiration date is not changed during a wake up.
2009-03-08 07:59:18 +01:00
Willy Tarreau
1b8ca663a4 [BUG] task: fix handling of duplicate keys
A bug was introduced with the ebtree-based scheduler. It seldom causes
some timeouts to last longer than required if they hit an expiration
date which is the same as the last queued date, is also part of a
duplicate tree without being the top of the tree. In this case, the
task will not be expired until after the duplicate tree has been
flushed.

It is easier to reproduce by setting a very short client timeout (1s)
and sending connections and waiting for them to expire with the 408
status. Then in parallel, inject at about 1kh/s. The bug causes the
connections to sometimes wait longer than 1s before timing out.

The cause was the use of eb_insert_dup() on wrong nodes, as this
function is designed to work only on the top of the dup tree. The
solution consists in updating last_timer only when its bit is -1,
and using it only if its bit is still -1 (top of a dup tree).

The fix has not reduced performance because it only fixes the case
where this bug could fire, which is extremely rare.
2009-03-08 07:57:47 +01:00
Willy Tarreau
39af0f663d [BUG] rate-limit in defaults section was ignored
Just a missing initialisation of the field when creating a proxy.
2009-03-07 11:53:44 +01:00
Willy Tarreau
2ade301505 [BUG] disable any analysers for monitoring requests
We must not parse an HTTP request on a monitoring request. In fact,
we should even create a dedicated monitoring analyser.
2009-03-06 19:16:39 +01:00
Willy Tarreau
3d8c5531d8 [OPTIM] freq_ctr: do not rotate the counters when reading
It's easier to take the counter's age into account when consulting it
than to rotate it first. It also saves some CPU cycles and avoids the
multiply for outdated counters, finally saving CPU cycles here too
when multiple operations need to read the same counter.

The freq_ctr code has also shrinked by one third consecutively to these
optimizations.
2009-03-06 14:29:25 +01:00
Willy Tarreau
ec22b2c27a [CLEANUP] remove last references to term_trace
term_trace was very useful while reworking the lower layers but has almost
completely been removed from every place it was referenced. Even the few
remaining ones were not accurate, so it's better to completely remove those
references and re-add them from scratch later if needed.
2009-03-06 13:07:40 +01:00
Willy Tarreau
9279562e2a [BUG] switch server-side stream interface to close in case of abort
In pure TCP mode, there is no response analyser to switch the server-side
stream interface from INI to CLO when the output has been closed after an
abort. This caused sessions to remain indefinitely active when they were
aborted by the client during a TCP content analysis.

The proper action is to switch the stream interface to the CLO state from
INI when we have write enable and shutdown write.
2009-03-06 12:51:23 +01:00
Willy Tarreau
79584225e5 [OPTIM] rate-limit: cleaner behaviour on low rates and reduce consumption
The rate-limit was applied to the smoothed value which does a special
case for frequencies below 2 events per period. This caused irregular
limitations when set to 1 session per second.

The proper way to handle this is to compute the number of remaining
events that can occur without reaching the limit. This is what has
been added. It also has the benefit that the frequency calculation
is now done once when entering event_accept(), before the accept()
loop, and not once per accept() loop anymore, thus saving a few CPU
cycles during very high loads.

With this fix, rate limits of 1/s are perfectly respected.
2009-03-06 09:18:27 +01:00
Willy Tarreau
efcbc6e66d [OPTIM] maintain_proxies: only wake up when the frontend will be ready
It's not needed to try to check the frontend's freq counter every
millisecond, we can precisely compute when to wake up.
2009-03-06 08:27:10 +01:00
Willy Tarreau
bb9251ed8f [BUG] typo in timeout error reporting : report *res and not *err 2009-03-06 08:05:40 +01:00
Willy Tarreau
604e83097f [BUG] interface binding: length must include the trailing zero
The interface length passed to the setsockopt(SO_BINDTODEVICE) must
include the trailing \0. Otherwise it will randomly fail.
2009-03-06 00:48:23 +01:00
Willy Tarreau
3a7d20781d [MEDIUM] implement "rate-limit sessions" for the frontend
The new "rate-limit sessions" statement sets a limit on the number of
new connections per second on the frontend. As it is extremely accurate
(about 0.1%), it is efficient at limiting resource abuse or DoS.
2009-03-05 23:48:25 +01:00
Willy Tarreau
079ff0a207 [MINOR] acl: add 2 new verbs: fe_sess_rate and be_sess_rate
These new ACLs match frontend session rate and backend session rate.
Examples are provided in the doc to explain how to use that in order
to limit abuse of service.
2009-03-05 21:34:28 +01:00
Willy Tarreau
3a8efeb46d [BUG] the "connslots" keyword was matched as "connlots"
This bug has been lying there since the patch got merged.
2009-03-05 21:31:36 +01:00
Willy Tarreau
7f062c4193 [MEDIUM] measure and report session rate on frontend, backends and servers
With this change, all frontends, backends, and servers maintain a session
counter and a timer to compute a session rate over the last second. This
value will be very useful because it varies instantly and can be used to
check thresholds. This value is also reported in the stats in a new "rate"
column.
2009-03-05 18:43:00 +01:00
Willy Tarreau
755905857a [MINOR] add curr_sec_ms and curr_sec_ms_scaled for current second.
Several algorithms will need to know the millisecond value within
the current second. Instead of doing a divide every time it is needed,
it's better to compute it when it changes, which is when now and now_ms
are recomputed.

curr_sec_ms_scaled is the same multiplied by 2^32/1000, which will be
useful to compute some ratios based on the position within last second.
2009-03-05 16:56:16 +01:00
Willy Tarreau
defc52da95 [MINOR] errors dump must use user-visible date, not internal date. 2009-03-04 20:53:44 +01:00
Willy Tarreau
74808cb907 [MEDIUM] implement error dump on unix socket with "show errors"
The new "show errors" command sent on a unix socket will dump
all captured request and response errors for all proxies. It is
also possible to bound the log to frontends and backends whose
ID is passed as an optional parameter.

The output provides information about frontend, backend, server,
session ID, source address, error type, and error position along
with a complete dump of the request or response which has caused
the error.

If a new error scratches the one currently being reported, then
the dump is aborted with a warning message, and processing goes
on to next error.
2009-03-04 15:53:18 +01:00
Willy Tarreau
f073a83b1d [MEDIUM] store a complete dump of request and response errors in proxies
Each proxy instance, either frontend or backend, now has some room
dedicated to storing a complete dated request or response in case
of parsing error. This will make it possible to consult errors in
order to find the exact cause, which is particularly important for
troubleshooting faulty applications.
2009-03-04 10:26:38 +01:00
Willy Tarreau
7552c031c0 [MINOR] ensure that http_msg_analyzer updates pointer to invalid char
If an invalid character is encountered while parsing an HTTP message, we
want to get buf->lr updated to reflect it.

Along this change, a few useless __label__ declarations have been removed
because they caused gcc to consume stack space without putting anything
there.
2009-03-01 11:10:40 +01:00
Willy Tarreau
f49d1df25c [BUG] global.tune.maxaccept must be limited even in mono-process mode
On overloaded systems, it sometimes happens that hundreds or thousands
of incoming connections are queued in the system's backlog, and all get
dequeued at once. The problem is that when haproxy processes them and
does not apply any limit, this can take some time and the internal date
does not progress, resulting in wrong timer measures for all sessions.

The most common effect of this is that all of these sessions report a
large request time (around several hundreds of ms) which is in fact
caused by the time spent accepting other connections. This might happen
on shared systems when the machine swaps.

For this reason, we finally apply a reasonable limit even in mono-process
mode. Accepting 100 connections at once is fast enough for extreme cases
and will not cause that much of a trouble when the system is saturated.
2009-03-01 08:35:41 +01:00
Willy Tarreau
368480cf45 [BUG] the "source" keyword must first clear optional settings
Problem reported by John Lauro. When "source ... usesrc ..." is
set in the defaults section, it is not possible anymore to remove
the "usesrc" part when declaring a more precise "source" in a
backend. The only workaround was to declare it by server.

We need to clear optional settings when declaring a new "source".
The problem was the same with the "interface" declaration.
2009-03-01 08:27:21 +01:00
Willy Tarreau
7b92db4cd5 [BUILD] proto_http did not build on gcc-2.95
move the DPRINTF below the local variable declarations.
2009-02-24 10:48:35 +01:00
Willy Tarreau
38c99bcb98 [BUG] fix unix socket processing of interrupted output
Unix socket processing was still quite buggy. It did not properly
handle interrupted output due to a full response buffer. The fix
mainly consists in not trying to prematurely enable write on the
response buffer, just like the standard session works. This also
gets the unix socket code closer to the standard session code
handling.
2009-02-22 15:58:45 +01:00
Willy Tarreau
fd3828e263 [BUG] fix random memory corruption using "show sess"
Commit 8a5c626e73bac905d150185e45110525588d7b4c introduced the sessions
dump on the unix socket. This implementation is buggy because it may try
to link to the sessions list's head after the last session is removed
with a backref. Also, for the LIST_ISEMPTY test to succeed, we have to
proceed with LIST_INIT after LIST_DEL.
2009-02-22 15:17:24 +01:00
Vincenzo Farruggia
9b97cff1c2 [BUILD] Haproxy won't compile if DEBUG_FULL is defined
As subject when i try to compile haproxy with -DDEBUG_FULL it stop at
stream_sock.c file with:
gcc -Iinclude -Wall -O2 -g     -DDEBUG_FULL  -DTPROXY -DENABLE_POLL
-DENABLE_EPOLL -DENABLE_SEPOLL -DNETFILTER -DUSE_GETSOCKNAME
-DCONFIG_HAPROXY_VERSION=\"1.3.15\"
-DCONFIG_HAPROXY_DATE=\"2008/04/19\" -c -o src/stream_sock.o
src/stream_sock.c
src/stream_sock.c: In function 'stream_sock_chk_rcv':
src/stream_sock.c:905: error: 'fd' undeclared (first use in this function)
src/stream_sock.c:905: error: (Each undeclared identifier is reported only once
src/stream_sock.c:905: error: for each function it appears in.)
src/stream_sock.c:905: error: 'ob' undeclared (first use in this function)
src/stream_sock.c: In function 'stream_sock_chk_snd':
src/stream_sock.c:940: error: 'fd' undeclared (first use in this function)
src/stream_sock.c:940: error: 'ib' undeclared (first use in this function)
make: *** [src/stream_sock.o] Error 1

With this patch all build fine:
2009-02-04 22:46:19 +01:00
Krzysztof Piotr Oledzki
f39c71c981 [CRITICAL] fix server state tracking: it was O(n!) instead of O(n)
Using the wrong operator (&& instead of &) causes DOWN->UP
transition to take longer than it should and to produce a lot of
redundant logs. With typical "track" usage (1-6 tracking servers) it
shouldn't make a big difference but for heavily tracked servers
this bug leads to hang with 100% CPU usage and extremely big
log spam.
2009-02-04 22:39:03 +01:00
Willy Tarreau
0b9c02c861 [MEDIUM] implement bind-process to limit service presence by process
The "bind-process" keyword lets the admin select which instances may
run on which process (in multi-process mode). It makes it easier to
more evenly distribute the load across multiple processes by avoiding
having too many listen to the same IP:ports.
2009-02-04 22:05:05 +01:00
Willy Tarreau
c76721da57 [MEDIUM] add support for source interface binding at the server level
Add support for "interface <name>" after the "source" statement on
the server line.
2009-02-04 20:20:58 +01:00
Willy Tarreau
d53f96b3f0 [MEDIUM] add support for source interface binding
Specifying "interface <name>" after the "source" statement allows
one to bind to a specific interface for proxy<->server traffic.

This makes it possible to use multiple links to reach multiple
servers, and to force traffic to pass via an interface different
from the one the system would have chosen based on the routing
table.
2009-02-04 18:46:54 +01:00
Willy Tarreau
4e30ed73f4 [BUG] inform the user when root is expected but not set
When a plain user runs haproxy as non-root but some options require
root, let's inform him.
2009-02-04 18:02:48 +01:00
Willy Tarreau
5e6e204d1c [MINOR] add support for bind interface name
By appending "interface <name>" to a "bind" line, it is now possible
to specifically bind to a physical interface name. Note that this
currently only works on Linux and requires root privileges.
2009-02-04 17:19:29 +01:00
Willy Tarreau
0a3b9d90d3 [BUG] we must not exit if protocol binding only returns a warning
Right now, protocol binding cannot return a warning, but when this
will happen, we must not exit but just print the warning.
2009-02-04 17:05:23 +01:00
Krzysztof Piotr Oledzki
7b723efca3 [DOC] remove buggy comment for use_backend
"early blocking based on ACLs" is definitely wrong here
2009-01-27 21:30:31 +01:00
Krzysztof Piotr Oledzki
52d522b566 [BUG] Fix listen & more of 2 couples <ip>:<port>
Fix "listen www-mutualise 80.248.x.y1:80,80.248.x.y2:80,80.248.x.y3:80":

[ALERT] 309/161509 (15450) : Invalid server address: '80.248.x.y1:80,80.248.x.y2'
[ALERT] 309/161509 (15450) : Error reading configuration file : /etc/haproxy/haproxy.cfg

Bug reported by Laurent Dolosor.
2009-01-27 21:00:18 +01:00
Willy Tarreau
3ab68cf0ae [MEDIUM] splice: add the global "nosplice" option
Setting "nosplice" in the global section will disable the use of TCP
splicing (both tcpsplice and linux 2.6 splice). The same will be
achieved using the "-dS" parameter on the command line.
2009-01-25 16:03:28 +01:00
Willy Tarreau
43b78999ec [MEDIUM] move global tuning options to the global structure
The global tuning options right now only concern the polling mechanisms,
and they are not in the global struct itself. It's not very practical to
add other options so let's move them to the global struct and remove
types/polling.h which was not used for anything else.
2009-01-25 15:42:27 +01:00
Willy Tarreau
686ac828fa [OPTIM] make global.maxpipes default to global.maxconn/4 when not specified
global.maxconn/4 seems to be a good hint for global.maxpipes when that
one must be guessed. If the limit is reached, it's still possible to
set it manually in the configuration.
2009-01-25 14:06:58 +01:00
Willy Tarreau
a206fa9d5d [STATS] report pipe usage in the statistics
Pipe usage is reported in info and web stats including maxpipes, pipes_free,
and pipes_used.
2009-01-25 14:02:00 +01:00
Willy Tarreau
3eba98aa57 [MEDIUM] splice: make use of pipe pools
Using pipe pools makes pipe management a lot easier. It also allows to
remove quite a bunch of #ifdefs in areas which depended on the presence
or not of support for kernel splicing.

The buffer now holds a pointer to a pipe structure which is always NULL
except if there are still data in the pipe. When it needs to use that
pipe, it dynamically allocates it from the pipe pool. When the data is
consumed, the pipe is immediately released.

That way, there is no need anymore to care about pipe closure upon
session termination, nor about pipe creation when trying to use
splice().

Another immediate advantage of this method is that it considerably
reduces the number of pipes needed to use splice(). Tests have shown
that even with 0.2 pipe per connection, almost all sessions can use
splice(), because the same pipe may be used by several consecutive
calls to splice().
2009-01-25 13:56:13 +01:00
Willy Tarreau
982b6e37e4 [MEDIUM] introduce pipe pools
A new data type has been added : pipes. Some pre-allocated empty pipes
are maintained in a pool for users such as splice which use them a lot
for very short times.

Pipes are allocated using get_pipe() and released using put_pipe().
Pipes which are released with pending data are immediately killed.
The struct pipe is small (16 to 20 bytes) and may even be further
reduced by unifying ->data and ->next.

It would be nice to have a dedicated cleanup task which would watch
for the pipes usage and destroy a few of them from time to time.
2009-01-25 13:49:53 +01:00
Willy Tarreau
98b306be65 [MEDIUM] splice: add hints to support older buggy kernels
Kernels before 2.6.27.13 would have splice() return EAGAIN on shutdown.
By adding a few tricks, we can deal with the situation. If splice()
returns EAGAIN and the pipe is empty, then fallback to recv() which
will be able to check if it's an end of connection or not.

The advantage of this method is that it remains transparent for good
kernels since there is no reason that epoll() will return EPOLLIN
without anything to read, and even if it would happen, the recv()
overhead on this check is minimal.
2009-01-25 11:11:32 +01:00
Willy Tarreau
afb4876778 [BUG] reserve some pipes for backends with splice enabled
If splicing is enabled in a backend, we need to guess how many
pipes will be needed. We used to rely on fullconn, but this leads
to non-working splicing when fullconn is not specified. So we now
fallback to global.maxconn.
2009-01-25 10:42:05 +01:00
Willy Tarreau
5bd8c376ad [MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.

Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.

Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.

We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.

Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.

Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.

The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.

Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.

It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.

Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-19 00:32:22 +01:00
Willy Tarreau
6b4aad4c1b [MEDIUM] add definitions for Linux kernel splicing
Some older libc don't define the splice() syscall, and some even
define a wrong one. For this reason, we try our best to declare
it correctly. These definitions still work with recent glibc.
2009-01-18 21:59:13 +01:00
Willy Tarreau
259de1b702 [MINOR] introduce structures required to support Linux kernel splicing
When CONFIG_HAP_LINUX_SPLICE is defined, the buffer structure will be
slightly enlarged to support information needed for kernel splicing
on Linux.

A first attempt consisted in putting this information into the stream
interface, but in the long term, it appeared really awkward. This
version puts the information into the buffer. The platform-dependant
part is conditionally added and will only enlarge the buffers when
compiled in.

One new flag has also been added to the buffers: BF_KERN_SPLICING.
It indicates that the application considers it is appropriate to
use splicing to forward remaining data.
2009-01-18 21:56:21 +01:00