balance hdr(<name>) provides on option 'use_domain_only' to match only the
domain part in a header (designed for the Host header).
Olivier Fredj reported that the hashes were not the same for
'subdomain.domain.tld' and 'domain.tld'.
This is because the pointer was rewinded one step to far, resulting in a hash
calculated against wrong values :
- '.domai' for 'subdomain.domain.tld'
- ' domai' for 'domain.tld' (beginning with the space in the header line)
Another special case is when no dot can be found in the header : the hash will
be calculated against an empty string.
The patch addresses both cases : 'domain' will be used to compute the hash for
'subdomain.domain.tld', 'domain.tld' and 'domain' (using the whole header value
for the last case).
The fix must be backported to haproxy 1.5 and 1.4.
Since commit 3dd6a25 ("MINOR: stream-int: retrieve session pointer from
stream-int"), we can get the session from the task, so let's get rid of
this less obvious function.
Add some documentation about the environment variables available with
"external-check command". Currently, only one of them is dynamically updated
on each check : HAPROXY_SERVER_CURCONN.
commit 9ede66b0 introduced an environment variable (HAPROXY_SERVER_CURCONN) that
was supposed to be dynamically updated, but it was set only once, during its
initialization.
Most of the code provided in this previous patch has been rewritten in order to
easily update the environment variables without reallocating memory during each
check.
Now, HAPROXY_SERVER_CURCONN will contain the current number of connections on
the server at the time of the check.
This is the equivalent of eb9fd51 ("OPTIM: stream_sock: reduce the amount
of in-flight spliced data") whose purpose is to try to immediately send
spliced data if available.
bi_swpbuf() swaps the buffer passed in argument with the one attached to
the channel, but only if this last one is empty. The idea is to avoid a
copy when buffers can simply be swapped.
This setting is used to limit memory usage without causing the alloc
failures caused by "-m". Unexpectedly, tests have shown a performance
boost of up to about 18% on HTTP traffic when limiting the number of
buffers to about 10% of the amount of concurrent connections.
tune.buffers.limit <number>
Sets a hard limit on the number of buffers which may be allocated per process.
The default value is zero which means unlimited. The minimum non-zero value
will always be greater than "tune.buffers.reserve" and should ideally always
be about twice as large. Forcing this value can be particularly useful to
limit the amount of memory a process may take, while retaining a sane
behaviour. When this limit is reached, sessions which need a buffer wait for
another one to be released by another session. Since buffers are dynamically
allocated and released, the waiting time is very short and not perceptible
provided that limits remain reasonable. In fact sometimes reducing the limit
may even increase performance by increasing the CPU cache's efficiency. Tests
have shown good results on average HTTP traffic with a limit to 1/10 of the
expected global maxconn setting, which also significantly reduces memory
usage. The memory savings come from the fact that a number of connections
will not allocate 2*tune.bufsize. It is best not to touch this value unless
advised to do so by an haproxy core developer.
Used in conjunction with the dynamic buffer allocator.
tune.buffers.reserve <number>
Sets the number of buffers which are pre-allocated and reserved for use only
during memory shortage conditions resulting in failed memory allocations. The
minimum value is 2 and is also the default. There is no reason a user would
want to change this value, it's mostly aimed at haproxy core developers.
We've already experimented with three wake up algorithms when releasing
buffers : the first naive one used to wake up far too many sessions,
causing many of them not to get any buffer. The second approach which
was still in use prior to this patch consisted in waking up either 1
or 2 sessions depending on the number of FDs we had released. And this
was still inaccurate. The third one tried to cover the accuracy issues
of the second and took into consideration the number of FDs the sessions
would be willing to use, but most of the time we ended up waking up too
many of them for nothing, or deadlocking by lack of buffers.
This patch completely removes the need to allocate two buffers at once.
Instead it splits allocations into critical and non-critical ones and
implements a reserve in the pool for this. The deadlock situation happens
when all buffers are be allocated for requests pending in a maxconn-limited
server queue, because then there's no more way to allocate buffers for
responses, and these responses are critical to release the servers's
connection in order to release the pending requests. In fact maxconn on
a server creates a dependence between sessions and particularly between
oldest session's responses and latest session's requests. Thus, it is
mandatory to get a free buffer for a response in order to release a
server connection which will permit to release a request buffer.
Since we definitely have non-symmetrical buffers, we need to implement
this logic in the buffer allocation mechanism. What this commit does is
implement a reserve of buffers which can only be allocated for responses
and that will never be allocated for requests. This is made possible by
the requester indicating how much margin it wants to leave after the
allocation succeeds. Thus it is a cooperative allocation mechanism : the
requester (process_session() in general) prefers not to get a buffer in
order to respect other's need for response buffers. The session management
code always knows if a buffer will be used for requests or responses, so
that is not difficult :
- either there's an applet on the initiator side and we really need
the request buffer (since currently the applet is called in the
context of the session)
- or we have a connection and we really need the response buffer (in
order to support building and sending an error message back)
This reserve ensures that we don't take all allocatable buffers for
requests waiting in a queue. The downside is that all the extra buffers
are really allocated to ensure they can be allocated. But with small
values it is not an issue.
With this change, we don't observe any more deadlocks even when running
with maxconn 1 on a server under severely constrained memory conditions.
The code becomes a bit tricky, it relies on the scheduler's run queue to
estimate how many sessions are already expected to run so that it doesn't
wake up everyone with too few resources. A better solution would probably
consist in having two queues, one for urgent requests and one for normal
requests. A failed allocation for a session dealing with an error, a
connection event, or the need for a response (or request when there's an
applet on the left) would go to the urgent request queue, while other
requests would go to the other queue. Urgent requests would be served
from 1 entry in the pool, while the regular ones would be served only
according to the reserve. Despite not yet having this, it works
remarkably well.
This mechanism is quite efficient, we don't perform too many wake up calls
anymore. For 1 million sessions elapsed during massive memory contention,
we observe about 4.5M calls to process_session() compared to 4.0M without
memory constraints. Previously we used to observe up to 16M calls, which
rougly means 12M failures.
During a test run under high memory constraints (limit enforced to 27 MB
instead of the 58 MB normally needed), performance used to drop by 53% prior
to this patch. Now with this patch instead it *increases* by about 1.5%.
The best effect of this change is that by limiting the memory usage to about
2/3 to 3/4 of what is needed by default, it's possible to increase performance
by up to about 18% mainly due to the fact that pools are reused more often
and remain hot in the CPU cache (observed on regular HTTP traffic with 20k
objects, buffers.limit = maxconn/10, buffers.reserve = limit/2).
Below is an example of scenario which used to cause a deadlock previously :
- connection is received
- two buffers are allocated in process_session() then released
- one is allocated when receiving an HTTP request
- the second buffer is allocated then released in process_session()
for request parsing then connection establishment.
- poll() says we can send, so the request buffer is sent and released
- process session gets notified that the connection is now established
and allocates two buffers then releases them
- all other sessions do the same till one cannot get the request buffer
without hitting the margin
- and now the server responds. stream_interface allocates the response
buffer and manages to get it since it's higher priority being for a
response.
- but process_session() cannot allocate the request buffer anymore
=> We could end up with all buffers used by responses so that none may
be allocated for a request in process_session().
When the applet processing leaves the session context, the test will have
to be changed so that we always allocate a response buffer regardless of
the left side (eg: H2->H1 gateway). A final improvement would consists in
being able to only retry the failed I/O operation without waking up a
task, but to date all experiments to achieve this have proven not to be
reliable enough.
A session doesn't need buffers all the time, especially when they're
empty. With this patch, we don't allocate buffers anymore when the
session is initialized, we only allocate them in two cases :
- during process_session()
- during I/O operations
During process_session(), we try hard to allocate both buffers at once
so that we know for sure that a started operation can complete. Indeed,
a previous version of this patch used to allocate one buffer at a time,
but it can result in a deadlock when all buffers are allocated for
requests for example, and there's no buffer left to emit error responses.
Here, if any of the buffers cannot be allocated, the whole operation is
cancelled and the session is added at the tail of the buffer wait queue.
At the end of process_session(), a call to session_release_buffers() is
done so that we can offer unused buffers to other sessions waiting for
them.
For I/O operations, we only need to allocate a buffer on the Rx path.
For this, we only allocate a single buffer but ensure that at least two
are available to avoid the deadlock situation. In case buffers are not
available, SI_FL_WAIT_ROOM is set on the stream interface and the session
is queued. Unused buffers resulting either from a successful send() or
from an unused read buffer are offered to pending sessions during the
->wake() callback.
When a session_alloc_buffers() fails to allocate one or two buffers,
it subscribes the session to buffer_wq, and waits for another session
to release buffers. It's then removed from the queue and woken up with
TASK_WAKE_RES, and can attempt its allocation again.
We decide to try to wake as many waiters as we release buffers so
that if we release 2 and two waiters need only once, they both have
their chance. We must never come to the situation where we don't wake
enough tasks up.
It's common to release buffers after the completion of an I/O callback,
which can happen even if the I/O could not be performed due to half a
failure on memory allocation. In this situation, we don't want to move
out of the wait queue the session that was just added, otherwise it
will never get any buffer. Thus, we only force ourselves out of the
queue when freeing the session.
Note: at the moment, since session_alloc_buffers() is not used, no task
is subscribed to the wait queue.
This patch introduces session_alloc_recv_buffer(), session_alloc_buffers()
and session_release_buffers() whose purpose will be to allocate missing
buffers and release unneeded ones around the process_session() and during
I/O operations.
I/O callbacks only need a single buffer for recv operations and none
for send. However we still want to ensure that we don't pick the last
buffer. That's what session_alloc_recv_buffer() is for.
This allocator is atomic in that it always ensures we can get 2 buffers
or fails. Here, if any of the buffers is not ready and cannot be
allocated, the operation is cancelled. The purpose is to guarantee that
we don't enter into the deadlock where all buffers are allocated by the
same size of all sessions.
A queue will have to be implemented for failed allocations. For now
they're just reported as failures.
This function is used to allocate a buffer and ensure that we leave
some margin after it in the pool. The function is not obvious. While
we allocate only one buffer, we want to ensure that at least two remain
available after our allocation. The purpose is to ensure we'll never
enter a deadlock where all sessions allocate exactly one buffer, and
none of them will be able to allocate the second buffer needed to build
a response in order to release the first one.
We also take care of remaining fast in the the fast path by first
checking whether or not there is enough margin, in which case we only
rely on b_alloc_fast() which is guaranteed to succeed. Otherwise we
take the slow path using pool_refill_alloc().
This function allocates a buffer and replaces *buf with this buffer. If
no memory is available, &buf_wanted is used instead. No control is made
to check if *buf already pointed to another buffer. The allocated buffer
is returned, or NULL in case no memory is available. The difference with
b_alloc() is that this function only picks from the pool and never calls
malloc(), so it can fail even if some memory is available. It is the
caller's job to refill the buffer pool if needed.
We'll soon want to release buffers together upon failure so we need to
allocate them after the channels. Let's change this now. There's no
impact on the behaviour, only the error path is unrolled slightly
differently. The same was done in peers.
Till now we'd consider a buffer full even if it had size==0 due to pointing
to buf.size. Now we change this : if buf_wanted is present, it means that we
have already tried to allocate a buffer but failed. Thus the buffer must be
considered full so that we stop trying to poll for reads on it. Otherwise if
it's empty, it's buf_empty and we report !full since we may allocate it on
the fly.
Doing so ensures that even when no memory is available, we leave the
channel in a sane condition. There's a special case in proto_http.c
regarding the compression, we simply pre-allocate the tmpbuf to point
to the dummy buffer. Not reusing &buf_empty for this allows the rest
of the code to differenciate an empty buffer that's not used from an
empty buffer that results from a failed allocation which has the same
semantics as a buffer full.
Channels are now created with a valid pointer to a buffer before the
buffer is allocated. This buffer is a global one called "buf_empty" and
of size zero. Thus it prevents any activity from being performed on
the buffer and still ensures that chn->buf may always be dereferenced.
b_free() also resets the buffer to &buf_empty, and was split into
b_drop() which does not reset the buffer.
We don't call pool_free2(pool2_buffers) anymore, we only call b_free()
to do the job. This ensures that we can start to centralize the releasing
of buffers.
It's not clean to initialize the buffer before the channel since it
dereferences one pointer in the channel. Also we'll want to let the
channel pre-initialize the buffer, so let's ensure that the channel
is always initialized prior to the buffers.
b_alloc() now allocates a buffer and initializes it to the size specified
in the pool minus the size of the struct buffer itself. This ensures that
callers do not need to care about buffer details anymore. Also this never
applies memory poisonning, which is slow and useless on buffers.
We'll soon need to be able to switch buffers without touching the
channel, so let's move buffer initialization out of channel_init().
We had the same in compressoin.c.
Till now this function would only allocate one entry at a time. But with
dynamic buffers we'll like to allocate the number of missing entries to
properly refill the pool.
Let's modify it to take a minimum amount of available entries. This means
that when we know we need at least a number of available entries, we can
ask to allocate all of them at once. It also ensures that we don't move
the pointers back and forth between the caller and the pool, and that we
don't call pool_gc2() for each failed malloc. Instead, it's called only
once and the malloc is only allowed to fail once.
pool_alloc2() used to pick the entry from the pool, fall back to
pool_refill_alloc(), and to perform the poisonning itself, which
pool_refill_alloc() was also doing. While this led to optimal
code size, it imposes memory poisonning on the buffers as well,
which is extremely slow on large buffers.
This patch cuts the allocator in 3 layers :
- a layer to pick the first entry from the pool without falling back to
pool_refill_alloc() : pool_get_first()
- a layer to allocate a dirty area by falling back to pool_refill_alloc()
but never performing the poisonning : pool_alloc_dirty()
- pool_alloc2() which calls the latter and optionally poisons the area
No functional changes were made.
Remove the code dealing with the old dual-linked lists imported from
librt that has remained unused for the last 8 years. Now everything
uses the linux-like circular lists instead.
In zlib we track memory usage. The problem is that the way alloc_zlib()
and free_zlib() account for memory is different, resulting in variations
that can lead to negative zlib_mem being reported. The alloc() function
uses the requested size while the free() function uses the pool size. The
difference can happen when pools are shared with other pools of similar
size. The net effect is that zlib_mem can be reported negative with a
slowly decreasing count, and over the long term the limit will not be
enforced anymore.
The fix is simple : let's use the pool size in both cases, which is also
the exact value when it comes to memory usage.
This fix must be backported to 1.5.
create_server_socket() used to dereference objt_server(conn->target),
but if the target is not a server (eg: a proxy) then it's NULL and we
get a segfault. This can be reproduced with a proxy using "dispatch"
with no server, even when namespaces are disabled, because that code
is not #ifdef'd. The fix consists in first checking if the target is
a server.
This fix does not need to be backported, this is 1.6-only.
There's a long-standing bug in pool_gc2(). It tries to protect the pool
against releasing of too many entries but the formula is wrong as it
compares allocated to minavail instead of (allocated-used) to minavail.
Under memory contention, it ends up releasing more than what is granted
by minavail and causes trouble to the dynamic buffer allocator.
This bug is in fact major by itself, but since minavail has never been
used till now, there is no impact at least in mainline. A backport to
1.5 is desired anyway in case any future backport or out-of-tree patch
relies on this.
In stream_int_register_handler(), we call si_alloc_appctx(si) but as a
mistake, instead of checking the return value for a NULL, we test <si>.
This bug was discovered under extreme memory contention (memory for only
two buffers with 500 connections waiting) and after 3 million failed
connections. While it was very hard to produce it, the fix is tagged
major because in theory it could happen when haproxy runs with a very
low "-m" setting preventing from allocating just the few bytes needed
for an appctx. But most users will never be able to trigger it. The
fix was confirmed to address the bug.
This fix must be backported to 1.5.
The path "MAJOR: namespace: add Linux network namespace support" doesn't
permit to use internal data producer like a "peers synchronisation"
system. The result is a segfault when the internal application starts.
This patch fix the commit b3e54fe387
It is introduced in 1.6dev version, it doesn't need to be backported.
By convention, the HAProxy CLI doesn't return message if the opration
is sucessfully done. The MAP and ACL returns the "Done." message, an
its noise the output during big MAP or ACL injection.
Immo Goltz reported a case of segfault while parsing the config where
we try to propagate processes across stopped frontends (those with a
"disabled" statement). The fix is trivial. The workaround consists in
commenting out these frontends, although not always easy.
This fix must be backported to 1.5.
propagate_processes() has a typo in a condition :
if (!from->cap & PR_CAP_FE)
return;
The return is never taken because each proxy has at least one capability
so !from->cap always evaluates to zero. Most of the time the caller already
checks that <from> is a frontend. In the cases where it's not tested
(use_backend, reqsetbe), the rules have been checked for the context to
be a frontend as well, so in the end it had no nasty side effect.
This should be backported to 1.5.
Since during parsing stage, curproxy always represents a proxy to be operated,
it should be a mistake by referring proxy.
Signed-off-by: Godbach <nylzhaowei@gmail.com>
Willy: commit f2e8ee2b introduced an optimization in the old speculative epoll
code, which implemented its own event cache. It was needed to store that many
events (it was bound to maxsock/4 btw). Now the event cache lives on its own
and we don't need this anymore. And since events are allocated on the kernel
side, we only need to allocate the events we want to return.
As a result, absmaxevents will be not used anymore. Just remove the definition
and the comment of it, replace it with global.tune.maxpollevents. It is also an
optimization of memory usage for large amounts of sockets.
Signed-off-by: Godbach <nylzhaowei@gmail.com>
random() will generate a number between 0 and RAND_MAX. POSIX mandates
RAND_MAX to be at least 32767. GNU libc uses (1<<31 - 1) as
RAND_MAX.
In smp_fetch_rand(), a reduction is done with a multiply and shift to
avoid skewing the results. However, the shift was always 32 and hence
the numbers were not distributed uniformly in the specified range. We
fix that by dividing by RAND_MAX+1. gcc is smart enough to turn that
into a shift:
0x000000000046ecc8 <+40>: shr $0x1f,%rax
Since commit 656c5fa7e8 ("BUILD: ssl: disable OCSP when using
boringssl) the OCSP code is bypassed when OPENSSL_IS_BORINGSSL
is defined. The correct thing to do here is to use OPENSSL_NO_OCSP
instead, which is defined for this exact purpose in
openssl/opensslfeatures.h.
This makes haproxy forward compatible if boringssl ever introduces
full OCSP support with the additional benefit that it links fine
against a OCSP-disabled openssl.
Signed-off-by: Lukas Tribus <luky-37@hotmail.com>
Using "option tcp-checks" without any rule is different from not using
it at all in that checks are sent with the TCP quick ack mode enabled,
causing servers to log incoming port probes.
This commit fixes this behaviour by disabling quick-ack on tcp-checks
unless the next rule exists and is an expect. All combinations were
tested and now the behaviour is as expected : basic port probes are
now doing a SYN-SYN/ACK-RST sequence.
This fix must be backported to 1.5.
If "option tcp-check" is used and no "tcp-check" rule is specified, we
only look at rule->action which dereferences the proxy's memory and which
can randomly match TCPCHK_ACT_CONNECT or whatever else, causing a check
to fail. This bug is the result of an incorrect fix attempted in commit
f621bea ("BUG/MINOR: tcpcheck connect wrong behavior").
This fix must be backported into 1.5.
tcp_check_main() would condition the polling for writes on check->type,
but this is absurd given that check->type == PR_O2_TCPCHK_CHK since this
is the only way we can get there! This patch removes this confusing test.
The external command accepted 4 arguments, some with the value "NOT_USED" when
not applicable. In order to make the exernal command more generic, this patch
also provides the values in environment variables. This allows to provide more
information.
Currently, the supported environment variables are :
PATH, as previously provided.
HAPROXY_PROXY_NAME, the backend name
HAPROXY_PROXY_ID, the backend id
HAPROXY_PROXY_ADDR, the first bind address if available (or empty)
HAPROXY_PROXY_PORT, the first bind port if available (or empty)
HAPROXY_SERVER_NAME, the server name
HAPROXY_SERVER_ID, the server id
HAPROXY_SERVER_ADDR, the server address
HAPROXY_SERVER_PORT, the server port if available (or empty)
HAPROXY_SERVER_MAXCONN, the server max connections
HAPROXY_SERVER_CURCONN, the current number of connections on the server