In order to manage "If-Modified-Since" requests, we need to keep a
reference time for our cache entries (to which the conditional request's
date will be compared).
This reference is either extracted from the "Last-Modified" header, or
the "Date" header, or the reception time of the response (in decreasing
order of priority).
The date values are converted into seconds since epoch in order to ease
comparisons and to limit storage space.
BorinSSL pretends to be 1.1.1 version of OpenSSL. It messes some
version based feature presense checks. For example, OpenSSL specific
early data support.
Let us change that feature detction to SSL_READ_EARLY_DATA_SUCCESS
macro check instead of version comparision.
Previous commit ae32ac74db ("BUG/MINOR: log: fix memory leak on logsrv
parse error") addressed one issue and introduced another one, the logsrv
pointer may also be null at the end of the function so we must test it
before deciding to dereference it.
This should be backported along with the patch above to 2.2.
In case of parsing error on logsrv, we can leave parse_logsrv() without
releasing logsrv->ring_name or smp_rgs. Let's free them on the error path.
This should fix issue #926 detected by Coverity.
The impact is only a tiny leak just before reporting a fatal error, so it
will essentially annoy valgrind.
This can be backported to 2.0 (just drop the ring part).
It's a regression from b3201a3e "BUG/MINOR: disable dynamic OCSP load
with BoringSSL". The origin bug is link to 76b4a12 "BUG/MEDIUM: ssl:
memory leak of ocsp data at SSL_CTX_free()": ssl_sock_free_ocsp()
shoud be in #ifndef OPENSSL_IS_BORINGSSL.
To avoid long #ifdef for small code, the BoringSSL part for ocsp load
is isolated in a simple #ifdef.
This must be backported in 2.2 and 2.1
`att_beg` is assigned to `next` at the end of the `for` loop, but is
assigned to `prev` at the beginning of the loop, which is itself
assigned to `next` after each loop. So it represents a double
assignation for the same value. Also `att_beg` is not used after the end
of the loop.
this is a partial fix for github issue #923, all the others could
probably be marked as intentional to protect future changes.
no backport needed.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
Issue #910 reports that we fail to check a few extchk_setenv() in the
child process. These are mostly harmless, but instead of counting on
the external check script to fail the dirty way, better fail cleanly
when detecting the failure.
This could probably be backported to all stable branches.
As reported by Coverity in issue #917, commit 96bca33 ("OPTIM: queue:
decrement the nbpend and totpend counters outside of the lock")
introduced a bug when moving the increments outside of the loop,
because we can't always rely on the pendconn "p" here as it may
be null. We can retrieve the proxy pointer directly from s->proxy
instead. The same is true for pendconn_redistribute(), though the
last "p" pointer there was still valid. This patch fixes both.
No backport is needed, this was introduced just before 2.3-dev8.
The "weight" column on the stats page is somewhat confusing when using
slowstart becaue it reports the effective weight, without being really
explicit about it. In some situations the user-configured weight is more
relevant (especially with long slowstarts where it's important to know
if the configured weight is correct).
This adds a new uweight stat which reports a server's user-configured
weight, and in a backend it receives the sum of all servers' uweights.
In addition it adds the mention of "effective" in a few descriptions
for the "weight" column (help and doc).
As a result, the list of servers in a backend is now always scanned
when dumping the stats. But this is not a problem given that these
servers are already scanned anyway and for way heavier processing.
In order to be compatible with the "set ssl cert" command of the CLI,
this patch restrict the ssl-load-extra-del-ext to files with a ".crt"
extension in the configuration.
Related to issue #785.
Should be backported where 8e8581e ("MINOR: ssl: 'ssl-load-extra-del-ext'
removes the certificate extension") was backported.
When dumping the stats page (or the CSV output), when many states are
mixed, it's hard to figure the number of up servers. But when showing
only the "up" servers or hiding the "maint" servers, there's no way to
know how many servers are configured, which is problematic when trying
to update server-templates.
What this patch does, for dumps in "up" or "no-maint" modes, is to add
after the backend's "UP" or "DOWN" state "(%d/%d)" indicating the number
of servers seen as UP to the total number of servers in the backend. As
such, seeing "UP (33/39)" immediately tells that there are 6 servers that
are not listed when using "up", or will let the client figure how many
servers are left once deducted the number of non-maintenance ones. It's
not done on default dumps so as not to disturb existing tools, which
already have all the information they need in the dump.
"no-maint" is a bit similar to "up" except that it will only hide
servers that are in maintenance (or disabled in the configuration), and
not those that are enabled but failed a check. One benefit here is to
significantly reduce the output of the "show stat" command when using
large server-templates containing entries that are not yet provisioned.
Note that the prometheus exporter also has such an option which does
the exact same.
We already had it on the HTTP interface but it was not accessible on the
CLI. It can be very convenient to hide servers which are down, do not
resolve, or are in maintenance.
Leastconn has the nice propery of being able to sort servers by their
current usage. It's really a shame to force all requests into the backend
queue when the algo would be able to also consider their current queue.
In order not to change existing behavior but extend it, this patch allows
leastconn to elect servers which are already full if they have an explicitly
configured maxqueue setting above zero and their queue hasn't reached that
threshold. This will significantly reduce the pressure in the backend queue
when queuing a lot with lots of servers.
A test on 8 threads with 100 servers configured with maxconn 1 jumped
from 165krps to 330krps with maxqueue 15 with this patch.
This partially undoes commit 82cd5c13a ("OPTIM: backend: skip LB when we
know the backend is full") but allows to scale much better even by setting
a single-digit maxqueue value. Some better heuristics could be used to
maintain the behavior of the bypass in the patch above, consisting in
keeping it if it's known that there is no server with a configured
maxqueue in the farm (or in the backend).
When servers are queued into the leastconn tree, it's important to also
consider their queue length. There could be some servers with lots of
queued requests that we don't want to hammer with extra connections. In
order not to add extra stress to the LB algorithm, we don't update the
value when adding to the queue, only when updating the connection count
(i.e. picking from the queue or releasing a connection). This will be
sufficient to significantly improve the fairness in such situations.
We don't need to do that inside the lock. However since the operation
used to be done in deep functions, we have to make it resurface closer
to visible parts. It remains reasonably self-contained in queue.c so
that's not that big of a deal. Some places (redistribute) could benefit
from a single operation for all counts at once. Others like
pendconn_process_next_strm() are still called with both locks held but
now it will be possible to change this.
Instead of incrementing, decrementing them and updating their max under
the lock, make them atomic and keep them out of the lock as much as
possible. For __pendconn_unlink_* it would be wide to decide to move
these counters outside of the function, inside the callers so that a
single atomic op can be done per counter even for groups of operations.
Similarly to previous changes, we know if we're dealing with a server
or proxy lock so let's directly lock at the finest possible places
there. It's worth noting that a part of the operation consisting in
an increment and update of a max could be done outside of the lock
using atomic ops and a CAS.
The function is called with the lock held and does too many tests for
things that are already known from its callers. Let's split it in two
so that its callers call either the per-server or per-proxy function
depending on where the element is (since they had to determine it
prior to taking the lock).
No need to use an exclusive lock on the proxy anymore when reading its
setting, a read lock is enough. A few other places continue to use a
write-lock when modifying simple flags only in order to let this
function see a consistent value all along. This might be changed in
the future using barriers and local copies.
This is an anticipation of finer grained locking for the queues. For now
all lock places take a write lock so that there is no difference at all
with previous code.
In h2_send(), if we are in a state where we know it is no longer possible to
send data, we must exit the sending loop to avoid any possiblity to loop
forever. It may happen if the mbuf ring is released while the H2_CF_MUX_MFULL
flag is still set. Here is a possible scenario to trigger the bug :
1) The mbuf ring is full because we are unable to send data. The
H2_CF_MUX_MFULL flag is set on the H2 connection.
2) At this stage, the task timeout expires because the H2 connection is
blocked. We enter in h2_timeout_task() function. Because the mbuf ring is
full, we cannot send the GOAWAY frame. Thus the H2_CF_GOAWAY_FAILED flag is
set. The H2 connection is not released yet because there is still a stream
attached. Here we leave h2_timeout_task() function.
3) A bit later, the H2 connection is woken up. If h2_process(), nothing is
performed by the first attempt to send data, in h2_send(). Then, because
the H2_CF_GOAWAY_FAILED flag is set, the mbuf ring is released. But the
H2_CF_MUX_MFULL flag is still there. At this step a second attempt to send
data is performed.
4) In h2_send(), we try to send data in a loop. To exist this loop, done
variable must be set to 1. Because the H2_CF_MUX_MFULL flag is set, we
don't call h2_process_mux() and done is not updated. Because the mbuf ring
is now empty, nothing is sent and the H2_CF_MUX_MFULL flag is never
removed. Now, we loop forever... waiting for the watchdog.
To fix the bug, we now exit the loop if one of these conditions is true :
- The H2_CF_GOAWAY_FAILED flag is set on the H2 connection
- The CO_FL_SOCK_WR_SH flag is set on the underlying connection
- The H2 connection is in the H2_CS_ERROR2 state
This patch should fix the issue #912 and most probably #875. It must be
backported as far as the 1.8.
When an internal response is returned to a client, the message payload must be
skipped if it is a reply to a HEAD request. The payload is removed from the HTX
message just before the message forwarding.
This bugs has been around for a long time. It was already there in the pre-HTX
versions. In legacy HTTP mode, internal errors are not parsed. So this bug
cannot be easily fixed. Thus, this patch should only be backported in all HTX
versions, as far as 2.0. However, the code has significantly changed in the
2.2. Thus in the 2.1 and 2.0, the patch must be entirely reworked.
Partial support of conditional HTTP requests. This commit adds the
support of the 'If-None-Match' header (see RFC 7232#3.2).
When a client specifies a list of ETags through one or more
'If-None-Match' headers, they are all compared to the one that might have
been stored in the corresponding http cache entry until one of them
matches.
If a match happens, a specific "304 Not Modified" response is
sent instead of the cached data. This response has all the stored
headers but no other data (see RFC 7232#4.1). Otherwise, the whole cached data
is sent.
Although unlikely in a GET/HEAD request, the "If-None-Match: *" syntax is
valid and also receives a "304 Not Modified" response (RFC 7434#4.3.2).
This resolves a part of GitHub issue #821.
When sent by a server for a given resource, the ETag header is
stored in the coresponding cache entry (as any other header). So in
order to perform future ETag comparisons (for subsequent conditional
HTTP requests), we keep the length of the ETag and its offset
relative to the start of the cache_entry.
If no ETag header exists, the length and offset are zero.
Add a function that compares two etags that might be of different types.
If any of them is weak, the 'W/' prefix is discarded and a strict string
comparison is performed.
Co-authored-by: Tim Duesterhus <tim@bastelstu.be>
If the slowstart value in a state file implies the latest state change
is within the slowstart period, we end up calling srv_update_status()
to reschedule the server's state change but its task is not yet
allocated and remains null, causing a crash on startup.
Make sure srv_update_status() supports being called with partially
initialized servers which do not yet have a task. If the task has to
be scheduled, it will necessarily happen after initialization since
it will result from a state change.
This should be backported wherever server-state is present.
In commit 5cd4bbd7a ("BUG/MAJOR: threads/queue: Fix thread-safety issues
on the queues management") the counter of transferred connections was
accidently lost, so that when a server goes down with connections in its
queue, it will always be reported that 0 connection were transferred.
This should be backported as far as 1.8 since the patch above was
backported there.
In issue #785, users are reporting that it's not convenient to load a
".crt.key" when the configuration contains a ".crt".
This option allows to remove the extension of the certificate before
trying to load any extra SSL file (.key, .ocsp, .sctl, .issuer etc.)
The patch changes a little bit the way ssl_sock_load_files_into_ckch()
looks for the file.
safer to close handle before the object is put back in the global pool.
this was introduced by commit 9378bbe0bef4005155d ("MEDIUM: listener:
use protocol->accept_conn() to accept a connection")
this should fix github issue #902
no backport needed.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
As previously discussed, nbproc usage is bad, deprecated, and scheduled
for removal in 2.5.
If "nbproc" is found with more than one process while nbthread is not
set, a warning will be emitted encouraging to remove it or to migrate
to nbthread instead. This makes sure the user has an opportunity to
both see the message and silence it.
This counter is only updated and never used, and in addition it's done
without any atomicity so it's very unlikely to be correct on multi-CPU
systems! Let's just remove it since it's not used.
It's a bit overkill to register an initcall to call a function to set
a lock to zero when not debugging, let's just declare the lock as
pre-initialized to zero.
When using a low hash-balance-factor value, it's possible to loop
many times trying to find the best server. Figures in the order of
100-300 times were observed for 1000 servers with a factor of 101
(which seems a bit excessive for such a large farm). Given that
there's nothing in that function that prevents multiple threads
from working in parallel, let's switch to a read lock. Tests on
8 threads show roughly a 2% performance increase with this.
The "first" algorithm creates a lot of contention because all threads
focus on the same server by definition (the first available one). By
turning the exclusive lock to a read lock in fas_get_next_server(),
the request rate increases by 16% for 8 threads when many servers are
getting close to their maxconn.
This function doesn't change the tree, it only looks for the first
usable server, so let's do that under a read lock to limit the
situations like the ones described in issue #881 where finding a
usable server when dealing with lots of saturated ones can be
expensive. At least threads will now be able to look up in
parallel.
It's interesting to note that s->served is not incremented during the
server choice, nor is the server repositionned. So right now already,
nothing prevents multiple threads from picking the same server. This
will not cause a significant imbalance anyway given that the server
will automatically be repositionned at the right place, but this might
be something to improve in the future if it doesn't come with too high
a cost.
It also looks like the way a server's weight is updated could be
revisited so that the write lock gets tighter at the expense of a
short part of inconsistency between weights and servers still present
in the tree.
- map_get_server_hash() doesn't need a write lock since it only
reads the array, let's only use a read lock here.
- map_get_server_rr() only needs exclusivity to adjust the rr_idx
while looking for its entry. Since this one is not used by
map_get_server_hash(), let's turn this lock to a seek lock that
doesn't block reads.
With 8 threads, no significant performance difference was noticed
given that lookups are usually instant with this LB algo so the
lock contention is rare.
It was previously a spinlock, and it happens that a number of LB algos
only lock it for lookups, without performing any modification. Let's
first turn it to an rwlock and w-lock it everywhere. This is strictly
identical.
It was carefully checked that every HA_SPIN_LOCK() was turned to
HA_RWLOCK_WRLOCK() and that HA_SPIN_UNLOCK() was turned to
HA_RWLOCK_WRUNLOCK() on this lock. _INIT and _DESTROY were updated too.
The server lock must be held when server_take_conn() and server_drop_conn()
lbprm callback functions are called. It is a documented prerequisite but it is
not always performed. It only affects leastconn and fas lb algorithm. Others
don't use these callback functions.
A race condition on the next pending effecive weight (next_eweight) may be
encountered with the leastconn lb algorithm. An agent check may set it to 0
while fwlc_srv_reposition() is called. The server is locked during the
next_eweight update. But because the server lock is not acquired when
fwlc_srv_reposition() is called, we may use it to recompute the server key,
leading to a division by 0.
This patch must be backported as far as 1.8.
It is not guaranteed that the backend connection has an owner. It is set when
the connection is created. But when the connection is moved in a server idle
list, the connection owner is set to NULL and may never be set again. On the
other hand, when a mux is created or when a CS is attached, the session is
always defined. The H1 stream always keep a reference on it when it is
created. Thus, when a bad message is captured we should not rely on the
connection owner to retrieve the session. Instead we should get it from the H1
stream.
If an agent try to set a variable with the NULL data type, an unset is perform
instead to avoid undefined behaviors. Once decoded, such data are translated to
a sample with the type SMP_T_ANY. It is unexpected in HAProxy. When a variable
is set with such sample, no data are attached to the variable. Thus, when the
variable is retrieved later in the transaction, the sample data are
uninitialized, leading to undefined behaviors depending on how it is used. For
instance, it leads to a crash if the debug converter is used on such variable.
This patch should fix the issue #855. It must be backported as far as 1.8.
Detect if the sni used a constant value and if so, allow to reuse this
connection for later sessions. Use a combination of SMP_USE_INTRN +
!SMP_F_VOLATILE to consider a sample as a constant value.
This features has been requested on github issue #371.
During a peers session collision (two peer sessions opened on both side) we must
mark the peer the session of which will be shutdown as alive, if not ->reconnect
timer will be set with a wrong value if the synchro task expires after the peer
has been reconnected. This possibly leads to unexpected deconnections during handshakes.
Furthermore, this patch cancels any heartbeat tranmimission when a reconnection
is prepared.
Right now when running a configuration with many global timers (e.g. many
health checks), there is a lot of contention on the global wait queue
lock because all threads queue up in front of it to scan it.
With 2000 servers checked every 10 milliseconds (200k checks per second),
after 23 seconds running on 8 threads, the lock stats were this high:
Stats about Lock TASK_WQ:
write lock : 9872564
write unlock: 9872564 (0)
wait time for write : 9208.409 msec
wait time for write/lock: 932.727 nsec
read lock : 240367
read unlock : 240367 (0)
wait time for read : 149.025 msec
wait time for read/lock : 619.991 nsec
i.e. ~5% of the total runtime spent waiting on this specific lock.
With upgradable locks we don't need to work like this anymore. We
can just try to upgade the read lock to a seek lock before scanning
the queue, then upgrade the seek lock to a write lock for each element
we want to delete there and immediately downgrade it to a seek lock.
The benefit is double:
- all other threads which need to call next_expired_task() before
polling won't wait anymore since the seek lock is compatible with
the read lock ;
- all other threads competing on trying to grab this lock will fail
on the upgrade attempt from read to seek, and will let the current
lock owner finish collecting expired entries.
Doing only this has reduced the wake_expired_tasks() CPU usage in a
very large servers test from 2.15% to 1.04% as reported by perf top,
and increased by 3% the health check rate (all threads being saturated).
This is expected to help against (and possibly solve) the problem
described in issue #875.