Initially per-thread pool caches were stored into a fixed-size array.
But this was a bit ugly because the last allocated pools were not able
to benefit from the cache at all. As a work around to preserve
performance, a size of 64 cacheable pools was set by default (there
are 51 pools at the moment, excluding any addon and debugging code),
so all in-tree pools were covered, at the expense of higher memory
usage.
In addition an index had to be calculated for each pool, and was used
to acces the pool cache head into that array. The pool index was not
even stored into the pools so it was required to determine it to access
the cache when the pool was already known.
This patch changes this by moving the pool cache head into the pool
head itself. This way it is certain that each pool will have its own
cache. This removes the need for index calculation.
The pool cache head is 32 bytes long so it was aligned to 64B to avoid
false sharing between threads. The extra cost is not huge (~2kB more
per pool than before), and we'll make better use of that space soon.
The pool cache head contains the size, which should probably be removed
since it's already in the pool's head.
In commit fb117e6a8 ("MEDIUM: memory: don't let pool_put_to_cache() free
the objects itself") pool_evict_from_cache() was introduced with no
argument, yet the only call place passes it the pool, the pointer and
the index number!
Let's remove these as they even let the reader think that the function
does something specific to the current pool while it's not the case.
This patch renames all existing uri-normalizers into a more consistent naming
scheme:
1. The part of the URI that is being touched.
2. The modification being performed as an explicit verb.
This normalizer merges `../` path segments with the predecing segment, removing
both the preceding segment and the `../`.
Empty segments do not receive special treatment. The `merge-slashes` normalizer
should be executed first.
See GitHub Issue #714.
When a thread ends its harmeless period, we must only consider running
threads when testing threads_want_rdv_mask mask. To do so, we reintroduce
all_threads_mask mask in the bitwise operation (It was removed to fix a
deadlock).
Note that for now it is useless because there is no way to stop threads or
to have threads reserved for another task. But it is safer this way to avoid
bugs in the future.
A previous patch was pushed to fix a deadlock when an isolated thread ends
its harmless period (a9a9e9aac ["BUG/MEDIUM: thread: Fix a deadlock if an
isolated thread is marked as harmless"]). But, unfortunately, the fix is
incomplete. The same must be done in the outer loop, in
thread_harmless_end() function. The current thread must be ignored when
threads_want_rdv_mask mask is tested.
This patch must also be backported as far as 2.0.
This library is required for the subsequent patch which adds
the JSON query possibility.
It is necessary to change the include statement in "src/mjson.c"
because the imported includes in haproxy are in "include/import"
orig: #include "mjson.h"
new: #include <import/mjson.h>
ub64dec and ub64enc are the base64url equivalent of b64dec and base64
converters. base64url encoding is the "URL and Filename Safe Alphabet"
variant of base64 encoding. It is also used in in JWT (JSON Web Token)
standard.
RFC1421 mention in base64.c file is deprecated so it was replaced with
RFC4648 to which existing converters, base64/b64dec, still apply.
Example:
HAProxy:
http-request return content-type text/plain lf-string %[req.hdr(Authorization),word(2,.),ub64dec]
Client:
Token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjoiZm9vIiwia2V5IjoiY2hhZTZBaFhhaTZlIn0.5VsVj7mdxVvo1wP5c0dVHnr-S_khnIdFkThqvwukmdg
$ curl -H "Authorization: Bearer ${TOKEN}" http://haproxy.local
{"user":"foo","key":"chae6AhXai6e"}
Allocation error are now handled in bind_conf_alloc() functions. Thus
callers, when not already done, are also updated to catch NULL return value.
This patch may be backported (at least partially) to all stable
versions. However, it only fix errors durung configuration parsing. Thus it
is not mandatory.
Add the trace support for the checks. Only tcp-check based health-checks are
supported, including the agent-check.
In traces, the first argument is always a check object. So it is easy to get
all info related to the check. The tcp-check ruleset, the conn-stream and
the connection, the server state...
Olivier spotted that I messed up during a rebase of commit 92c059c2a
("MINOR: atomic: implement native BTS/BTR for x86"), losing the x86
version of the BTS/BTR and leaving the generic version for it instead
of having this block in the #else. Since this variant is not used for
now it was easy to overlook it. Let's re-implement it here.
The time initialization was made a bit complex because we rely on a
dummy negative argument to reset all fields, leaving no distinction
between process-level initialization and thread-level initialization.
This patch changes this by introducing two functions, one for the
process and the second one for the threads. This removes ambigous
test and makes sure that the relevant fields are always initialized
exactly once. This also offers a better solution to the bug fixed in
commit b48e7c001 ("BUG/MEDIUM: time: make sure to always initialize
the global tick") as there is no more special values for global_now_ms.
It's simple enough to be backported if any other time-related issues
are encountered in stable versions in the future.
It was only used by freq_ctr and is not used anymore. In addition the
local curr_sec_ms was removed, as well as the equivalent extern
definitions which did not exist anymore either.
update_freq_ctr_period() was still not very clean and didn't wait for
the rotation lock to be dropped before trying again, thus maintaining
the contention at a high level. In addition, the rotation update was
made in three steps, which are not very efficient in terms of bus
cycles.
Here the wait loop was reworked so that the fast path remains short
and that the contended path waits for the lock to be dropped before
attempting another write, but it only waits a relax cycle before
attempting a read. The rotation block was simplified to remove a
test that was already validated by the first loop, and so that the
retrieval of the current period, its reset and its increment are all
performed in a single atomic op and the store to the previous period
is performed immediately after.
All this results in significantly smaller code for the inline function
(~1kB total) and a shorter critical path.
When counters are rotated, there is contention between the threads which
can slow down the operation of the thread performing the rotation. Let's
apply a cpu_relax there to let the first thread finish faster.
It remains cumbersome to preserve two versions of the freq counters and
two different internal clocks just for this. In addition, the savings
from using two different mechanisms are not that important as the only
saving is a divide that is replaced by a multiply, but now thanks to
the freq_ctr_total() unificaiton the code could also be simplified to
optimize it in case of constants.
This patch turns all non-period freq_ctr functions to static inlines
which call the period-based ones with a period of 1 second. A direct
benefit is that a single internal clock is now needed for any counter
and that they now all rely on ticks.
These 1-second counters are essentially used to report request rates
and to enforce a connection rate limitation in listeners. It was
verified that these continue to work like before.
Both structures are identical except the name of the field starting
the period and its description. Let's call them all freq_ctr and the
period's start "curr_tick" which is generic.
This is only a temporary change and fields are expected to remain
the same with no code change (verified).
There was still no function to compute a wait time for periods, let's
implement it on top of freq_ctr_total() as we'll soon need it for the
per-second one. The divide here is applied on the frequency so that it
will be replaced with a reciprocal multiply when constant.
This one is the easiest to implement, it just requires a call and a
divide of the result. Anti-flapping correction for low-rates was
preserved.
Now calls using a constant period will be able to use a reciprocal
multiply for the period instead of a divide.
Most of the functions designed to read a counter over a period go through
the same complex loop and only differ in the way they use the returned
values, so it was worth implementing all this into freq_ctr_total() which
returns the total number of events over a period so that the caller can
finish its operation using a divide or a remaining time calculation. As
a special case, read_freq_ctr_period() doesn't take pending events but
requires to enable an anti-flapping correction at very low frequencies.
Thus the function implements it when pend<0.
Thanks to this function it will be possible to reimplement the other ones
as inline and merge the per-second ones with the arbitrary period ones
without always adding the cost of a 64 bit divide.
Some variables are mostly read (mostly pointers) but they tend to be
merged with other ones in the same cache line, slowing their access down
in multi-thread setups. This patch declares an empty, aligned variable
in a section called "read_mostly". This will force a cache-line alignment
on this section so that any variable declared in it will be certain to
avoid false sharing with other ones. The section will be eliminated at
link time if not used.
A __read_mostly attribute was added to compiler.h to ease use of this
section.
HA_SECTION() is used as an attribute to force a section name. This is
required because OSX prepends "__DATA, " in front of the declaration.
HA_SECTION_START() and HA_SECTION_STOP() are used as post-attribute on
variable declaration to designate the section start/end (needed only on
OSX, empty on others).
For platforms with an obsolete linker, all macros are left empty. It would
possibly still work on some of them but this will not be needed anyway.
Due to length restrictions on OSX the initcall sections are called "i_"
there while they're called "init_" on other OSes. However the start and
end of sections are still called "__start_init_" and "__stop_init_",
which forces to have distinct code between the OSes. Let's switch everyone
to "i_" and rename the symbols accordingly.
The trace() function is convenient to avoid calling trace() when traces
are not enabled, but there starts to be some callers which place complex
expressions in their trace calls, which results in all of them to be
evaluated before being passed as arguments to the trace() function. This
needlessly wastes precious CPU cycles.
Let's change the function for a macro, so that the arguments are now only
evaluated when the surce has traces enabled. However having a generic
macro being called "trace()" can easily cause conflicts with innocent
code so we rename it "_trace".
Just doing this has resulted in a 2.5% increase of the HTTP/1 request rate.
Interestingly, all arrays used to declare patterns were read-write while
only hard-coded. Let's mark them const so that they move from data to
rodata and don't risk to experience false sharing.
This argument is not being used inside the function (and the functions
themselves are unused as well) and not documented. Its purpose is not clear.
Just remove it.
With latest commit f50906519 ("MEDIUM: fd: merge fdtab[].ev and state
for FD_EV_* and FD_POLL_* into state") one occurrence of a pair of
chars was missed in fd_stop_both(), resulting in the operation to
fail if the upper flags were set. Interestingly it managed to fail
2 tests in all setups in the CI while all used to work fine on my
local machines. Probably that the reason is that the chars had enough
room above them for the CAS to fail then refill "old" overwriting the
upper parts of the stack, and that thanks to this the subsequent tests
worked. With ASAN being used on lots of tests, it very likely caught
it but used to only report failed tests with no more info.
No backport is needed, as this was never released nor backported.
The current BTS/BTR operations on x86 are ugly because they rely on a
CAS, so they may be unfair and take time to converge. Fortunately,
where they are currently used (mostly FDs) the contention is expected
to be rare (mostly listeners). But this also limits their use to such
few low-load cases.
On x86 there is a set of BTS/BTR instructions which help for this,
but before the FD's state migrated to 32 bits there was little use of
them since they do not exist in 8 bits.
Now at least it makes sense to use them, at the very least in order
to significantly reduce the code size (one BTS instead of a CMPXCHG
loop). The implementation relies on modern gcc's ability to return
condition flags and limit code inflation and register spilling. The
fall back is retained on the old implementation for all other situations
(inappropriate target size or non-capable compiler). The code shrank
by 1.6 kB on the fast path.
As expected, for now on up to 4 threads there is no measurable difference
of performance.
Probably due to the result of an old copy-paste, HA_ATOMIC_BTS/BTR were
still implemented using the __sync_* builtins instead of the more
modern __atomic_* which allow to specify the memory model. Let's update
this to use the newer there and also implement the relaxed variants
(which are not used for now).
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.