Limit the number of old entries we remove in one call of
stktable_trash_oldest(), as we do so while holding the heavily contended
update write lock, so we'd rather not hold it for too long.
This helps getting stick tables perform better under heavy load.
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
As discussed in GH #2423, there are some cases where src_{inc,clr}_gpc*
is not sufficient because we need to perform the lookup on a specific
key. Indeed, just like we did in e642916 ("MEDIUM: stktable: leverage
smp_fetch_* helpers from sample conv"), we can easily implement new
table converters based on existing fetches. This is what we do in
this patch.
Also the doc was updated so that src_{inc,clr}_gpc* fetches now point to
their generic equivalent table_{inc,clr}_gpc*. Indeed, src_{inc,clr}_gpc*
are simply aliases.
This should fix GH #2423.
sample_conv_table_bytes_out_rate() was defined in the middle of other
stick-table sample convs without any ordering logic. Let's put it
where it belongs, right after sample_conv_table_bytes_in_rate().
In this patch we try to prevent code duplication: some fetches and sample
converters do the exact same thing, except that the converter takes the
argument as input data. Until now, both the converter and the fetch
had their own implementation (copy pasted), with the fetch specific or
converter specific lookup part.
Thanks to previous commits, we now have generic sample fetch helpers
that take the stkctr as argument, so let's leverage them directly
from the converter functions when available. This allows to remove
a lot of code duplication and should make code maintenance easier in the
future.
While this patch actually adds more insertions than deletions, it actually
tries to simplify the lookup logic for sc_ and src_ sticktable fetches.
Indeed, smp_create_src_stkctr() and smp_fetch_sc_stkctr() combination
was used everywhere the fetch supports sc_ and src_ form, and
smp_fetch_sc_stkctr() even integrated some of the src-oriented fetch logic.
Not only this was confusing, but it made the task of adding new generic
fetches even more complex.
Thus in this patch we completely dedicate smp_fetch_sc_stkctr() to sc_
oriented fetches, while smp_create_src_stkctr() is now renamed to
smp_fetch_src_stkctr() and can now work on its own for src_ oriented
fetches. It takes an additional paramater, "create" to tell the function
if the entry should be created if it doesn't exist yet.
Now it's up to the calling function to know if it should be using the
sc_ oriented fetch or the src_ oriented one based on the input keyword.
In this patch we split several sample fetch functions that are leveraged
by the "src-" fetches such as smp_fetch_sc_inc_gpc().
Indeed, for all of them, we add an intermediate helper function that takes
a stkctr pointer as parameter and performs the logic, leaving the lookup
part in the calling function. Before this patch existing functions were
doing the lookup + the fetch logic. Thanks to this patch it will become
easier to add generic converters taking lookup key as input.
List of targeted functions:
- smp_fetch_sc_inc_gpc()
- smp_fetch_sc_inc_gpc0()
- smp_fetch_sc_inc_gpc1()
- smp_fetch_sc_clr_gpc()
- smp_fetch_sc_clr_gpc0()
- smp_fetch_sc_clr_gpc1()
- smp_fetch_sc_conn_cnt()
- smp_fetch_sc_conn_rate()
- smp_fetch_sc_updt_conn_cnt()
- smp_fetch_sc_conn_curr()
- smp_fetch_sc_glitch_cnt()
- smp_fetch_sc_glitch_rate()
- smp_fetch_sc_sess_cnt()
- smp_fetch_sc_sess_rate()
- smp_fetch_sc_http_req_cnt()
- smp_fetch_sc_http_req_rate()
- smp_fetch_sc_http_err_cnt()
- smp_fetch_sc_http_err_rate()
- smp_fetch_sc_http_fail_cnt()
- smp_fetch_sc_http_fail_rate()
- smp_fetch_sc_kbytes_in()
- smp_fetch_sc_bytes_in_rate()
- smp_fetch_kbytes_out()
- smp_fetch_sc_gpc1_rate()
- smp_fetch_sc_gpc0_rate()
- smp_fetch_sc_gpc_rate()
- smp_fetch_sc_get_gpc1()
- smp_fetch_sc_get_gpc0()
- smp_fetch_sc_get_gpc()
- smp_fetch_sc_get_gpt0()
- smp_fetch_sc_get_gpt()
- smp_fetch_sc_bytes_out_rate()
Please note that this patch doesn't render any good using "git show" or
"git diff". For all the functions listed above, a new helper function was
defined right above it, with the same name without "_sc". These new
functions perform the fetch part, while the original ones (with "_sc")
now simply perform the lookup and then leverage the corresponding fetch
helper.
smp_fetch_stksess(table, smp, create) performs a lookup in <table> by
using <smp> as a key. It returns matching entry on success and NULL on
failure. <create> can be set to 1 to force the entry creation.
We then use this helper everywhere relevant to prevent code duplication
As discussed in GH #2838, the previous fix f399dbf
("MINOR: stktable: fix potential build issue in smp_to_stkey") which
attempted to remove conversion ambiguity and prevent build warning proved
to be insufficient.
This time, we implement Willy's suggestion, which is to use an union to
perform the conversion.
Hopefully this should fix GH #2838. If that's the case (and only in that
case), then this patch may be backported with f399dbf (else the patch
won't apply) anywhere b59d1fd ("BUG/MINOR: stktable: fix big-endian
compatiblity in smp_to_stkey()") was backported.
In 819fc6f563
("MEDIUM: threads/stick-tables: handle multithreads on stick tables"),
sample fetch and action functions were properly guarded with stksess
read/write locks for read and write operations respectively, but the
sample_conv_table functions leveraged by "table_*" converters were
overlooked.
This bug was not known to cause issues in existing deployments yet (at
least it was not reported), but due to its nature it can theorically lead
to inconsistent values being reported by "table_*" converters if the value
is being updated by another thread in parallel.
It should be backported to all stable versions.
[ada: for versions < 3.0, glitch_cnt and glitch_rate samples should be
ignored as they first appeared in 3.0]
smp_to_stkey() uses an ambiguous cast from 64bit integer to 32 bit
unsigned integer. While it is intended, let's make the cast less
ambiguous by explicitly casting the right part of the assignment to the
proper type.
This should fix GH #2838
As discussed in GH #1750, we were lacking a sample fetch to be able to
retrieve the key from the currently tracked counter entry. To do so,
sc_key fetch can now be used. It returns a sample with the correct type
(table key type) corresponding to the tracked counter entry (from previous
track-sc rules).
If no entry is currently tracked, it returns nothing.
It can be used using the standard form "sc_key(<sc_number>)" or the legacy
form: "sc0_key", "sc1_key", "sc2_key"
Documentation was updated.
stksess_getkey(t, ts) returns a stktable_key struct pointer filled with
data from input <ts> entry in <t> table. Returned pointer uses the
static_table_key variable. Indeed, stktable_key struct is more convenient
to manipulate than having to deal with the key extraction from stktsess
struct directly.
When smp_to_stkey() deals with SINT samples, since stick-tables deals with
32 bits integers while SINT sample is 64 bit integer, inplace conversion
was done in smp_to_stkey. For that the 64 bit integer was truncated before
the key would point to it. Unfortunately this only works on little endian
architectures because with big endian ones, the key would point to the
wrong 32bit range.
To fix the issue and make the conversion endian-proof, let's re-assign
the sample as 32bit integer before the key points to it.
Thanks to Willy for having spotted the bug and suggesting the above fix.
It should be backported to all stable versions.
Some actions such as "sc0_get_gpc0" (using smp_fetch_sc_stkctr()
internally) can take an optional table name as parameter to perform the
lookup on a different table from the tracked one but using the key from
the tracked entry. It is done by leveraging the stktable_lookup() function
which was originally meant to perform intra-table lookups.
Calling sc0_get_gpc0() with a different table name will result in
stktable_lookup() being called to perform lookup using a stktsess from
a different table. While it is theorically fine, it comes with a pitfall:
both tables (the one from where the stktsess originates and the actual
target table) should rely on the exact same key type and length.
Failure to do so actually results in undefined behavior, because the key
type and/or length from one table is used to perform the lookup in
another table, while the underlying lookup API expects explicit type and
key length.
For instance, consider the below example:
peers testpeers
bind 127.0.0.1:10001
server localhost
table test type binary len 1 size 100k expire 1h store gpc0
table test2 type string size 100k expire 1h store gpc0
listen test_px
mode http
bind 0.0.0.0:8080
http-request track-sc0 bin(AA) table testpeers/test
http-request track-sc1 str(ok) table testpeers/test2
log-format "%[sc0_get_gpc0(testpeers/test2)]"
log stdout format raw local0
server s1 git.haproxy.org:80
Performing a curl request to localhost:8080 will cause unitialized reads
because string "ok" from test2 table will be compared as a string against
"AA" binary sample which is not NULL terminated:
==2450742== Conditional jump or move depends on uninitialised value(s)
==2450742== at 0x484F238: strlen (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2450742== by 0x27BCE6: stktable_lookup (stick_table.c:539)
==2450742== by 0x281470: smp_fetch_sc_stkctr (stick_table.c:3580)
==2450742== by 0x283083: smp_fetch_sc_get_gpc0 (stick_table.c:3788)
==2450742== by 0x2A805C: sample_process (sample.c:1376)
So let's prevent that by adding some comments in stktable_set_entry()
func description, and by adding a check in smp_fetch_sc_stkctr() to ensure
both source stksess and target table share the same key properties.
While it could be relevant to backport this in all stable versions, it is
probably safer to wait for some time before doing so, to ensure that no
existing configs rely on this ambiguity because the fact that the target
table and source stksess entry need to share the same key type and length
is not explicitly documented.
As discussed in GH #2286, {set, clear, show} table commands were unable
to deal with array types such as gpt, because they handled such types as
a non-array types, thus only the first entry (ie: gpt[0]) was considered.
In this patch we add an extra logic around array-types handling so that
it is possible to specify an array index right after the type, like this:
set table peer/table key mykey data.gpt[2] value
# where 2 is the entry index that we want to access
If no index is specified, then it implicitly defaults to 0 to mimic
previous behavior.
Same as stktable_get_data_type(), but tries to parse optional index in
the form "name[idx]" (only for array types).
Falls back to stktable_get_data_type() when no index is provided.
Thanks to previous commit stktable struct now have a "flags" struct member
Let's take this opportunity to remove the isolated "nopurge" attribute in
stktable struct and rely on a flag named STK_FL_NOPURGE instead.
This helps to better organize stktable struct members.
When "recv-only" keyword is added on a stick table declaration (in peers
or proxy section), haproxy considers that the table is only used for
data retrieval from a remote location and not used to perform local
updates. As such, it enables the retrieval of local-only values such
as conn_cur that are ignored by default. This can be useful in some
contexts where we want to know about local-values such are conn_cur
from a remote peer.
To do this, add stktable struct flags which default to NONE and enable
the RECV_ONLY flag on the table then "recv-only" keyword is found in the
table declaration. Then, when in peer_treat_updatemsg(), when handling
table updates, don't ignore data updates for local-only values if the flag
is set.
The file name used to point to the calling function's stack for stick
tables, which was OK during parsing but remained dangling afterwards.
At least it was already marked const so as not to accidentally free it.
Let's make it point to a file_name_node now.
Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.
For now, this parameter must be in the range ]0-1024].
Since 2.5, an array of GPC is provided to replace legacy gpc0/gpc1.
src_inc_gpc is a sample fetch which is used to increment counters in
this array.
A crash occurs if src_inc_gpc is used without any previous track-sc
rule. This is caused by an error in smp_fetch_sc_inc_gpc(). When
temporary stick counter is created via smp_create_src_stkctr(), table
pointer arg value used is not correct : it points to the counter ID
instead of the table argument. To fix this, use the proper sample fetch
second arg.
This can be reproduced with the following config :
acl mark src_inc_gpc(0,<table>) -m bool
tcp-request connection accept if mark
This should be backported up to 2.6.
Guarded functions to kill a sticky session, stksess_kill()
stksess_kill_if_expired(), may or may not decrement and test its reference
counter before really killing it. This depends on a parameter. If it is set
to non-zero value, the ref count is decremented and if it falls to zero, the
session is killed. Otherwise, if this parameter is equal to zero, the
session is killed, regardless the ref count value.
In the code, these functions are always called with a non-zero parameter and
the ref count is always decremented and tested. So, there is no reason to
still have a special case. Especially because it is not really easy to say
if it is supported or not. Does it mean it is possible to kill a sticky
session while it is still referenced somewhere ? probably not. So, does it
mean it is possible to kill a unreferenced session ? This case may be
problematic because the session is accessed outside of any lock and thus may
be released by another thread because it is unreferenced. Enlarging scope of
the lock to avoid any issue is possible but it is a bit of shame to do so
because there is no usage for now.
The best is to simplify the API and remove this case. Now, stksess_kill()
and stksess_kill_if_expired() functions always decrement and test the ref
count before killing a sticky session.
When we try to kill a session, the shard must be locked before decrementing
the ref count on the session. Otherwise, the ref count can fall to 0 and a
purge task (stktable_trash_oldest or process_table_expire) may release the
session before we have the opportunity to acquire the lock on the shard to
effectively kill the session. This could lead to a double free.
Here is the scenario:
Thread 1 Thread 2
sktsess_kill(ts)
if (ATOMIC_DEC(&ts->ref_cnt) != 0)
return
/* here the ref count is 0 */
stktable_trash_oldest()
LOCK(&sh_lock)
if (!ATOMIC_LOAD(&ts->ref_cnf))
__stksess_free(ts)
UNLOCK(&sh_lock)
/* here the session was released */
LOCK(&sh_lock)
__stksess_free(ts) <--- double free
UNLOCK(&sh_lock)
The bug was introduced in 2.9 by the commit 7968fe3889 ("MEDIUM:
stick-table: change the ref_cnt atomically"). The ref count must be
decremented inside the lock for stksess_kill() and sktsess_kill_if_expired()
function.
This patch should fix the issue #2611. It must be backported as far as 2.9. On
the 2.9, there is no sharding. All the table is locked. The patch will have to
be adapted.
As reported by @Bbulatov in GH #2586, stktable_data_ptr() return value is
used without checking it isn't NULL first, which may happen if the given
type is invalid or not stored in the table.
However, since date_type is set by table_prepare_data_request() right
before cli_io_handler_table() is invoked, date_type is not expected to
be invalid: table_prepare_data_request() normally checked that the type
is stored inside the table. Thus stktable_data_ptr() should not be failing
at this point, so we add a BUG_ON() to indicate that.
During changes made in 2.7 by commits 8d3c3336f9 ("MEDIUM: stick-table:
make stksess_kill_if_expired() avoid the exclusive lock") and 996f1a5124
("MEDIUM: stick-table: do not take a lock to update t->current anymore."),
the operation was done cautiously one baby step at a time and the final
cleanup was not done, as we're keeping a read lock under an atomic dec.
Furthermore there's a pool_free() call under that lock, and we try to
avoid pool_alloc() and pool_free() under locks for their nasty side
effects (e.g. when memory gets recompacted), so let's really drop it
now.
Note that the performance gain is not really perceptible here, it's
essentially for code clarity reasons that this has to be done.
Due to the code in stktable_touch_with_exp() being the same as in other
functions previously made around a loop trying first to upgrade a read
lock then to fall back to a direct write lock, there remains a confusing
construct with multiple tests on use_wrlock that is obviously zero when
tested. Let's remove them since the value is known and the loop does not
exist anymore.
In GH issue #2552, Christian Ruppert reported an increase in crashes
with recent 3.0-dev versions, always related with stick-tables and peers.
One particularity of his config is that it has a lot of peers.
While trying to reproduce, it empirically was found that firing 10 load
generators at 10 different haproxy instances tracking a random key among
100k against a table of max 5k entries, on 8 threads and between a total
of 50 parallel peers managed to reproduce the crashes in seconds, very
often in ebtree deletion or insertion code, but not only.
The debugging revealed that the crashes are often caused by a parent node
being corrupted while delete/insert tries to update it regarding a recently
inserted/removed node, and that that corrupted node had always been proven
to be deleted, then immediately freed, so it ought not be visited in the
tree from functions enclosed between a pair of lock/unlock. As such the
only possibility was that it had experienced unexpected inserts. Also,
running with pool integrity checking would 90% of the time cause crashes
during allocation based on corrupted contents in the node, likely because
it was found at two places in the same tree and still present as a parent
of a node being deleted or inserted (hence the __stksess_free and
stktable_trash_oldest callers being visible on these items).
Indeed the issue is in fact related to the test set (occasionally redundant
keys, many peers). What happens is that sometimes, a same key is learned
from two different peers. When it is learned for the first time, we end up
in stktable_touch_with_exp() in the "else" branch, where the test for
existence is made before taking the lock (since commit cfeca3a3a3
("MEDIUM: stick-table: touch updates under an upgradable read lock") that
was merged in 2.9), and from there the entry is added. But is one of the
threads manages to insert it before the other thread takes the lock, then
the second thread will try to insert this node again. And inserting an
already inserted node will corrupt the tree (note that we never switched
to enforcing a check in insertion code on this due to API history that
would break various code parts).
Here the solution is simple, it requires to recheck leaf_p after getting
the lock, to avoid touching anything if the entry has already been
inserted in the mean time.
Many thanks to Christian Ruppert for testing this and for his invaluable
help on this hard-to-trigger issue.
This fix needs to be backported to 2.9.
When a sticky session is killed, we must be sure no other entity is still
referencing it. The session's ref_cnt must be 0. However, there is a race
with peers, as decribed in 21447b1dd4 ("BUG/MAJOR: stick-tables: fix race
with peers in entry expiration"). When the update lock is acquire, we must
recheck the ref_cnt value.
This patch is part of a debugging session about issue #2552. It must be
backported to 2.9.
It is the same that the one fixed in process_table_expire() (21447b1dd4
["BUG/MAJOR: stick-tables: fix race with peers in entry expiration"]). In
stktable_trash_oldest(), when the update lock is acquired, we must take care
to check again the ref_cnt because some peers may increment it (See commit
above for details).
This patch fixes a crash mentionned in 2552#issuecomment-2110532706. It must
be backported to 2.9.
Since 3.0-dev7 with commit 1a088da7c2 ("MAJOR: stktable: split the keys
across multiple shards to reduce contention"), building without threads
yields a warning about the shard not being used. This is because the
locks API does nothing of its arguments, which is the only place where
the shard is being used. We cannot modify the lock API to pretend to
consume its argument because quite often it's not even instantiated.
Let's just pretend we consume shard using an explict ALREADY_CHECKED()
statement instead. While we're at it, let's make sure that XXH32() is
not called when there is a single bucket!
No backport is needed.
In 2.9 with commit 7968fe3889 ("MEDIUM: stick-table: change the ref_cnt
atomically") we significantly relaxed the stick-tables locking when
dealing with peers by adjusting the ref_cnt atomically and moving it
out of the lock.
However it opened a tiny window that became problematic in 3.0-dev7
when the table's contention was lowered by commit 1a088da7c2 ("MAJOR:
stktable: split the keys across multiple shards to reduce contention").
What happens is that some peers may access the entry for reading at
the moment it's about to expire, and while the read accesses to push
the data remain unnoticed (possibly that from time to time we push
crap), but the releasing of the refcount causes a new write that may
damage anything else. The scenario is the following:
process_table_expire() peer_send_teachmsgs()
RDLOCK(&updt_lock);
tick_is_expired() != 0
ebmb_delete(ts->key);
if (ts->upd.node.leaf_p) {
HA_ATOMIC_INC(&ts->ref_cnt);
RDUNLOCK(&updt_lock);
WRLOCK(&updt_lock);
eb32_delete(&ts->upd);
}
__stksess_free(t, ts);
peer_send_updatemsg(ts);
RDLOCK(&updt_lock);
HA_ATOMIC_DEC(&ts->ref_cnt);
Here it's clear that the bottom part of peer_send_teachmsgs() believes
to be protected but may act on freed data.
This is more visible when enabling -dMtag,no-merge,integrity because
the ATOMIC_DEC(&ref_cnt) decrements one byte in the area, that makes
the eviction check fail while the tag has the address of the left
__stksess_free(), proving a completed pool_free() before the decrement,
and the anomaly there is pretty visible in the crash dump. Changing
INC()/DEC() with ADD(2)/DEC(2) shows that the byte is now off by two,
confirming that the operation happened there.
The solution is not very hard, it consists in checking for the ref_cnt
on the left after grabbing the lock, and doing both before deleting the
element, so that we have the guarantee that either the peer will not
take it or that it has already started taking it.
This was proven to be sufficient, as instead of crashing after 3s of
injection with 4 peers, 16 threads and 130k RPS, it survived for 15mn.
In order to stress the setup, a config involving 4+ peers, tracking
HTTP request with randoms and applying a bwlim-out filter with a
random key, with a client made of 160 h2 conns downloading 10 streams
of 4MB objects in parallel managed to trigger it within a few seconds:
frontend ft
http-request track-sc0 rand(100000) table tbl
filter bwlim-out lim-out limit 2047m key rand(100000000),ipmask(32) min-size 1 table tbl
http-request set-bandwidth-limit lim-out
use_backend bk
backend bk
server s1 198.18.0.30:8000
server s2 198.18.0.34:8000
backend tbl
stick-table type ip size 1000k expire 1s store http_req_cnt,bytes_in_rate(1s),bytes_out_rate(1s) peers peers
This seems to be very dependent on the timing and setup though.
This will need to be backported to 2.9. This part of the code was
reindented with shards but the block should remain mostly unchanged.
The logic to apply is the same.
When adding the shards support to tables with commit 1a088da7c ("MAJOR:
stktable: split the keys across multiple shards to reduce contention"),
the condition to stop eliminating entries based on the batch size being
reached is based on a pre-decrement of the max_search counter, but now
it goes back into the outer loop which doesn't check it, so next time
it does it when entering the next shard, it will become even more
negative and will properly stop, but at first glance it looks like an
int overflow (which it is not). Let's make sure the outer loop stops
on this condition so that we don't continue searching when the limit
is reached.
While changing the stick-table indexing that led to commit 1a088da7c
("MAJOR: stktable: split the keys across multiple shards to reduce
contention"), I met a problem with the task's expiration date being
incorrectly updated, I fixed it and apparently I committed the wrong
version :-/
The effect is that the task's date is only correctly reset if the
table is empty, otherwise the task wakes up again and is queued at
the previous date, eating 100% CPU. The tick_isfirst() must not be
used when storing the last result.
No backport is needed as this was only merged in 3.0-dev7.
This bug arrived with this commit:
MAJOR: stktable: split the keys across multiple shards to reduce contention
At this time, there are no callers which call stktable_get_entry() without checking
the nullity of <key> passed as parameter. But the documentation of this function
says it supports this case where the <key> passed as parameter could be null.
Move the nullity test on <key> at first statement of this function.
Thanks to @chipitsine for having reported this issue in GH #2518.
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.
The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):
function dump_stk()
local dmp = core.proxies['tbl'].stktable:dump({});
local count = 0
for _, __ in pairs(dmp) do
count = count + 1
end
core.Info('Total entries: ' .. count)
end
core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);
##
global
tune.lua.log.stderr on
lua-load-per-thread lua-cnttbl.lua
listen front
bind :8001
http-request lua.dump_stk if { path_beg /stk }
http-request track-sc1 rand(),upper,hex table tbl
http-request redirect location /
backend tbl
stick-table size 100k type string len 12 store http_req_cnt
##
$ h2load -c 16 -n 10000 0:8001/
$ curl 0:8001/stk
## A count close to 100k appears on haproxy's stderr
## On the CLI, "show table tbl" | wc will show the same.
Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.
The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.
The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().
The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.
At this point, this allowed to completely unlock the performance on a
80-thread setup:
before: 5.4 Gbps, 150k RPS, 80 cores
52.71% haproxy [.] stktable_lookup_key
36.90% haproxy [.] stktable_get_entry.part.0
0.86% haproxy [.] ebmb_lookup
0.18% haproxy [.] process_stream
0.12% haproxy [.] process_table_expire
0.11% haproxy [.] fwrr_get_next_server
0.10% haproxy [.] eb32_insert
0.10% haproxy [.] run_tasks_from_lists
after: 36 Gbps, 980k RPS, 80 cores
44.92% haproxy [.] stktable_get_entry
5.47% haproxy [.] ebmb_lookup
2.50% haproxy [.] fwrr_get_next_server
0.97% haproxy [.] eb32_insert
0.92% haproxy [.] process_stream
0.52% haproxy [.] run_tasks_from_lists
0.45% haproxy [.] conn_backend_get
0.44% haproxy [.] __pool_alloc
0.35% haproxy [.] process_table_expire
0.35% haproxy [.] connect_server
0.35% haproxy [.] h1_headers_to_hdr_list
0.34% haproxy [.] eb_delete
0.31% haproxy [.] srv_add_to_idle_list
0.30% haproxy [.] h1_snd_buf
WIP: uint64_t -> long
WIP: ulong -> uint
code is much smaller
Thanks to the previous commit, we can now simply perform an atomic read
on stksess->seen and take the write lock to recreate the entry only if
at least one peer has seen it, otherwise leave it untouched. On a test
on 40 cores, the performance used to drop from 2.10 to 1.14M RPS when
one peer was connected, now it drops to 2.05, thus there's basically
no impact of connecting a peer vs ~45% previously, all spent in the
read lock. This can be particularly important when often updating the
same entries (user-agent, source address during an attack etc).
Right now we're taking the stick-tables update lock for reads just for
the sake of checking if the update index is past it or not. That's
costly because even taking the read lock is sufficient to provoke a
cache line write, while when under load or attack it's frequent that
the update has not yet been propagated and wouldn't require anything.
This commit brings a new field to the stksess, "seen", which is zeroed
when the entry is updated, and set to one as soon as at least one peer
starts to consult it. This way it will reflect that the entry must be
updated again so that this peer can see it. Otherwise no update will
be necessary. For now the flag is only set/reset but not exploited.
A great care is taken to avoid writes whenever possible.
In 2.7 we addressed a race condition in the stick tables expiration task
with commit fbb934d ("BUG/MEDIUM: stick-table: fix a race condition when
updating the expiration task"). The issue was that the task could be
running on another thread which would destroy its expiration timer
while one had just recalculated it and prepares to queue it, causing
a bug due to the attempt to queue an expired task. The fix consisted in
enclosing the change into the stick-table's lock, which had a very low
cost since it's done only after having checked that the date changed,
i.e. no more than once every millisecond.
But as reported by Ricardo and Felipe from Taghos in github issue #2508,
a tiny race remained after the fix: the unlock() was done before the call
to task_queue(), leaving a tiny window for another thread to run between
unlock() and task_queue() and erase the timer. As confirmed, it's
sufficient to also protect the task_queue() call.
But overall this raises a point regarding the task_queue() API on tasks
that may run anywhere. A while ago an attempt was made at removing the
timer for woken up tasks, but something like this would be deserved
with more atomicity on the timer manipulation (e.g. atomically use
task_schedule() instead maybe). This should be backported to all
stable branches.
The main CLI I/O handle is responsible to interrupt the processing on
shutdown/abort. It is not the responsibility of the I/O handler of CLI
commands to take care of it.
In resolvers.c:rslv_promex_next_ts() and in
stick-tables.c:stk_promex_next_ts(), an unused argument was mistakenly
called "unsued" instead of "unused". Let's fix this in a separate patch
so that it can be omitted from backports if this causes build problems.
This adds a new pair of stored types in the stick-tables:
- glitch_cnt
- glitch_rate
These keep count of the number of glitches reported on a front connection,
in order to decide how to act with a badly defective client or a potential
attacker. For now nothing updates these counters, but all the infrastructure
needed to configure, update and retrieve them was added, including the doc.
No regtest was added yet since they're not filled yet.
Commit 9b2717e7b ("MINOR: stktable: use {show,set,clear} table with ptr")
stores a pointer in a long long (64bit), which fails the cas to void* on
32-bit platforms:
src/stick_table.c: In function 'table_process_entry_per_ptr':
src/stick_table.c:5136:37: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
5136 | ts = stktable_lookup_ptr(t, (void *)ptr);
On all our supported platforms, longs and pointers are of the same size,
so let's just turn this to a ulong instead.
In table_process_entry(), stktable_data_ptr() result is dereferenced
without checking if it's NULL first, which may happen when bad inputs
are provided to the function.
However, data_type and ts arguments were already checked prior to calling
the function, so we know for sure that stktable_data_ptr() will never
return NULL in this case.
However some static code analyzers such as Coverity are being confused
because they think that the result might possibly be NULL.
(See GH #2398)
To make it explicit that we always provide good inputs and expect valid
result, let's switch to the __stktable_data_ptr() unsafe function.
This patchs adds support for optional ptr (0xffff form) instead of key
argument to match against existing sticktable entries, ie: if the key is
empty or cannot be matched on the cli due to incompatible characters.
Lookup is performed using a linear search so it will be slower than key
search which relies on eb tree lookup.
Example:
set table mytable key mykey data.gpc0 1
show table mytable
> 0x7fbd00032bd8: key=mykey use=0 exp=86373242 shard=0 gpc0=1
clear table mytable ptr 0x7fbd00032bd8
This patchs depends on:
- "MINOR: stktable: add table_process_entry helper function"
It should solve GH #2118