Instead of having table_process_entry() decrement the session's ref
counter, do it outside, from the caller. Some were missed, such as when
an action was invalid, which would lead to the ref counter not being
decremented, and the session not being destroyable.
It makes more sense to do that from the caller, who just obtained the
ref counter, anyway.
This should be backporter up to 2.8.
A big deal of the task_queue() contention is caused by this function
because it's created using task_new_anywhere() and is subject to
heavy updates. Let's turn it to single thread by rotating the assigned
threads during initialization so that a table only runs on one thread
at a time.
However there's a trick: the function used to call task_queue() to
requeue the task if it had advanced its timer (may only happen when
learning an entry from a peer). We can't do that anymore since we can't
queue another thread's task. Thus instead of the task needs to be
scheduled earlier than previously planned, we simply perform a wakeup.
It will likely do nothing and will self-adjust its next wakeup timer.
Doing so halves the number of multi-thread task wakeups. In addition
the request rate at saturation increased by 12% with 16 peers and 40
tables on a 16 8-thread processes. This should improve the situation
described by Felipe in issues #3084 and #3101.
This should be backported to 3.2 after some extended checks.
In stktable_requeue_exp(), there's a tiny race at the beginning during
which we check the task's expiration date to decide whether or not to
wake process_table_expire() up. During this race, the task might just
have finished running on its owner thread and we can miss a task_queue()
opportunity, which probably explains why during testing it seldom happens
that a few entries are left at the end.
Let's perform a CAS to confirm the value is still the same before
leaving. This way we're certain that our value has been seen at least
once.
This should be backported to 3.2.
This function is a tasklet handler used to send peers updates, and it can
happen quite a bit in "show tasks" and "show profiling tasks", so let's
export it so that we don't face a cryptic symbol name:
$ socat - /tmp/haproxy-n10.stat <<< "show tasks"
Running tasks: 43 (8 threads)
function places % lat_tot lat_avg calls_tot calls_avg calls%
process_table_expire 16 37.2 1.072m 4.021s 115831 7239 15.4
task_process_applet 15 34.8 1.072m 4.287s 486299 32419 65.0
stktable_add_pend_updates 8 18.6 - - 89725 11215 12.0
sc_conn_io_cb 3 6.9 - - 5007 1669 0.6
process_peer_sync 1 2.3 4.293s 4.293s 50765 50765 6.7
This should be backported to 3.2 as it participates to debugging the
table+peers processing overhead.
The stick-table expiration of ref-counted entries was insufficiently
addresse by commit 324f0a60ab ("BUG/MINOR: stick-tables: never leave
used entries without expiration"), because now entries are just requeued
where they were, so they're visited over and over for long sessions,
causing process_table_expire() to loop, eating CPU and causing lock
contention.
Here we take care of refreshing their timeer when they are met, so
that we don't meet them more than once per stick-table lifetime. It
should address at least a part of the recent degradation that Felipe
noticed in GH #3084.
Since the fix above was marked for backporting to 3.2, this one should
be backported there as well.
This one doesn't need to wait forever, if it cannot work it can postpone
it. When building with a high value of STKTABLE_MAX_UPDATES_AT_ONCE (1000),
it's still possible to trigger warnings in this function on the write lock
that is contended by peers and expiration. Changing it for a trylock resolves
the issue.
This should be backported to 3.2 after a bit of testing.
process_table_expire() can take quite a lot of time running over all
shards. During this time it will hinder track-sc rules and peers, which
will experience an increased latency to do their work, especially peers
where each message will cause a lock, whose cumulated time can exceed
the watchdog's patience.
Here, we proceed just like in stktable_trash_oldest(), which is that
we're using a trylock to detect contention. The first time it happens,
if we hadn't purged anything, we switch to a regular lock to perform
the operation, and next time it happens we abort. This guarantees that
some entries will be expired and that contention will be reduced with
when detected.
With this change, various tests didn't manage to produce any warning,
including at the end of the load generation session.
This should be backported to 3.2 after a bit more testing.
stktable_trash_oldest() does insist a lot on purging what was requested,
only limited by STKTABLE_MAX_UPDATES_AT_ONCE. This is called in two
conditions, one to allocate a new stksess, and the other one to purge
entries of a stopping process. The cost of iterating over all shards
is huge, and a shard lock is taken each time before looking up entries.
Moreover, multiple threads can end up doing the same and looking hard for
many entries to purge when only one is needed. Furthermore, all threads
start from the same shard, hence synchronize their locks. All of this
costs a lot to other operations such as access from peers.
This commit simplifies the approach by ignoring the budget, starting
from a random shard number, and using a trylock so as to be able to
give up early in case of contention. The approach chosen here consists
in trying hard to flush at least one entry, but once at least one is
evicted or at least one trylock failed, then a failure on the trylock
will result in finishing.
The function now returns a success as long as one entry was freed.
With this, tests no longer show watchdog warnings during tests, though
a few still remain when stopping the tests (which are not related to
this function but to the contention from process_table_expire()).
With this change, under high contention some entries' purge might be
postponed and the table may occasionally contain slightly more entries
than their size (though this already happens since stksess_new() first
increments ->current before decrementing it).
Measures were made on a 64-core system with 8 peers
of 16 threads each, at CPU saturation (350k req/s each doing 10
track-sc) for 10M req, with 3 different approaches:
- this one resulted in 1500 failures to find an entry (0.015%
size overhead), with the lowest contention and the fairest
peers distibution.
- leaving only after a success resulted in 229 failures (0.0029%
size overhead) but doubled the time spent in the function (on
the write lock precisely).
- leaving only when both a success and a failed lock were met
resulted in 31 failures (0.00031% overhead) but the contention
was high enough again so that peers were not all up to date.
Considering that a saturated machine might exceed its entries by
0.015% is pretty minimal, the mechanism is kept.
This should be backported to 3.2 after a bit more testing as it
resolves some watchdog warnings and panics. It requires precedent
commit "MINOR: stick-table: permit stksess_new() to temporarily
allocate more entries" to over-allocate instead of failing in case
of contention.
stksess_new() calls stktable_trash_oldest() to release some entries.
If it fails however, it will fail to allocate an entry. This is a problem
because it doesn't permit stktable_trash_oldest() to be used in best effort
mode, which forces it to impose high contention. There's no problem with
allocating slightly more in practice. In the worst case if all entries are
in use, it's not shocking to temporarily exceed the number of entries by a
few units.
Let's relax this problematic rule. This patch might need to be backported
to 3.2 after a bit more testing in order to support locking relaxation.
It helps keep the contention level low: when we hold the update lock
that we know other parts may be relying on (peers, track-sc etc),
we decrease the remaining visit counters 4 times as fast to further
reduce the contention. At this point no more warnings are seen during
intense synchronization (2x64 cores, 1.5M req/s with a track-sc each,
5M entries in use).
As reported by Felipe in GH issue #3084, on large systems it's not
sufficient to leave the expiration process after a certain number of
expired entries, because if they accumulate too fast, it's possible
to still spend some time visiting many (e.g. those still in use),
which takes time.
Thus here we're taking a stricter approach consisting in counting the
number of visited entries, which allows to leave early if we can't do
the expected work in a reasonable amount of time.
In order to avoid always stopping on first shards and never visiting
last ones, we're always starting from a random shard number and looping
from that one. This way even if we always leave early, all shards will
be handled equally.
This should be backported to 3.2.
When trying to kill/expire entries, if a ref-counted entry is found,
let's requeue it with its expiration timer instead of leaving it out,
because other ref-counters (e.g. peers) will not purge it otherwise,
leaving it orphan. This one seems trickier to trigger, though it seems
to happen sometimes when peers are late and a long resync is active
and competing with intense calls to process_table_expire() (i.e. when
no other acitvity is there).
This must be backported to 3.2. It's likely that older versions are
affected as well, but possibly differently since the expiration
mechanism changed between 3.1 and 3.2, so better not take unneeded
risks there.
In 3.2, the table expiration latency was improved by commit 994cc58576
("MEDIUM: stick-tables: Limit the number of entries we expire"), however
it introduced an issue by which it's possible to leave the loop after a
certain number of elements were expired, without requeuing the deleted
elements. The issue it causes is that other places with a non-null ref_cnt
will not necessarily delete it themselves, resulting in orphan elements in
the table. These ones will then pollute it and force recycling old ones
more often which in turn results in an increase of the contention.
Let's check for the expiration counter before deleting the element so
that it can be found upon next visit.
This fix must be backported to 3.2. It is directly related to GH
issue #3084. Thanks to Felipe and Ricardo for sharing precious info
and testing a candidate fix.
Cap sticky counter index with tune.nb_stk_ctr instead of MAX_SESS_STKCTR for
sc-add-gpc. Same logic is already implemented for sc-inc-gpc and sc-set-gpt
keywords. So, it seems missed for sc-add-gpc.
This fixes the issue #3061 reported at GitHub. Thanks to @ma311 for
reporting their analysis of the issue.
This should be backported in all versions until 2.8, included 2.8.
Since commit 388539faa ("MEDIUM: stick-tables: defer adding updates to a
tasklet"), between the entry creation and its arrival in the updates tree,
there is time for scheduling, and it now becomes possible for an stksess
entry to be requeued into the list while it's still in the tree as a remote
one. Only local updates were removed prior to being inserted. In this case
we would re-insert the entry, causing it to appear as the parent of two
distinct nodes or leaves, and to be visited from the first leaf during a
delete() after having already been removed and freed, causing a crash,
as Christian reported in issue #2959.
There's no reason to backport this as this appeared with the commit above
in 3.2-dev13.
It might be possible not to see the element in the tree, then not to
see it in the update list, thus not to take the lock before deleting.
But an element in the list could have moved to the tree during the
check, and be removed later without the updt_lock.
Let's delete prior to checking the presence in the tree to avoid
this situation. No backport is needed since this arrived in -dev13
with the update list.
However the doc purposely says the opposite, to encourage migrating away
from "ip". The goal is that in the future we change "ip" to mean "ipv6",
which seems to be what most users naturally expect. But we cannot break
configurations in the LTS version so for now "ipv4" is the alias.
The reason for not changing it in the table is that the type name is
used at a few places (look for "].kw"):
- dumps
- promex
We'd rather not change that output for 3.2, but only do it in 3.3.
This way, 3.2 can be made future-proof by using "ipv4" in the config
without any other side effect.
Please see github issue #2962 for updates on this transition.
As reported in GH #2958, commit 6c9b315 caused a regression with sc_*
fetches and tracked counter id > 9.
As such, the below configuration would cause a BUG_ON() to be triggered:
global
log stdout format raw local0
tune.stick-counters 11
defaults
log global
mode http
frontend www
bind *:8080
acl track_me bool(true)
http-request set-var(txn.track_var) str("a")
http-request track-sc10 var(txn.track_var) table rate_table if track_me
http-request set-var(txn.track_var_rate) sc_gpc_rate(0,10,rate_table)
http-request return status 200
backend rate_table
stick-table type string size 1k expire 5m store gpc_rate(1,1m)
While in 6c9b315 the src_fetch logic was removed from
smp_fetch_sc_stkctr(), num > 9 is indeed not expected anymore as
original num value. But what we didn't consider is that num is effectively
re-assigned for generic sc_* variant.
Thus the BUG_ON() is misplaced as it should only be evaluated for
non-generic fetches. It explains why it triggers with valid configurations
Thanks to GH user @tkjaer for his detailed report and bug analysis
No backport needed, this bug is specific to 3.2.
In process_table_expire(), limit the number of entries we remove in one
call, and just reschedule the task if there's more to do. Removing
entries require to use the heavily contended update write lock, and we
don't want to hold it for too long.
This helps getting stick tables perform better under heavy load.
Limit the number of old entries we remove in one call of
stktable_trash_oldest(), as we do so while holding the heavily contended
update write lock, so we'd rather not hold it for too long.
This helps getting stick tables perform better under heavy load.
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
As discussed in GH #2423, there are some cases where src_{inc,clr}_gpc*
is not sufficient because we need to perform the lookup on a specific
key. Indeed, just like we did in e642916 ("MEDIUM: stktable: leverage
smp_fetch_* helpers from sample conv"), we can easily implement new
table converters based on existing fetches. This is what we do in
this patch.
Also the doc was updated so that src_{inc,clr}_gpc* fetches now point to
their generic equivalent table_{inc,clr}_gpc*. Indeed, src_{inc,clr}_gpc*
are simply aliases.
This should fix GH #2423.
sample_conv_table_bytes_out_rate() was defined in the middle of other
stick-table sample convs without any ordering logic. Let's put it
where it belongs, right after sample_conv_table_bytes_in_rate().
In this patch we try to prevent code duplication: some fetches and sample
converters do the exact same thing, except that the converter takes the
argument as input data. Until now, both the converter and the fetch
had their own implementation (copy pasted), with the fetch specific or
converter specific lookup part.
Thanks to previous commits, we now have generic sample fetch helpers
that take the stkctr as argument, so let's leverage them directly
from the converter functions when available. This allows to remove
a lot of code duplication and should make code maintenance easier in the
future.
While this patch actually adds more insertions than deletions, it actually
tries to simplify the lookup logic for sc_ and src_ sticktable fetches.
Indeed, smp_create_src_stkctr() and smp_fetch_sc_stkctr() combination
was used everywhere the fetch supports sc_ and src_ form, and
smp_fetch_sc_stkctr() even integrated some of the src-oriented fetch logic.
Not only this was confusing, but it made the task of adding new generic
fetches even more complex.
Thus in this patch we completely dedicate smp_fetch_sc_stkctr() to sc_
oriented fetches, while smp_create_src_stkctr() is now renamed to
smp_fetch_src_stkctr() and can now work on its own for src_ oriented
fetches. It takes an additional paramater, "create" to tell the function
if the entry should be created if it doesn't exist yet.
Now it's up to the calling function to know if it should be using the
sc_ oriented fetch or the src_ oriented one based on the input keyword.
In this patch we split several sample fetch functions that are leveraged
by the "src-" fetches such as smp_fetch_sc_inc_gpc().
Indeed, for all of them, we add an intermediate helper function that takes
a stkctr pointer as parameter and performs the logic, leaving the lookup
part in the calling function. Before this patch existing functions were
doing the lookup + the fetch logic. Thanks to this patch it will become
easier to add generic converters taking lookup key as input.
List of targeted functions:
- smp_fetch_sc_inc_gpc()
- smp_fetch_sc_inc_gpc0()
- smp_fetch_sc_inc_gpc1()
- smp_fetch_sc_clr_gpc()
- smp_fetch_sc_clr_gpc0()
- smp_fetch_sc_clr_gpc1()
- smp_fetch_sc_conn_cnt()
- smp_fetch_sc_conn_rate()
- smp_fetch_sc_updt_conn_cnt()
- smp_fetch_sc_conn_curr()
- smp_fetch_sc_glitch_cnt()
- smp_fetch_sc_glitch_rate()
- smp_fetch_sc_sess_cnt()
- smp_fetch_sc_sess_rate()
- smp_fetch_sc_http_req_cnt()
- smp_fetch_sc_http_req_rate()
- smp_fetch_sc_http_err_cnt()
- smp_fetch_sc_http_err_rate()
- smp_fetch_sc_http_fail_cnt()
- smp_fetch_sc_http_fail_rate()
- smp_fetch_sc_kbytes_in()
- smp_fetch_sc_bytes_in_rate()
- smp_fetch_kbytes_out()
- smp_fetch_sc_gpc1_rate()
- smp_fetch_sc_gpc0_rate()
- smp_fetch_sc_gpc_rate()
- smp_fetch_sc_get_gpc1()
- smp_fetch_sc_get_gpc0()
- smp_fetch_sc_get_gpc()
- smp_fetch_sc_get_gpt0()
- smp_fetch_sc_get_gpt()
- smp_fetch_sc_bytes_out_rate()
Please note that this patch doesn't render any good using "git show" or
"git diff". For all the functions listed above, a new helper function was
defined right above it, with the same name without "_sc". These new
functions perform the fetch part, while the original ones (with "_sc")
now simply perform the lookup and then leverage the corresponding fetch
helper.
smp_fetch_stksess(table, smp, create) performs a lookup in <table> by
using <smp> as a key. It returns matching entry on success and NULL on
failure. <create> can be set to 1 to force the entry creation.
We then use this helper everywhere relevant to prevent code duplication
As discussed in GH #2838, the previous fix f399dbf
("MINOR: stktable: fix potential build issue in smp_to_stkey") which
attempted to remove conversion ambiguity and prevent build warning proved
to be insufficient.
This time, we implement Willy's suggestion, which is to use an union to
perform the conversion.
Hopefully this should fix GH #2838. If that's the case (and only in that
case), then this patch may be backported with f399dbf (else the patch
won't apply) anywhere b59d1fd ("BUG/MINOR: stktable: fix big-endian
compatiblity in smp_to_stkey()") was backported.
In 819fc6f563
("MEDIUM: threads/stick-tables: handle multithreads on stick tables"),
sample fetch and action functions were properly guarded with stksess
read/write locks for read and write operations respectively, but the
sample_conv_table functions leveraged by "table_*" converters were
overlooked.
This bug was not known to cause issues in existing deployments yet (at
least it was not reported), but due to its nature it can theorically lead
to inconsistent values being reported by "table_*" converters if the value
is being updated by another thread in parallel.
It should be backported to all stable versions.
[ada: for versions < 3.0, glitch_cnt and glitch_rate samples should be
ignored as they first appeared in 3.0]
smp_to_stkey() uses an ambiguous cast from 64bit integer to 32 bit
unsigned integer. While it is intended, let's make the cast less
ambiguous by explicitly casting the right part of the assignment to the
proper type.
This should fix GH #2838
As discussed in GH #1750, we were lacking a sample fetch to be able to
retrieve the key from the currently tracked counter entry. To do so,
sc_key fetch can now be used. It returns a sample with the correct type
(table key type) corresponding to the tracked counter entry (from previous
track-sc rules).
If no entry is currently tracked, it returns nothing.
It can be used using the standard form "sc_key(<sc_number>)" or the legacy
form: "sc0_key", "sc1_key", "sc2_key"
Documentation was updated.
stksess_getkey(t, ts) returns a stktable_key struct pointer filled with
data from input <ts> entry in <t> table. Returned pointer uses the
static_table_key variable. Indeed, stktable_key struct is more convenient
to manipulate than having to deal with the key extraction from stktsess
struct directly.
When smp_to_stkey() deals with SINT samples, since stick-tables deals with
32 bits integers while SINT sample is 64 bit integer, inplace conversion
was done in smp_to_stkey. For that the 64 bit integer was truncated before
the key would point to it. Unfortunately this only works on little endian
architectures because with big endian ones, the key would point to the
wrong 32bit range.
To fix the issue and make the conversion endian-proof, let's re-assign
the sample as 32bit integer before the key points to it.
Thanks to Willy for having spotted the bug and suggesting the above fix.
It should be backported to all stable versions.
Some actions such as "sc0_get_gpc0" (using smp_fetch_sc_stkctr()
internally) can take an optional table name as parameter to perform the
lookup on a different table from the tracked one but using the key from
the tracked entry. It is done by leveraging the stktable_lookup() function
which was originally meant to perform intra-table lookups.
Calling sc0_get_gpc0() with a different table name will result in
stktable_lookup() being called to perform lookup using a stktsess from
a different table. While it is theorically fine, it comes with a pitfall:
both tables (the one from where the stktsess originates and the actual
target table) should rely on the exact same key type and length.
Failure to do so actually results in undefined behavior, because the key
type and/or length from one table is used to perform the lookup in
another table, while the underlying lookup API expects explicit type and
key length.
For instance, consider the below example:
peers testpeers
bind 127.0.0.1:10001
server localhost
table test type binary len 1 size 100k expire 1h store gpc0
table test2 type string size 100k expire 1h store gpc0
listen test_px
mode http
bind 0.0.0.0:8080
http-request track-sc0 bin(AA) table testpeers/test
http-request track-sc1 str(ok) table testpeers/test2
log-format "%[sc0_get_gpc0(testpeers/test2)]"
log stdout format raw local0
server s1 git.haproxy.org:80
Performing a curl request to localhost:8080 will cause unitialized reads
because string "ok" from test2 table will be compared as a string against
"AA" binary sample which is not NULL terminated:
==2450742== Conditional jump or move depends on uninitialised value(s)
==2450742== at 0x484F238: strlen (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2450742== by 0x27BCE6: stktable_lookup (stick_table.c:539)
==2450742== by 0x281470: smp_fetch_sc_stkctr (stick_table.c:3580)
==2450742== by 0x283083: smp_fetch_sc_get_gpc0 (stick_table.c:3788)
==2450742== by 0x2A805C: sample_process (sample.c:1376)
So let's prevent that by adding some comments in stktable_set_entry()
func description, and by adding a check in smp_fetch_sc_stkctr() to ensure
both source stksess and target table share the same key properties.
While it could be relevant to backport this in all stable versions, it is
probably safer to wait for some time before doing so, to ensure that no
existing configs rely on this ambiguity because the fact that the target
table and source stksess entry need to share the same key type and length
is not explicitly documented.
As discussed in GH #2286, {set, clear, show} table commands were unable
to deal with array types such as gpt, because they handled such types as
a non-array types, thus only the first entry (ie: gpt[0]) was considered.
In this patch we add an extra logic around array-types handling so that
it is possible to specify an array index right after the type, like this:
set table peer/table key mykey data.gpt[2] value
# where 2 is the entry index that we want to access
If no index is specified, then it implicitly defaults to 0 to mimic
previous behavior.
Same as stktable_get_data_type(), but tries to parse optional index in
the form "name[idx]" (only for array types).
Falls back to stktable_get_data_type() when no index is provided.
Thanks to previous commit stktable struct now have a "flags" struct member
Let's take this opportunity to remove the isolated "nopurge" attribute in
stktable struct and rely on a flag named STK_FL_NOPURGE instead.
This helps to better organize stktable struct members.
When "recv-only" keyword is added on a stick table declaration (in peers
or proxy section), haproxy considers that the table is only used for
data retrieval from a remote location and not used to perform local
updates. As such, it enables the retrieval of local-only values such
as conn_cur that are ignored by default. This can be useful in some
contexts where we want to know about local-values such are conn_cur
from a remote peer.
To do this, add stktable struct flags which default to NONE and enable
the RECV_ONLY flag on the table then "recv-only" keyword is found in the
table declaration. Then, when in peer_treat_updatemsg(), when handling
table updates, don't ignore data updates for local-only values if the flag
is set.
The file name used to point to the calling function's stack for stick
tables, which was OK during parsing but remained dangling afterwards.
At least it was already marked const so as not to accidentally free it.
Let's make it point to a file_name_node now.
Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.
For now, this parameter must be in the range ]0-1024].
Since 2.5, an array of GPC is provided to replace legacy gpc0/gpc1.
src_inc_gpc is a sample fetch which is used to increment counters in
this array.
A crash occurs if src_inc_gpc is used without any previous track-sc
rule. This is caused by an error in smp_fetch_sc_inc_gpc(). When
temporary stick counter is created via smp_create_src_stkctr(), table
pointer arg value used is not correct : it points to the counter ID
instead of the table argument. To fix this, use the proper sample fetch
second arg.
This can be reproduced with the following config :
acl mark src_inc_gpc(0,<table>) -m bool
tcp-request connection accept if mark
This should be backported up to 2.6.
Guarded functions to kill a sticky session, stksess_kill()
stksess_kill_if_expired(), may or may not decrement and test its reference
counter before really killing it. This depends on a parameter. If it is set
to non-zero value, the ref count is decremented and if it falls to zero, the
session is killed. Otherwise, if this parameter is equal to zero, the
session is killed, regardless the ref count value.
In the code, these functions are always called with a non-zero parameter and
the ref count is always decremented and tested. So, there is no reason to
still have a special case. Especially because it is not really easy to say
if it is supported or not. Does it mean it is possible to kill a sticky
session while it is still referenced somewhere ? probably not. So, does it
mean it is possible to kill a unreferenced session ? This case may be
problematic because the session is accessed outside of any lock and thus may
be released by another thread because it is unreferenced. Enlarging scope of
the lock to avoid any issue is possible but it is a bit of shame to do so
because there is no usage for now.
The best is to simplify the API and remove this case. Now, stksess_kill()
and stksess_kill_if_expired() functions always decrement and test the ref
count before killing a sticky session.
When we try to kill a session, the shard must be locked before decrementing
the ref count on the session. Otherwise, the ref count can fall to 0 and a
purge task (stktable_trash_oldest or process_table_expire) may release the
session before we have the opportunity to acquire the lock on the shard to
effectively kill the session. This could lead to a double free.
Here is the scenario:
Thread 1 Thread 2
sktsess_kill(ts)
if (ATOMIC_DEC(&ts->ref_cnt) != 0)
return
/* here the ref count is 0 */
stktable_trash_oldest()
LOCK(&sh_lock)
if (!ATOMIC_LOAD(&ts->ref_cnf))
__stksess_free(ts)
UNLOCK(&sh_lock)
/* here the session was released */
LOCK(&sh_lock)
__stksess_free(ts) <--- double free
UNLOCK(&sh_lock)
The bug was introduced in 2.9 by the commit 7968fe3889 ("MEDIUM:
stick-table: change the ref_cnt atomically"). The ref count must be
decremented inside the lock for stksess_kill() and sktsess_kill_if_expired()
function.
This patch should fix the issue #2611. It must be backported as far as 2.9. On
the 2.9, there is no sharding. All the table is locked. The patch will have to
be adapted.
As reported by @Bbulatov in GH #2586, stktable_data_ptr() return value is
used without checking it isn't NULL first, which may happen if the given
type is invalid or not stored in the table.
However, since date_type is set by table_prepare_data_request() right
before cli_io_handler_table() is invoked, date_type is not expected to
be invalid: table_prepare_data_request() normally checked that the type
is stored inside the table. Thus stktable_data_ptr() should not be failing
at this point, so we add a BUG_ON() to indicate that.
During changes made in 2.7 by commits 8d3c3336f9 ("MEDIUM: stick-table:
make stksess_kill_if_expired() avoid the exclusive lock") and 996f1a5124
("MEDIUM: stick-table: do not take a lock to update t->current anymore."),
the operation was done cautiously one baby step at a time and the final
cleanup was not done, as we're keeping a read lock under an atomic dec.
Furthermore there's a pool_free() call under that lock, and we try to
avoid pool_alloc() and pool_free() under locks for their nasty side
effects (e.g. when memory gets recompacted), so let's really drop it
now.
Note that the performance gain is not really perceptible here, it's
essentially for code clarity reasons that this has to be done.
Due to the code in stktable_touch_with_exp() being the same as in other
functions previously made around a loop trying first to upgrade a read
lock then to fall back to a direct write lock, there remains a confusing
construct with multiple tests on use_wrlock that is obviously zero when
tested. Let's remove them since the value is known and the loop does not
exist anymore.