MEDIUM: stick-tables: relax stktable_trash_oldest() to only purge what is needed

stktable_trash_oldest() does insist a lot on purging what was requested,
only limited by STKTABLE_MAX_UPDATES_AT_ONCE. This is called in two
conditions, one to allocate a new stksess, and the other one to purge
entries of a stopping process. The cost of iterating over all shards
is huge, and a shard lock is taken each time before looking up entries.

Moreover, multiple threads can end up doing the same and looking hard for
many entries to purge when only one is needed. Furthermore, all threads
start from the same shard, hence synchronize their locks. All of this
costs a lot to other operations such as access from peers.

This commit simplifies the approach by ignoring the budget, starting
from a random shard number, and using a trylock so as to be able to
give up early in case of contention. The approach chosen here consists
in trying hard to flush at least one entry, but once at least one is
evicted or at least one trylock failed, then a failure on the trylock
will result in finishing.

The function now returns a success as long as one entry was freed.

With this, tests no longer show watchdog warnings during tests, though
a few still remain when stopping the tests (which are not related to
this function but to the contention from process_table_expire()).

With this change, under high contention some entries' purge might be
postponed and the table may occasionally contain slightly more entries
than their size (though this already happens since stksess_new() first
increments ->current before decrementing it).

Measures were made on a 64-core system with 8 peers
of 16 threads each, at CPU saturation (350k req/s each doing 10
track-sc) for 10M req, with 3 different approaches:

  - this one resulted in 1500 failures to find an entry (0.015%
    size overhead), with the lowest contention and the fairest
    peers distibution.

  - leaving only after a success resulted in 229 failures (0.0029%
    size overhead) but doubled the time spent in the function (on
    the write lock precisely).

  - leaving only when both a success and a failed lock were met
    resulted in 31 failures (0.00031% overhead) but the contention
    was high enough again so that peers were not all up to date.

Considering that a saturated machine might exceed its entries by
0.015% is pretty minimal, the mechanism is kept.

This should be backported to 3.2 after a bit more testing as it
resolves some watchdog warnings and panics. It requires precedent
commit "MINOR: stick-table: permit stksess_new() to temporarily
allocate more entries" to over-allocate instead of failing in case
of contention.
This commit is contained in:
Willy Tarreau 2025-09-09 11:56:11 +02:00
parent b119280f60
commit f87cf8b76e
3 changed files with 41 additions and 26 deletions

View File

@ -71,7 +71,7 @@ struct stkctr *smp_create_src_stkctr(struct session *sess, struct stream *strm,
int stktable_compatible_sample(struct sample_expr *expr, unsigned long table_type); int stktable_compatible_sample(struct sample_expr *expr, unsigned long table_type);
int stktable_register_data_store(int idx, const char *name, int std_type, int arg_type); int stktable_register_data_store(int idx, const char *name, int std_type, int arg_type);
int stktable_get_data_type(char *name); int stktable_get_data_type(char *name);
int stktable_trash_oldest(struct stktable *t, int to_batch); int stktable_trash_oldest(struct stktable *t);
int __stksess_kill(struct stktable *t, struct stksess *ts); int __stksess_kill(struct stktable *t, struct stksess *ts);
/************************* Composite address manipulation ********************* /************************* Composite address manipulation *********************

View File

@ -2190,7 +2190,6 @@ struct task *manage_proxy(struct task *t, void *context, unsigned int state)
* to push to a new process and * to push to a new process and
* we are free to flush the table. * we are free to flush the table.
*/ */
int budget;
int cleaned_up; int cleaned_up;
/* We purposely enforce a budget limitation since we don't want /* We purposely enforce a budget limitation since we don't want
@ -2203,14 +2202,12 @@ struct task *manage_proxy(struct task *t, void *context, unsigned int state)
* Moreover, we must also anticipate the pool_gc() call which * Moreover, we must also anticipate the pool_gc() call which
* will also be much slower if there is too much work at once * will also be much slower if there is too much work at once
*/ */
budget = MIN(p->table->current, (1 << 15)); /* max: 32K */ cleaned_up = stktable_trash_oldest(p->table);
cleaned_up = stktable_trash_oldest(p->table, budget);
if (cleaned_up) { if (cleaned_up) {
/* immediately release freed memory since we are stopping */ /* immediately release freed memory since we are stopping */
pool_gc(NULL); pool_gc(NULL);
if (cleaned_up > (budget / 2)) { if (cleaned_up) {
/* most of the budget was used to purge entries, /* it is very likely that there are still trashable
* it is very likely that there are still trashable
* entries in the table, reschedule a new cleanup * entries in the table, reschedule a new cleanup
* attempt ASAP * attempt ASAP
*/ */

View File

@ -286,12 +286,12 @@ static struct stksess *__stksess_init(struct stktable *t, struct stksess * ts)
} }
/* /*
* Trash oldest <to_batch> sticky sessions from table <t> * Trash up to STKTABLE_MAX_UPDATES_AT_ONCE oldest sticky sessions from table
* Returns number of trashed sticky sessions. It may actually trash less * <t>. Returns non-null if it managed to release at least one entry. It will
* than expected if finding these requires too long a search time (e.g. * avoid waiting on a lock if it managed to release at least one object. It
* most of them have ts->ref_cnt>0). This function locks the table. * tries hard to limit the time spent evicting objects.
*/ */
int stktable_trash_oldest(struct stktable *t, int to_batch) int stktable_trash_oldest(struct stktable *t)
{ {
struct stksess *ts; struct stksess *ts;
struct eb32_node *eb; struct eb32_node *eb;
@ -299,24 +299,34 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
int max_per_shard; int max_per_shard;
int done_per_shard; int done_per_shard;
int batched = 0; int batched = 0;
int to_batch;
int updt_locked; int updt_locked;
int failed_once = 0;
int looped; int looped;
int shard; int shard;
int init_shard;
shard = 0; /* start from a random shard number to avoid starvation in the last ones */
shard = init_shard = statistical_prng_range(CONFIG_HAP_TBL_BUCKETS - 1);
if (to_batch > STKTABLE_MAX_UPDATES_AT_ONCE)
to_batch = STKTABLE_MAX_UPDATES_AT_ONCE; to_batch = STKTABLE_MAX_UPDATES_AT_ONCE;
max_search = to_batch * 2; // no more than 50% misses max_search = to_batch * 2; // no more than 50% misses
max_per_shard = (to_batch + CONFIG_HAP_TBL_BUCKETS - 1) / CONFIG_HAP_TBL_BUCKETS; max_per_shard = (to_batch + CONFIG_HAP_TBL_BUCKETS - 1) / CONFIG_HAP_TBL_BUCKETS;
while (batched < to_batch) { do {
done_per_shard = 0; done_per_shard = 0;
looped = 0; looped = 0;
updt_locked = 0; updt_locked = 0;
if (HA_RWLOCK_TRYWRLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock) != 0) {
if (batched)
break; // no point insisting, we have or made some room
if (failed_once)
break; // already waited once, that's enough
failed_once = 1;
HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock); HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock);
}
eb = eb32_lookup_ge(&t->shards[shard].exps, now_ms - TIMER_LOOK_BACK); eb = eb32_lookup_ge(&t->shards[shard].exps, now_ms - TIMER_LOOK_BACK);
while (batched < to_batch && done_per_shard < max_per_shard) { while (batched < to_batch && done_per_shard < max_per_shard) {
@ -349,8 +359,12 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
if (ts->expire != ts->exp.key || HA_ATOMIC_LOAD(&ts->ref_cnt) != 0) { if (ts->expire != ts->exp.key || HA_ATOMIC_LOAD(&ts->ref_cnt) != 0) {
requeue: requeue:
if (!tick_isset(ts->expire)) if (!tick_isset(ts->expire)) {
/* don't waste more time here it we're not alone */
if (failed_once)
break;
continue; continue;
}
ts->exp.key = ts->expire; ts->exp.key = ts->expire;
eb32_insert(&t->shards[shard].exps, &ts->exp); eb32_insert(&t->shards[shard].exps, &ts->exp);
@ -369,6 +383,9 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
if (!eb || tick_is_lt(ts->exp.key, eb->key)) if (!eb || tick_is_lt(ts->exp.key, eb->key))
eb = &ts->exp; eb = &ts->exp;
/* don't waste more time here it we're not alone */
if (failed_once)
break;
continue; continue;
} }
@ -395,6 +412,10 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
__stksess_free(t, ts); __stksess_free(t, ts);
batched++; batched++;
done_per_shard++; done_per_shard++;
/* don't waste more time here it we're not alone */
if (failed_once)
break;
} }
if (updt_locked) if (updt_locked)
@ -402,13 +423,10 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock); HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock);
if (max_search <= 0) shard++;
break; if (shard >= CONFIG_HAP_TBL_BUCKETS)
shard = 0;
shard = (shard + 1) % CONFIG_HAP_TBL_BUCKETS; } while (max_search > 0 && shard != init_shard);
if (!shard)
break;
}
return batched; return batched;
} }
@ -438,7 +456,7 @@ struct stksess *stksess_new(struct stktable *t, struct stktable_key *key)
* locking contention but it's not a problem in practice, * locking contention but it's not a problem in practice,
* these will be recovered later. * these will be recovered later.
*/ */
stktable_trash_oldest(t, (t->size >> 8) + 1); stktable_trash_oldest(t);
} }
ts = pool_alloc(t->pool); ts = pool_alloc(t->pool);