Having a thread_local for the pool cache is messy as we need to
initialize all elements upon startup, but we can't until the threads
are created, and once created it's too late. For this reason, the
allocation code used to check for the pool's initialization, and
it was the release code which used to detect the first call and to
initialize the cache on the fly, which is not exactly optimal.
Now that we have initcalls, let's turn this into a per-thread array.
This array is initialized very early in the boot process (STG_PREPARE)
so that pools are always safe to use. This allows to remove the tests
from the alloc/free calls.
Doing just this has removed 2.5 kB of code on all cumulated pool_alloc()
and pool_free() paths.
Instead of exporting a number of pools and having to manually delete
them in deinit() or to have dedicated destructors to remove them, let's
simply kill all pools on deinit().
For this a new function pool_destroy_all() was introduced. As its name
implies, it destroys and frees all pools (provided they don't have any
user anymore of course).
This allowed to remove 4 implicit destructors, 2 explicit ones, and 11
individual calls to pool_destroy(). In addition it properly removes
the mux_pt_ctx pool which was not cleared on exit (no backport needed
here since it's 1.9 only). The sig_handler pool doesn't need to be
exported anymore and became static now.
The new function create_pool_callback() takes 3 args including the
return pointer, and creates a pool with the specified name and size.
In case of allocation error, it emits an error message and returns.
The new macro REGISTER_POOL() registers a callback using this function
and will be usable to request some pools creation and guarantee that
the allocation will be checked. An even simpler approach is to use
DECLARE_POOL() and DECLARE_STATIC_POOL() which declare and register
the pool.
This switches explicit calls to various trivial registration methods for
keywords, muxes or protocols from constructors to INITCALL1 at stage
STG_REGISTER. All these calls have in common to consume a single pointer
and return void. Doing this removes 26 constructors. The following calls
were addressed :
- acl_register_keywords
- bind_register_keywords
- cfg_register_keywords
- cli_register_kw
- flt_register_keywords
- http_req_keywords_register
- http_res_keywords_register
- protocol_register
- register_mux_proto
- sample_register_convs
- sample_register_fetches
- srv_register_keywords
- tcp_req_conn_keywords_register
- tcp_req_cont_keywords_register
- tcp_req_sess_keywords_register
- tcp_res_cont_keywords_register
- flt_register_keywords
Remaining calls to si_cant_put() were all for lack of room and were
turned to si_rx_room_blk(). A few places where SI_FL_RXBLK_ROOM was
cleared by hand were converted to si_rx_room_rdy().
The now unused si_cant_put() function was removed.
It doesn't make sense to limit this code to applets, as any stream
interface can use it. Let's rename it by simply dropping the "applet_"
part of the name. No other change was made except updating the comments.
Fred reported a random crash related to the pools. This was introduced
by commit e18db9e98 ("MEDIUM: pools: implement a thread-local cache for
pool entries") because the minimum pool item size should have been
increased to 32 bytes to accommodate the 2 double-linked lists.
No backport is needed.
Similary to what's been done in 7a6ad88b02,
take into account that free_list that free_list is a void **, and so use
a void ** too when attempting to do a CAS.
The calls to HA_ATOMIC_CAS() on the lockfree version of the pool allocator
were mistakenly done on (void*) for the old value instead of (void **).
While this has no impact on "recent" gcc, it does have one for gcc < 4.7
since the CAS was open coded and it's not possible to assign a temporary
variable of type "void".
No backport is needed, this only affects 1.9.
Each thread now keeps the last ~512 kB of freed objects into a local
cache. There are some heuristics involved so that a specific pool cannot
use more than 1/8 of the total cache in number of objects. Tests have
shown that 512 kB is an optimal size on a 24-thread test running on a
dual-socket machine, resulting in an overall 7.5% performance increase
and a cache miss ratio reducing from 19.2 to 17.7%. Anyway it seems
pointless to keep more than an L2 cache, which probably explains why
sizes between 256 and 512 kB are optimal.
Cached objects appear in two lists, one per pool and one LRU to help
with fair eviction. Currently there is no way to check each thread's
cache state nor to flush it. This cache cannot be disabled and is
enabled as soon as the lockless pools are enabled (i.e.: threads are
enabled, no pool debugging is in use and the CPU supports a double word
CAS).
For caching it will be convenient to have indexes associated with pools,
without having to dereference the pool itself. One solution could consist
in replacing all pool pointers with integers but this would limit the
number of allocatable pools. Instead here we allocate the 32 first pools
from a pre-allocated array whose base address is known so that it's trivial
to convert a pool to an index in this array. Pools that cannot fit there
will be allocated normally.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Since commit cf975d4 ("MINOR: pools/threads: Implement lockless memory
pools."), we support lockless pools. However the parts dedicated to
detecting use-after-free are not present in this part, making DEBUG_UAF
useless in this situation.
The present patch sets a new define CONFIG_HAP_LOCKLESS_POOLS when such
a compatible architecture is detected, and when pool debugging is not
requested, then makes use of this everywhere in pools and buffers
functions. This way enabling DEBUG_UAF will automatically disable the
lockless version.
No backport is needed as this is purely 1.9-dev.
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
For HTTP/2 we'll need some buffer-only equivalent functions to some of
the ones applying to channels and still squatting the bi_* / bo_*
namespace. Since these names have kept being misleading for quite some
time now and are really getting annoying, it's time to rename them. This
commit will use "ci/co" as the prefix (for "channel in", "channel out")
instead of "bi/bo". The following ones were renamed :
bi_getblk_nc, bi_getline_nc, bi_putblk, bi_putchr,
bo_getblk, bo_getblk_nc, bo_getline, bo_getline_nc, bo_inject,
bi_putchk, bi_putstr, bo_getchr, bo_skip, bi_swpbuf
Usually it's desirable to merge similarly sized pools, which is the
reason why their size is rounded up to the next multiple of 16. But
for the buffers this is problematic because we add the size of
struct buffer to the user-requested size, and the rounding results
in 8 extra bytes that are usable in the end. So the user gets more
bytes than asked for, and in case of SSL it results in short writes
for the extra bytes that are sent above multiples of 16 kB.
So we add a new flag MEM_F_EXACT to request that the size is not
rounded up when creating the entry. Thus it doesn't disable merging.
When DEBUG_MEMORY_POOLS is used, we now use the link pointer at the end
of the pool to store a pointer to the pool, and to control it during
pool_free2() in order to serve four purposes :
- at any instant we can know what pool an object was allocated from
when examining memory, hence how we should possibly decode it ;
- it serves to detect double free when they happen, as the pointer
cannot be valid after the element is linked into the pool ;
- it serves to detect if an element is released in the wrong pool ;
- it serves as a canary, to detect if some buffers experienced an
overflow before being release.
All these elements will definitely help better troubleshoot strange
situations, or at least confirm that certain conditions did not happen.
When debugging a core file, it's sometimes convenient to be able to
visit the released entries in the pools (typically last released
session). Unfortunately the first bytes of these entries are destroyed
by the link elements of the pool. And of course, most structures have
their most accessed elements at the beginning of the structure (typically
flags). Let's add a build-time option DEBUG_MEMORY_POOLS which allocates
an extra pointer in each pool to put the link at the end of each pool
item instead of the beginning.
When debugging an issue, sometimes it can be useful to be able to use
byte 0 to poison memory areas, resulting in the same effect as a calloc().
This patch changes the default mem_poison_byte to -1 to disable it so that
all positive values are usable.
Till now this function would only allocate one entry at a time. But with
dynamic buffers we'll like to allocate the number of missing entries to
properly refill the pool.
Let's modify it to take a minimum amount of available entries. This means
that when we know we need at least a number of available entries, we can
ask to allocate all of them at once. It also ensures that we don't move
the pointers back and forth between the caller and the pool, and that we
don't call pool_gc2() for each failed malloc. Instead, it's called only
once and the malloc is only allowed to fail once.
pool_alloc2() used to pick the entry from the pool, fall back to
pool_refill_alloc(), and to perform the poisonning itself, which
pool_refill_alloc() was also doing. While this led to optimal
code size, it imposes memory poisonning on the buffers as well,
which is extremely slow on large buffers.
This patch cuts the allocator in 3 layers :
- a layer to pick the first entry from the pool without falling back to
pool_refill_alloc() : pool_get_first()
- a layer to allocate a dirty area by falling back to pool_refill_alloc()
but never performing the poisonning : pool_alloc_dirty()
- pool_alloc2() which calls the latter and optionally poisons the area
No functional changes were made.
There's a long-standing bug in pool_gc2(). It tries to protect the pool
against releasing of too many entries but the formula is wrong as it
compares allocated to minavail instead of (allocated-used) to minavail.
Under memory contention, it ends up releasing more than what is granted
by minavail and causes trouble to the dynamic buffer allocator.
This bug is in fact major by itself, but since minavail has never been
used till now, there is no impact at least in mainline. A backport to
1.5 is desired anyway in case any future backport or out-of-tree patch
relies on this.
show pools
Dump the status of internal memory pools. This is useful to track memory
usage when suspecting a memory leak for example. It does exactly the same
as the SIGQUIT when running in foreground except that it does not flush
the pools.
From time to time, some bugs are discovered that are caused by non-initialized
memory areas. It happens that most platforms return a zero-filled area upon
first malloc() thus hiding potential bugs. This patch also replaces malloc()
in pools with calloc() to ensure that all platforms exhibit the same behaviour
upon startup. In order to catch these bugs more easily, add a -dM command line
flag to enable memory poisonning. Optionally, passing -dM<byte> forces the
poisonning byte to <byte>.
A race condition exists in the hot reconfiguration code. It is
theorically possible that the second signal is sent during a free()
in the first list, which can cause crashes or freezes (the later
have been observed). Just set up a counter to ensure we do not
recurse.
New functions implemented:
- deinit_pollers: called at the end of deinit())
- prune_acl: called via list_for_each_entry_safe
Add missing pool_destroy2 calls:
- p->hdr_idx_pool
- pool2_tree64
Implement all task stopping:
- health-check: needs new "struct task" in the struct server
- queue processing: queue_mgt
- appsess_refresh: appsession_refresh
before (idle system):
==6079== LEAK SUMMARY:
==6079== definitely lost: 1,112 bytes in 75 blocks.
==6079== indirectly lost: 53,356 bytes in 2,090 blocks.
==6079== possibly lost: 52 bytes in 1 blocks.
==6079== still reachable: 150,996 bytes in 504 blocks.
==6079== suppressed: 0 bytes in 0 blocks.
after (idle system):
==6945== LEAK SUMMARY:
==6945== definitely lost: 7,644 bytes in 137 blocks.
==6945== indirectly lost: 9,913 bytes in 587 blocks.
==6945== possibly lost: 0 bytes in 0 blocks.
==6945== still reachable: 0 bytes in 0 blocks.
==6945== suppressed: 0 bytes in 0 blocks.
before (running system for ~2m):
==9343== LEAK SUMMARY:
==9343== definitely lost: 1,112 bytes in 75 blocks.
==9343== indirectly lost: 54,199 bytes in 2,122 blocks.
==9343== possibly lost: 52 bytes in 1 blocks.
==9343== still reachable: 151,128 bytes in 509 blocks.
==9343== suppressed: 0 bytes in 0 blocks.
after (running system for ~2m):
==11616== LEAK SUMMARY:
==11616== definitely lost: 7,644 bytes in 137 blocks.
==11616== indirectly lost: 9,981 bytes in 591 blocks.
==11616== possibly lost: 0 bytes in 0 blocks.
==11616== still reachable: 4 bytes in 1 blocks.
==11616== suppressed: 0 bytes in 0 blocks.
Still not perfect but significant improvement.
since pools v2, the way pools were destroyed at exit is incorrect
because it ought to account for users of those pools before freeing
them. This test also ensures there is no double free.
When we're interrupted by another instance, it is very likely
that the other one will need some memory. Now we know how to
free what is not used, so let's do it.
Also only free non-null pointers. Previously, pool_destroy()
did implicitly check for this case which was incidentely
needed.
- keep the number of users of each pool
- call the garbage collector on out of memory conditions
- sort the pools by size for faster creation
- force the alignment size to 16 bytes instead of 4*sizeof(void *)