Commit 82cd5c13a ("OPTIM: backend: skip LB when we know the backend is
full") has uncovered a long-burried bug in the dequeing code: when a
server releases a connection, it picks a new one from the proxy's or
its queue. Technically speaking it only picks a pendconn which is a
link between a position in the queue and a stream. It then sets this
pendconn's target to itself, and wakes up the stream's task so that
it can try to connect again.
The stream then goes through the regular connection setup phases,
calls back_try_conn_req() which calls pendconn_dequeue(), which
sets the stream's target to the pendconn's and releases the pendconn.
It then reaches assign_server() which sees no SF_ASSIGNED and calls
assign_server_and_queue() to perform load balancing or queuing. This
one first destroys the stream's target and gets ready to perform load
balancing. At this point we're load-balancing for no reason since we
already knew what server was available. And this is where the commit
above comes into play: the check for the backend's queue above may
detect other connections that arrived in between, and will immediately
return FULL, forcing this request back into the queue. If the server
had a very low maxconn (e.g. 1 due to a long slowstart), it's possible
that this evicted connection was the last one on the server and that
no other one will ever be present to process the queue. Usually a
regularly processed request will still have its own srv_conn that will
be used during stream_free() to dequeue other connections. But if the
server had a down-up cycle, then a call to pendconn_grab_from_px()
may start to dequeue entries which had no srv_conn and which will have
no server slot to offer when they expire, thus maintaining the situation
above forever. Worse, as new requests arrive, there are always some
requests in the queue and the situation feeds on itself.
The correct fix here is to properly set SF_ASSIGNED in pendconn_dequeue()
when the stream's target is assigned (as it's what this flag means), so
as to avoid a load-balancing pass when dequeuing.
Many thanks to Pierre Cheynier for the numerous detailed traces he
provided that helped narrow this problem down.
This could be backported to all stable versions, but in practice only
2.3 and above are really affected since the presence of the commit
above. Given how tricky this code is it's better to limit it to those
versions that really need it.
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
Currently our atomic ops return a value but it's never known whether
the fetch is done before or after the operation, which causes some
confusion each time the value is desired. Let's create an explicit
variant of these operations suffixed with _FETCH to explicitly mention
that the fetch occurs after the operation, and make use of it at the
few call places.
The two algos defining these functions (first and leastconn) do not need the
server's lock. However it's already present in pendconn_process_next_strm()
so the API must be updated so that the functions may take it if needed and
that the callers indicate whether they already own it.
As such, the call places (backend.c and stream.c) now do not take it
anymore, queue.c was unchanged since it's already held, and both "first"
and "leastconn" were updated to take it if not already held.
A quick test on the "first" algo showed a jump from 432 to 565k rps by
just dropping the lock in stream.c!
The remaining contention on the server lock solely comes from
sess_change_server() which takes the lock to add and remove a
stream from the server's actconn list. This is both expensive
and pointless since we have mt-lists, and this list is only
used by the CLI's "shutdown server sessions" command!
Let's migrate to an mt-list and remove the need for this costly
lock. By doing so, the request rate increased by ~1.8%.
This patch removes unecessary tests on p or pp pointers in
pendconn_process_next_strm() function. This should make cppcheck happy and
avoid false report of null pointer dereference.
This patch should fix the issue #1036.
As reported by Coverity in issue #917, commit 96bca33 ("OPTIM: queue:
decrement the nbpend and totpend counters outside of the lock")
introduced a bug when moving the increments outside of the loop,
because we can't always rely on the pendconn "p" here as it may
be null. We can retrieve the proxy pointer directly from s->proxy
instead. The same is true for pendconn_redistribute(), though the
last "p" pointer there was still valid. This patch fixes both.
No backport is needed, this was introduced just before 2.3-dev8.
We don't need to do that inside the lock. However since the operation
used to be done in deep functions, we have to make it resurface closer
to visible parts. It remains reasonably self-contained in queue.c so
that's not that big of a deal. Some places (redistribute) could benefit
from a single operation for all counts at once. Others like
pendconn_process_next_strm() are still called with both locks held but
now it will be possible to change this.
Instead of incrementing, decrementing them and updating their max under
the lock, make them atomic and keep them out of the lock as much as
possible. For __pendconn_unlink_* it would be wide to decide to move
these counters outside of the function, inside the callers so that a
single atomic op can be done per counter even for groups of operations.
Similarly to previous changes, we know if we're dealing with a server
or proxy lock so let's directly lock at the finest possible places
there. It's worth noting that a part of the operation consisting in
an increment and update of a max could be done outside of the lock
using atomic ops and a CAS.
The function is called with the lock held and does too many tests for
things that are already known from its callers. Let's split it in two
so that its callers call either the per-server or per-proxy function
depending on where the element is (since they had to determine it
prior to taking the lock).
This is an anticipation of finer grained locking for the queues. For now
all lock places take a write lock so that there is no difference at all
with previous code.
In commit 5cd4bbd7a ("BUG/MAJOR: threads/queue: Fix thread-safety issues
on the queues management") the counter of transferred connections was
accidently lost, so that when a server goes down with connections in its
queue, it will always be reported that 0 connection were transferred.
This should be backported as far as 1.8 since the patch above was
backported there.
This patch fixes all the leftovers from the include cleanup campaign. There
were not that many (~400 entries in ~150 files) but it was definitely worth
doing it as it revealed a few duplicates.
This one was not easy because it was embarking many includes with it,
which other files would automatically find. At least global.h, arg.h
and tools.h were identified. 93 total locations were identified, 8
additional includes had to be added.
In the rare files where it was possible to finalize the sorting of
includes by adjusting only one or two extra lines, it was done. But
all files would need to be rechecked and cleaned up now.
It was the last set of files in types/ and proto/ and these directories
must not be reused anymore.
extern struct dict server_name_dict was moved from the type file to the
main file. A handful of inlined functions were moved at the bottom of
the file. Call places were updated to use server-t.h when relevant, or
to simply drop the entry when not needed.
The files remained mostly unchanged since they were OK. However, half of
the users didn't need to include them, and about as many actually needed
to have it and used to find functions like srv_currently_usable() through
a long chain that broke when moving the file.
It was moved without any change, however many callers didn't need it at
all. This was a consequence of the split of proto_http.c into several
parts that resulted in many locations to still reference it.
The TASK_IS_TASKLET() macro was moved to the proto file instead of the
type one. The proto part was a bit reordered to remove a number of ugly
forward declaration of static inline functions. About a tens of C and H
files had their dependency dropped since they were not using anything
from task.h.
This one is particularly tricky to move because everyone uses it
and it depends on a lot of other types. For example it cannot include
arg-t.h and must absolutely only rely on forward declarations to avoid
dependency loops between vars -> sample_data -> arg. In order to address
this one, it would be nice to split the sample_data part out of sample.h.
Now the file is ready to be stored into its final destination. A few
minor reorderings were performed to keep the file properly organized,
making the various sections more visible (cache & lockless).
In addition and to stay consistent, memory.c was renamed to pool.c.
This one is included almost everywhere and used to rely on a few other
.h that are not needed (unistd, stdlib, standard.h). It could possibly
make sense to split it into multiple parts to distinguish operations
performed on timers and the internal time accounting, but at this point
it does not appear much important.
This splits the hathreads.h file into types+macros and functions. Given
that most users of this file used to include it only to get the definition
of THREAD_LOCAL and MAXTHREADS, the bare minimum was placed into thread-t.h
(i.e. types and macros).
All the thread management was left to haproxy/thread.h. It's worth noting
the drop of the trailing "s" in the name, to remove the permanent confusion
that arises between this one and the system implementation (no "s") and the
makefile's option (no "s").
For consistency, src/hathreads.c was also renamed thread.c.
A number of files were updated to only include thread-t which is the one
they really needed.
Some future improvements are possible like replacing empty inlined
functions with macros for the thread-less case, as building at -O0 disables
inlining and causes these ones to be emitted. But this really is cosmetic.
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
This is where other imported components are located. All files which
used to directly include ebtree were touched to update their include
path so that "import/" is now prefixed before the ebtree-related files.
The ebtree.h file was slightly adjusted to read compiler.h from the
common/ subdirectory (this is the only change).
A build issue was encountered when eb32sctree.h is loaded before
eb32tree.h because only the former checks for the latter before
defining type u32. This was addressed by adding the reverse ifdef
in eb32tree.h.
No further cleanup was done yet in order to keep changes minimal.
When an end pointer is passed, instead of complaining that a comma is
missing after a keyword, sample_parse_expr() will silently return the
pointer to the current location into this return pointer so that the
caller can continue its parsing. This will be used by more complex
expressions which embed sample expressions, and may even permit to
embed sample expressions into arguments of other expressions.
There is a very short race in the queues which happens in the following
situation:
- stream A on thread 1 is being processed by a server
- stream B on thread 2 waits in the backend queue for a server
- stream B on thread 2 is fed up with waiting and expires, calls
stream_free() which calls pendconn_free(), which sees the
stream attached
- at the exact same instant, stream A finishes on thread 1, sees
one stream is waiting (B), detaches it and wakes it up
- stream B continues pendconn_free() and calls pendconn_unlink()
- pendconn_unlink() now detaches the node again and performs a
second deletion (harmless since idempotent), and decrements
srv/px->nbpend again
=> the number of connections on the proxy or server may reach -1 if/when
this race occurs.
It is extremely tight as it can only occur during the test on p->leaf_p
though it has been witnessed at least once. The solution consists in
testing leaf_p again once the lock is held to make sure the element was
not removed in the mean time.
This should be backported to 2.0 and 1.9, probably even 1.8.
A problem involving server slowstart was reported by @max2k1 in issue #197.
The problem is that pendconn_grab_from_px() takes the proxy lock while
already under the server's lock while process_srv_queue() first takes the
proxy's lock then the server's lock.
While the latter seems more natural, it is fundamentally incompatible with
mayn other operations performed on servers, namely state change propagation,
where the proxy is only known after the server and cannot be locked around
the servers. Howwever reversing the lock in process_srv_queue() is trivial
and only the few functions related to dynamic cookies need to be adjusted
for this so that the proxy's lock is taken for each server operation. This
is possible because the proxy's server list is built once at boot time and
remains stable. So this is what this patch does.
The comments in the proxy and server structs were updated to mention this
rule that the server's lock may not be taken under the proxy's lock but
may enclose it.
Another approach could consist in using a second lock for the proxy's queue
which would be different from the regular proxy's lock, but given that the
operations above are rare and operate on small servers list, there is no
reason for overdesigning a solution.
This fix was successfully tested with 10000 servers in a backend where
adjusting the dyncookies in loops over the CLI didn't have a measurable
impact on the traffic.
The only workaround without the fix is to disable any occurrence of
"slowstart" on server lines, or to disable threads using "nbthread 1".
This must be backported as far as 1.8.
The old module proto_http does not exist anymore. All code dedicated to the HTTP
analysis is now grouped in the file proto_htx.c. So, to finish the polishing
after removing the legacy HTTP code, proto_htx.{c,h} files have been moved in
http_ana.{c,h} files.
In addition, all HTX analyzers and related functions prefixed with "htx_" have
been renamed to start with "http_" instead.
In pendconn_redistribute() we scan the queue using eb32_next() on the
node we've just deleted, which is wrong since the node is not in the
tree anymore, and it could dereference one node that has already been
released by another thread. Note that we cannot use eb32_first() in the
loop here instead because we need to skip pendconns having SF_FORCE_PRST.
Instead, let's keep a copy of the next node before deleting it.
In addition, the pendconn retrieved there is wrong, it uses &node as
the pointer instead of node, resulting in very quick crashes when the
server list is scanned.
Fortunately this only happens when "option redispatch" is used in
conjunction with "maxconn" on server lines, "cookie" for the stickiness,
and when a server goes down with entries in its queue.
This bug was introduced by commit 0355dabd7 ("MINOR: queue: replace
the linked list with a tree") so the fix must be backported to 1.9.
This commit replaces the explicit pool creation that are made in
constructors with a pool registration. Not only this simplifies the
pools declaration (it can be done on a single line after the head is
declared), but it also removes references to pools from within
constructors. The only remaining create_pool() calls are those
performed in init functions after the config is parsed, so there
is no more user of potentially uninitialized pool now.
It has been the opportunity to remove no less than 12 constructors
and 6 init functions.
This switches explicit calls to various trivial registration methods for
keywords, muxes or protocols from constructors to INITCALL1 at stage
STG_REGISTER. All these calls have in common to consume a single pointer
and return void. Doing this removes 26 constructors. The following calls
were addressed :
- acl_register_keywords
- bind_register_keywords
- cfg_register_keywords
- cli_register_kw
- flt_register_keywords
- http_req_keywords_register
- http_res_keywords_register
- protocol_register
- register_mux_proto
- sample_register_convs
- sample_register_fetches
- srv_register_keywords
- tcp_req_conn_keywords_register
- tcp_req_cont_keywords_register
- tcp_req_sess_keywords_register
- tcp_res_cont_keywords_register
- flt_register_keywords
These ones are mostly called from cfgparse.c for the parsing and do
not depend on the HTTP representation. The functions's prototypes
were moved to proto/http_rules.h, making this file work exactly like
tcp_rules. Ideally we should stop calling these functions directly
from cfgparse and register keywords, but there are a few cases where
that wouldn't work (stats http-request) so it's probably not worth
trying to go this far.
Since commit 3ff577e ("MAJOR: server: make server state changes
synchronous again"), srv_update_status() is called with the server
lock held. It calls (among others) pendconn_redistribute() which used
to take this lock, causing CPU loops by default, or crashes if build
with -DDEBUG_THREAD. Since this function is not called from any other
place anymore, it doesn't require the lock on its own so let's simply
drop it from there.
No backport is needed, this is 1.9-specific.
The priority values are used when connections are queued to determine
which connections should be served first. The lowest priority class is
served first. When multiple requests from the same class are found, the
earliest (according to queue_time + offset) is served first. The queue
offsets can span over roughly 17 minutes after which the offsets will
wrap around. This allows up to 8 minutes spent in the queue with no
reordering.
This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content. At this point they are not used
yet, which is the purpose of the next commit, but all the logic to
set and clear the values is there.
We'll need trees to manage the queues by priorities. This change replaces
the list with a tree based on a single key. It's effectively a list but
allows us to get rid of the list management right now.
We store the queue index in the stream and check it on dequeueing to
figure how many entries were processed in between. This way we'll be
able to count the elements that may later be added before ours.
The current name is misleading as it implies a queue size, but the value
instead indicates a position in the queue.
The value is only the queue size at the exact moment the element is enqueued.
Soon we will gain the ability to insert anywhere into the queue, upon which
clarity of the name is more important.
When switching back from a backup to an active server, the backup server
currently continues to drain the proxy's connections, which is a problem
because it's not expected to be able to pick them.
This patch ensures that a backup server will only pick backend connections
if there is no active server and it is the selected backup server or all
backup servers are supposed to be used.
This issue seems to have existed forever, so this fix should be backported
to all stable versions.
Now that we can wake one thread sleeping in the poller, we don't have to
use THREAD_WANT_SYNC any more.
This gives a significant performance boost on highly contended accesses
(servers with maxconn 1), showing a jump from 21k to 31k conn/s on a
test involving 8 threads.
This lock was necessary to manipulate the pendconn element between
concurrent places, but was causing great difficulties in the list walk
by having to iterate over multiple entries instead of being able to
safely pick the first one (in fact the first element was always the
right one but the locking model was hard to prove).
Here since we know we can always rely on the queue's locks, we take
the queue's lock every time we need to modify the element. In practice
it was already the case everywhere except in pendconn_dequeue() which
only works on an element that was already detached. This function had
to be protected against the risk of meeting an incompletely detached
element (which could be unlinked but not yet assigned). By taking the
queue lock around the LIST_ISEMPTY test, it's enough to ensure that a
concurrent thread either didn't begin or had completed the operation.
The true benefit really is in pendconn_process_next_strm() where we
can again safely work with the first element of each queue. This will
significantly simplify next updates to this code.
The new pendconn_queue_lock() and pendconn_queue_unlock() functions are
made to make it more convenient to lock or unlock the pendconn queue
either at the proxy or the server depending on pendconn->srv. This way
it is possible to remove the open-coding of these locks at various places.
These ones have been used in pendconn_unlink() and pendconn_add(), thus
significantly simplifying the logic there.