"log-bufsize" may now be used for a log server (in a log backend) to
configure the bufsize of implicit ring associated to the server (which
defaults to BUFSIZE).
Using "mode log" in a backend section turns the proxy in a log backend
which can be used to log-balance logs between multiple log targets
(udp or tcp servers)
log backends can be used as regular log targets using the log directive
with "backend@be_name" prefix, like so:
| log backend@mybackend local0
A log backend will distribute log messages to servers according to the
log load-balancing algorithm that can be set using the "log-balance"
option from the log backend section. For now, only the roundrobin
algorithm is supported and set by default.
In cli_parse_delete_server(), we take care of checking that the server is
in MAINT and that the cur_sess counter is set to 0, in the hope that no
connection/stream ressources continue to point to the server, else we
refuse to delete it.
As shown in GH #2298, this is not sufficient.
Indeed, when the server option "on-marked-down shutdown-sessions" is not
used, server streams are not purged when srv enters maintenance mode.
As such, there could be remaining streams that point to the server. To
detect this, a secondary check on srv->cur_sess counter was performed in
cli_parse_delete_server(). Unfortunately, there are some code paths that
could lead to cur_sess being decremented, and not resulting in a stream
being actually shutdown. As such, if the delete_server cli is handled
right after cur_sess has been decremented with streams still pointing to
the server, we could face some nasty bugs where stream->srv_conn could
point to garbage memory area, as described in the original github report.
To make the check more reliable prior to deleting the server, we don't
rely exclusively on cur_sess and directly check that the server is not
used in any stream through the srv_has_stream() helper function.
Thanks to @capflam which found out the root cause for the bug and greatly
helped to provide the fix.
This should be backported up to 2.6.
rdr_pfx was not being free during server cleanup, leading to small memory
leak when "redir" argument was used on a server line (HTTP only).
This should be backported to every stable versions.
[For 2.6 and 2.7: the free should be performed in srv_drop() directly.
For older versions: free in deinit() function near the free for the
cookie string]
This reverts commit c618ed5ff4.
The list iterator is broken. As found by Fred, running QUIC single-
threaded shows that only the first connection is accepted because the
accepter relies on the element being initialized once detached (which
is expected and matches what MT_LIST_DELETE_SAFE() used to do before).
However while doing this in the quic_sock code seems to work, doing it
inside the macro show total breakage and the unit test doesn't work
anymore (random crashes). Thus it looks like the fix is not trivial,
let's roll this back for the time it will take to fix the loop.
The new mt_list code supports exponential back-off on conflict, which
is important for use cases where there is contention on a large number
of threads. The API evolved a little bit and required some updates:
- mt_list_for_each_entry_safe() is now in upper case to explicitly
show that it is a macro, and only uses the back element, doesn't
require a secondary pointer for deletes anymore.
- MT_LIST_DELETE_SAFE() doesn't exist anymore, instead one just has
to set the list iterator to NULL so that it is not re-inserted
into the list and the list is spliced there. One must be careful
because it was usually performed before freeing the element. Now
instead the element must be nulled before the continue/break.
- MT_LIST_LOCK_ELT() and MT_LIST_UNLOCK_ELT() have always been
unclear. They were replaced by mt_list_cut_around() and
mt_list_connect_elem() which more explicitly detach the element
and reconnect it into the list.
- MT_LIST_APPEND_LOCKED() was only in haproxy so it was left as-is
in list.h. It may however possibly benefit from being upstreamed.
This required tiny adaptations to event_hdl.c and quic_sock.c. The
test case was updated and the API doc added. Note that in order to
keep include files small, the struct mt_list definition remains in
list-t.h (par of the internal API) and was ifdef'd out in mt_list.h.
A test on QUIC with both quictls 1.1.1 and wolfssl 5.6.3 on ARM64 with
80 threads shows a drastic reduction of CPU usage thanks to this and
the refined memory barriers. Please note that the CPU usage on OpenSSL
3.0.9 is significantly higher due to the excessive use of atomic ops
by openssl, but 3.1 is only slightly above 1.1.1 though:
- before: 35 Gbps, 3.5 Mpps, 7800% CPU
- after: 41 Gbps, 4.2 Mpps, 2900% CPU
Backend idle connections are purged on a recurring occurence during the
process lifetime. An estimated number of needed connections is
calculated and the excess is removed periodically.
Before this patch, purge was done directly using the idle then the safe
connection tree of a server instance. This has a major drawback to take
no account of a specific ordre and it may removed functional connections
while leaving ones which will fail on the next reuse.
The problem can be worse when using criteria to differentiate idle
connections such as the SSL SNI. In this case, purge may remove
connections with a high rate of reusing while leaving connections with
criteria never matched once, thus reducing drastically the reuse rate.
To improve this, introduce an alternative storage for idle connection
used in parallel of the idle/safe trees. Now, each connection inserted
in one of this tree is also inserted in the new list at
`srv_per_thread.idle_conn_list`. This guarantees that recently used
connection is present at the end of the list.
During the purge, use this list instead of idle/safe trees. Remove first
connection in front of the list which were not reused recently. This
will ensure that connection that are frequently reused are not purged
and should increase the reuse rate, particularily if distinct idle
connection criterias are in used.
Define a new function _srv_add_idle(). This is a simple wrapper to
insert a connection in the server idle tree. This is reserved for simple
usage and require to idle_conns lock. In most cases,
srv_add_to_idle_list() should be used.
This patch does not have any functional change. However, it will help
with the next patch as idle connection will be always inserted in a list
as secondary storage along with idle/safe trees.
Small change of API for conn_delete_from_tree(). Now the connection
instance is taken as argument instead of its inner node.
No functional change introduced with this commit. This simplifies
slightly invocation of conn_delete_from_tree(). The most useful changes
is that this function will be extended in the next patch to be able to
remove the connection from its new idle list at the same time as in its
idle tree.
Implement reverse-connect server. This server type cannot instantiate
its own connection on transfer. Instead, it can only reuse connection
from its idle pool. These connections will be populated using the future
'tcp-request session attach-srv' rule.
A reverse-connect has no address. Instead, it uses a new custom server
notation with '@' character prefix. For the moment, only '@reverse' is
defined. An extra check is implemented to ensure server is used in a
HTTP proxy.
A connection contains extra elements which are only used for the backend
side. Regroup their allocation and deallocation in two new functions
named conn_backend_init() and conn_backend_deinit().
No functional change is introduced with this commit. The new functions
are reused in place of manual alloc/dealloc in conn_new() / conn_free().
This patch will be useful for reverse connect support with connection
conversion from backend to frontend side and vice-versa.
Several CLI handlers use a server argument specified with the format
'<backend>/<server>'. The parsing of this arguement is done in two
steps, first splitting the string with '/' delimiter and then use
get_backend_server() to retrieve the server instance.
Refactor this code sections with the following changes :
* splitting is reimplented using ist API
* get_backend_server() is removed. Instead use the already existing
proxy_be_by_name() then server_find_by_name() which contains
duplicated code with the now removed function.
No functional change occurs with this commit. However, it will be useful
to add new configuration options reusing the same '<backend>/<server>'
for reverse connect.
During startup, when the "none" method for "init-addr" is evaluated, a
warning is emitted if a resolution failure was previously encountered. The
documentation of the "none" method states it should be used to ignore server
resolution failures and let the server starts in DOWN state. However,
because a warning may be emitted, it is not possible to start HAProxy with
"zero-warning" option.
The same is true when "-dr" command line option is used. It is counter
intuitive and, in a way, this contradict what is specified in the
documentation.
So instead, a notice message is now emitted. At the end, if "-dr" command
line option is used or if "none" method is explicitly used, it means the
admin is agree with server resolution failures. There is no reason to emit a
warning.
This patch should fix the issue #2176. It could be backported to all stable
versions but backporting to 2.8 is probably enough for now.
srv->rid default value is set in _srv_parse_init() after the server is
succesfully allocated using new_server().
This is wrong because new_server() can be used independently so rid value
assignment would be skipped in this case.
Hopefully new_server() allocates server data using calloc() so srv->rid
is already set to 0 in practise. But if calloc() is replaced by malloc()
or other memory allocating function that doesn't zero-initialize srv
members, this could lead to rid being uninitialized in some cases.
This should be backported in 2.8 with 61e3894dfe ("MINOR: server: add
srv->rid (revision id) value")
When support for 'namespace' keyword was added for the 'default-server'
directive in 22f41a2 ("MINOR: server: Make 'default-server' support
'namespace' keyword."), we forgot to copy the attribute from the parent
to the newly created server.
This resulted in the 'namespace' keyword being parsed without errors when
used from a 'default-server' directive, but in practise the option was
simply ignored.
There's no need to duplicate the netns struct because it is stored in
a shared list, so copying the pointer does the job.
This patch partially fixes GH #2038 and should be backported to all
stable versions.
proxy default-server is a specific type of server that is not allocated
using new_server(): it is directly stored within the parent proxy
structure. However, since it may contain some default config options that
may be inherited by regular servers, it is also subject to dynamic members
(strings, structures..) that needs to be deallocated when the parent proxy
is cleaned up.
Unfortunately, srv_drop() may not be used directly from p->defsrv since
this function is meant to be used on regular servers only (those created
using new_server()).
To circumvent this, we're splitting srv_drop() to make a new function
called srv_free_params() that takes care of the member cleaning which
originally takes place in srv_drop(). This function is exposed through
server.h, so it may be called from outside server.c.
Thanks to this, calling srv_free_params(&p->defsrv) from free_proxy()
prevents any memory leaks due to dynamic parameters allocated when
parsing a default-server line from a proxy section.
This partially fixes GH #2173 and may be backported to 2.8.
[While it could also be relevant for other stable versions, the patch
won't apply due to architectural changes / name changes between 2.4 => 2.6
and then 2.6 => 2.8. Considering this is a minor fix that only makes
memory analyzers happy during deinit paths (at least for <= 2.8), it might
not be worth the trouble to backport them any further?]
There are several misuses in peers sections that are not detected during the
configuration parsing and that could lead to undefined behaviors or crashes.
First, only one listener is expected for a peers section. If several bind
lines or local peer definitions are used, an error is triggered. However, if
multiple addresses are set on the same bind line, there is no error while
only the last listener is properly configured. On the 2.8, there is no crash
but side effects are hardly predictable. On older version, HAProxy crashes
if an unconfigured listener is used.
Then, there is no check on remote peers name. It is unexpected to have same
name for several remote peers. There is now a test, performed during the
post-parsing, to verify all remote peer names are unique.
Finally, server parsing options for the peers sections are changed to be
sure a port is always defined, and not a port range or a port offset.
This patch fixes the issue #2066. It could be backported to all stable
versions.
When server is transitionning from UP to DOWN, a log message is generated.
e.g.: "Server backend_name/server_name is DOWN")
However since f71e064 ("MEDIUM: server: split srv_update_status() in two
functions"), the allocated buffer tmptrash which is used to prepare the
log message is not freed after it has been used, resulting in a small
memory leak each time a server goes DOWN because of an operational
change.
This is a 2.8 specific bug, no backport needed unless the above commit
gets backported.
Within srv_update_status subfunctions _op() and _adm(), each time tmptrash
is freed, we assign it to NULL to ensure it will not be reused.
However, within those functions it is not very useful given that tmptrash
is never checked against NULL except upon allocation through
alloc_trash_chunk(), which happens everytime a new log message is
generated, sent, and then freed right away, so there are no code paths
that could lead to tmptrash being checked for reuse (tmptrash is
systematically overwritten since all log messages are independant from
each other).
This was raised by coverity, see GH #2162.
Remaining in drain mode after removing one of server admins flags leads
to this message being generated:
"Server name/backend is leaving forced drain but remains in drain mode."
However this is not necessarily true: the server might just be leaving
MAINT with the IDRAIN flag set, so the report is incorrect in this case.
(FDRAIN was not set so it cannot be cleared)
To prevent confusion around this message and to comply with the code
comment above it: we remove the "leaving forced drain" precision to
make the report suitable for multiple transitions.
Adding a new event type: SERVER_CHECK.
This event is published when a server's check state ought to be reported.
(check status change or check result)
SERVER_CHECK event is provided as a server event with additional data
carrying relevant check's context such as check's result and health.
Adding a new SERVER event in the event_hdl API.
SERVER_ADMIN is implemented as an advanced server event.
It is published each time the administrative state changes.
(when s->cur_admin changes)
SERVER_ADMIN data is an event_hdl_cb_data_server_admin struct that
provides additional info related to the admin state change, but can
be casted as a regular event_hdl_cb_data_server struct if additional
info is not needed.
Reuse cb_data from STATE event to publish UP and DOWN events.
This saves some CPU time since the event is only constructed
once to publish STATE, STATE+UP or STATE+DOWN depending on the
state change.
Adding a new SERVER event in the event_hdl API.
SERVER_STATE is implemented as an advanced server event.
It is published each time the server's effective state changes.
(when s->cur_state changes)
SERVER_STATE data is an event_hdl_cb_data_server_state struct that
provides additional info related to the server state change, but can
be casted as a regular event_hdl_cb_data_server struct if additional
info is not needed.
add a macro helper to help publish server events to global and
per-server subscription list at once since all server events
support both subscription modes.
This puts an end to the occasional confusion between the "now" date
that is internal, monotonic and not synchronized with the system's
date, and "date" which is the system's date and not necessarily
monotonic. Variable "now" was removed and replaced with a 64-bit
integer "now_ns" which is a counter of nanoseconds. It wraps every
585 years, so if all goes well (i.e. if humanity does not need
haproxy anymore in 500 years), it will just never wrap. This implies
that now_ns is never nul and that the zero value can reliably be used
as "not set yet" for a timestamp if needed. This will also simplify
date checks where it becomes possible again to do "date1<date2".
All occurrences of "tv_to_ns(&now)" were simply replaced by "now_ns".
Due to the intricacies between now, global_now and now_offset, all 3
had to be turned to nanoseconds at once. It's not a problem since all
of them were solely used in 3 functions in clock.c, but they make the
patch look bigger than it really is.
The clock_update_local_date() and clock_update_global_date() functions
are now much simpler as there's no need anymore to perform conversions
nor to round the timeval up or down.
The wrapping continues to happen by presetting the internal offset in
the short future so that the 32-bit now_ms continues to wrap 20 seconds
after boot.
The start_time used to calculate uptime can still be turned to
nanoseconds now. One interrogation concerns global_now_ms which is used
only for the freq counters. It's unclear whether there's more value in
using two variables that need to be synchronized sequentially like today
or to just use global_now_ns divided by 1 million. Both approaches will
work equally well on modern systems, the difference might come from
smaller ones. Better not change anyhting for now.
One benefit of the new approach is that we now have an internal date
with a resolution of the nanosecond and the precision of the microsecond,
which can be useful to extend some measurements given that timestamps
also have this resolution.
Instead we're using ns_to_sec(tv_to_ns(&now)) which allows the tv_sec
part to disappear. At this point, "now" is only used as a timeval in
clock.c where it is updated.
Adding the possibility to publish an event using a struct wrapper
around existing SERVER events to provide additional contextual info.
Using the specific struct wrapper is not required: it is supported
to cast event data as a regular server event data struct so
that we don't break the existing API.
However, casting event data with a more explicit data type allows
to fetch event-only relevant hints.
Considering that srv_update_status() is now synchronous again since
3ff577e1 ("MAJOR: server: make server state changes synchronous again"),
and that we can easily identify if the update is from an operational
or administrative context thanks to "MINOR: server: pass adm and op cause
to srv_update_status()".
And given that administrative and operational updates cannot be cumulated
(since srv_update_status() is called synchronously and independently for
admin updates and state/operational updates, and the function directly
consumes the changes).
We split srv_update_status() in 2 distinct parts:
Either <type> is 0, meaning the update is an operational update which
is handled by directly looking at cur_state and next_state to apply the
proper transition.
Also, the check to prevent operational state from being applied
if MAINT admin flag is set is no longer needed given that the calling
functions already ensure this (ie: srv_set_{running,stopping,stopped)
Or <type> is 1, meaning the update is an administrative update, where
cur_admin and next_admin are evaluated to apply the proper transition and
deduct the resulting server state (next_state is updated implicitly).
Once this is done, both operations share a common code path in
srv_update_status() to update proxy and servers stats if required.
Thanks to this change, the function's behavior is much more predictable,
it is not an all-in-one function anymore. Either we apply an operational
change, else it is an administrative change. That's it, we cannot mix
the 2 since both code paths are now properly separated.
Operational and administrative state change causes are not propagated
through srv_update_status(), instead they are directly consumed within
the function to provide additional info during the call when required.
Thus, there is no valid reason for keeping adm and op causes within
server struct. We are wasting space and keeping uneeded complexity.
We now exlicitly pass change type (operational or administrative) and
associated cause to srv_update_status() so that no extra storage is
needed since those values are only relevant from srv_update_status().
Fixing function comments for the server state changing function since they
still refer to asynchonous propagation of server state which is no longer
in play.
Moreover, there were some mixups between running/stopping.
This one is greatly inspired by "MINOR: server: change adm_st_chg_cause storage type".
While looking at current srv_op_st_chg_cause usage, it was clear that
the struct needed some cleanup since some leftovers from asynchronous server
state change updates were left behind and resulted in some useless code
duplication, and making the whole thing harder to maintain.
Two observations were made:
- by tracking down srv_set_{running, stopped, stopping} usage,
we can see that the <reason> argument is always a fixed statically
allocated string.
- check-related state change context (duration, status, code...) is
not used anymore since srv_append_status() directly extracts the
values from the server->check. This is pure legacy from when
the state changes were applied asynchronously.
To prevent code duplication, useless string copies and make the reason/cause
more exportable, we store it as an enum now, and we provide
srv_op_st_chg_cause() function to fetch the related description string.
HEALTH and AGENT causes (check related) are now explicitly identified to
make consumers like srv_append_op_chg_cause() able to fetch checks info
from the server itself if they need to.
srv_append_status() has become a swiss-knife function over time.
It is used from server code and also from checks code, with various
inputs and distincts code paths, making it very hard to guess the
actual behavior of the function (resulting string output).
To simplify the logic behind it, we're dividing it in multiple contextual
functions that take simple inputs and do explicit things, making them
more predictable and easier to maintain.
Even though it doesn't look like it at first glance, this is more like
a cleanup than an actual code improvement:
Given that srv->adm_st_chg_cause has been used to exclusively store
static strings ever since it was implemented, we make the choice to
store it as an enum instead of a fixed-size string within server
struct.
This will allow to save some space in server struct, and will make
it more easily exportable (ie: event handlers) because of the
reduced memory footprint during handling and the ability to later get
the corresponding human-readable message when it's explicitly needed.
Now that we have a generic srv_lb_propagate(s) function, let's
use it each time we explicitly wan't to set the status down as
well.
Indeed, it is tricky to try to handle "down" case explicitly,
instead we use srv_lb_propagate() which will call the proper
function that will handle the new server state.
This will allow some code cleanup and will prevent any logic
error.
This commit depends on:
- "MINOR: server: propagate server state change to lb through single function"
Use a dedicated helper function to propagate server state change to
lb algorithms, since it is performed at multiple places within
srv_update_status() function.
Based on "BUG/MINOR: server: don't miss server stats update on server
state transitions", we're also taking advantage of the new centralized
logic to update down_trans server counter directly from there instead
of multiple places.
s->last_change and s->down_time updates were manually updated for each
effective server state change within srv_update_status().
This is rather error-prone, and as a result there were still some state
transitions that were not handled properly since at least 1.8.
ie:
- when transitionning from DRAIN to READY: downtime was updated
(which is wrong since a server in DRAIN state should not be
considered as DOWN)
- when transitionning from MAINT to READY: downtime was not updated
(this can be easily seen in the html stats page)
To fix these all at once, and prevent similar bugs from being introduced,
we centralize the server last_change and down_time stats logic at the end
of srv_update_status():
If the server state changed during the call, then it means that
last_change must be updated, with a special case when changing from
STOPPED state which means the server was previously DOWN and thus
downtime should be updated.
This patch depends on:
- "MINOR: server: explicitly commit state change in srv_update_status()"
This could be backported to every stable versions.
backend "down" stats logic has been duplicated multiple times in
srv_update_status(), resulting in the logic now being error-prone.
For example, the following bugfix was needed to compensate for a
copy-paste introduced bug: d332f139
("BUG/MINOR: server: update last_change on maint->ready transitions too")
While the above patch works great, we actually forgot to update the
proxy downtime like it is done for other down->up transitions...
This is simply illustrating that the current design is error-prone,
it is very easy to miss something in this area.
To properly update the proxy downtime stats on the maint->ready transition,
to cleanup srv_update_status() and to prevent similar bugs from being
introduced in the future, proxy/backend stats update are now automatically
performed at the end of the server state change if needed.
Thus we can remove existing updates that were performed at various places
within the function, this simplifies things a bit.
This patch depends on:
- "MINOR: server: explicitly commit state change in srv_update_status()"
This could be backported to all stable versions.
Backport notes:
2.2:
Replace
struct task *srv_cleanup_toremove_conns(struct task *task, void *context, unsigned int state)
by
struct task *srv_cleanup_toremove_connections(struct task *task, void *context, unsigned short state)
As shown in 8f29829 ("BUG/MEDIUM: checks: a down server going to
maint remains definitely stucked on down state."), state changes
that don't result in explicit lb state change, require us to perform
an explicit server state commit to make sure the next state is
applied before returning from the function.
This is the case for server state changes that don't trigger lb logic
and only perform some logging.
This is quite error prone, we could easily forget a state change
combination that could result in next_state, next_admin or next_eweight
not being applied. (cur_state, cur_admin and cur_eweight would be left
with unexpected values)
To fix this, we explicitly call srv_lb_commit_status() at the end
of srv_update_status() to enforce the new values, even if they were
already applied. (when a state changes requires lb state update
an implicit commit is already performed)
Applying the state change multiple times is safe (since the next value
always points to the current value).
Backport notes:
2.2:
Replace
struct task *srv_cleanup_toremove_conns(struct task *task, void *context, unsigned int state)
by
struct task *srv_cleanup_toremove_connections(struct task *task, void *context, unsigned short state)
Report message for tracking servers completely leaving drain is wrong:
The check for "leaving drain .. via" never evaluates because the
condition !(s->next_admin & SRV_ADMF_FDRAIN) is always true in the
current block which is guarded by !(s->next_admin & SRV_ADMF_DRAIN).
For tracking servers that leave inherited drain mode, this results in the
following message being emitted:
"Server x/b is UP (leaving forced drain)"
Instead of:
"Server x/b is UP (leaving drain) via x/a"
To this fix: we check if FDRAIN is currently set, else it means that the
drain status is inherited from the tracked server (IDRAIN)
This regression was introduced with 64cc49cf ("MAJOR: servers: propagate server
status changes asynchronously."), thus it may be backported to every stable
versions.
This commit does nothing that ought to be mentioned, except that
it adds missing comments and slighty moves some function calls
out of "sensitive" code in preparation of some server code refactors.
Expose proxy_uuid variable in event_hdl_cb_data_server struct to
overcome proxy_name fixed length limitation.
proxy_uuid may be used by the handler to perform proxy lookups.
This should be preferred over lookups relying proxy_name.
(proxy_name is suitable for printing / logging purposes but not for
ID lookups since it has a maximum fixed length)
srv_update_status() function comment says that the function "is designed
to be called asynchronously".
While this used to be true back then with 64cc49cf
("MAJOR: servers: propagate server status changes asynchronously.")
This is not true anymore since 3ff577e ("MAJOR: server: make server state changes
synchronous again")
Fixing the comment in order to better reflect current behavior.
These ones are genenerally harmless on modern compilers because the
compiler checks them. While gcc optimizes them away without even
referencing strcpy(), clang prefers to call strcpy(). Nevertheless they
prevent from enabling stricter checks so better remove them altogether.
They were all replaced by strlcpy2() and the size of the destination
which is always known there.
srv_drop() function is reponsible for freeing the server when the
refcount reaches 0.
There is one exception: when global.mode has the MODE_STOPPING flag set,
srv_drop() will ignore the refcount and free the server on first
invocation.
This logic has been implemented with 13f2e2ce ("BUG/MINOR: server: do
not use refcount in free_server in stopping mode") and back then doing
so was not a problem since dynamic server API was just implemented and
srv_take() and srv_drop() were not widely used.
Now that dynamic server API is starting to get more popular we cannot
afford to keep the current logic: some modules or lua scripts may hold
references to existing server and also do their cleanup in deinit phases
In this kind of situation, it would be easy to trigger double-frees
since every call to srv_drop() on a specific server will try to free it.
To fix this, we take a different approach and try to fix the issue at
the source: we now properly drop server references involved with
checks/agent_checks in deinit_srv_check() and deinit_srv_agent_check().
While this could theorically be backported up to 2.6, it is not very
relevant for now since srv_drop() usage in older versions is very
limited and we're only starting to face the issue in mid 2.8
developments. (ie: lua core updates)
In srv_drop(), we only call the ssl->destroy_srv() method on
specific conditions.
But this has two downsides:
First, destroy_srv() is reponsible for freeing data that may have been
allocated in prepare_srv(), but not exclusively: it also frees
ssl-related parameters allocated when parsing a server entry, such as
ca-file for instance.
So this is quite error-prone, we could easily miss a condition where
some data needs to be deallocated using destroy_srv() even if
prepare_srv() was not used (since prepare_srv() is also conditional),
thus resulting in memory leaks.
Moreover, depending on srv->proxy to guard the check is probably not
a good idea here, since srv_drop() could be called in late de-init paths
in which related proxy could be freed already. srv_drop() should only
take care of freeing local server data without external logic.
Thankfully, destroy_srv() function performs the necessary checks to
ensure that a systematic call to the function won't result in invalid
reads or double frees.
No backport needed.
We recently discovered a bug which affects dynamic server deletion:
When a server is deleted, it is removed from the "visible" server list.
But as we've seen in previous commit
("MINOR: server: add SRV_F_DELETED flag"), it can still be accessed by
someone who keeps a reference on it (waiting for the final srv_drop()).
Throughout this transient state, server ptr is still valid (may be
dereferenced) and the flag SRV_F_DELETED is set.
However, as the server is not part of server list anymore, we have
an issue: srv->next pointer won't be updated anymore as the only place
where we perform such update is in cli_parse_delete_server() by
iterating over the "visible" server list.
Because of this, we cannot guarantee that a server with the
SRV_F_DELETED flag has a valid 'next' ptr: 'next' could be pointing
to a fully removed (already freed) server.
This problem can be easily demonstrated with server dumping in
the stats:
server list dumping is performed in stats_dump_proxy_to_buffer()
The function can be interrupted and resumed later by design.
ie: output buffer is full: partial dump and finish the dump after
the flush
This is implemented by calling srv_take() on the server being dumped,
and only releasing it when we're done with it using srv_drop().
(drop can be delayed after function resume if buffer is full)
While the function design seems OK, it works with the assumption that
srv->next will still be valid after the function resumes, which is
not true. (especially if multiple servers are being removed in between
the 2 dumping attempts)
In practice, this did not cause any crash yet (at least this was not
reported so far), because server dumping is so fast that it is very
unlikely that multiple server deletions make their way between 2
dumping attempts in most setups. But still, this is a problem that we
need to address because some upcoming work might depend on this
assumption as well and for the moment it is not safe at all.
========================================================================
Here is a quick reproducer:
With this patch, we're creating a large deletion window of 3s as soon
as we reach a server named "t2" while iterating over the list.
This will give us plenty of time to perform multiple deletions before
the function is resumed.
| diff --git a/src/stats.c b/src/stats.c
| index 84a4f9b6e..15e49b4cd 100644
| --- a/src/stats.c
| +++ b/src/stats.c
| @@ -3189,11 +3189,24 @@ int stats_dump_proxy_to_buffer(struct stconn *sc, struct htx *htx,
| * Temporarily increment its refcount to prevent its
| * anticipated cleaning. Call free_server to release it.
| */
| + struct server *orig = ctx->obj2;
| for (; ctx->obj2 != NULL;
| ctx->obj2 = srv_drop(sv)) {
|
| sv = ctx->obj2;
| + printf("sv = %s\n", sv->id);
| srv_take(sv);
| + if (!strcmp("t2", sv->id) && orig == px->srv) {
| + printf("deletion window: 3s\n");
| + thread_idle_now();
| + thread_harmless_now();
| + sleep(3);
| + thread_harmless_end();
| +
| + thread_idle_end();
| +
| + goto full; /* simulate full buffer */
| + }
|
| if (htx) {
| if (htx_almost_full(htx))
| @@ -4353,6 +4366,7 @@ static void http_stats_io_handler(struct appctx *appctx)
| struct channel *res = sc_ic(sc);
| struct htx *req_htx, *res_htx;
|
| + printf("http dump\n");
| /* only proxy stats are available via http */
| ctx->domain = STATS_DOMAIN_PROXY;
|
Ok, we're ready, now we start haproxy with the following conf:
global
stats socket /tmp/ha.sock mode 660 level admin expose-fd listeners thread 1-1
nbthread 2
frontend stats
mode http
bind *:8081 thread 2-2
stats enable
stats uri /
backend farm
server t1 127.0.0.1:1899 disabled
server t2 127.0.0.1:18999 disabled
server t3 127.0.0.1:18998 disabled
server t4 127.0.0.1:18997 disabled
And finally, we execute the following script:
curl localhost:8081/stats&
sleep .2
echo "del server farm/t2" | nc -U /tmp/ha.sock
echo "del server farm/t3" | nc -U /tmp/ha.sock
This should be enough to reveal the issue, I easily manage to
consistently crash haproxy with the following reproducer:
http dump
sv = t1
http dump
sv = t1
sv = t2
deletion window = 3s
[NOTICE] (2940566) : Server deleted.
[NOTICE] (2940566) : Server deleted.
http dump
sv = t2
sv = �����U
[1] 2940566 segmentation fault (core dumped) ./haproxy -f ttt.conf
========================================================================
To fix this, we add prev_deleted mt_list in server struct.
For a given "visible" server, this list will contain the pending
"deleted" servers references that point to it using their 'next' ptr.
This way, whenever this "visible" server is going to be deleted via
cli_parse_delete_server() it will check for servers in its
'prev_deleted' list and update their 'next' pointer so that they no
longer point to it, and then it will push them in its
'next->prev_deleted' list to transfer the update responsibility to the
next 'visible' server (if next != NULL).
Then, following the same logic, the server about to be removed in
cli_parse_delete_server() will push itself as well into its
'next->prev_deleted' list (if next != NULL) so that it may still use its
'next' ptr for the time it is in transient removal state.
In srv_drop(), right before the server is finally freed, we make sure
to remove it from the 'next->prev_deleted' list so that 'next' won't
try to perform the pointers update for this server anymore.
This has to be done atomically to prevent 'next' srv from accessing a
purged server.
As a result:
for a valid server, either deleted or not, 'next' ptr will always
point to a non deleted (ie: visible) server.
With the proposed fix, and several removal combinations (including
unordered cli_parse_delete_server() and srv_drop() calls), I cannot
reproduce the crash anymore.
Example tricky removal sequence that is now properly handled:
sv list: t1,t2,t3,t4,t5,t6
ops:
take(t2)
del(t4)
del(t3)
del(t5)
drop(t3)
drop(t4)
drop(t5)
drop(t2)
Set the SRV_F_DELETED flag when server is removed from the cli.
When removing a server from the cli (in cli_parse_delete_server()),
we update the "visible" server list so that the removed server is no
longer part of the list.
However, despite the server being removed from "visible" server list,
one could still access the server data from a valid ptr (ie: srv_take())
Deleted flag helps detecting when a server is in transient removal
state: that is, removed from the list, thus not visible but not yet
purged from memory.