If an error occured during the CLI 'add server' handler, the newly
created server must be removed from the proxy list if already inserted.
Currently, this can happen on the extremely rare error during server id
generation if there is no id left.
The removal operation is not thread-safe, it must be conducted before
releasing the thread isolation.
This can be backported up to 2.4. Please note that dynamic server track
is not implemented in 2.4, so the release_server_track invocation must
be removed for the backport to prevent a compilation error.
In 2.4, runtime server deletion was brought by commit e558043e1 ("MINOR:
server: implement delete server cli command"). A comment remained in the
code about a theoretical race between the thread_isolate() call and another
thread being in the process of allocating memory before accessing the
server via a reference that was grabbed before the memory allocation,
since the thread_harmless_now()/thread_harmless_end() pair around mmap()
may have the effect of allowing cli_parse_delete_server() to proceed.
Now that the full thread isolation is available, let's update the code
to rely on this. Now it is guaranteed that competing threads will either
be in the poller or queued in front of thread_isolate_full().
This may be backported to 2.4 if any report of breakage suggests the bug
really exists, in which case the two following patches will also be
needed:
MINOR: threads: make thread_release() not wait for other ones to complete
MEDIUM: threads: add a stronger thread_isolate_full() call
If an error occurs during a dynamic server creation with tracking, it
must be removed from the tracked list. This operation is not thread-safe
and thus must be conducted under the thread isolation.
Track support for dynamic servers has been introduced in this release.
This does not need to be backported.
Allow the usage of the 'track' keyword for dynamic servers. On server
deletion, the server is properly removed from the tracking chain to
prevents NULL pointer dereferencing.
Prevents the use of the "track" keyword for a dynamic server. This
simplifies the deletion of a dynamic server, without having to worry
about servers which might tracked it.
A BUG_ON is present in the dynamic server delete function to validate
this assertion.
When a default-server line specified a client certificate to use, the
frontend would not take it into account and create an empty SSL context,
which would raise an error on the backend side ("peer did not return a
certificate").
This bug was introduced by d817dc733eacfd7cf5bb0bbc6128f44644db078e in
which the SSL contexts are created earlier than before (during the
default-server line parsing) without setting it in the corresponding
server structures. It then made the server create an empty SSL context
in ssl_sock_prepare_srv_ctx because it thought it needed one.
It was raised on redmine, in Bug #3906.
It can be backported to 2.4.
The commit 3406766d5 ("MEDIUM: resolvers: add a ref between servers and srv
request or used SRV record") introduced a regression. The first server of a
template based on SRV record is no longer resolved. The same bug exists for
a normal server based on a SRV record.
In fact, the server used during parsing (used as reference when a
server-template line is parsed) is never attached to the corresponding srvrq
object. Thus with following lines, no resolution is performed because
"srvrq->attached_servers" is empty:
server-template test 1 _http.domain.tld resolvers dns ...
server test1 _http.domain.tld resolvers dns ...
This patch should fix the issue #1295 (but not confirmed yet it is the same
bug). It must be backported everywhere the above commit is.
If resolv_get_ip_from_response() returns an error (or an unexpected return
value), the server is set to RMAINT status. However, its address must also
be reset. Otherwise, it is still reported by the cli on "show servers state"
commands. This may be confusing. Note that it is a theorical patch because
this code path does not exist. Thus it is not tagged as a BUG.
This patch may be backported as far as 2.0.
For A/AAAA resolution, if no ip is found for a server in the response, the
server is set to RMAINT status. However, its address must also be
reset. Otherwise, it is still reported by the cli on "show servers state"
commands. This may be confusing.
This patch may be backported as far as 2.0.
A queue is specific to a server or a proxy, so we don't need to place
this distinction inside all pendconns, it can be in the queue itself.
This commit adds the relevant fields "px" and "sv" into the struct
queue, and initializes them accordingly.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 362k
to 374k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 9%.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
This reverts commit fcb8bf8650ec6b5614d1b88db54f1200ebd96cbd.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit c83e45e9b001591633188a480a896c935d3c9625.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 491k
to 507k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 6%.
The performance profile changes from this:
13.03% haproxy [.] fwlc_srv_reposition
8.08% haproxy [.] fwlc_get_next_server
3.62% haproxy [.] process_srv_queue
1.78% haproxy [.] pendconn_dequeue
1.74% haproxy [.] pendconn_add
to this:
11.95% haproxy [.] fwlc_srv_reposition
7.57% haproxy [.] fwlc_get_next_server
3.51% haproxy [.] process_srv_queue
1.74% haproxy [.] pendconn_dequeue
1.70% haproxy [.] pendconn_add
At this point the differences are mostly measurement noise.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
The server_parse_maxconn_change_request locks the server lock. However,
this function can be called via agent-checks or lua code which already
lock it. This bug has been introduced by the following commit :
commit 79a88ba3d09f7e2b73ae27cb5d24cc087a548fa6
BUG/MAJOR: server: prevent deadlock when using 'set maxconn server'
This commit tried to fix another deadlock with can occur because
previoulsy server_parse_maxconn_change_request requires the server lock
to be held. However, it may call internally process_srv_queue which also
locks the server lock. The locking policy has thus been updated. The fix
is functional for the CLI 'set maxconn' but fails to address the
agent-check / lua counterparts.
This new issue is fixed in two steps :
- changes from the above commit have been reverted. This means that
server_parse_maxconn_change_request must again be called with the
server lock.
- to counter the deadlock fixed by the above commit, process_srv_queue
now takes an argument to render the server locking optional if the
caller already held it. This is only used by
server_parse_maxconn_change_request.
The above commit was subject to backport up to 1.8. Thus this commit
must be backported in every release where it is already present.
Activate the 'ssl' keyword for dynamic servers. This is the final step
to have ssl dynamic servers feature implemented. If activated,
ssl_sock_prepare_srv_ctx will be called at the end of the 'add server'
CLI handler.
At the same time, update the management doc to list all ssl keywords
implemented for dynamic servers.
'set server ssl' uses ssl parameters from default-server. As dynamic
servers does not reuse any default-server parameters, this command has
no sense for them.
The commit c7b391aed ("BUG/MEDIUM: server/cli: Fix ABBA deadlock when fqdn
is set from the CLI") introduced 2 bugs. The first one is a typo on the
server's lock label (s/SERVER_UNLOCK/SERVER_LOCK/). The second one is about
the server's lock itself. It must be acquired to execute the "agent-send"
subcommand.
The patch above is marked to be backported as far as 1.8. Thus, this one
must also backported as far 1.8.
BUG/MINOR: server/cli: Don't forget to lock server on agent-send subcommand
When a server relies on a SRV resolution, a task is created to clean it up
(fqdn/port and address) when the SRV resolution is considered as outdated
(based on the resolvers 'timeout' value). It is only possible if the server
inherits outdated info from a state file and is no longer selected to be
attached to a SRV item. Note that most of time, a server is attached to a
SRV item. Thus when the item becomes obsolete, the server is cleaned
up.
It is important to have such task to be sure the server will be free again
to have a chance to be resolved again with fresh information. Of course,
this patch is a workaround to solve a design issue. But there is no other
obvious way to fix it without rewritting all the resolvers part. And it must
be backportable.
This patch relies on following commits:
* MINOR: resolvers: Clean server in a dedicated function when removing a SRV item
* MINOR: resolvers: Remove server from named_servers tree when removing a SRV item
All the series must be backported as far as 2.2 after some observation
period. Backports to 2.0 and 1.8 must be evaluated.
To perform servers resolution, the resolver's lock is first acquired then
the server's lock when necessary. However, when the fqdn is set via the CLI,
the opposite is performed. So, it is possible to experience an ABBA
deadlock.
To fix this bug, the server's lock is acquired and released for each
subcommand of "set server" with an exception when the fqdn is set. The
resolver's lock is first acquired. Of course, this means we must be sure to
have a resolver to lock.
This patch must be backported as far as 1.8.
If a server is configured to rely on a SRV resolution, we must forbid to
change its fqdn on the CLI. Indeed, in this case, the server retrieves its
fqdn from the SRV resolution. If the fqdn is changed via the CLI, this
conflicts with the SRV resolution and leaves the server in an undefined
state. Most of time, the SRV resolution remains enabled with no effect on
the server (no update). Some time the A/AAAA resolution for the new fqdn is
not enabled at all. It depends on the server state and resolver state when
the CLI command is executed.
This patch must be backported as far as 2.0 (maybe to 1.8 too ?) after some
observation period.
To avoid repeating the same source code, allocating memory and initializing
the per_thr field from the server structure is transferred to a separate
function.
Until then, the servers were automatically attached on their creation
into the proxy addr_node tree via _srv_parse_init. In case of an invalid
dynamic server which is instantly freed, no detach operation was made
leaving a NULL server in the tree.
Change this mode of operation by marking the attach operation as
optional in _srv_parse_init. This operation is not conduct for a dynamic
server. The server is attached only at the end of the CLI handler when
it is marked as valid.
This must be backported up to 2.4.
A bug is present when trying to create a dynamic server with a fixed id.
If the server is detected invalid due to a later parsing arguments
error, the server is not removed from the proxy used ids tree before
being freed.
Change the mode of operation of 'id' keyword parsing handler. The
insertion in the backend tree is removed from the handler and is not
taken in charge by parse_server for configuration parsing. For the
dynamic servers, the insertion is called at the end of the 'add server'
CLI handler when the server has been validated.
This must be backported up to 2.4.
If no id is specified by the user for a dynamic server, it is necessary
to generate a new one. This operation is now done at the end of 'add
server' CLI handler. The server is then inserted into the proxy ids
tree.
Without this, several features may be broken for dynamic servers. Among
them, there is the "first" lb algorithm, the persistence using
stick-tables or the uniqueness internal check of srv_parse_id.
This must be backported up to 2.4.
Do not leave deleted server in used_server_id/used_server_addr backend
trees. This might lead to crashes if a deleted server is used through
these trees.
At this moment, dynamic servers are only added in used_server_id if they
have a fixed id. They are never inserted in used_server_addr as this
code is missing. So these new delete instructions are noop. However, a
fix will be provided soon to insert properly all dynamic servers in both
used_server_id and used_server_addr trees so the deletion counterpart
will be mandatory in the CLI server delete handler.
This must be backported to 2.4.
Some config parsing handlers were designed to be run at startup on a
single-thread. When executing at runtime for dynamic servers,
thread-safety is not guaranteed. This is the case for example in
srv_parse_id which manipulates backend used_ids tree.
One solution could be to add locks but it might be tricky to found all
affected functions and it can be an easy source of deadlock. The other
solution which has been chosen is to use thread-isolation over almost
all of the cli_parse_add_server CLI handler.
For now this solution is sufficient. If some users make heavy use of the
'add server', hurting the overall performance, it will be necessary to
design a much thinner solution.
This must be backported up to 2.4.
This patch fix the issue adding a test in srvrq before registering
the server on it during server template init.
This was a regression due to commit :
3406766d57fc11478d54a6fa2d048cbfe4524a4e
This should be backported with this previous commit (until 2.0)
This patch add a ref into servers to register them onto the
record answer item used to set their hostnames.
It also adds a head list into 'srvrq' to register servers free
to be affected to a SRV record.
A head of a tree is also added to srvrq to put servers which
present a hotname in server state file. To re-link them fastly
to the matching record as soon an item present the same name.
This results in better performances on SRV record response
parsing.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
This patch adds a head list into answer items on servers which use
this record to set their IPs. It makes lookup on duplicated ip faster and
allow to check immediatly if an item is still valid renewing the IP.
This results in better performances on A/AAAA resolutions.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
In case of SRV records, The answer item list was purged by the
error callback of the first requester which considers the error
could not be safely ignored. It makes this item list unavailable
for subsequent requesters even if they consider the error
could be ignored.
On A resolution or do_resolve action error, the answer items were
never trashed.
This patch re-work the error callbacks and the code to check the return code
If a callback return 1, we consider the error was ignored and
the answer item list must be kept. At the opposite, If all error callbacks
of all requesters of the same resolution returns 0 the list will be purged
This patch should be backported as far as 2.0.
Define srv.init_addr_methods to SRV_IADDR_NONE on 'add server' CLI
handler. This explicitly states that no resolution will be made on the
server creation.
This is not a real bug as the default value (SRV_IADDR_END) has the same
effect in practice. However the intent is clearer and prevent to use the
default "libc,last" by mistake which cannot execute on runtime (blocking
call + file access via gethostbyname/getaddrinfo).
The doc is also updated to reflect this limitation.
This should be backported up to 2.4.
Replace memprintf usage in _srv_parse* functions by ha_alert calls. This
has the advantage to simplify the function prototype by removing an
extra char** argument.
As a consequence, the CLI handler of 'add server' is updated to output
the user messages buffers if not empty.
Initialize the parsing context in srv_init_addr. This function is called
after configuration check.
This will standardize the stderr output on startup with the parse_server
function.
Fix memprintf used in server_parse_sni_expr. Error messages should not
be ending with a newline as it will be inserted in the parent function
on the ha_alert invocation.
Two calloc calls were not checked in the srv_parse_source function.
Considering that this function could be called at runtime through a
dynamic server creation via the CLI, this could lead to an unfortunate
crash.
It was raised in GitHub issue #1233.
It could be backported to all stable branches even though the runtime
crash could only happen on branches where dynamic server creation is
possible.
A deadlock is possible with 'set maxconn server' command, if there is
pending connection ready to be dequeued. This is caused by the locking
of server spinlock in both cli_parse_set_maxconn_server and
process_srv_queue.
Fix this by reducing the scope of the server lock into
server_parse_maxconn_change_request. If connection are dequeued, the
lock is taken a second time. This can be seen as suboptimal but as it
happens only during 'set maxconn server' it can be considered as
tolerable.
This issue was reported on the mailing list, for the 1.8.x branch.
It must be backported up to the 1.8.
Only check servers attached to a proxy with PR_CAP_LB.
This does not need to be backported as the diag message was added in the
current 2.4-dev branch.
There were 102 CLI commands whose help were zig-zagging all along the dump
making them unreadable. This patch realigns all these messages so that the
command now uses up to 40 characters before the delimiting colon. About a
third of the commands did not correctly list their arguments which were
added after the first version, so they were all updated. Some abuses of
the term "id" were fixed to use a more explanatory term. The
"set ssl ocsp-response" command was not listed because it lacked a help
message, this was fixed as well. The deprecated enable/disable commands
for agent/health/server were prominently written as deprecated. Whenever
possible, clearer explanations were provided.
Implement a function to close all server idle connections. This function
is called via a global deinit server handler.
The main objective is to prevents from leaving sockets in TIME_WAIT
state. To limit the set of operations on shutdown and prevents
tasks rescheduling, only the ctrl stack closing is done.
The text mentionned that only backends with consistent hash method were
supported for dynamic servers. In fact, it is only required that the lb
algorith is dynamic.
gcc still reports a potential null pointer dereference in delete server
function event with a BUG_ON before it. Remove the misleading NULL check
in the for loop which should never happen.
This does not need to be backported.