For A/AAAA resolution, if no ip is found for a server in the response, the
server is set to RMAINT status. However, its address must also be
reset. Otherwise, it is still reported by the cli on "show servers state"
commands. This may be confusing.
This patch may be backported as far as 2.0.
A queue is specific to a server or a proxy, so we don't need to place
this distinction inside all pendconns, it can be in the queue itself.
This commit adds the relevant fields "px" and "sv" into the struct
queue, and initializes them accordingly.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 362k
to 374k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 9%.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
This reverts commit fcb8bf8650ec6b5614d1b88db54f1200ebd96cbd.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit c83e45e9b001591633188a480a896c935d3c9625.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 491k
to 507k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 6%.
The performance profile changes from this:
13.03% haproxy [.] fwlc_srv_reposition
8.08% haproxy [.] fwlc_get_next_server
3.62% haproxy [.] process_srv_queue
1.78% haproxy [.] pendconn_dequeue
1.74% haproxy [.] pendconn_add
to this:
11.95% haproxy [.] fwlc_srv_reposition
7.57% haproxy [.] fwlc_get_next_server
3.51% haproxy [.] process_srv_queue
1.74% haproxy [.] pendconn_dequeue
1.70% haproxy [.] pendconn_add
At this point the differences are mostly measurement noise.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
The server_parse_maxconn_change_request locks the server lock. However,
this function can be called via agent-checks or lua code which already
lock it. This bug has been introduced by the following commit :
commit 79a88ba3d09f7e2b73ae27cb5d24cc087a548fa6
BUG/MAJOR: server: prevent deadlock when using 'set maxconn server'
This commit tried to fix another deadlock with can occur because
previoulsy server_parse_maxconn_change_request requires the server lock
to be held. However, it may call internally process_srv_queue which also
locks the server lock. The locking policy has thus been updated. The fix
is functional for the CLI 'set maxconn' but fails to address the
agent-check / lua counterparts.
This new issue is fixed in two steps :
- changes from the above commit have been reverted. This means that
server_parse_maxconn_change_request must again be called with the
server lock.
- to counter the deadlock fixed by the above commit, process_srv_queue
now takes an argument to render the server locking optional if the
caller already held it. This is only used by
server_parse_maxconn_change_request.
The above commit was subject to backport up to 1.8. Thus this commit
must be backported in every release where it is already present.
Activate the 'ssl' keyword for dynamic servers. This is the final step
to have ssl dynamic servers feature implemented. If activated,
ssl_sock_prepare_srv_ctx will be called at the end of the 'add server'
CLI handler.
At the same time, update the management doc to list all ssl keywords
implemented for dynamic servers.
'set server ssl' uses ssl parameters from default-server. As dynamic
servers does not reuse any default-server parameters, this command has
no sense for them.
The commit c7b391aed ("BUG/MEDIUM: server/cli: Fix ABBA deadlock when fqdn
is set from the CLI") introduced 2 bugs. The first one is a typo on the
server's lock label (s/SERVER_UNLOCK/SERVER_LOCK/). The second one is about
the server's lock itself. It must be acquired to execute the "agent-send"
subcommand.
The patch above is marked to be backported as far as 1.8. Thus, this one
must also backported as far 1.8.
BUG/MINOR: server/cli: Don't forget to lock server on agent-send subcommand
When a server relies on a SRV resolution, a task is created to clean it up
(fqdn/port and address) when the SRV resolution is considered as outdated
(based on the resolvers 'timeout' value). It is only possible if the server
inherits outdated info from a state file and is no longer selected to be
attached to a SRV item. Note that most of time, a server is attached to a
SRV item. Thus when the item becomes obsolete, the server is cleaned
up.
It is important to have such task to be sure the server will be free again
to have a chance to be resolved again with fresh information. Of course,
this patch is a workaround to solve a design issue. But there is no other
obvious way to fix it without rewritting all the resolvers part. And it must
be backportable.
This patch relies on following commits:
* MINOR: resolvers: Clean server in a dedicated function when removing a SRV item
* MINOR: resolvers: Remove server from named_servers tree when removing a SRV item
All the series must be backported as far as 2.2 after some observation
period. Backports to 2.0 and 1.8 must be evaluated.
To perform servers resolution, the resolver's lock is first acquired then
the server's lock when necessary. However, when the fqdn is set via the CLI,
the opposite is performed. So, it is possible to experience an ABBA
deadlock.
To fix this bug, the server's lock is acquired and released for each
subcommand of "set server" with an exception when the fqdn is set. The
resolver's lock is first acquired. Of course, this means we must be sure to
have a resolver to lock.
This patch must be backported as far as 1.8.
If a server is configured to rely on a SRV resolution, we must forbid to
change its fqdn on the CLI. Indeed, in this case, the server retrieves its
fqdn from the SRV resolution. If the fqdn is changed via the CLI, this
conflicts with the SRV resolution and leaves the server in an undefined
state. Most of time, the SRV resolution remains enabled with no effect on
the server (no update). Some time the A/AAAA resolution for the new fqdn is
not enabled at all. It depends on the server state and resolver state when
the CLI command is executed.
This patch must be backported as far as 2.0 (maybe to 1.8 too ?) after some
observation period.
To avoid repeating the same source code, allocating memory and initializing
the per_thr field from the server structure is transferred to a separate
function.
Until then, the servers were automatically attached on their creation
into the proxy addr_node tree via _srv_parse_init. In case of an invalid
dynamic server which is instantly freed, no detach operation was made
leaving a NULL server in the tree.
Change this mode of operation by marking the attach operation as
optional in _srv_parse_init. This operation is not conduct for a dynamic
server. The server is attached only at the end of the CLI handler when
it is marked as valid.
This must be backported up to 2.4.
A bug is present when trying to create a dynamic server with a fixed id.
If the server is detected invalid due to a later parsing arguments
error, the server is not removed from the proxy used ids tree before
being freed.
Change the mode of operation of 'id' keyword parsing handler. The
insertion in the backend tree is removed from the handler and is not
taken in charge by parse_server for configuration parsing. For the
dynamic servers, the insertion is called at the end of the 'add server'
CLI handler when the server has been validated.
This must be backported up to 2.4.
If no id is specified by the user for a dynamic server, it is necessary
to generate a new one. This operation is now done at the end of 'add
server' CLI handler. The server is then inserted into the proxy ids
tree.
Without this, several features may be broken for dynamic servers. Among
them, there is the "first" lb algorithm, the persistence using
stick-tables or the uniqueness internal check of srv_parse_id.
This must be backported up to 2.4.
Do not leave deleted server in used_server_id/used_server_addr backend
trees. This might lead to crashes if a deleted server is used through
these trees.
At this moment, dynamic servers are only added in used_server_id if they
have a fixed id. They are never inserted in used_server_addr as this
code is missing. So these new delete instructions are noop. However, a
fix will be provided soon to insert properly all dynamic servers in both
used_server_id and used_server_addr trees so the deletion counterpart
will be mandatory in the CLI server delete handler.
This must be backported to 2.4.
Some config parsing handlers were designed to be run at startup on a
single-thread. When executing at runtime for dynamic servers,
thread-safety is not guaranteed. This is the case for example in
srv_parse_id which manipulates backend used_ids tree.
One solution could be to add locks but it might be tricky to found all
affected functions and it can be an easy source of deadlock. The other
solution which has been chosen is to use thread-isolation over almost
all of the cli_parse_add_server CLI handler.
For now this solution is sufficient. If some users make heavy use of the
'add server', hurting the overall performance, it will be necessary to
design a much thinner solution.
This must be backported up to 2.4.
This patch fix the issue adding a test in srvrq before registering
the server on it during server template init.
This was a regression due to commit :
3406766d57fc11478d54a6fa2d048cbfe4524a4e
This should be backported with this previous commit (until 2.0)
This patch add a ref into servers to register them onto the
record answer item used to set their hostnames.
It also adds a head list into 'srvrq' to register servers free
to be affected to a SRV record.
A head of a tree is also added to srvrq to put servers which
present a hotname in server state file. To re-link them fastly
to the matching record as soon an item present the same name.
This results in better performances on SRV record response
parsing.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
This patch adds a head list into answer items on servers which use
this record to set their IPs. It makes lookup on duplicated ip faster and
allow to check immediatly if an item is still valid renewing the IP.
This results in better performances on A/AAAA resolutions.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
In case of SRV records, The answer item list was purged by the
error callback of the first requester which considers the error
could not be safely ignored. It makes this item list unavailable
for subsequent requesters even if they consider the error
could be ignored.
On A resolution or do_resolve action error, the answer items were
never trashed.
This patch re-work the error callbacks and the code to check the return code
If a callback return 1, we consider the error was ignored and
the answer item list must be kept. At the opposite, If all error callbacks
of all requesters of the same resolution returns 0 the list will be purged
This patch should be backported as far as 2.0.
Define srv.init_addr_methods to SRV_IADDR_NONE on 'add server' CLI
handler. This explicitly states that no resolution will be made on the
server creation.
This is not a real bug as the default value (SRV_IADDR_END) has the same
effect in practice. However the intent is clearer and prevent to use the
default "libc,last" by mistake which cannot execute on runtime (blocking
call + file access via gethostbyname/getaddrinfo).
The doc is also updated to reflect this limitation.
This should be backported up to 2.4.
Replace memprintf usage in _srv_parse* functions by ha_alert calls. This
has the advantage to simplify the function prototype by removing an
extra char** argument.
As a consequence, the CLI handler of 'add server' is updated to output
the user messages buffers if not empty.
Initialize the parsing context in srv_init_addr. This function is called
after configuration check.
This will standardize the stderr output on startup with the parse_server
function.
Fix memprintf used in server_parse_sni_expr. Error messages should not
be ending with a newline as it will be inserted in the parent function
on the ha_alert invocation.
Two calloc calls were not checked in the srv_parse_source function.
Considering that this function could be called at runtime through a
dynamic server creation via the CLI, this could lead to an unfortunate
crash.
It was raised in GitHub issue #1233.
It could be backported to all stable branches even though the runtime
crash could only happen on branches where dynamic server creation is
possible.
A deadlock is possible with 'set maxconn server' command, if there is
pending connection ready to be dequeued. This is caused by the locking
of server spinlock in both cli_parse_set_maxconn_server and
process_srv_queue.
Fix this by reducing the scope of the server lock into
server_parse_maxconn_change_request. If connection are dequeued, the
lock is taken a second time. This can be seen as suboptimal but as it
happens only during 'set maxconn server' it can be considered as
tolerable.
This issue was reported on the mailing list, for the 1.8.x branch.
It must be backported up to the 1.8.
Only check servers attached to a proxy with PR_CAP_LB.
This does not need to be backported as the diag message was added in the
current 2.4-dev branch.
There were 102 CLI commands whose help were zig-zagging all along the dump
making them unreadable. This patch realigns all these messages so that the
command now uses up to 40 characters before the delimiting colon. About a
third of the commands did not correctly list their arguments which were
added after the first version, so they were all updated. Some abuses of
the term "id" were fixed to use a more explanatory term. The
"set ssl ocsp-response" command was not listed because it lacked a help
message, this was fixed as well. The deprecated enable/disable commands
for agent/health/server were prominently written as deprecated. Whenever
possible, clearer explanations were provided.
Implement a function to close all server idle connections. This function
is called via a global deinit server handler.
The main objective is to prevents from leaving sockets in TIME_WAIT
state. To limit the set of operations on shutdown and prevents
tasks rescheduling, only the ctrl stack closing is done.
The text mentionned that only backends with consistent hash method were
supported for dynamic servers. In fact, it is only required that the lb
algorith is dynamic.
gcc still reports a potential null pointer dereference in delete server
function event with a BUG_ON before it. Remove the misleading NULL check
in the for loop which should never happen.
This does not need to be backported.
Implement a new CLI command 'del server'. It can be used to removed a
dynamically added server. Only servers in maintenance mode can be
removed, and without pending/active/idle connection on it.
Add a new reg-test for this feature. The scenario of the reg-test need
to first add a dynamic server. It is then deleted and a client is used
to ensure that the server is non joinable.
The management doc is updated with the new command 'del server'.
cli_parse_add_server can be executed in parallel by several CLI
instances and so must be thread-safe. The critical points of the
function are :
- server duplicate detection
- insertion of the server in the proxy list
The mode of operation has been reversed. The server is first
instantiated and parsed. The duplicate check has been moved at the end
just before the insertion in the proxy list, under the thread isolation.
Thus, the thread safety is guaranteed and server allocation is kept
outside of locks/thread isolation.
The current "ADD" vs "ADDQ" is confusing because when thinking in terms
of appending at the end of a list, "ADD" naturally comes to mind, but
here it does the opposite, it inserts. Several times already it's been
incorrectly used where ADDQ was expected, the latest of which was a
fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain
why we queue the stream at the head of the server list").
Let's use more explicit (but slightly longer) names now:
LIST_ADD -> LIST_INSERT
LIST_ADDQ -> LIST_APPEND
LIST_ADDED -> LIST_INLIST
LIST_DEL -> LIST_DELETE
The same is true for MT_LISTs, including their "TRY" variant.
LIST_DEL_INIT keeps its short name to encourage to use it instead of the
lazier LIST_DELETE which is often less safe.
The change is large (~674 non-comment entries) but is mechanical enough
to remain safe. No permutation was performed, so any out-of-tree code
can easily map older names to new ones.
The list doc was updated.
The test in srv_alloc_lb() to allocate the lb_nodes[] array used in the
consistent hash was incorrect, it wouldn't do it for consistent hash and
could do it for regular random.
No backport is needed as this was added for dynamic servers in 2.4-dev by
commit f99f77a50 ("MEDIUM: server: implement 'add server' cli command").