For server checks, SSL and PROXY is automatically inherited from the
server settings if no specific check port is specified. Change this
behavior for dynamic servers : explicit "check-ssl"/"check-send-proxy"
are required for them.
Without this change, it is impossible to add a dynamic server with
SSL/PROXY settings and checks without, if the check port is not
explicit. This is because "no-check-ssl"/"no-check-send-proxy" keywords
are not available for dynamic servers.
This change respects the principle that dynamic servers on the CLI
should not reuse the same shortcuts used during the config file parsing.
Mostly because we expect this feature to be manipulated by automated
tools, contrary to the config file which should aim to be the shortest
possible for human readability.
Update the documentation of the "check" keyword to reflect this change.
The unsafe conn-stream API (__cs_*) is now used when we are sure the good
endpoint or application is attached to the conn-stream. This avoids compiler
warnings about possible null derefs. It also simplify the code and clear up
any ambiguity about manipulated entities.
In the same way a stream has always valid conn-streams, when a health-checks
is created, a conn-stream is now created and the health-check is attached on
it, as an app. This simplify a bit the connect part when a health-check is
running.
In the same way the conn-stream has a pointer to the stream endpoint , this
patch adds a pointer to the application entity in the conn-stream
structure. For now, it is a stream or a health-check. It is mandatory to
merge the stream-interface with the conn-stream.
The conn-stream will progressively replace the stream-interface. Thus, a
stream will have to allocate the backend conn-stream during its
creation. This means it will be possible to have a conn-stream with no
connection. To prepare this change, we test the conn-stream's connection
when we retrieve it.
These functions are declared as external functions in check.h and
as inline functions in check.c. Let's move them as static inline in
check.h. This appeared in 2.4 with the following commits:
4858fb2e1 ("MEDIUM: check: align agentaddr and agentport behaviour")
1c921cd74 ("BUG/MINOR: check: consitent way to set agentaddr")
While harmless (it only triggers build warnings with some gcc 4.x),
it should probably be backported where the paches above are present
to keep the code consistent.
When cleaning up the code to remove most explicit task masks in commit
beeabf531 ("MINOR: task: provide 3 task_new_* wrappers to simplify the
API"), a mistake was done with the external checks where the call does
task_new_on(1) instead of task_new_on(0) due to the confusion with the
previous mask 1.
No backport is needed as that's only 2.5-dev.
This change is required to support TCP/HTTP rules in defaults sections. The
'disabled' bitfield in the proxy structure, used to know if a proxy is
disabled or stopped, is replaced a generic bitfield named 'flags'.
PR_DISABLED and PR_STOPPED flags are renamed to PR_FL_DISABLED and
PR_FL_STOPPED respectively. In addition, everywhere there is a test to know
if a proxy is disabled or stopped, there is now a bitwise AND operation on
PR_FL_DISABLED and/or PR_FL_STOPPED flags.
The last 3 fields were 3 list heads that are per-thread, and which are:
- the pool's LRU head
- the buffer_wq
- the streams list head
Moving them into thread_ctx completes the removal of dynamic elements
from the struct thread_info. Now all these dynamic elements are packed
together at a single place for a thread.
We'll need to improve the API to pass other arguments in the future, so
let's start to adapt better to the current use cases. task_new() is used:
- 18 times as task_new(tid_bit)
- 18 times as task_new(MAX_THREADS_MASK)
- 2 times with a single bit (in a loop)
- 1 in the debug code that uses a mask
This patch provides 3 new functions to achieve this:
- task_new_here() to create a task on the calling thread
- task_new_anywhere() to create a task to be run anywhere
- task_new_on() to create a task to run on a specific thread
The change is trivial and will allow us to later concentrate the
required adaptations to these 3 functions only. It's still possible
to call task_new() if needed but a comment was added to encourage the
use of the new ones instead. The debug code was not changed and still
uses it.
'slowstart' can be used without check on a server, with the CLI handlers
'enable/disable server'. Move the code to initialize and start the
slowstart task outside of check.c.
This change will also be reused to enable slowstart for dynamic servers.
In a future patch, it will be possible to remove at runtime every
servers, both static and dynamic. This requires to extend the server
refcount for all instances.
First, refcount manipulation functions have been renamed to better
express the API usage.
* srv_refcount_use -> srv_take
The refcount is always initialize to 1 on the server creation in
new_server. It's also incremented for each check/agent configured on a
server instance.
* free_server -> srv_drop
This decrements the refcount and if null, the server is freed, so code
calling it must not use the server reference after it. As a bonus, this
function now returns the next server instance. This is useful when
calling on the server loop without having to save the next pointer
before each invocation.
In these functions, remove the checks that prevent refcount on
non-dynamic servers. Each reference to "dynamic" in variable/function
naming have been eliminated as well.
There is currently a leak on agent-check for dynamic servers. When
deleted, the check rules and vars are not liberated. This leak grows
each time a dynamic server with agent-check is deleted.
Replace the manual purge code by a free_check invocation which
centralizes all the details on check cleaning.
There is no leak for health check because in this case the proxy is the
owner of the check vars and rules.
This should not be backported, unless dynamic server checks are
backported.
If an error occured during a dynamic server creation, free_check is used
to liberate a possible agent-check. However, this does not free
associated vars and rules associated as this is done on another function
named deinit_srv_agent_check.
To simplify the check free and avoid a leak, move free vars/rules in
free_check. This is valid because deinit_srv_agent_check also uses
free_check.
This operation is done only for an agent-check because for a health
check, the proxy instance is the owner of check vars/rules.
This should not be backported, unless dynamic server checks are
backported.
Do not reset check flags when setting CHK_ST_PURGE.
Currently, this change has no impact. However, it is semantically wrong
to clear important flags such as CHK_ST_AGENT on purge.
Furthermore, this change will become mandatoy for a future fix to
properly free agent checks on dynamic servers removal. For this, it will
be needed to differentiate health/agent-check on purge via CHK_ST_AGENT
to properly free agent checks.
This must not be backported unless dynamic servers checks are
backported.
Test if server is not null before using free_server in the check purge
operation. Currently, the null server scenario should not occured as
purge is used with refcounted dynamic servers. However, this might not
be always the case if purge is use in the future in other cases; thus
the test is useful for extensibility.
No need to backport, unless dynamic server checks are backported.
This has been reported through a coverity report in github issue #1343.
This commit is the counterpart for agent check of
"MEDIUM: server: implement check for dynamic servers".
The "agent-check" keyword is enabled for dynamic servers. The agent
check must manually be activated via "enable agent" CLI. This can
enable the dynamic server if the agent response is "ready" without an
explicit "enable server" CLI.
Implement check support for dynamic servers. The "check" keyword is now
enabled for dynamic servers. If used, the server check is initialized
and the check task started in the "add server" CLI handler. The check is
explicitely disabled and must be manually activated via "enable health"
CLI handler.
The dynamic server refcount is incremented if a check is configured. On
"delete server" handler, the check is purged, which decrements the
refcount.
Implement a collection of keywords deemed safe and useful to dynamic
servers. The list of the supported keywords is :
- addr
- check-proto
- check-send-proxy
- check-via-socks4
- rise
- fall
- fastinter
- downinter
- port
- agent-addr
- agent-inter
- agent-port
- agent-send
Implement a mechanism to free a started check on runtime for dynamic
servers. A new function check_purge is created for this. The check task
will be marked for deletion and scheduled to properly close connection
elements and free the task/tasklet/buf_wait elements.
This function will be useful to delete a dynamic server wich checks.
global maxsock is used to estimate a number of fd to reserve for
internal use, such as checks. It is incremented at startup with the info
from the config file.
Disable this incrementation in checks functions at runtime. First, it
currently serves no purpose to increment it after startup. Worse, it may
lead to out-of-bound accesse on the fdtab.
This will be useful to initiate checks for dynamic servers.
Remove static qualifier on init_srv_check, init_srv_agent_check and
start_check_task. These functions will be called in server.c for dynamic
servers with checks.
Allocate default tcp ruleset for every backend without explicit rules
defined, even if no server in the backend use check. This change is
required to implement checks for dynamic servers.
This allocation is done on check_config_validity. It must absolutely be
called before check_proxy_tcpcheck (called via post proxy check) which
allocate the implicit tcp connect rule.
A config like the below fails to validate because of a bogus test:
backend b1
tcp-check connect port 1234
option tcp-check
server s1 1.2.3.4 check
[ALERT] (18887) : config : config: proxy 'b1': server 's1' has neither
service port nor check port, and a tcp_check rule
'connect' with no port information.
A || instead of a && only validates the connect rule when both the
address *and* the port are set. A work around is to set the rule like
this:
tcp-check connect addr 0:1234 port 1234
This needs to be backported as far as 2.2 (2.0 is OK).
In srv_parse_agent_check the error code is not returned in case
something goes wrong. The value 0 is always return.
Additionally, there's a small cleanup of unreachable returns that in
most checks are not present either and removed in two places they were
present. This makes the code consistent across the different checks.
Set "config :" as a prefix for the user messages context before starting
the configuration parsing. All following stderr output will be prefixed
by it.
As a consequence, remove extraneous prefix "config" already specified in
various ha_alert/warning/notice calls.
On observe mode, if a server is marked as DOWN, the server's health-check is
rescheduled using the fastinter timeout if the new expiration date is newer
that the current one. But this must only be performed if the fastinter
timeout is defined.
Internally, tick_is_lt() function only checks the date and does not perform any
verification on the provided args. Thus, we must take care of it. However, it is
possible to disable the server health-check by setting its task expiration date
to TICK_ETERNITY.
This patch must be backported as far as 2.2. It is related to
A connection may be synchronously established. In the tcpcheck context, it
may be a problem if several connections come one after another. In this
case, there is no event to close the very first connection before starting
the next one. The checks is thus blocked and timed out, a L7 timeout error
is reported.
To fix the bug, when a tcpcheck is started, we immediately evaluate its
state. Most of time, nothing is performed and we must wait. But it is thus
possible to handle the result of a successfull connection.
This patch should fix the issue #1234. It must be backported as far as 2.2.
The current "ADD" vs "ADDQ" is confusing because when thinking in terms
of appending at the end of a list, "ADD" naturally comes to mind, but
here it does the opposite, it inserts. Several times already it's been
incorrectly used where ADDQ was expected, the latest of which was a
fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain
why we queue the stream at the head of the server list").
Let's use more explicit (but slightly longer) names now:
LIST_ADD -> LIST_INSERT
LIST_ADDQ -> LIST_APPEND
LIST_ADDED -> LIST_INLIST
LIST_DEL -> LIST_DELETE
The same is true for MT_LISTs, including their "TRY" variant.
LIST_DEL_INIT keeps its short name to encourage to use it instead of the
lazier LIST_DELETE which is often less safe.
The change is large (~674 non-comment entries) but is mechanical enough
to remain safe. No permutation was performed, so any out-of-tree code
can easily map older names to new ones.
The list doc was updated.
The dummy frontend used to create the session of the tcp-checks is
initialized without identifier. However, it is required because this id may
be used without any guard, for instance in log-format string via "%f" or
when fe_name sample fetch is called. Thus, an unset id may lead to crashes.
This patch must be backported as far as 2.2.
Add the trace support for the checks. Only tcp-check based health-checks are
supported, including the agent-check.
In traces, the first argument is always a check object. So it is easy to get
all info related to the check. The tcp-check ruleset, the conn-stream and
the connection, the server state...
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
The function's purpose used to be to fail a buffer allocation if that
allocation wouldn't result in leaving some buffers available. Thus,
some allocations could succeed and others fail for the sole purpose of
trying to provide 2 buffers at once to process_stream(). But things
have changed a lot with 1.7 breaking the promise that process_stream()
would always succeed with only two buffers, and later the thread-local
pool caches that keep certain buffers available that are not accounted
for in the global pool so that local allocators cannot guess anything
from the number of currently available pools.
Let's just replace all last uses of b_alloc_margin() with b_alloc() once
for all.
Prepare the server parsing API to support dynamic servers.
- define a new parsing flag to be used for dynamic servers
- each keyword contains a new field dynamic_ok to indicate if it can be
used for a dynamic server. For now, no keyword are supported.
- do not copy settings from the default server for a new dynamic server.
- a dynamic server is created in a maintenance mode and requires an
explicit 'enable server' command.
- a new server flag named SRV_F_DYNAMIC is created. This flag is set for
all servers created at runtime. It might be useful later, for example
to know if a server can be purged.
It's been too short for quite a while now and is now full. It's still
time to extend it to 32-bits since we have room for this without
wasting any space, so we now gained 16 new bits for future flags.
The values were not reassigned just in case there would be a few
hidden u16 or short somewhere in which these flags are placed (as
it used to be the case with stream->pending_events).
The patch is tagged MEDIUM because this required to update the task's
process() prototype to use an int instead of a short, that's quite a
bunch of places.
Refactoring performed with the following Coccinelle patch:
@@
char *s;
@@
(
- ist2(s, strlen(s))
+ ist(s)
|
- ist2(strdup(s), strlen(s))
+ ist(strdup(s))
)
Note that this replacement is safe even in the strdup() case, because `ist()`
will not call `strlen()` on a `NULL` pointer. Instead is inserts a length of
`0`, effectively resulting in `IST_NULL`.
This makes the code more readable and less prone to copy-paste errors.
In addition, it allows to place some __builtin_constant_p() predicates
to trigger a link-time error in case the compiler knows that the freed
area is constant. It will also produce compile-time error if trying to
free something that is not a regular pointer (e.g. a function).
The DEBUG_MEM_STATS macro now also defines an instance for ha_free()
so that all these calls can be checked.
178 occurrences were converted. The vast majority of them were handled
by the following Coccinelle script, some slightly refined to better deal
with "&*x" or with long lines:
@ rule @
expression E;
@@
- free(E);
- E = NULL;
+ ha_free(&E);
It was verified that the resulting code is the same, more or less a
handful of cases where the compiler optimized slightly differently
the temporary variable that holds the copy of the pointer.
A non-negligible amount of {free(str);str=NULL;str_len=0;} are still
present in the config part (mostly header names in proxies). These
ones should also be cleaned for the same reasons, and probably be
turned into ist strings.
Historically this function would try to wake the most accurate number of
process_stream() waiters. But since the introduction of filters which could
also require buffers (e.g. for compression), things started not to be as
accurate anymore. Nowadays muxes and transport layers also use buffers, so
the runqueue size has nothing to do anymore with the number of supposed
users to come.
In addition to this, the threshold was compared to the number of free buffer
calculated as allocated minus used, but this didn't work anymore with local
pools since these counts are not updated upon alloc/free!
Let's clean this up and pass the number of released buffers instead, and
consider that each waiter successfully called counts as one buffer. This
is not rocket science and will not suddenly fix everything, but at least
it cannot be as wrong as it is today.
This could have been marked as a bug given that the current situation is
totally broken regarding this, but this probably doesn't completely fix
it, it only goes in a better direction. It is possible however that it
makes sense in the future to backport this as part of a larger series if
the situation significantly improves.
The buffer wait queue used to be global historically but this doest not
make any sense anymore given that the most common use case is to have
thread-local pools. Thus there's no point waking up waiters of other
threads after releasing an entry, as they won't benefit from it.
Let's move the queue head to the thread_info structure and use
ti->buffer_wq from now on.
The server lock was taken preventively for anything in health_adjust(),
including the static config checks needed to detect that the lock was not
needed, while the function is always called on the response path to update
a server's status. This was responsible for huge contention causing a
performance drop of about 17% on 16 threads. Let's move the lock only
where it should be, i.e. inside the function around the critical sections
only. By doing this, a 16-thread process jumped back from 575 to 675 krps.
This should be backported to 2.3 as the situation degraded there, and
maybe later to 2.2.
There's an issue when a server state changes, we use an integer comparison
to decide whether or not to reschedule a test instead of using a wrapping
timer comparison. This will cause some health-checks not to be immediately
triggered half of the time, and some unneeded calls to task_queue() to be
performed in other cases.
This bug has always been there as it was introduced with the commit that
added the feature, 97f07b832 ("[MEDIUM] Decrease server health based on
http responses / events, version 3"). This may be backported everywhere.
This patch splits current dns.c into two files:
The first dns.c contains code related to DNS message exchange over UDP
and in future other TCP. We try to remove depencies to resolving
to make it usable by other stuff as DNS load balancing.
The new resolvers.c inherit of the code specific to the actual
resolvers.
Note:
It was really difficult to obtain a clean diff dur to the amount
of moved code.
Note2:
Counters and stuff related to stats is not cleany separated because
currently counters for both layers are merged and hard to separate
for now.