With previous commit, 9e080bf ("BUG/MINOR: checks: make sure fastinter is used
even on forced transitions"), on-error mark-down|sudden-death|fail-check are
now working as expected.
However, on-error fastinter remains broken because srv_getinter(), used in
the above commit to check the expiration date, won't return fastinter interval
if server health is maxed out (which is the case with on-error fastinter mode).
To fix this, we introduce a check flag named CHK_ST_FASTINTER.
This flag is set when on-error is triggered. This way we can force
srv_getinter() to return fastinter interval whenever the flag is set.
The flag is automatically cleared as soon as the new check task expiry is
recalculated in process_chk_conn().
This restores original behavior prior to d114f4a ("MEDIUM: checks: spread the
checks load over random threads").
It must be backported to 2.7 along with the aforementioned commits.
We're using srv_update_status() as the only event source or UP/DOWN server
events in an attempt to simplify the support for these 2 events.
It seems srv_update_status() is the common path for server state changes anyway
Tested with server state updated from various sources:
- the cli
- server-state file (maybe we could disable this or at least don't publish
in global event queue in the future if it ends in slower startup for setups
relying on huge server state files)
- dns records (ie: srv template)
(again, could be fined tuned to only publish in server specific subscriber
list and no longer in global subscription list if mass dns update tend to
slow down srv_update_status())
- normal checks and observe checks (HCHK_STATUS_HANA)
(same as above, if checks related state update storms are expected)
- lua scripts
- html stats page (admin mode)
Basic support for ADD and DEL server events are added through this commit:
SERVER_ADD is published on dynamic server addition through cli.
SERVER_DEL is published on dynamic server deletion through cli.
This work depends on:
"MINOR: event_hdl: add event handler base api"
"MINOR: server: add srv->rid (revision id) value"
With current design, we could not distinguish between
previously existing deleted server and a new server reusing
the deleted server name/id.
This can cause some confusion when auditing stats/events/logs,
because the new server will look similar to the old
one.
To address this, we're adding a new value in server structure: rid
rid (revision id) value is an unsigned 32bits value that is set upon
server creation. Value is derived from a global counter that starts
at 0 and is incremented each time one or multiple server deletions are
followed by a server addition (meaning that old name/id reuse could occur).
Thanks to this revision id, it is now easy to tell whether the server
we're looking at is the same as before or if it has been deleted and
re-added in the meantime.
(combining server name/id + server revision id yields a process-wide unique
identifier)
Complete ipcmp() function with a new argument <check_port>. If this
argument is true, the function will compare port values besides IP
addresses and return true only if both are identical.
This commit will simplify QUIC connection migration detection. As such,
it should be backported to 2.7.
In order to evenly pick idle connections from other threads, there is
a "next_takeover" index in the server, that is incremented each time
a connection is picked from another thread, and indicates which one to
start from next time.
With thread groups this doesn't work well because the index is the same
regardless of the group, and if a group has more threads than another,
there's even a risk to reintroduce an imbalance.
This patch introduces a new per-tgroup storage in servers which, for now,
only contains an instance of this next_takeover index. This way each
thread will now only manipulate the index specific to its own group, and
the takeover will become fair again. More entries may come soon.
In 2.2, some idle conns usage metrics were added by commit cf612a045
("MINOR: servers: Add a counter for the number of currently used
connections."), which mentioned that the operation doesn't need to be
atomic since we're not seeking exact values. This is true but at least
we should use atomic stores to make sure not to cause invalid values
to appear on archs that wouldn't guarantee atomicity when writing an
int, such as writing two 16-bit words. This is pretty unlikely on our
targets but better keep the code safe against this.
This may be backported as far as 2.2.
cli_parse_add_server() is the CLI handler for 'add server' command. This
functions uses usermsgs_ctx to retrieve logs messages from internal
ha_alert() calls and display it at the end of the handler.
At the beginning of the handler, stderr prefix is defined to "CLI" via
usermsgs_clr() function. However, this is not resetted at the end. This
causes inconsistency for stderr output :
1. each ha_alert() invocation will reuse "CLI" prefix if 'add server'
command was executed before, even in non-CLI context
2. usermsgs_ctx is thread local, so this is only true if this runs on
the same thread as 'add server' handler.
To fix this, ensure that "CLI" prefix is now resetted after
cli_parse_add_server(). This is done thanks to the addition to
cli_umsg()/cli_umsgerr() functions.
This can be backported up to 2.5 if we prefer to ensure output
consistency at the risk of changing stderr behaviors in stable versions.
In this case, the previous commit should be backported before :
MINOR: cli: define usermsgs print context
Add "shards" new keyword for "peers" section to configure the number
of peer shards attached to such secions. This impact all the stick-tables
attached to the section.
Add "shard" new "server" parameter to configure the peers which participate to
all the stick-tables contents distribution. Each peer receive the stick-tables updates
only for keys with this shard value as distribution hash. The "shard" value
is stored in ->shard new server struct member.
cfg_parse_peers() which is the function which is called to parse all
the lines of a "peers" section is modified to parse the "shards" parameter
stored in ->nb_shards new peers struct member.
Add srv_parse_shard() new callback into server.c to pare the "shard"
parameter.
Implement stksess_getkey_hash() to compute the distribution hash for a
stick-table key as the 64-bits xxhash of the key concatenated to the stick-table
name. This function is called by stksess_setkey_shard(), itself
called by the already implemented function which create a new stick-table
key (stksess_new()).
Add ->idlen new stktable struct member to store the stick-table name length
to not have to compute it each time a stick-table key hash is computed.
Idle connections do not work on 32-bit machines due to an alignment issue
causing the connection nodes to be indexed with their lower 32-bits set to
zero and the higher 32 ones containing the 32 lower bitss of the hash. The
cause is the use of ebmb_node with an aligned data, as on this platform
ebmb_node is only 32-bit aligned, leaving a hole before the following hash
which is a uint64_t:
$ pahole -C conn_hash_node ./haproxy
struct conn_hash_node {
struct ebmb_node node; /* 0 20 */
/* XXX 4 bytes hole, try to pack */
int64_t hash; /* 24 8 */
struct connection * conn; /* 32 4 */
/* size: 40, cachelines: 1, members: 3 */
/* sum members: 32, holes: 1, sum holes: 4 */
/* padding: 4 */
/* last cacheline: 40 bytes */
};
Instead, eb64 nodes should be used when it comes to simply storing a
64-bit key, and that is what this patch does.
For backports, a variant consisting in simply marking the "hash" member
with a "packed" attribute on the struct also does the job (tested), and
might be preferable if the fix is difficult to adapt. Only 2.6 and 2.5
are affected by this.
When calling 'add server' with a hostname from the cli (runtime),
str2sa_range() does not resolve hostname because it is purposely
called without PA_O_RESOLVE flag.
This leads to 'srv->addr_node.key' being NULL. According to Willy it
is fine behavior, as long as we handle it properly, and is already
handled like this in srv_set_addr_desc().
This patch fixes GH #1865 by adding an extra check before inserting
'srv->addr_node' into 'be->used_server_addr'. Insertion and removal
will be skipped if 'addr_node.key' is NULL.
It must be backported to 2.6 and 2.5 only.
When parsing a peers section, it's particularly difficult to make the
difference between the local peer which doesn't have any address, and
other peers which need one, and the error messages do not help because
with just:
peers foo
bind :8001
server foo 127.0.0.1:8001
server bar 127.0.0.2:8001
One can get such a confusing message when the local peer is "bar":
[peers.cfg:15] : 'server foo/bar' : unknown keyword '127.0.0.1:8001'.
It's not clear there why the other peer doesn't trigger an error.
With this commit we add a hint in the error message when no address
was expected. The error remains quite generic (since deep into the
server code) but at least the useer gets a hint about why the keyword
wasn't understood:
[peers.cfg:15] : 'server foo/bar' : unknown keyword '127.0.0.1:8001'.
Hint: no address was expected for this server.
There's no more reason for keepin the code and definitions in conn_stream,
let's move all that to stconn. The alphabetical ordering of include files
was adjusted.
This file contains all the stream-connector functions that are specific
to application layers of type stream. So let's name it accordingly so
that it's easier to figure what's located there.
The alphabetical ordering of include files was preserved.
This applies the change so that the applet code stops using ci_putchk()
and friends everywhere possible, for the much saferapplet_put*() instead.
The change is mechanical but large. Two or three functions used to have no
appctx and a cs derived from the appctx instead, which was a reminiscence
of old times' stream_interface. These were simply changed to directly take
the appctx. No sensitive change was performed, and the old (more complex)
API is still usable when needed (e.g. the channel is already known).
The change touched roughly a hundred of locations, with no less than 124
lines removed.
It's worth noting that the stats applet, the oldest of the series, could
get a serious lifting, as it's still very channel-centric instead of
propagating the appctx along the chain. Given that this code doesn't
change often, there's no emergency to clean it up but it would look
better.
This renames the "struct conn_stream" to "struct stconn" and updates
the descriptions in all comments (and the rare help descriptions) to
"stream connector" or "connector". This touches a lot of files but
the change is minimal. The local variables were not even renamed, so
there's still a lot of "cs" everywhere.
This one is the pointer to the conn_stream which is always in the
endpoint that is always present in the appctx, thus it's not needed.
This patch removes it and replaces it with appctx_cs() instead. A
few occurences that were using __cs_strm(appctx->owner) were moved
directly to appctx_strm() which does the equivalent.
Remaining flags and associated functions are move in the conn-stream
scope. These flags are added on the endpoint and not the conn-stream
itself. This way it will be possible to get them from the mux or the
applet. The functions to get or set these flags are renamed accordingly with
the "cs_" prefix and updated to manipualte a conn-stream instead of a
stream-interface.
At many places, we now use the new CS functions to get a stream or a channel
from a conn-stream instead of using the stream-interface API. It is the
first step to reduce the scope of the stream-interfaces. The main change
here is about the applet I/O callback functions. Before the refactoring, the
stream-interface was the appctx owner. Thus, it was heavily used. Now, as
far as possible,the conn-stream is used. Of course, it remains many calls to
the stream-interface API.
All registered config keywords that are valid in the config parser are
dumped to stdout organized like the regular sections (global, listen,
etc). Some keywords that are known to only be valid in frontends or
backends will be suffixed with [FE] or [BE].
All regularly registered "bind" and "server" keywords are also dumped,
one per "bind" or "server" line. Those depending on ssl are listed after
the "ssl" keyword. Doing so required to export the listener and server
keyword lists that were static.
The function is called from dump_registered_keywords() for keyword
class "cfg".
Dynamic servers feature is now judged to be stable enough. Remove the
experimental-mode requirement for "add/del server" commands. This should
facilitate dynamic servers adoption.
Because appctx is now an endpoint of the conn-stream, there is no reason to
still have the stream-interface as appctx owner. Thus, the conn-stream is
now the appctx owner.
While giving a fresh try to `set server ssl` (which I wrote), I realised
the behavior is a bit inconsistent. Indeed when using this command over
a server with ssl enabled for the data path but also for the health
check path we have:
- data and health check done using tls
- emit `set server be_foo/srv0 ssl off`
- data path and health check path becomes plain text
- emit `set server be_foo/srv0 ssl on`
- data path becomes tls and health check path remains plain text
while I thought the end result would be:
- data path and health check path comes back in tls
In the current code we indeed erase all connections while deactivating,
but restore only the data path while activating. I made this mistake in
the past because I was testing with a case where the health check plain
text by default.
There are several ways to solve this issue. The cleanest one would
probably be to avoid changing the health check connection when we use
`set server ssl` command, and create a new command `set server
ssl-check` to change this. For now I assumed this would be ok to simply
avoid changing the health check path and be more consistent.
This patch tries to address that and also update the documentation. It
should not break the existing usage with health check on plain text, as
in this case they should have `no-check-ssl` in defaults. Without this
patch, it makes the command unusable in an env where you have a list of
server to add along the way with initial `server-template`, and all
using tls for data and healthcheck path.
For 2.6 we should probably reconsider and add `set server ssl-check`
command for better granularity of cases.
If this solution is accepted, this patch should be backported up to >=
2.4.
The alternative solution was to restore the previous state, but I
believe this will create even more confusion in the future.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
This bug was introduced by d817dc73 ("MEDIUM: ssl: Load client
certificates in a ckch for backend servers") in which the creation of
the SSL_CTX for a server was moved to the configuration parser when
using a "crt" keyword instead of being done in ssl_sock_prepare_srv_ctx().
The patch 0498fa40 ("BUG/MINOR: ssl: Default-server configuration ignored by
server") made it worse by setting the same SSL_CTX for every servers
using a default-server. Resulting in any SSL option on a server applied
to every server in its backend.
This patch fixes the issue by reintroducing a string which store the
path of certificate inside the server structure, and loading the
certificate in ssl_sock_prepare_srv_ctx() again.
This is a quick fix to backport, a cleaner way can be achieve by always
creating the SSL_CTX in ssl_sock_prepare_srv_ctx() and splitting
properly the ssl_sock_load_srv_cert() function.
This patch fixes issue #1488.
Must be backported as far as 2.4.
When a server is dynamically added via the CLI with a custom id, the key
used to insert it in the backend's tree of used names is not initialized.
The server id must be used but it is only used when no custom id is
provided. Thus, with a custom id, HAProxy crashes.
Now, the server id is always used to init this key, to be able to insert the
server in the corresponding tree.
This patch should fix the issue #1481. It must be backported as far as 2.4.
During post-parsing stage, the SSL context of a server is initialized if SSL
is configured on the server or its default-server. It is required to be able
to enable SSL at runtime. However a regression was introduced, because the
last parsed default-server is used. But it is not necessarily the
default-server line used to configure the server. This may lead to
erroneously initialize the SSL context for a server without SSL parameter or
the skip it while it should be done.
The problem is the default-server used to configure a server is not saved
during configuration parsing. So, the information is lost during the
post-parsing. To fix the bug, the SRV_F_DEFSRV_USE_SSL flag is
introduced. It is used to know when a server was initialized with a
default-server using SSL.
For the record, the commit f63704488e ("MEDIUM: cli/ssl: configure ssl on
server at runtime") has introduced the bug.
This patch must be backported as far as 2.4.
This patch add a union and struct into dns_counter struct to split
application specific counters.
The only current existing application is the resolver.c layer but
in futur we could handle different application such as dns load
balancing with others specific counters.
This patch should not be backported.
Implement parsing for the server keyword 'ws'. This is used to configure
the mode of selection for websocket protocol. The configuration
documentation has been updated.
A new regtest has been created to test the proper behavior of the
keyword.
Handle properly websocket streams if the server uses an ALPN with both
h1 and h2. Add a new field h2_ws in the server structure. If set to off,
reuse is automatically disable on backend and ALPN is forced to http1.x
if possible. Nothing is done if on.
Implement a mechanism to be able to use a different http version for
websocket streams. A new server member <ws> represents the algorithm to
select the protocol. This can overrides the server <proto>
configuration. If the connection uses ALPN for proto selection, it is
updated for websocket streams to select the right protocol.
Three mode of selection are implemented :
- auto : use the same protocol between non-ws and ws streams. If ALPN is
use, try to update it to "http/1.1"; this is only done if the server
ALPN contains "http/1.1".
- h1 : use http/1.1
- h2 : use http/2.0; this requires the server to support RFC8441 or an
error will be returned by haproxy.
At a few places we were still using protocol_by_family() instead of
the richer protocol_lookup(). The former is limited as it enforces
SOCK_STREAM and a stream protocol at the control layer. At least with
protocol_lookup() we don't have this limitationn. The values were still
set for now but later we can imagine making them configurable on the
fly.
This one was used to indicate whether the callee had to follow particularly
safe code path when removing resolutions. Since the code now uses a kill
list, this is not needed anymore.
This change is required to support TCP/HTTP rules in defaults sections. The
'disabled' bitfield in the proxy structure, used to know if a proxy is
disabled or stopped, is replaced a generic bitfield named 'flags'.
PR_DISABLED and PR_STOPPED flags are renamed to PR_FL_DISABLED and
PR_FL_STOPPED respectively. In addition, everywhere there is a test to know
if a proxy is disabled or stopped, there is now a bitwise AND operation on
PR_FL_DISABLED and/or PR_FL_STOPPED flags.
This function is bogus at the API level: it demands that the input string
is zero-terminated *and* that its length *including* the trailing zero is
passed on input. While that already looks smelly, the trailing zero is
copied as-is, and is then explicitly replaced with a zero... Not only
all callers have to pass hostname_len+1 everywhere to work around this
absurdity, but this requirement causes a bug in the do-resolve() action
that passes random string lengths on input, and that will be fixed on a
subsequent patch.
Let's fix this API issue for now.
This patch will have to be backported, and in versions 2.3 and older,
the function is in dns.c and is called dns_str_to_dn_label().
The following functions are quite heavy and have no reason to be kept
inlined:
srv_release_conn, srv_lookup_conn, srv_lookup_conn_next,
srv_add_to_idle_list
They were moved to server.c. It's worth noting that they're a bit
at the edge between server and connection and that maybe we could
create an idle-conn file for these in the near future.
This one has nothing to do with ssl_sock as it manipulates the struct
server only. Let's move it to server.c and remove unneeded dependencies
on ssl_sock.h. This further reduces by 10% the number of includes of
opensslconf.h and by 0.5% the number of compiled lines.
The SSL stuff in struct server takes less than 3% of it and requires
lots of annoying ifdefs in the code just to take care of the cases
where the field is absent. Let's get rid of this and stop including
openssl-compat from server.c to detect NPN and ALPN capabilities.
This reduces the total LoC by another 0.4%.
We'll need to improve the API to pass other arguments in the future, so
let's start to adapt better to the current use cases. task_new() is used:
- 18 times as task_new(tid_bit)
- 18 times as task_new(MAX_THREADS_MASK)
- 2 times with a single bit (in a loop)
- 1 in the debug code that uses a mask
This patch provides 3 new functions to achieve this:
- task_new_here() to create a task on the calling thread
- task_new_anywhere() to create a task to be run anywhere
- task_new_on() to create a task to run on a specific thread
The change is trivial and will allow us to later concentrate the
required adaptations to these 3 functions only. It's still possible
to call task_new() if needed but a comment was added to encourage the
use of the new ones instead. The debug code was not changed and still
uses it.
Enable the 'slowstart' keyword for dynamic servers. The slowstart task
is allocated in 'add server' handler if slowstart is used.
As the server is created in disabled state, there is no need to start
the task. The slowstart task will be automatically started on the first
'enable server' invocation.
'slowstart' can be used without check on a server, with the CLI handlers
'enable/disable server'. Move the code to initialize and start the
slowstart task outside of check.c.
This change will also be reused to enable slowstart for dynamic servers.
Allow to use the check related keywords defined in server.c. These
keywords can be enabled now that checks have been implemented for
dynamic servers.
Here is the list of the new keywords supported :
- error-limit
- observe
- on-error
- on-marked-down
- on-marked-up
Allow to configure ssl support for dynamic server checks independently
of the ssl server configuration. This is done via the keyword
"check-ssl". Also enable to configure the sni/alpn used for the check
via "check-sni/alpn".
The ssl context is not initialized for a dynamic server, even if there
is a tcpcheck rule which uses ssl on the related backed. This will cause
the check initialization to failed with the message :
"Out of memory when initializing an SSL connection"
This can be reproduced by having the following config in the backend :
option tcp-check
tcp-check connect ssl
and create a dynamic server with check activated and a ca-file.
Fix this by calling the prepare_srv xprt callback when the proxy options
PR_O_TCPCKH_SSL is set.
Check support for dynamic servers has been merged in the current branch.
No backport needed.
Test that checks have been configured on the server before enabling via
the 'enable health' CLI. This mirrors the 'enable agent' command.
Without this, a user can use the command on the server without checks.
This leaves the server in an undefined state. Notably, the stat page
reports the server in check transition.
This condition was left on the following reorg commit.
2c04eda8b5
REORG: cli: move "{enable|disable} health" to server.c
This should be backported up to 1.8.