In an attempt to try to provide automatic maxconn settings, we need to
decorrelate a listner's backlog and maxconn so that these values can be
independent. This introduces a listener_backlog() function which retrieves
the backlog value from the listener's backlog, the frontend's, the
listener's maxconn, the frontend's or falls back to 1024. This
corresponds to what was done in cfgparse.c to force a value there except
the last fallback which was not set since the frontend's maxconn is always
known.
Now that nbproc and nbthread are exclusive, we can still provide more
detailed explanations about what we've found in the config when a bind
line appears on multiple threads and processes at the same time, then
ignore the setting.
This patch reduces the listener's thread mask to a single mask instead
of an array of masks per process. Now we have only one thread mask and
one process mask per bind-conf. This removes ~504 bytes of RAM per
bind-conf and will simplify handling of thread masks.
If a "bind" line only refers to process numbers not found by its parent
frontend or not covered by the global nbproc directive, or to a thread
not covered by the global nbthread directive, a warning is emitted saying
what will be used instead.
If the master was reloaded and there was a established connection to a
server, the FD resulting from the accept was leaking.
There was no CLOEXEC flag set on the FD of the socketpair created during
a connect call. This is specific to the socketpair in the master process
but it should be applied to every protocol in case we use them in the
master at some point.
No backport needed.
This switches explicit calls to various trivial registration methods for
keywords, muxes or protocols from constructors to INITCALL1 at stage
STG_REGISTER. All these calls have in common to consume a single pointer
and return void. Doing this removes 26 constructors. The following calls
were addressed :
- acl_register_keywords
- bind_register_keywords
- cfg_register_keywords
- cli_register_kw
- flt_register_keywords
- http_req_keywords_register
- http_res_keywords_register
- protocol_register
- register_mux_proto
- sample_register_convs
- sample_register_fetches
- srv_register_keywords
- tcp_req_conn_keywords_register
- tcp_req_cont_keywords_register
- tcp_req_sess_keywords_register
- tcp_res_cont_keywords_register
- flt_register_keywords
In the various connect_server() functions, don't reset the connection flags,
as some may have been set before. The flags are initialized in conn_init(),
anyway.
This patch improves the previous fix by implementing the socket draining
code directly in conn_sock_drain() so that it always applies regardless
of the protocol's family. Thus it gets rid of tcp_drain().
Right now conn_sock_drain() calls the protocol's ->drain() function if
it exists, otherwise it simply tries to disable polling for receiving
on the connection. This doesn't work well anymore since we've implemented
the muxes in 1.8, and it has a side effect with keep-alive backend
connections established over unix sockets. What happens is that if
during the idle time after a request, a connection reports some data,
si_idle_conn_null_cb() is called, which will call conn_sock_drain().
This one sees there's no drain() on unix sockets and will simply disable
polling for data on the connection. But it doesn't do anything on the
conn_stream. Thus while leaving the conn_fd_handler, the mux's polling
is updated and recomputed based on the conn_stream's polling state,
which is still enabled, and nothing changes, so we see the process
use 100% CPU in this case because the FD remains active in the cache.
There are several issues that need to be addressed here. The first and
most important is that we cannot expect some protocols to simply stop
reading data when asked to drain pending data. So this patch make the
unix sockets rely on tcp_drain() since the functions are the same. This
solution is appropriate for backporting, but a better one is desired for
the long term. The second issue is that si_idle_conn_null_cb() shouldn't
drain the connection but the conn_stream.
At the moment we don't have any way to drain a conn_stream, though a flag
on rcv_buf() will do it well. Until we support muxes on the server side
it is not a problem so this part can be addressed later.
This fix must be backported to 1.8.
When checking if a socket we got from the parent is suitable for a listener,
we just checked that the path matched sockname.tmp, however this is
unsuitable for abns sockets, where we don't have to create a temporary
file and rename it later.
To detect that, check that the first character of the sun_path is 0 for
both, and if so, that &sun_path[1] is the same too.
This should be backported to 1.8.
When a listener is temporarily disabled, we start by locking it and then we call
.pause callback of the underlying protocol (tcp/unix). For TCP listeners, this
is not a problem. But listeners bound on an unix socket are in fact closed
instead. So .pause callback relies on unbind_listener function to do its job.
Unfortunatly, unbind_listener hold the listener's lock and then call an internal
function to unbind it. So, there is a deadlock here. This happens during a
reload. To fix the problemn, the function do_unbind_listener, which is lockless,
is now exported and is called when a listener bound on an unix socket is
temporarily disabled.
This patch must be backported in 1.8.
When removing the socket from the xfer_sock_list, we want to set
next->prev to prev, not to next->prev, which is useless.
This should be backported to 1.8.
fd_insert() is currently called just after setting the owner and iocb,
but proceeding like this prevents the operation from being atomic and
requires a lock to protect the maxfd computation in another thread from
meeting an incompletely initialized FD and computing a wrong maxfd.
Fortunately for now all fdtab[].owner are set before calling fd_insert(),
and the first lock in fd_insert() enforces a memory barrier so the code
is safe.
This patch moves the initialization of the owner and iocb to fd_insert()
so that the function will be able to properly arrange its operations and
remain safe even when modified to become lockless. There's no other change
beyond the internal API.
The listeners and connectors may complain that process-wide or
system-wide FD limits have been reached and will in this case report
maxfd as the limit. This is wrong in fact since there's no reason for
the whole FD space to be contiguous when the total # of FD is reached.
A better approach would consist in reporting the accurate number of
opened FDs, but this is pointless as what matters here is to give a
hint about what might be wrong. So let's simply report the configured
maxsock, which will generally explain why the process' limits were
reached, which is the most common reason. This removes another
dependency on maxfd.
These flags are not exactly for the data layer, they instead indicate
what is expected from the transport layer. Since we're going to split
the connection between the transport and the data layers to insert a
mux layer, it's important to have a clear idea of what each layer does.
All function conn_data_* used to manipulate these flags were renamed to
conn_xprt_*.
A config containing "stats socket /path/to/socket mode admin" used to
silently start and be unusable (mode 0, level user) because the "mode"
parser doesn't take care of non-digits. Now it properly reports :
[ALERT] 276/144303 (7019) : parsing [ext-check.cfg:4] : 'stats socket' : ''mode' : missing or invalid mode 'admin' (octal integer expected)'
This can probably be backported to 1.7, 1.6 and 1.5, though reporting
parsing errors in very old versions probably isn't a good idea if the
feature was left unused for years.
Since everything is self contained in proto_uxst.c there's no need to
export anything. The same should be done for proto_tcp.c but the file
contains other stuff that's not related to the TCP protocol itself
and which should first be moved somewhere else.
cfgparse has no business directly calling each individual protocol's 'add'
function to create a listener. Now that they're all registered, better
perform a protocol lookup on the family and have a standard ->add method
for all of them.
It's a shame that cfgparse() has to make special cases of each protocol
just to cast the port to the target address family. Let's pass the port
in argument to the function. The unix listener simply ignores it.
Till now connections used to rely exclusively on file descriptors. It
was planned in the past that alternative solutions would be implemented,
leading to member "union t" presenting sock.fd only for now.
With QUIC, the connection will need to continue to exist but will not
rely on a file descriptor but a connection ID.
So this patch introduces a "connection handle" which is either a file
descriptor or a connection ID, to replace the existing "union t". We've
now removed the intermediate "struct sock" which was never used. There
is no functional change at all, though the struct connection was inflated
by 32 bits on 64-bit platforms due to alignment.
James Brown reported some cases where a race condition happens between
the old and the new processes resulting in the leaving process removing
a newly bound unix socket. Jeff gave all the details he observed here :
https://www.mail-archive.com/haproxy@formilux.org/msg25001.html
The unix socket removal was an attempt at an optimal cleanup, which
almost never works anyway since the process is supposed to be chrooted.
And in the rare cases where it works it occasionally creates trouble.
There was already a workaround in place to avoid removing this socket
when it's been inherited from a parent's file descriptor.
So let's finally kill this useless stuff now to definitely get rid of
this persistent problem.
This fix should be backported to all stable releases.
Add a new command that will send all the listening sockets, via the
stats socket, and their properties.
This is a first step to workaround the linux problem when reloading
haproxy.
There's a test after a successful synchronous connect() consisting
in waking the data layer up asap if there's no more handshake.
Unfortunately this test is run before setting the CO_FL_SEND_PROXY
flag and before the transport layer adds its own flags, so it can
indicate a willingness to send data while it's not the case and it
will have to be handled later.
This has no visible effect except a useless call to a function in
case of health checks making use of the proxy protocol for example.
Additionally a corner case where EALREADY was returned and considered
equivalent to EISCONN was fixed so that it's considered equivalent to
EINPROGRESS given that the connection is not complete yet. But this
code should never return on the first call anyway so it's mostly a
cleanup.
This fix should be backported to 1.7 and 1.6 at least to avoid
headaches during some debugging.
When a connect() to a unix socket returns EAGAIN we talk about
"no free ports" in the error/debug message, which only makes
sense when using TCP.
Explain connect() failure and suggest troubleshooting server
backlog size.
Abstract namespace sockets ignore the shutdown() call and do not make
it possible to temporarily stop listening. The issue it causes is that
during a soft reload, the new process cannot bind, complaining that the
address is already in use.
This change registers a new pause() function for unix sockets and
completely unbinds the abstract ones since it's possible to rebind
them later. It requires the two previous patches as well as preceeding
fixes.
This fix should be backported into 1.5 since the issue apperas there.
Jan Seda noticed that abstract sockets are incompatible with soft reload,
because the new process cannot bind and immediately fails. This patch marks
the binding as retryable and not fatal so that the new process can try to
bind again after sending a signal to the old process.
Note that this fix is not enough to completely solve the problem, but it
is necessary. This patch should be backported to 1.5.
When bind() fails (function uxst_bind_listener()), the fail path doesn't
consider the abstract namespace and tries to unlink paths held in
uninitiliazed memory (tempname and backname). See the strace excerpt;
the strings still hold the path from test1.
===============================================================================================
23722 bind(5, {sa_family=AF_FILE, path=@"test2"}, 110) = -1 EADDRINUSE (Address already in use)
23722 unlink("/tmp/test1.sock.23722.tmp") = -1 ENOENT (No such file or directory)
23722 close(5) = 0
23722 unlink("/tmp/test1.sock.23722.bak") = -1 ENOENT (No such file or directory)
===============================================================================================
This patch should be backported to 1.5.
Plain "tcp" health checks sent to a unix socket cause two connect()
calls to be made, one to connect, and a second one to verify that the
connection properly established. But with unix sockets, we get
immediate notification of success, so we can avoid this second
attempt. However we need to ensure that we'll visit the connection
handler even if there's no remaining handshake pending, so for this
we claim we have some data to send in order to enable polling for
writes if there's no more handshake.
These sockets are the same as Unix sockets except that there's no need
for any filesystem access. The address may be whatever string both sides
agree upon. This can be really convenient for inter-process communications
as well as for chaining backends to frontends.
These addresses are forced by prepending their address with "abns@" for
"abstract namespace".
We've had everything in place for this for a while now, we just missed
the connect function for UNIX sockets. Note that in order to connect to
a UNIX socket inside a chroot, the path will have to be relative to the
chroot.
UNIX sockets connect about twice as fast as TCP sockets (or consume
about half of the CPU at the same rate). This is interesting for
internal communications between SSL processes and HTTP processes
for example, or simply to avoid allocating source ports on the
loopback.
The tcp_connect_probe() function is still used to probe a dataless
connection, but it is compatible so that's not an issue for now.
Health checks are not yet fully supported since they require a port.
Using the address syntax "fd@<num>", a listener may inherit a file
descriptor that the caller process has already bound and passed as
this number. The fd's socket family is detected using getsockname(),
and the usual initialization is performed through the existing code
for that family, but the socket creation is skipped.
Whether the parent has performed the listen() call or not is not
important as this is detected.
For UNIX sockets, we immediately clear the path after preparing a
socket so that we never remove it in case an abort would happen due
to a late error during startup.
Unix permissions are per-bind configuration line and not per listener,
so let's concretize this in the way the config is stored. This avoids
some unneeded loops to set permissions on all listeners.
The access level is not part of the unix perms so it has been moved
away. Once we can use str2listener() to set all listener addresses,
we'll have a bind keyword parser for this one.
Navigating through listeners was very inconvenient and error-prone. Not to
mention that listeners were linked in reverse order and reverted afterwards.
In order to definitely get rid of these issues, we now do the following :
- frontends have a dual-linked list of bind_conf
- frontends have a dual-linked list of listeners
- bind_conf have a dual-linked list of listeners
- listeners have a pointer to their bind_conf
This way we can now navigate from anywhere to anywhere and always find the
proper bind_conf for a given listener, as well as find the list of listeners
for a current bind_conf.
The "mode", "uid", "gid", "user" and "group" bind options were moved to
proto_uxst as they are unix-specific.
Note that previous versions had a bug here, only the last listener was
updated with the specified settings. However, it almost never happens
that bind lines contain multiple UNIX socket paths so this is not that
much of a problem anyway.
The "raw_sock" prefix will be more convenient for naming functions as
it will be prefixed with the data layer and suffixed with the data
direction. So let's rename the files now to avoid any further confusion.
The #include directive was also removed from a number of files which do
not need it anymore.
fdtab[].state was only used to know whether a connection was in progress
or an error was encountered. Instead we now use connection->flags to store
a flag for both. This way, connection management will be able to update the
connection status on I/O.
The destination address is purely a connection thing and not an fd thing.
It's also likely that later the address will be stored into the connection
and linked to by the SI.
struct fdinfo only keeps the pointer to the port range and the local port
for now. All of this also needs to move to the connection but before this
the release of the port range must move from fd_delete() to a new function
dedicated to the connection.