fd_stop_recv() has nothing to do in the generic listener code, it's per
protocol as some don't need it. For instance with abns@ it could even
lead to fd_stop_recv(-1). And later with QUIC we don't want to touch
the fd at all! It used to be that since commit f2cb169487 delegating
fd manipulation to their respective threads it wasn't possible to call
it down there but it's not the case anymore, so let's perform the action
in the protocol-specific code.
This function is used as a wrapper to set a listener's state everywhere.
We'll use it later to maintain some counters in a consistent state when
switching state so it's capital that all state changes go through it.
No functional change was made beyond calling the wrapper.
This one will be needed to more accurately select a protocol. It may
differ from the socket type for QUIC, which uses dgram at the socket
layer and provides stream at the control layer. The upper level requests
a control layer only so we need this field.
This removes the following fields from struct protocol that are now
retrieved from the protocol family instead: .sock_family, .sock_addrlen,
.l3_addrlen, .addrcmp, .bind, .get_src, .get_dst.
This also removes the UDP-specific udp{,6}_get_{src,dst}() functions
which were referenced but not used yet. Their goal was only to remap
the original AF_INET* addresses to AF_CUST_UDP*.
Note that .sock_domain is still there as it's used as a selector for
the protocol struct to be used.
We need to specially handle protocol families which regroup common
functions used for a given address family. These functions include
bind(), addrcmp(), get_src() and get_dst() for now. Some fields are
also added about the address family, socket domain (protocol family
passed to the socket() syscall), and address length.
These protocol families are referenced from the protocols but not yet
used.
All protocol's listeners now only take care of themselves and not of
the receiver anymore since that's already being done in proto_bind_all().
Now it finally becomes obvious that UDP doesn't need a listener, as the
only thing it does is to set the listener's state to LI_LISTEN!
This removes all the AF_UNIX-specific code from uxst_bind_listener()
and now simply relies on sock_unix_bind_listener() to do the same
job. As mentionned in previous commit, the only difference is that
now an unlikely failure on listen() will not result in a roll back
of the temporary socket names since they will have been renamed
during the bind() operation (as expected). But such failures do not
correspond to any normal case and mostly denote operating system
issues so there's no functionality loss here.
This function performs all the bind-related stuff for UNIX sockets that
was previously done in uxst_bind_listener(). There is a very tiny
difference however, which is that previously, in the unlikely event
where listen() would fail, it was still possible to roll back the binding
and rename the backup to the original socket. Now we have to rename it
before calling returning, hence it will be done before calling listen().
However, this doesn't cover any particular use case since listen() has no
reason to fail there (and the rollback is not done for inherited sockets),
that was just done that way as a generic error processing path.
The code is not used yet and is referenced in the uxst proto's ->bind().
It's the receiver's FD that's inherited from the parent process, not
the listener's so the flag must move to the receiver so that appropriate
actions can be taken.
In order to split the receiver from the listener, we'll need to know that
a socket is already bound and ready to receive. We used to do that via
tha LI_O_ASSIGNED state but that's not sufficient anymore since the
receiver might not belong to a listener anymore. The new RX_F_BOUND flag
is used for this.
Some socket settings used to be retrieved via the listener and the
bind_conf. Now instead we use the receiver and its settings whenever
appropriate. This will simplify the removal of the dependency on the
listener.
The receiver is the one which depends on the protocol while the listener
relies on the receiver. Let's move the protocol there. Since there's also
a list element to get back to the listener from the proto list, this list
element (proto_list) was moved as well. For now when scanning protos, we
still see listeners which are linked by their rx.proto_list part.
The listening socket is represented by its file descriptor, which is
generic to all receivers and not just listeners, so it must move to
the rx struct.
It's worth noting that in order to extend receivers and listeners to
other protocols such as QUIC, we'll need other handles than file
descriptors here, and that either a union or a cast to uintptr_t
will have to be used. This was not done yet and the field was
preserved under the name "fd" to avoid adding confusion.
There currently is a large inconsistency in how binding parameters are
split between bind_conf and listeners. It happens that for historical
reasons some parameters are available at the listener level but cannot
be configured per-listener but only for a bind_conf, and thus, need to
be replicated. In addition, some of the bind_conf parameters are in fact
for the listening socket itself while others are for the instanciated
sockets.
A previous attempt at splitting listeners into receivers failed because
the boundary between all these settings is not well defined.
This patch introduces a level of listening socket settings in the
bind_conf, that will be detachable later. Such settings that are solely
for the listening socket are:
- unix socket permissions (used only during binding)
- interface (used for binding)
- network namespace (used for binding)
- process mask and thread mask (used during startup)
The rest seems to be used only to initialize the resulting sockets, or
to control the accept rate. For now, only the unix params (bind_conf->ux)
were moved there.
This is essentially a merge from tcp_find_compatible_fd() and
uxst_find_compatible_fd() that relies on a listener's address and
compare function and still checks for other variations. For AF_INET6
it compares a few of the listener's bind options. A minor change for
UNIX sockets is that transparent mode, interface and namespace used
to be ignored when trying to pick a previous socket while now if they
are changed, the socket will not be reused. This could be refined but
it's still better this way as there is no more risk of using a
differently bound socket by accident.
Eventually we should not pass a listener there but a set of binding
parameters (address, interface, namespace etc...) which ultimately will
be grouped into a receiver. For now this still doesn't exist so let's
stick to the listener to break dependencies in the rest of the code.
The new addrcmp() protocol member points to the function to be used to
compare two addresses of the same family.
When picking an FD from a previous process, we can now use the address
specific address comparison functions instead of having to rely on a
local implementation. This will help move that code to a more central
place.
The new file sock.c will contain generic code for standard sockets
relying on file descriptors. We currently have way too much duplication
between proto_uxst, proto_tcp, proto_sockpair and proto_udp.
For now only get_src, get_dst and sock_create_server_socket were moved,
and are used where appropriate.
Let's finish the cleanup and get rid of all bind and server keywords
parsers from proto_uxst.c. They're now moved to cfgparse-unix.c. Now
proto_uxst.c is clean and only contains code related to binding and
connecting.
This new flag will be used to mark FDs that must be passed to any future
process across the CLI's "_getsocks" command.
The scheme here is quite complex and full of special cases:
- FDs inherited from parent processes are *not* exported this way, as
they are supposed to instead be passed by the master process itself
across reloads. However such FDs ought never to be paused otherwise
this would disrupt the socket in the parent process as well;
- FDs resulting from a "bind" performed over a socket pair, which are
in fact one side of a socket pair passed inside another control socket
pair must not be passed either. Since all of them are used the same
way, for now it's enough never to put this "exported" flag to FDs
bound by the socketpair code.
- FDs belonging to temporary listeners (e.g. a passive FTP data port)
must not be passed either. Fortunately we don't have such FDs yet.
- the rest of the listeners for now are made of TCP, UNIX stream, ABNS
sockets and are exportable, so they get the flag.
- UDP listeners were wrongly created as listeners and are not suitable
here. Their FDs should be passed but for now they are not since the
client doesn't even distinguish the SO_TYPE of the retrieved sockets.
In addition, it's important to keep in mind that:
- inherited FDs may never be closed in master process but may be closed
in worker processes if the service is shut down (useless since still
bound, but technically possible) ;
- inherited FDs may not be disabled ;
- exported FDs may be disabled because the caller will perform the
subsequent listen() on them. However that might not work for all OSes
- exported FDs may be closed, it just means the service was shut down
from the worker, and will be rebound in the new process. This implies
that we have to disable exported on close().
=> as such, contrary to an apparently obvious equivalence, the "exported"
status doesn't imply anything regarding the ability to close a
listener's FD or not.
When a connect() doesn't immediately succeed (i.e. most of the times),
fd_cant_send() is called to enable polling. But given that we don't
mark that we cannot receive either, we end up performing a failed
recvfrom() immediately when the connect() is finally confirmed, as
indicated in issue #253.
This patch simply adds fd_cant_recv() as well so that we're only
notified once the recv path is ready. The reason it was not there
is purely historic, as in the past when there was the fd cache,
doing it would have caused a pending recv request to be placed into
the fd cache, hence a useless recvfrom() upon success (i.e. what
happens now).
Without this patch, forwarding 100k connections does this:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
17.51 0.704229 7 100000 100000 connect
16.75 0.673875 3 200000 sendto
16.24 0.653222 3 200036 close
10.82 0.435082 1 300000 100000 recvfrom
10.37 0.417266 1 300012 setsockopt
7.12 0.286511 1 199954 epoll_ctl
6.80 0.273447 2 100000 shutdown
5.34 0.214942 2 100005 socket
4.65 0.187137 1 105002 5002 accept4
3.35 0.134757 1 100004 fcntl
0.61 0.024585 4 5858 epoll_wait
With the patch:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
18.04 0.697365 6 100000 100000 connect
17.40 0.672471 3 200000 sendto
17.03 0.658134 3 200036 close
10.57 0.408459 1 300012 setsockopt
7.69 0.297270 1 200000 recvfrom
7.32 0.282934 1 199922 epoll_ctl
7.09 0.274027 2 100000 shutdown
5.59 0.216041 2 100005 socket
4.87 0.188352 1 104697 4697 accept4
3.35 0.129641 1 100004 fcntl
0.65 0.024959 4 5337 1 epoll_wait
Note the total disappearance of 1/3 of failed recvfrom() *without*
adding any extra syscall anywhere else.
The trace of an HTTP health check is now totally clean, with no useless
syscall at all anymore:
09:14:21.959255 connect(9, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
09:14:21.959292 epoll_ctl(4, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=9, u64=9}}) = 0
09:14:21.959315 epoll_wait(4, [{EPOLLOUT, {u32=9, u64=9}}], 200, 1000) = 1
09:14:21.959376 sendto(9, "OPTIONS / HTTP/1.0\r\ncontent-leng"..., 41, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 41
09:14:21.959436 epoll_wait(4, [{EPOLLOUT, {u32=9, u64=9}}], 200, 1000) = 1
09:14:21.959456 epoll_ctl(4, EPOLL_CTL_MOD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, u64=9}}) = 0
09:14:21.959512 epoll_wait(4, [{EPOLLIN|EPOLLRDHUP, {u32=9, u64=9}}], 200, 1000) = 1
09:14:21.959548 recvfrom(9, "HTTP/1.0 200\r\nContent-length: 0\r"..., 16320, 0, NULL, NULL) = 126
09:14:21.959570 close(9) = 0
With the edge-triggered poller, it gets even better:
09:29:15.776201 connect(9, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
09:29:15.776256 epoll_ctl(4, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=9, u64=9}}) = 0
09:29:15.776287 epoll_wait(4, [{EPOLLOUT, {u32=9, u64=9}}], 200, 1000) = 1
09:29:15.776320 sendto(9, "OPTIONS / HTTP/1.0\r\ncontent-leng"..., 41, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 41
09:29:15.776374 epoll_wait(4, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=9, u64=9}}], 200, 1000) = 1
09:29:15.776406 recvfrom(9, "HTTP/1.0 200\r\nContent-length: 0\r"..., 16320, 0, NULL, NULL) = 126
09:29:15.776434 close(9) = 0
It could make sense to backport this patch to 2.2 and maybe 2.1 after
it has been sufficiently checked for absence of side effects in 2.3-dev,
as some people had reported an extra overhead like in issue #168.
When building with gcc-8 -fsanitize=address, we get this warning once on
an strncpy() call in proto_uxst.c:
src/proto_uxst.c:262:3: warning: 'strncpy' output may be truncated copying 107 bytes from a string of length 4095 [-Wstringop-truncation]
strncpy(addr.sun_path, tempname, sizeof(addr.sun_path) - 1);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It happens despite the test on snprintf() at the top (since gcc's string
handling is totally empiric), and requires the strlen() test to be placed
"very close" to the strncpy() call (with "very close" yet to be determined).
There's no other way to shut this one except disabling it. Given there's
only one instance of this warning and the cost of dealing with it in the
code is not huge, let's decorate the code to make gcc happily believe it
is smart since it seems to have a mind of itself.
This patch fixes all the leftovers from the include cleanup campaign. There
were not that many (~400 entries in ~150 files) but it was definitely worth
doing it as it revealed a few duplicates.
The current state of the logging is a real mess. The main problem is
that almost all files include log.h just in order to have access to
the alert/warning functions like ha_alert() etc, and don't care about
logs. But log.h also deals with real logging as well as log-format and
depends on stream.h and various other things. As such it forces a few
heavy files like stream.h to be loaded early and to hide missing
dependencies depending where it's loaded. Among the missing ones is
syslog.h which was often automatically included resulting in no less
than 3 users missing it.
Among 76 users, only 5 could be removed, and probably 70 don't need the
full set of dependencies.
A good approach would consist in splitting that file in 3 parts:
- one for error output ("errors" ?).
- one for log_format processing
- and one for actual logging.
The type file is becoming a mess, half of it is for the proxy protocol,
another good part describes conn_streams and mux ops, it would deserve
being split again. At least it was reordered so that elements are easier
to find, with the PP-stuff left at the end. The MAX_SEND_FD macro was moved
to compat.h as it's said to be the value for Linux.
The TASK_IS_TASKLET() macro was moved to the proto file instead of the
type one. The proto part was a bit reordered to remove a number of ugly
forward declaration of static inline functions. About a tens of C and H
files had their dependency dropped since they were not using anything
from task.h.
global.h was one of the messiest files, it has accumulated tons of
implicit dependencies and declares many globals that make almost all
other file include it. It managed to silence a dependency loop between
server.h and proxy.h by being well placed to pre-define the required
structs, forcing struct proxy and struct server to be forward-declared
in a significant number of files.
It was split in to, one which is the global struct definition and the
few macros and flags, and the rest containing the functions prototypes.
The UNIX_MAX_PATH definition was moved to compat.h.
A few includes were missing in each file. A definition of
struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was
moved to defaults.h
Stdio used to be silently inherited from whatever path but it's needed
for list_pollers() which takes a FILE* and which can thus not be
forward-declared.
And also rename standard.c to tools.c. The original split between
tools.h and standard.h dates from version 1.3-dev and was mostly an
accident. This patch moves the files back to what they were expected
to be, and takes care of not changing anything else. However this
time tools.h was split between functions and types, because it contains
a small number of commonly used macros and structures (e.g. name_desc)
which in turn cause the massive list of includes of tools.h to conflict
with the callers.
They remain the ugliest files of the whole project and definitely need
to be cleaned and split apart. A few types are defined there only for
functions provided there, and some parts are even OS-specific and should
move somewhere else, such as the symbol resolution code.
The protocol.h files are pretty low in the dependency and (sadly) used
by some files from common/. Almost nothing was changed except lifting a
few comments.
This one is included almost everywhere and used to rely on a few other
.h that are not needed (unistd, stdlib, standard.h). It could possibly
make sense to split it into multiple parts to distinguish operations
performed on timers and the internal time accounting, but at this point
it does not appear much important.
Half of the users of this include only need the type definitions and
not the manipulation macros nor the inline functions. Moves the various
types into mini-clist-t.h makes the files cleaner. The other one had all
its includes grouped at the top. A few files continued to reference it
without using it and were cleaned.
In addition it was about time that we'd rename that file, it's not
"mini" anymore and contains a bit more than just circular lists.
This one used to be stored into debug.h but the debug tools got larger
and require a lot of other includes, which can't use BUG_ON() anymore
because of this. It does not make sense and instead this macro should
be placed into the lower includes and given its omnipresence, the best
solution is to create a new bug.h with the few surrounding macros needed
to trigger bugs and place assertions anywhere.
Another benefit is that it won't be required to add include <debug.h>
anymore to use BUG_ON, it will automatically be covered by api.h. No
less than 32 occurrences were dropped.
The FSM_PRINTF macro was dropped since not used at all anymore (probably
since 1.6 or so).
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
Just like previous commit, we don't need to pass through the connection
layer anymore to enable polling during a connect(), we know the FD, so
let's simply call fd_want_send().
When intializing a listener, let's make sure the bind_thread mask is
always limited to all_threads_mask when inserting the FD. This will
avoid seeing listening FDs with bits corresponding to threads that are
not active (e.g. when using "bind ... process 1/even"). The side effect
is very limited, all that was identified is that atomic operations are
used in fd_update_events() when not necessary. It's more a matter of
long-term correctness in practice.
This fix might be backported as far as 1.8 (then proto_sockpair must
be dropped).
We do have some checks for the UNIX socket path length to validate the
full pathname of a unix socket but the pathname extension is only taken
into account when using a bind_prefix. The second check only matches
against MAXPATHLEN. So this means that path names between 98 and 108
might successfully parse but fail to bind. Let's adjust the check in
the address parser and refine the error checking at the bind() step.
This addresses bug #493.
Apparently gcc developers decided that strncpy() semantics are no longer
valid and now deserve a warning, especially if used exactly as designed.
This results in issue #304. Let's just remove one to the target size to
please her majesty gcc, the God of C Compilers, who tries hard to make
users completely eliminate any use of string.h and reimplement it by
themselves at much higher risks. Pfff....
This can be backported to stable version, the fix is harmless since it
ignores the last zero that is already set on next line.
In order to address the absurd polling sequence described in issue #253,
let's make sure we disable receiving on a connection until it's established.
Previously with bottom-top I/Os, we were almost certain that a connection
was ready when the first I/O was confirmed. Now we can enter various
functions, including process_stream(), which will attempt to read
something, will fail, and will then subscribe. But we don't want them
to try to receive if we know the connection didn't complete. The first
prerequisite for this is to mark the connection as not ready for receiving
until it's validated. But we don't want to mark it as not ready for sending
because we know that attempting I/Os later is extremely likely to work
without polling.
Once the connection is confirmed we re-enable recv readiness. In order
for this event to be taken into account, the call to tcp_connect_probe()
was moved earlier, between the attempt to send() and the attempt to recv().
This way if tcp_connect_probe() enables reading, we have a chance to
immediately fall back to this and read the possibly pending data.
Now the trace looks like the following. It's far from being perfect
but we've already saved one recvfrom() and one epollctl():
epoll_wait(3, [], 200, 0) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22
epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126
close(7) = 0
Dragan Dosen found that the listeners lock is not sufficient to protect
the listeners list when proxies are stopping because the listeners are
also unlinked from the protocol list, and under certain situations like
bombing with soft-stop signals or shutting down many frontends in parallel
from multiple CLI connections, it could be possible to provoke multiple
instances of delete_listener() to be called in parallel for different
listeners, thus corrupting the protocol lists.
Such operations are pretty rare, they are performed once per proxy upon
startup and once per proxy on shut down. Thus there is no point trying
to optimize anything and we can use a global lock to protect the protocol
lists during these manipulations.
This fix (or a variant) will have to be backported as far as 1.8.