When the "process" setting of a bind line limits the processes a
listening socket is enabled on, a "disable frontend" operation followed
by an "enable frontend" triggers a bug because all declared listeners
are attempted to be bound again regardless of their assigned processes.
This can at minima create new sockets not receiving traffic, and at worst
prevent from re-enabling a frontend if it's bound to a privileged port.
This bug was introduced by commit 1c4b814 ("MEDIUM: listener: support
rebinding during resume()") merged in 1.6-dev1, trying to perform the
bind() before checking the process list instead of after.
Just move the process check before the bind() operation to fix this.
This fix must be backported to 1.7 and 1.6.
Thanks to Pavlos for reporting this one.
Historically, all listeners have a pointer to the frontend. But since
the introduction of SSL, we now have an intermediary layer called
bind_conf corresponding to a "bind" line. It makes no sense to have
the frontend on each listener given that it's the same for all
listeners belonging to a same bind_conf. Also certain parts like
SSL can only operate on bind_conf and need the frontend.
This patch fixes this by moving the frontend pointer from the listener
to the bind_conf. The extra indirection is quite cheap given and the
places were this is used are very scarce.
When NetScaler application switch is used as L3+ switch, informations
regarding the original IP and TCP headers are lost as a new TCP
connection is created between the NetScaler and the backend server.
NetScaler provides a feature to insert in the TCP data the original data
that can then be consumed by the backend server.
Specifications and documentations from NetScaler:
https://support.citrix.com/article/CTX205670https://www.citrix.com/blogs/2016/04/25/how-to-enable-client-ip-in-tcpip-option-of-netscaler/
When CIP is enabled on the NetScaler, then a TCP packet is inserted just after
the TCP handshake. This is composed as:
- CIP magic number : 4 bytes
Both sender and receiver have to agree on a magic number so that
they both handle the incoming data as a NetScaler Client IP insertion
packet.
- Header length : 4 bytes
Defines the length on the remaining data.
- IP header : >= 20 bytes if IPv4, 40 bytes if IPv6
Contains the header of the last IP packet sent by the client during TCP
handshake.
- TCP header : >= 20 bytes
Contains the header of the last TCP packet sent by the client during TCP
handshake.
When a listener is not bound to a process its frontend belongs to, it
is only paused and not stopped. This creates confusion from the outside
as "netstat -ltnp" for example will report only the parent process as
the listener instead of the effective one. "ss -lnp" will report that
all processes are listening to all sockets.
This is confusing enough to suggest a fix. Now we simply stop the unused
listeners. Example with this simple config :
global
nbproc 4
frontend haproxy_test
bind-process 1-40
bind :12345 process 1
bind :12345 process 2
bind :12345 process 3
bind :12345 process 4
Before the patch :
$ netstat -ltnp
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30457/./haproxy
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30457/./haproxy
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30457/./haproxy
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30457/./haproxy
After the patch :
$ netstat -ltnp
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30504/./haproxy
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30503/./haproxy
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30502/./haproxy
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 30501/./haproxy
This patch may be backported to 1.6 and 1.5, but it relies on commit
7a798e5 ("CLEANUP: fix inconsistency between fd->iocb, proto->accept
and accept()") since it will expose an API inconsistency by including
listener.h in the .c.
There's quite some inconsistency in the internal API. listener_accept()
which is the main accept() function returns void but is declared as int
in the include file. It's assigned to proto->accept() for all stream
protocols where an int is expected but the result is never checked (nor
is it documented by the way). This proto->accept() is in turn assigned
to fd->iocb() which is supposed to return an int composed of FD_WAIT_*
flags, but which is never checked either.
So let's fix all this mess :
- nobody checks accept()'s return
- nobody checks iocb()'s return
- nobody sets a return value
=> let's mark all these functions void and keep the current ones intact.
Additionally we now include listener.h from listener.c to ensure we won't
silently hide this incoherency in the future.
Note that this patch could/should be backported to 1.6 and even 1.5 to
simplify debugging sessions.
The union name "data" is a little bit heavy while we read the source
code because we can read "data.data.sint". The rename from "data" to "u"
makes the read easiest like "data.u.sint".
This patch remove the struct information stored both in the struct
sample_data and in the striuct sample. Now, only thestruct sample_data
contains data, and the struct sample use the struct sample_data for storing
his own data.
This patch removes the 32 bits unsigned integer and the 32 bit signed
integer. It replaces these types by a unique type 64 bit signed.
This makes easy the usage of integer and clarify signed and unsigned use.
With the previous version, signed and unsigned are used ones in place of
others, and sometimes the converter loose the sign. For example, divisions
are processed with "unsigned", if one entry is negative, the result is
wrong.
Note that the integer pattern matching and dotted version pattern matching
are already working with signed 64 bits integer values.
There is one user-visible change : the "uint()" and "sint()" sample fetch
functions which used to return a constant integer have been replaced with
a new more natural, unified "int()" function. These functions were only
introduced in the latest 1.6-dev2 so there's no impact on regular
deployments.
This patch removes the structs "session", "stream" and "proxy" from
the sample-fetches and converters function prototypes.
This permits to remove some weight in the prototype call.
Pavlos Parissis reported that a sequence of disable/enable on a frontend
performed on the CLI can result in an error if the frontend has several
"bind" lines each bound to different processes. This is because the
resume_listener() function returns a failure for frontends not part of
the current process instead of returning a success to pretend there was
no failure.
This fix should be backported to 1.5.
Many such function need a session, and till now they used to dereference
the stream. Once we remove the stream from the embryonic session, this
will not be possible anymore.
So as of now, sample fetch functions will be called with this :
- sess = NULL, strm = NULL : never
- sess = valid, strm = NULL : tcp-req connection
- sess = valid, strm = valid, strm->txn = NULL : tcp-req content
- sess = valid, strm = valid, strm->txn = valid : http-req / http-res
All of them can now retrieve the HTTP transaction *if it exists* from
the stream and be sure to get NULL there when called with an embryonic
session.
The patch is a bit large because many locations were touched (all fetch
functions had to have their prototype adjusted). The opportunity was
taken to also uniformize the call names (the stream is now always "strm"
instead of "l4") and to fix indent where it was broken. This way when
we later introduce the session here there will be less confusion.
With HTTP/2, we'll have to support multiplexed streams. A stream is in
fact the largest part of what we currently call a session, it has buffers,
logs, etc.
In order to catch any error, this commit removes any reference to the
struct session and tries to rename most "session" occurrences in function
names to "stream" and "sess" to "strm" when that's related to a session.
The files stream.{c,h} were added and session.{c,h} removed.
The session will be reintroduced later and a few parts of the stream
will progressively be moved overthere. It will more or less contain
only what we need in an embryonic session.
Sample fetch functions and converters will have to change a bit so
that they'll use an L5 (session) instead of what's currently called
"L4" which is in fact L6 for now.
Once all changes are completed, we should see approximately this :
L7 - http_txn
L6 - stream
L5 - session
L4 - connection | applet
There will be at most one http_txn per stream, and a same session will
possibly be referenced by multiple streams. A connection will point to
a session and to a stream. The session will hold all the information
we need to keep even when we don't yet have a stream.
Some more cleanup is needed because some code was already far from
being clean. The server queue management still refers to sessions at
many places while comments talk about connections. This will have to
be cleaned up once we have a server-side connection pool manager.
Stream flags "SN_*" still need to be renamed, it doesn't seem like
any of them will need to move to the session.
When a listener resumes operations, supporting a full rebind makes it
possible to perform a full stop as a pause(). This will be used for
pausing abstract namespace unix sockets.
In order to fix the abstact socket pause mechanism during soft restarts,
we'll need to proceed differently depending on the socket protocol. The
pause_listener() function already supports some protocol-specific handling
for the TCP case.
This commit makes this cleaner by adding a new ->pause() function to the
protocol struct, which, if defined, may be used to pause a listener of a
given protocol.
For now, only TCP has been adapted, with the specific code moved from
pause_listener() to tcp_pause_listener().
This is currently harmless, but when stopping a listener, its fd is
closed but not set to -1, so it is not possible to re-open it again.
Currently this has no impact but can have after the abstract sockets
are modified to perform a complete close on soft-reload.
The fix can be backported to 1.5 and may even apply to 1.4 (protocols.c).
Now that we know what processes a "bind" statement is attached to, we
have the ability to avoid starting some of them when they're not on the
proper process. This feature is disabled when running in foreground
however, so that debug mode continues to work with everything bound to
the first and only process.
The main purpose of this change is to finally allow the global stats
sockets to be each bound to a different process.
It can also be used to force haproxy to use different sockets in different
processes for the same IP:port. The purpose is that under Linux 3.9 and
above (and possibly other OSes), when multiple processes are bound to the
same IP:port via different sockets, the system is capable of performing
a perfect round-robin between the socket queues instead of letting any
process pick all the connections from a queue. This results in a smoother
load balancing and may achieve a higher performance with a large enough
maxaccept setting.
During some tests in multi-process mode under Linux, it appeared that
issuing "disable frontend foo" on the CLI to pause a listener would
make the shutdown(read) of certain processes disturb another process
listening on the same socket, resulting in a 100% CPU loop. What
happens is that accept() returns EAGAIN without accepting anything.
Fortunately, we see that epoll_wait() reports EPOLLIN+EPOLLRDHUP
(likely because the FD points to the same file in the kernel), so we
can use that to stop the other process from trying to accept connections
for a short time and try again later, hoping for the situation to change.
We must not disable the FD otherwise there's no way to re-enable it.
Additionally, during these tests, a loop was encountered on EINVAL which
was not caught. Now if we catch an EINVAL, we proceed the same way, in
case the socket is re-enabled later.
On ARM, glibc does not implement accept4() and simply returns ENOSYS
which was not caught as a reason to fall back to accept(), resulting
in a spinning process since poll() would call again.
Let's change the error detection mechanism to save the broken status
of the syscall into a local variable that is used to fall back to the
legacy accept().
In addition to this, since the code was becoming a bit messy, the
accept4() was removed, so now the fallback code and the legacy code
are the same. This will also increase bug report accuracy if needed.
This is 1.5-specific, no backport is needed.
Just like the previous commit, we sometimes want to limit the rate of
incoming SSL connections. While it can be done for a frontend, it was
not possible for a whole process, which makes sense when multiple
processes are running on a system to server multiple customers.
The new global "maxsslrate" setting is usable to fix a limit on the
session rate going to the SSL frontends. The limits applies before
the SSL handshake and not after, so that it saves the SSL stack from
expensive key computations that would finally be aborted before being
accounted for.
The same setting may be changed at run time on the CLI using
"set rate-limit ssl-session global".
It's sometimes useful to be able to limit the connection rate on a machine
running many haproxy instances (eg: per customer) but it removes the ability
for that machine to defend itself against a DoS. Thus, better also provide a
limit on the session rate, which does not include the connections rejected by
"tcp-request connection" rules. This permits to have much higher limits on
the connection rate without having to raise the session rate limit to insane
values.
The limit can be changed on the CLI using "set rate-limit sessions global",
or in the global section using "maxsessrate".
This is the reimplementation of the "done" action : when we experience
a short read, we're almost certain that we've exhausted the system's
buffers and that we'll meet an EAGAIN if we attempt to read again. If
the FD is not yet polled, the stream interface already takes care of
stopping the speculative read. When the FD is already being polled, we
have two options :
- either we're running from a level-triggered poller, in which case
we'd rather report that we've reached the end so that we don't
speculate over the poller and let it report next time data are
available ;
- or we're running from an edge-triggered poller in which case we
have no choice and have to see the EAGAIN to re-enable events.
At the moment we don't have any edge-triggered poller, so it's desirable
to avoid speculative I/O that we know will fail.
Note that this must not be ported to SSL since SSL hides the real
readiness of the file descriptor.
Thanks to this change, we observe no EAGAIN anymore during keep-alive
transfers, and failed recvfrom() are reduced by half in http-server-close
mode (the client-facing side is always being polled and the second recv
can be avoided). Doing so results in about 5% performance increase in
keep-alive mode. Similarly, we used to have up to about 1.6% of EAGAIN
on accept() (1/maxaccept), and these have completely disappeared under
high loads.
This commit heavily changes the polling system in order to definitely
fix the frequent breakage of SSL which needs to remember the last
EAGAIN before deciding whether to poll or not. Now we have a state per
direction for each FD, as opposed to a previous and current state
previously. An FD can have up to 8 different states for each direction,
each of which being the result of a 3-bit combination. These 3 bits
indicate a wish to access the FD, the readiness of the FD and the
subscription of the FD to the polling system.
This means that it will now be possible to remember the state of a
file descriptor across disable/enable sequences that generally happen
during forwarding, where enabling reading on a previously disabled FD
would result in forgetting the EAGAIN flag it met last time.
Several new state manipulation functions have been introduced or
adapted :
- fd_want_{recv,send} : enable receiving/sending on the FD regardless
of its state (sets the ACTIVE flag) ;
- fd_stop_{recv,send} : stop receiving/sending on the FD regardless
of its state (clears the ACTIVE flag) ;
- fd_cant_{recv,send} : report a failure to receive/send on the FD
corresponding to EAGAIN (clears the READY flag) ;
- fd_may_{recv,send} : report the ability to receive/send on the FD
as reported by poll() (sets the READY flag) ;
Some functions are used to report the current FD status :
- fd_{recv,send}_active
- fd_{recv,send}_ready
- fd_{recv,send}_polled
Some functions were removed :
- fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai()
The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers
knows it can try to access the file descriptor to get this information.
In order to simplify the conditions to add/remove cache entries, a new
function fd_alloc_or_release_cache_entry() was created to be used from
pollers while scanning for updates.
The following pollers have been updated :
ev_select() : done, built, tested on Linux 3.10
ev_poll() : done, built, tested on Linux 3.10
ev_epoll() : done, built, tested on Linux 3.10 & 3.13
ev_kqueue() : done, built, tested on OpenBSD 5.2
The accept loop used to force fd_poll_recv() even in places where it
was not completely appropriate (eg: unexpected errors). It does not
yet cause trouble but will do with the upcoming polling changes. Let's
use it only where relevant now. EINTR/ECONNABORTED do not result in
poll() anymore but the failed connection is simply skipped (this code
dates from 1.1.32 when error codes were first considered).
The accept4() Linux syscall requires _GNU_SOURCE on ix86, otherwise
it emits a warning. On other archs including x86_64, this problem
doesn't happen. Thanks to Charles Carter from Sigma Software for
reporting this.
We're having a lot of duplicate code just because of minor variants between
fetch functions that could be dealt with if the functions had the pointer to
the original keyword, so let's pass it as the last argument. An earlier
version used to pass a pointer to the sample_fetch element, but this is not
the best solution for two reasons :
- fetch functions will solely rely on the keyword string
- some other smp_fetch_* users do not have the pointer to the original
keyword and were forced to pass NULL.
So finally we're passing a pointer to the keyword as a const char *, which
perfectly fits the original purpose.
Benoit Dolez reported a failure to start haproxy 1.5-dev19. The
process would immediately report an internal error with missing
fetches from some crap instead of ACL names.
The cause is that some versions of gcc seem to trim static structs
containing a variable array when moving them to BSS, and only keep
the fixed size, which is just a list head for all ACL and sample
fetch keywords. This was confirmed at least with gcc 3.4.6. And we
can't move these structs to const because they contain a list element
which is needed to link all of them together during the parsing.
The bug indeed appeared with 1.5-dev19 because it's the first one
to have some empty ACL keyword lists.
One solution is to impose -fno-zero-initialized-in-bss to everyone
but this is not really nice. Another solution consists in ensuring
the struct is never empty so that it does not move there. The easy
solution consists in having a non-null list head since it's not yet
initialized.
A new "ILH" list head type was thus created for this purpose : create
an Initialized List Head so that gcc cannot move the struct to BSS.
This fixes the issue for this version of gcc and does not create any
burden for the declarations.
Now that ACLs solely rely on sample fetch functions, make them use the
same arg mask. All inconsistencies have been fixed separately prior to
this patch, so this patch almost only adds a new pointer indirection
and removes all references to ARG*() in the definitions.
The parsing is still performed by the ACL code though.
ACL fetch functions used to directly reference a fetch function. Now
that all ACL fetches have their sample fetches equivalent, we can make
ACLs reference a sample fetch keyword instead.
In order to simplify the code, a sample keyword name may be NULL if it
is the same as the ACL's, which is the most common case.
A minor change appeared, http_auth always expects one argument though
the ACL allowed it to be missing and reported as such afterwards, so
fix the ACL to match this. This is not really a bug.
The following sample fetch functions were only usable by ACLs but are now
usable by sample fetches too :
dst_conn, so_id,
The fetch functions have been renamed "smp_fetch_*".
If some listeners are mistakenly configured with 0 as the maxaccept value,
then we now consider them as limited to one accept() at a time. This will
avoid some issues as fixed by the past commit.
global.tune.maxaccept was used for all listeners. This becomes really not
convenient when some listeners are bound to a single process and other ones
are bound to many processes.
Now we change the principle : we count the number of processes a listener
is bound to, and apply the maxaccept either entirely if there is a single
process, or divided by twice the number of processes in order to maintain
fairness.
The default limit has also been increased from 32 to 64 as it appeared that
on small machines, 32 was too low to achieve high connection rates.
It happens that on some systems, the libc is recent enough to permit
building with accept4() but the kernel does not support it. The result
is then a disaster since no connection is accepted. We now detect this
and automatically fall back to accept() and fcntl() when this happens.
On Linux, accept4() does the same as accept() except that it allows
the caller to specify some flags to set on the resulting socket. We
use this to set the O_NONBLOCK flag and thus to save one fcntl()
call in each connection. The effect is a small performance gain of
around 1%.
The option is automatically enabled when target linux2628 is set, or
when the USE_ACCEPT4 Makefile variable is set. If the libc is too old
to provide the equivalent function, this is automatically detected and
our own function is used instead. In any case it is possible to force
the use of our implementation with USE_MY_ACCEPT4.
Pausing a UNIX_STREAM socket results in a major pain because the socket
does not correctly resume, it wakes poll() but return EAGAIN on accept(),
resulting in a busy loop. So let's only pause protocols that support it.
This issues has existed since UNIX sockets were introduced on bind lines.
We were having several different behaviours with monitor-net and
"mode health" :
- monitor-net on TCP connections was evaluated just after accept(),
did not count a connection on the frontend and were not subject
to tcp-request connection rules, and caused an immediate close().
- monitor-net in HTTP mode was evaluated once the session was
accepted (eg: on top of SSL), returned "HTTP/1.0 200 OK\r\n\r\n"
over the connection's data layer and instanciated a session which
was responsible for closing this connection. A connection AND a
session were counted for the frontend ;
- "mode health" with "option httpchk" would do exactly the same as
monitor-net in HTTP mode ;
- "mode health" without "option httpchk" would do the same as above
except that "OK" was returned instead of "HTTP/1.0 200 OK\r\n\r\n".
None of them took care of cleaning the input buffer, sometimes resulting
in a TCP reset to be emitted after the last packet if a request was received
over the connection.
Given the inconsistencies and the complexity in keeping all these features
handled at the right position, we now slightly changed the way they are
handled :
- all of them are handled just after the "tcp-request connection" rules,
so that all of them may be blocked using such rules, offering more
flexibility and consistency ;
- no connection handshake is performed anymore for non-TCP modes
- all of them send the response as raw data over the socket, there is no
more difference between TCP and HTTP mode for example (these rules were
never meant to be served over SSL connections and were never documented
as able to do that).
- any possible pending data on the incoming socket is drained before the
response is sent, in order to avoid the risk of a reset.
- none of them exactly did what was documented !
This results in more consistent, more flexible and more accurate handling of
monitor rules, with smaller and more robust code.
Navigating through listeners was very inconvenient and error-prone. Not to
mention that listeners were linked in reverse order and reverted afterwards.
In order to definitely get rid of these issues, we now do the following :
- frontends have a dual-linked list of bind_conf
- frontends have a dual-linked list of listeners
- bind_conf have a dual-linked list of listeners
- listeners have a pointer to their bind_conf
This way we can now navigate from anywhere to anywhere and always find the
proper bind_conf for a given listener, as well as find the list of listeners
for a current bind_conf.
When an unknown "bind" keyword is detected, dump the list of all
registered keywords. Unsupported default alternatives are also reported
as "not supported".
With the arrival of SSL, the "bind" keyword has received even more options,
all of which are processed in cfgparse in a cumbersome way. So it's time to
let modules register their own bind options. This is done very similarly to
the ACLs with a small difference in that we make the difference between an
unknown option and a known, unimplemented option.