The FCGI application handles all the configuration parameters used to format
requests sent to an application. The configuration of an application is grouped
in a dedicated section (fcgi-app <name>) and referenced in a backend to be used
(use-fcgi-app <name>). To be valid, a FCGI application must at least define a
document root. But it is also possible to set the default index, a regex to
split the script name and the path-info from the request URI, parameters to set
or unset... In addition, this patch also adds a FCGI filter, responsible for
all processing on a stream.
To avoid code duplication in the futur mux FCGI, functions parsing H1 messages
and converting them into HTX have been moved in the file h1_htx.c. Some
specific parts remain in the mux H1. But most of the parsing is now generic.
Application is a generic term here. It is a modules which handle its own log
server list, with no dependency on a proxy. Such applications can now call the
function app_log() to log messages, passing a log server list and a tag as
parameters. Internally, the function __send_log() has been adapted accordingly.
Most of times, when a keyword is added in proxy section or on the server line,
we need to have a post-parser callback to check the config validity for the
proxy or the server which uses this keyword.
It is possible to register a global post-parser callback. But all these
callbacks need to loop on the proxies and servers to do their job. It is neither
handy nor efficient. Instead, it is now possible to register per-proxy and
per-server post-check callbacks.
Most of times, when any allocation is done during configuration parsing because
of a new keyword in proxy section or on the server line, we must add a call in
the deinit() function to release allocated ressources. It is now possible to
register a post-deinit callback because, at this stage, the proxies and the
servers are already releases.
Now, it is possible to register deinit callbacks per-proxy or per-server. These
callbacks will be called for each proxy and server before releasing them.
This new flag may be used to report unexpected error because of not well
formatted HTX messages (not related to a parsing error) or our incapactity to
handle the processing because we reach a limit (ressource exhaustion, too big
headers...). It should result to an error 500 returned to the client when
applicable.
It is now possible to export stats using the JSON format from the HTTP stats
page. Like for the CSV export, to export stats in JSON, you must add the option
";json" on the stats URL. It is also possible to dump the JSON schema with the
option ";json-schema". Corresponding Links have been added on the HTML page.
This patch fixes the issue #263.
This adds two extra fields to the stats, one for the current number of idle
connections and one for the configured limit. A tooltip link now appears on
the HTML page to show these values in front of the active connection values.
This should be backported to 2.0 and 1.9 as it's the only way to monitor
the idle connections behaviour.
When using "http-reuse safe", which is the default, a new incoming connection
does not automatically reuse an existing connection for the first request, as
we don't want to risk to lose the contents if we know the client will not be
able to replay the request. A side effect to this is that when dealing with
mostly http-close traffic, the reuse rate is extremely low and we keep
accumulating server-side connections that may even never be reused. At some
point we're limited to a ratio of file descriptors, but when the system is
configured with very high FD limits, we can still reach the limit of outgoing
source ports and make the system significantly slow down trying to find an
available port for outgoing connections. A simple test on my laptop with
ulimit 100000 and with the following config results in the load immediately
dropping after a few seconds :
listen l1
bind :4445
mode http
server s1 127.0.0.1:8000
As can be seen, the load falls from 38k cps to 400 cps during the first 200ms
(in fact when the source port table is full and connect() takes ages to find
a spare port for a new connection):
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
2439 2439 39338 39338 356094 5743 5743 0 0 0.4 0.5 0.4
7637 5198 38185 37666 1115002 5575 5499 0 0 0.7 0.5 0.7
7719 82 25730 820 1127002 3756 120 0 0 21.8 18.8 21.8
7797 78 19492 780 1138446 2846 114 0 0 61.4 2.5 61.4
7877 80 15754 800 1150182 2300 117 0 0 58.6 0.5 58.6
7920 43 13200 430 1156488 1927 63 0 0 58.9 0.3 58.9
At this point, lots of connections are indeed in use, for only 10 connections
on the frontend side:
$ ss -ant state established | wc -l
39022
This patch makes sure we never keep more idle connections than we've ever
had outstanding requests on a server. This way the total number of idle
connections will never exceed the sum of maximum connections. Thus highly
loaded servers will be able to get many connections and slightly loaded
servers will keep less. Ideally we should apply similar limits per process
and the per backend, but in practice this already addresses the issues
pretty well:
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
4423 4423 40209 40209 645758 5870 5870 0 0 0.2 0.4 0.2
8020 3597 40100 39966 1170920 5854 5835 0 0 0.2 0.4 0.2
12037 4017 40123 40170 1757402 5858 5864 0 0 0.2 0.4 0.2
16069 4032 40172 40320 2346074 5865 5886 0 0 0.2 0.4 0.2
20047 3978 40013 39386 2926862 5842 5750 0 0 0.3 0.4 0.3
24005 3958 40008 39979 3504730 5841 5837 0 0 0.2 0.4 0.2
$ ss -ant state established | wc -l
234
This patch must be backported to 2.0. It could be useful in 1.9 as well
eventhough pools and reuse are not enabled by default there.
As mentioned in previous commit, these flags do not map well to
modern poller capabilities. Let's use the FD_EV_*_{R,W} flags instead.
This first patch only performs a 1-to-1 mapping making sure that the
previously reported flags are still reported identically while using
the closest possible semantics in the pollers.
It's worth noting that kqueue will now support improvements such as
returning distinctions between shut and errors on each direction,
though this is not exploited for now.
There's currently a big ambiguity on our use of POLLHUP because we
currently map POLLHUP and POLLRDHUP to FD_POLL_HUP. The first one
indicates a close in *both* directions while the second one indicates
a unidirectional close. Since we don't know from the resulting flag
we always have to read when reported. Furthermore kqueue only reports
unidirectional responses which are mapped to FD_POLL_HUP as well, and
their write closes are mapped to a general error.
We could add a new FD_POLL_RDHUP flag to improve the mapping, or
switch only to the POLL* flags, but that further complicates the
portability for operating systems like FreeBSD which do not have
POLLRDHUP but have its semantics.
Let's instead directly use the per-direction flag values we already
have, and it will be a first step in the direction of finer states.
Thus we introduce an ERR and a SHUT status for each direction, that
the pollers will be able to compute and pass to fd_update_events().
It's worth noting that FD_EV_STATUS already sees the two new flags,
but they are harmless since used only by fd_{recv,send}_state() which
are never called. Thus in its current state this patch must be totally
transparent.
These two functions are used to enable recv/send but only if the FD is
not marked as active yet. The purpose is to conditionally mark them as
tentatively usable without interfering with the polling if polling was
already enabled, when it's supposed to be likely true.
Given that all our I/Os are now directed from top to bottom and not the
opposite way around, and the FD cache was removed, it doesn't make sense
anymore to create FDs that are marked not ready since this would prevent
the first accesses unless the caller explicitly does an fd_may_recv()
which is not expected to be its job (which conn_ctrl_init() has to do
by the way). Let's move this into fd_insert() instead, and have a single
atomic operation for both directions via fd_may_both().
Now that we don't have to update FD_EV_POLLED_* at the same time as
FD_EV_ACTIVE_*, we don't need to use a CAS anymore, a bit-test-and-set
operation is enough. Doing so reduces the code size by a bit more than
1 kB. One function was special, fd_done_recv(), whose comments and doc
were inaccurate for the part related to the lack of polling.
Since commit 7ac0e35f2 in 1.9-dev1 ("MAJOR: fd: compute the new fd polling
state out of the fd lock") we've started to update the FD POLLED bit a
bit more aggressively. Lately with the removal of the FD cache, this bit
is always equal to the ACTIVE bit. There's no point continuing to watch
it and update it anymore, all it does is create confusion and complicate
the code. One interesting side effect is that it now becomes visible that
all fd_*_{send,recv}() operations systematically call updt_fd_polling(),
except fd_cant_recv()/fd_cant_send() which never saw it change.
Now by prefixing a log server with "ring@<name>" it's possible to send
the logs to a ring buffer. One nice thing is that it allows multiple
sessions to consult the logs in real time in parallel over the CLI, and
without requiring file system access. At the moment, ring0 is created as
a default sink for tracing purposes and is available. No option is
provided to create new rings though this is trivial to add to the global
section.
Instead of detecting an AF_UNSPEC address family for a log server and
to deduce a file descriptor, let's create a target type field and
explicitly mention that the socket is of type FD.
The purpose is to be able to remember that initialization was already
done for a file descriptor. This will allow to get rid of some dirty
hacks performed in the logs or fd sinks where the init state of the
fd has to be guessed.
The "cache" entry was still present in the fdtab struct and it was
reported in "show sess". Removing it broke the cache-line alignment
on 64-bit machines which is important for threads, so it was fixed
by adding an attribute(aligned()) when threads are in use. Doing it
only in this case allows 32-bit thread-less platforms to see the
struct fit into 32 bytes.
Now it is possible for a reader to subscribe and wait for new events
sent to a ring buffer. When new events are written to a ring buffer,
the applets that are subscribed are woken up to display new events.
For now we only support this with the CLI applet called by "show events"
since the I/O handler is indeed a CLI I/O handler. But it's not
complicated to add other mechanisms to consume events and forward them
to external log servers for example. The wait mode is enabled by adding
"-w" after "show events <sink>". An extra "-n" was added to directly
seek to new events only.
Some CLI parsers are currently abusing the CLI context types such as
pointers to stuff longs into them by lack of room. But the context is
80 bytes while cli is only 48, thus there's some room left. This patch
adds a list element and two size_t usable as various offsets. The list
element is initialized.
The detail level initially based on syslog levels is not used, while
something related is missing, trace verbosity, to indicate whether or
not we want to call the decoding callback and what level of decoding
we want (raw captures etc). Let's change the field to "verbosity" for
this. A verbosity of zero means that the decoding callback is not
called, and all other levels are handled by this callback and are
source-specific. The source is now prompted to list the levels that
are proposed to the user. When the source doesn't define anything,
"quiet" and "default" are available.
Working on adding traces to mux-h2 revealed that the function names are
manually copied a lot in developer traces. The reason is that they are
not preprocessor macros and as such cannot be concatenated. Let's
slightly adjust the trace() function call to take a function name just
after the file:line argument. This argument is only added for the
TRACE_DEVEL and 3 new TRACE_ENTER, TRACE_LEAVE, and TRACE_POINT macros
and left NULL for others. This way the function name is only reported
for traces aimed at the developers. The pretty-print callback was also
extended to benefit from this. This will also significantly shrink the
data segment as the "entering" and "leaving" strings will now be merged.
One technical point worth mentioning is that the function name is *not*
passed as an ist to the inline function because it's not considered as
a builtin constant by the compiler, and would lead to strlen() being
run on it from all call places before calling the inline function. Thus
instead we pass the const char * (that the compiler knows where to find)
and it's the __trace() function that converts it to an ist for internal
consumption and for the pretty-print callback. Doing this avoids losing
5-10% peak performance.
The "payload" trace level was ambigous because its initial purpose was
to be able to dump received data. But it doesn't make sense to force to
report data transfers just to be able to report state changes. For
example, all snd_buf()/rcv_buf() operations coming from the application
layer should be tagged at this level. So here we move this payload level
above the state transitions and rename it to avoid the ambiguity making
one think it's only about request/response payload. Now it clearly is
about any data transfer and is thus just below the developer level. The
help messages on the CLI and the doc were slightly reworded to help
remove this ambiguity.
In prompts on the CLI we now commonly need to propose a keyword name
and a description and it doesn't make sense to define a new struct for
each such pairs. Let's simply have a generic "name_desc" for this.
Save the authority TLV in a PROXYv2 header from the client connection,
if present, and make it available as fc_pp_authority.
The fetch can be used, for example, to set the SNI for a backend TLS
connection.
Previously the callback was almost mandatory so it made sense to have it
before the message. Now that it can default to the one declared in the
trace source, most TRACE() calls contain series of empty args and callbacks,
which make them suitable for being at the end and being totally omitted.
This patch thus reverses the TRACE arguments so that the message appears
first, then the mask, then arg1..arg4, then the callback. In practice
we'll mostly see 1 arg, or 2 args and nothing else, and it will not be
needed anymore to pass long series of commas in the middle of the
arguments. However if a source is enforced, the empty commas will still
be needed for all omitted arguments.
It becomes apparent that most traces will use a single trace pretty
print callback, so let's allow the trace source to declare a default
one so that it can be omitted from trace calls, and will be used if
no other one is specified.
The principle is that when emitting a message, if some dropped events
were logged, we first attempt to report this counter before going
further. This is done under an exclusive lock while all logs are
produced under a shared lock. This ensures that the dropped line is
accurately reported and doesn't accidently arrive after a later
event.
This now provides sink_new_buf() which allocates a ring buffer. One such
ring ("buf0") of 1 MB is created already, and may be used by sink_write().
The sink's creation should probably be moved somewhere else later.
The three functions (attach, IO handler, and release) are meant to be
called by any CLI command which requires to dump the contents of a ring
buffer. We do not implement anything generic to dump any ring buffer on
the CLI since it's meant to be used by other functionalities above.
However these functions deal with locking and everything so it's trivial
to embed them in other code.
This function tries to write to the ring buffer, possibly removing enough
old messages to make room for the new one. It takes two arrays of fragments
on input to ease the insertion of prefixes by the caller. It atomically
writes the message, possibly truncating it if desired, and returns the
operation's status.
Our circular buffers are well suited for being used as ring buffers for
not-so-structured data. The machanism here consists in making room in a
buffer before inserting a new record which is prefixed by its size, and
looking up next record based on the previous one's offset and size. We
can have up to 255 consumers watching for data (dump in progress, tail)
which guarantee that entrees are not recycled while they're being dumped.
The complete representation is described in the header file. For now only
ring_new(), ring_resize() and ring_free() are created.
Currently both logs and event sinks may use a file descriptor to
atomically emit some output contents. The two may use the same FD though
nothing is done to make sure they use the same lock. Also there is quite
some redundancy between the two. Better make a specific function to send
a fragmented message to a file descriptor which will take care of the
locking via the fd's lock. The function is also able to truncate a
message and to enforce addition of a trailing LF when building the
output message.
The new functions are :
__b_put_varint() : inserts a varint when it's known that it fits
b_put_varint() : tries to insert a varint at the tail
b_get_varint() : tries to get a varint from the head
b_peek_varint() : tries to peek a varint at a specific offset
Wrapping is supported so that they are expected to be safe to use to
manipulate varints with buffers anywhere.
It will sometimes be useful to encode varints to know the output size in
advance. Two versions are provided, one inline using a switch/case construct
which will be trivial for use with constants (and will be very fast albeit
huge) and one function iterating on the number which is 5 times smaller,
for use with variables.
I forgot to fix this one before pushing, despite my tests. lockon_ptr is
only used to compare pointers, it doesn't need to point to a writable
location. Without threads the atomic store is turned into an assignment
and rightfully complains.
Given that we can pass typed arguments to the trace() function, let's
add provisions for tracking them. They are source-specific so we need
to let the source fill their name and description. Only those with a
non-null name will be proposed.
With a few macros it's possible for a trace source to commit to only
using a certain type for a given argument (or set of). This will be
particularly useful to let the trace subsystem retrieve some precious
information such as a connection, session, listener, source address or
so, and enable/disable filtering and/or locking.
The new TRACE_<level>() macros take a mask, 4 args, a callback and a
static message. From this they also inherit the TRACE_SOURCE macro from
the caller, which contains the pointer to the trace source (so that it's
not required to paste it everywhere), and an ist string is also made by
the concatenation of the file name and the line number. This uses string
concatenation by the preprocessor, and turns it into an ist by the compiler
so that there is no operation at all to perform to adjust the data length
as the compiler knows where to cut during the optimization phase. Last,
the message is also automatically turned into an ist so that it's trivial
to put it into an iovec without having to run strlen() on it.
All arguments and the callback may be empty and will then automatically
be replaced with a NULL pointer. This makes the TRACE calls slightly
lighter especially since arguments are not always used. Several other
options were considered to use variadic macros but there's no outstanding
rule that justifies to place an argument before another one, and it still
looks convenient to have the message be the last one to encourage copy-
pasting of the trace statements.
A generic TRACE() macro takes TRACE_LEVEL in from the source file as the
trace level instead of taking it from its name. This may slightly simplify
the production of traces that always run at the same level (internal core
parts may probably only be called at developer level).
The trace() call will support an optional decoding callback and 4
arguments that this function is supposed to know how to use to provide
extra information. The output remains unchanged when the function is
NULL. Otherwise, the message is pre-filled into the thread-local
trace_buf, and the function is called with all arguments so that it
completes the buffer in a readable form depending on the expected
level of detail.
This new "level" argument will allow the trace sources to label the
traces for different purposes, and filter out some of them if they
are not relevant to the current target. Right now we have 5 different
levels:
- USER : the least verbose one, only a few functional information
- PAYLOAD: like user but also displays some payload-related information
- PROTO: focuses on the protocol's framing
- STATE: also indicate state internal transitions or non-transitions
- DEVELOPER: adds extra info about branches taken in the code (break
points, return points)
We now pass an extra argument "where" to the trace() call, which
is supposed to be an ist made of the concatenation of the filename
and the line number. We only keep the last 10 chars from this string
since the end of file names is most often easy to recognize. This
gives developers useful information at very low cost.
For now it remains quite basic. It performs a few state checks, calls
the source's sink if defined, and performs the transitions between
RUNNING, STOPPED and WAITING when the configured events match.
For now it lists the sources if one is not provided, and checks
for the source's existence. It lists the events if not provided,
checks for their existence if provided, and adjusts reported
events/start/stop/pause events, and performs state transitions.
It lists sinks and adjusts them as well. Filters, lock, and
level are not implemented yet.
The principle of this subsystem will be to support taking live traces
at various places in the code with conditional triggers, filters, and
ability to lock on some elements. The traces will support typed events
and will be sent into sinks made of ring buffers, file descriptors or
remote servers.
This is the most basic type of sink. It pre-registers "stdout" and
"stderr", and is able to use writev() on them. The writev() operation
is locked to avoid mixing outputs. It's likely that the registration
should move somewhere else to take into account the fact that stdout
and stderr are still opened or are closed.
The principle will be to be able to dispatch events to various destinations
called "sinks". This is already done in part in logs where log servers can
be either a UDP socket or a file descriptor. This will be needed with the
new trace subsystem where we may also want to add ring buffers. And it turns
out that all such destinations make sense at all places. Logs may need to be
sent to a TCP server via a ring buffer, or consulted from the CLI. Trace
events may need to be sent to stdout/stderr as well as to remote log servers.
This patch creates a new structure "sink" aiming at addressing these similar
needs. The goal is to merge together what is common to all of them, such as
the output format, the dropped events count, etc, and also keep separately
the target identification (network address, file descriptor). Provisions
were made to have a "waiter" on the sink. For a TCP log server it will be
the task to wake up after writing to the log buffer. For a ring buffer, it
could be the list of watchers on the CLI running a "tail" operation and
waiting for new events. A lock was also placed in the struct since many
operations will require some locking, including the FD ones. The output
formats covers those in use by logs and two extra ones prepending the ISO
time in front of the message (convenient for stdio/buffer).
For now only the generic infrastructure is present, no type-specific
output is implemented. There's the sink_write() function which prepares
and formats a message to be sent, trying hard to avoid copies and only
using pointer manipulation, where the type-specific code just has to be
added. Dropped messages are already counted (for now 100% drop). The
message is put into an iovec array as it will be trivial to use with
file descriptors and sockets.
The function call tracing code is a quite old and was never ported to
support threads. It's not even sure whether it still works well, but
at least its presence creates confusion for future work so let's rename
it to calltrace.c and add a comment about its lack of thread-safety.
It's sometimes convenient for debugging macros not to be forced to
explicitly pass NULL in an unused argument. This macro does this, it
replaces a missing arg with NULL.
The current functions are seen outside from the debugging code and are
convenient to export so that we can improve the thread dump output :
void hlua_applet_tcp_fct(struct appctx *ctx);
void hlua_applet_http_fct(struct appctx *ctx);
struct task *hlua_process_task(struct task *task, void *context, unsigned short state);
Of course they are only available when USE_LUA is defined.
This is somewhat related to indent_msg() except that this one places a
known prefix at the beginning of each line, allows to replace the EOL
character, and not to insert a prefix on the first line if not desired.
It works with a normal output buffer/chunk so it doesn't need to allocate
anything nor to modify the input string. It is suitable for use in multi-
line backtraces.
When I/O events are being processed, we want to make sure to mark the
thread as not stuck. The reason is that some pollers (like poll()) which
do not limit the number of FDs they report could possibly report a huge
amount of FD all having to perform moderately expensive operations in
the I/O callback (e.g. via mux-pt which forwards to the upper layers),
making the watchdog think the thread is stuck since it does not schedule.
Of course this must never happen but if it ever does we must be liberal
about it.
This should be backported to 2.0, where the situation may happen more
easily due to the FD cache which can start to collect a large amount of
events. It may be related to the report in issue #201 though nothing is
certain about it.
These functions perform all the boring filling of the appctx's
cli struct needed by CLI parsers to return a message or an error,
and they return 1 so that they can be used as a single-line return
statement. They may be used for const messages or dynamic messages.
Right now we used to have extremely inconsistent states to report output,
one is CLI_ST_PRINT which prints constant message cli->msg with the
assigned severity, and CLI_ST_PRINT_FREE which prints dynamically
allocated cli->err with severity LOG_ERR, and nothing in between,
eventhough it's useful to be able to report dynamically allocated
messages as well as constant error messages.
This patch adds two extra states, which are not particularly well named
given the constraints imposed by existing ones. One is CLI_ST_PRINT_ERR
which prints a constant error message. The other one is CLI_ST_PRINT_DYN
which prints a dynamically allocated message. By doing so we maintain
the compatibility with current code.
It is important to keep in mind that we cannot pre-initialize pointers
and automatically detect what message type it is based on the assigned
fields, because the CLI's context is in a union shared with all other
users, thus unused fields contain anything upon return. This is why we
have no choice but using 4 states. Keeping the two fields <msg> and
<err> remains useful because one is const and not the other one, and
this catches may copy-paste mistakes. It's just that <err> is pretty
confusing here, it should be renamed.
The CPU time accounting field called "cpu_time" is used only by tasks
and not tasklets, yet it used to be stored into the TASK_COMMON part,
which doesn't make sense and wastes tasklet memory. In addition, moving
it to tasks also helps better group the various parts in cache lines.
Since last commit there's no point anymore in having two variants of the
same function, let's switch to b_free() only. __b_drop() was renamed to
__b_free() for obvious consistency reasons.
A small race exists in buffers with "show sess all". This one wants to show
some information grabbed from the buffer (especially in HTX mode). But the
thread owning this buffer might just be releasing its area, right after a
free() or munmap() call, resulting in a head that is not seen as empty yet
though the area was released. It may then be dereferenced by "show sess all"
causing a crash. Note that in practice it only happens in debug mode with
UAF enabled, but it's tricky enough to fix it right now.
This should be backported to stable versions which support threads and a
store barrier. It's worth noting that by performing the clearing first,
b_free() and b_drop() now become two exact equivalent.
Commit 85b2cae63 ("MINOR: pools: make the thread harmless during the
mmap/munmap syscalls") was used to relax the pressure experienced by
other threads when running in debug mode with UAF enabled. It places
a pair of thread_harmless_now()/thread_harmless_end() around the call
to mmap(), assuming callers are not sensitive to parallel activity.
But there are a few cases like "show sess all" where this happens in
isolated threads, and marking the thread as harmless there is a very
bad idea, even worse when arriving to thread_harmless_end() which loops
forever.
Let's only do that when the thread is not isolated. No backport is
needed as the patch above was only in 2.1-dev.
When parsing references to stick-tables declared as backends, they are added to
a list of proxies (they are proxies!) which refer to this stick-tables.
Before this patch we added them to these list without checking they were already
present, making the silly hypothesis the actions/sample were checked/resolved in the same
order the proxies are parsed.
This patch implement a simple inline function to in_proxies_list() to test
the presence of a proxy in a list of proxies. We use this function when resolving
/checking samples/actions.
This bug was introduced by 015e4d7 commit.
Must be backported to 2.0.
In stream_set_backend(), if we have a TCP stream, and we want to upgrade it
to H2 instead of attempting ot reuse the stream, just destroy the
conn_stream, make sure we don't log anything about the stream, and pretend
we failed setting the backend, so that the stream will get destroyed.
New streams will then be created by the mux, as if the connection just
happened.
This fixes a crash when upgrading from TCP to H2, as the H2 mux totally
ignored the conn_stream provided by the upgrade, as reported in github
issue #196.
This should be backported to 2.0.
It happens that upon looping threads the watchdog fires, starts a dump,
and other threads expire their budget while waiting for the other threads
to get dumped and trigger a watchdog event again, adding some confusion
to the traces. With this patch the situation becomes clearer as we export
the list of threads being dumped so that the watchdog can check it before
deciding to trigger. This way such threads in queue for being dumped are
not attempted to be reported in turn.
This should be backported to 2.0 as it helps understand stack traces.
In the poller code, instead of just remembering if we're currently polling
a fd or not, remember if we're polling it for writing and/or for reading, that
way, we can avoid to modify the polling if it's already polled as needed.
Now that the architecture was changed so that attempts to receive/send data
always come from the upper layers, instead of them only trying to do so when
the lower layer let them know they could try, we can finally get rid of the
fd cache. We don't really need it anymore, and removing it gives us a small
performance boost.
A problem involving server slowstart was reported by @max2k1 in issue #197.
The problem is that pendconn_grab_from_px() takes the proxy lock while
already under the server's lock while process_srv_queue() first takes the
proxy's lock then the server's lock.
While the latter seems more natural, it is fundamentally incompatible with
mayn other operations performed on servers, namely state change propagation,
where the proxy is only known after the server and cannot be locked around
the servers. Howwever reversing the lock in process_srv_queue() is trivial
and only the few functions related to dynamic cookies need to be adjusted
for this so that the proxy's lock is taken for each server operation. This
is possible because the proxy's server list is built once at boot time and
remains stable. So this is what this patch does.
The comments in the proxy and server structs were updated to mention this
rule that the server's lock may not be taken under the proxy's lock but
may enclose it.
Another approach could consist in using a second lock for the proxy's queue
which would be different from the regular proxy's lock, but given that the
operations above are rare and operate on small servers list, there is no
reason for overdesigning a solution.
This fix was successfully tested with 10000 servers in a backend where
adjusting the dyncookies in loops over the CLI didn't have a measurable
impact on the traffic.
The only workaround without the fix is to disable any occurrence of
"slowstart" on server lines, or to disable threads using "nbthread 1".
This must be backported as far as 1.8.
When a lua action or a lua sample fetch is called, a lua transaction is
created. It is an entry in the stack containing the class TXN. Thanks to it, we
can know the direction (request or response) of the call. But, for some
functions, it is also necessary to know if the buffer is "HTTP ready" for the
given direction. "HTTP ready" means there is a valid HTTP message in the
channel's buffer. So, when a lua action or a lua sample fetch is called, the
flag HLUA_TXN_HTTP_RDY is set if it is appropriate.
This one was added by commit daacf3664 ("BUG/MEDIUM: protocols: add a
global lock for the init/deinit stuff") but I forgot to add it to the
include file, breaking DEBUG_THREAD.
There is no standard case for HTTP header names because, as stated in the
RFC7230, they are case-insensitive. So applications must handle them in a
case-insensitive manner. But some bogus applications erroneously rely on the
case used by most browsers. This problem becomes critical with HTTP/2
because all header names must be exchanged in lowercase. And HAProxy uses the
same convention. All header names are sent in lowercase to clients and servers,
regardless of the HTTP version.
This design choice is linked to the HTX implementation. So, for previous
versions (2.0 and 1.9), a workaround is to disable the HTX mode to fall
back to the legacy HTTP mode.
Since the legacy HTTP mode was removed, some users reported interoperability
issues because their application was not able anymore to handle HTTP/1 message
received from HAProxy. So, we've decided to add a way to change the case of some
headers before sending them. It is now possible to define a "mapping" between a
lowercase header name and a version supported by the bogus application. To do
so, you must use the global directives "h1-case-adjust" and
"h1-case-adjust-file". Then options "h1-case-adjust-bogus-client" and
"h1-case-adjust-bogus-server" may be used in proxy sections to enable the
conversion. See the configuration manual for more info.
Of course, our advice is to urgently upgrade these applications for
interoperability concerns and because they may be vulnerable to various types of
content smuggling attacks. But, if your are really forced to use an unmaintained
bogus application, you may use these directive, at your own risks.
If it is relevant, this feature may be backported to 2.0.
Dragan Dosen found that the listeners lock is not sufficient to protect
the listeners list when proxies are stopping because the listeners are
also unlinked from the protocol list, and under certain situations like
bombing with soft-stop signals or shutting down many frontends in parallel
from multiple CLI connections, it could be possible to provoke multiple
instances of delete_listener() to be called in parallel for different
listeners, thus corrupting the protocol lists.
Such operations are pretty rare, they are performed once per proxy upon
startup and once per proxy on shut down. Thus there is no point trying
to optimize anything and we can use a global lock to protect the protocol
lists during these manipulations.
This fix (or a variant) will have to be backported as far as 1.8.
Empty error files may be used to disable the sending of any message for specific
error codes. A common use-case is to use the file "/dev/null". This way the
default error message is overridden and no message is returned to the client. It
was supported in the legacy HTTP mode, but not in HTX. Because of a bug, such
messages triggered an error.
This patch must be backported to 2.0 and 1.9. However, the patch will have to be
adapted.
When forcing the outgoing address of a connection, till now we used to
allocate this outgoing connection and set the address into it, then set
SF_ADDR_SET. With connection reuse this causes a whole lot of issues and
difficulties in the code.
Thanks to the previous changes, it is now possible to store the target
address into the stream instead, and copy the address from the stream to
the connection when initializing the connection. assign_server_address()
does this and as a result SF_ADDR_SET now reflects the presence of the
target address in the stream, not in the connection. The http_proxy mode,
the peers and the master's CLI now use the same mechanism. For now the
existing connection code was not removed to limit the amount of tricky
changes, but the allocated connection is not used anymore.
This change also revealed a latent issue that we've been having around
option http_proxy : the address was set in the connection but neither the
SF_ADDR_SET nor the SF_ASSIGNED flags were set. It looks like the connection
could establish only due to the fact that it existed with a non-null
destination address.
The purpose will be to store the target address there and not to
allocate a connection just for this anymore. For now it's only placed
in the struct, a few fields were moved to plug some holes, and the
entry is freed on release (never allocated yet for now). This must
have no impact. Note that in order to fit, the store_count which
previously was an int was turned into a short, which is way more
than enough given that the hard-coded limit is 8.
Now addresses are dynamically allocated when needed. Each connection is
created with src=dst=NULL, these entries are allocated on the fly, and
released when the connection is released.
This commit places calls to sockaddr_alloc() at the places where an address
is needed, and makes sure that the allocation is properly tested. This does
not add too many error paths since connection allocations are already in the
vicinity and share the same error paths. For the two cases where a
clear_addr() was called, instead the address was not allocated.
This pool will be used to allocate storage for source and destination
addresses used in connections. Two functions sockaddr_{alloc,free}()
were added and will have to be used everywhere an address is needed.
These ones are safe for progressive replacement as they check that the
existing pointer is set before replacing it. The pool is not yet used
during allocation nor freeing. Also they operate on pointers to pointers
so they will perform checks and replace values. The free one nulls the
pointer.
This is in preparation for the switch to dynamic address allocation,
let's migrate the code using the old fields to the pointers instead.
Note that no extra check was added for now, the purpose is only to
get the code to use the pointers and still work.
In the proxy protocol message handling we make sure the addresses are
properly allocated before declaring them unset.
At the moment we're facing difficulties with connection reuse based on
the fact that connections may be allocated very early only to set a
target address in transparent mode. With the imminent removal of the
legacy mode, the connection reuse by a same stream will not exist
anymore and all this awful complexity is not justified anymore. However
we still need to be able to assign addresses somewhere.
Thus instead of allocating a connection, we'll only place addresses where
needed in the stream during operations. But this takes quite some room
(typically 128 bytes). This is a nice opportunity for cleaning all this
up and dynamically allocatating the addresses fields, which will result
in actually saving memory from connection structs since most of the time
the client's "to" address is not used and the server's "from" is not used
either, thus saving ~256 bytes per end-to-end connection.
For now these new "src" and "dst" pointers point to addr.from and addr.to.
This will allow us to smoothly update the whole code to use these pointers
prior to going further and switching them to pools.
These functions are not used anymore. They didn't report failures and
as such were often misused. conn_get_src() and conn_get_dst() now
replaced them everywhere.
The backend connect code uses conn_get_{from,to}_addr to forward addresses
in transparent mode and to map server ports, without really checking if the
operation succeeds. In preparation of future changes, let's switch to
conn_get_{src,dst}() and integrate status check for possible failures.
These functions currently are the same as conn_get_from_addr() and
conn_get_to_addr() respectively except that they return a status for
the operation that the caller can test.
Default HTTP error messages are stored in an array of chunks. And since the HTX
was added, these messages are also converted in HTX and stored in another
array. But now, the first array is not used anymore because the legacy HTTP mode
was removed.
So now, only the array with the HTX messages are kept. The other one was
removed.
The keywords req* and rsp* are now unsupported. So the corresponding lists are
now unused. It is safe to remove them from the structure proxy.
As a result, the code dealing with these rules in HTTP analyzers was also
removed.
It was announced for the 2.1. Following keywords are now unsupported:
* reqadd, reqallow, reqiallow, reqdel, reqidel, reqdeny, reqideny, reqpass,
reqipass, reqrep, reqirep reqtarpit, reqitarpit
* rspadd, rspdel, rspidel, rspdeny, rspideny, rsprep, rspirep
a fatal error is emitted if one of these keyword is found during the
configuraion parsing.
The option 'http-tunnel' is deprecated and it was only used in the legacy HTTP
mode. So this option is now totally ignored and a warning is emitted during
HAProxy startup if it is found in a configuration file.
The old module proto_http does not exist anymore. All code dedicated to the HTTP
analysis is now grouped in the file proto_htx.c. So, to finish the polishing
after removing the legacy HTTP code, proto_htx.{c,h} files have been moved in
http_ana.{c,h} files.
In addition, all HTX analyzers and related functions prefixed with "htx_" have
been renamed to start with "http_" instead.
Many flags of the HTTP transction (TX_*) are now unused and useless. So the
flags TX_WAIT_CLEANUP, TX_HDR_CONN_*, TX_CON_CLO_SET and TX_CON_KAL_SET were
removed. Most of TX_CON_WANT_* were also removed. Only TX_CON_WANT_TUN has been
kept.
First of all, all legacy HTTP analyzers and all functions exclusively used by
them were removed. So the most of the functions in proto_http.{c,h} were
removed. Only functions to deal with the HTTP transaction have been kept. Then,
http_msg and hdr_idx modules were entirely removed. And finally the structure
http_msg was lightened of all its useless information about the legacy HTTP. The
structure hdr_ctx was also removed because unused now, just like unused states
in the enum h1_state. Note that the memory pool "hdr_idx" was removed and
"http_txn" is now smaller.
This commit breaks the compatibility with filters still relying on the legacy
HTTP code. The legacy callbacks were removed (http_data, http_chunk_trailers and
http_forward_data).
For now, the filters must still set the flag FLT_CFG_FL_HTX to be used on HTX
streams.
Since the legacy HTTP mode is disbabled, all HTTP sample fetches work on HTX
streams. So it is safe to remove all code relying on HTTP legacy mode. Among
other things, the function smp_prefetch_http() was removed with the associated
macros CHECK_HTTP_MESSAGE_FIRST() and CHECK_HTTP_MESSAGE_FIRST_PERM().
Since the legacy HTTP mode is disabled and no multiplexer relies on it anymore,
there is no reason to have 2 multiplexer protocols for the HTTP. So the protocol
PROTO_MODE_HTX was removed and all HTTP multiplexers use now PROTO_MODE_HTTP.
Because the h2 multiplexer only uses the HTX mode, following H2 functions were
removed :
* h2_prepare_h1_reqline
* h2_make_h1_request()
* h2_make_h1_trailers()
Instead of using a array of (struct block), it is more natural and intuitive to
use an array of char. Indeed, not only (struct block) are stored in this array,
but also their payload.
<head> and <tail> fields are now signed 32-bits integers. For an empty HTX
message, these fields are set to -1. So the field <used> is now useless and can
safely be removed. To know if an HTX message is empty or not, we just compare
<head> against -1 (it also works with <tail>). The function htx_nbblks() has
been added to get the number of used blocks.
A long time ago, applets were seen as an alternative to connections,
and since their respective sizes were roughly equal it appeared wise
to share the same pool. Nowadays, connections got significantly larger
but applets are not that often used, except for the cache. However
applets are mostly complementary and not alternatives anymore, as
it's very possible not to have a back connection or to share one with
other streams.
The connections will soon lose their addresses and their size will
shrink so much that appctx won't fit anymore. Given that the old
benefits of sharing these pools have long disappeared, let's stop
doing this and have a dedicated pool for appctx.
Since commit 81492c989 ("MINOR: threads: flatten the per-thread cpu-map"),
we don't keep the proc*thread matrix anymore to represent the full binding
possibilities, but only the proc and thread ones. The problem is that the
per-process binding is not the same for each thread and for the process,
and the proc[] array was assumed to store the per-proc first thread value
when doing this change. Worse, the logic present there tries to deal with
thread ranges and process ranges in a way which automatically exclused the
other possibility (since ranges cannot be used on both) but as such fails
to apply changes if neither the process nor the thread is expressed as a
range.
The real problem comes from the fact that specifying cpu-map 1/1 doesn't
yet reveal if the per-process mask or the per-thread mask needs to be
updated. In practice it's the thread one but then the current storage
doesn't allow to store the binding of the first thread of each other
process in nbproc>1 configurations.
When removing the proc*thread matrix, what ought to have been kept was
both the thread column for process 1 and the process line for threads 1,
but instead only the thread column was kept. This patch reintroduces the
storage of the configuration for the first thread of each process so that
it is again possible to store either the per-thread or per-process
configuration.
As a partial workaround for existing configurations, it is possible to
systematically indicate at least two processes or two threads at once
and map them by pairs or more so that at least two values are present
in the range. E.g :
# set processes 1-4 to cpus 0-3 :
cpu-map auto:1-4/1 0 1 2 3
# or:
cpu-map 1-2/1 0 1
cpu-map 2-3/1 2 3
# set threads 1-4 to cpus 0-3 :
cpu-map auto:1/1-4 0 1 2 3
# or :
cpu-map 1/1-2 0 1
cpu-map 3/3-4 2 3
This fix must be backported to 2.0.
Move the logic to decide if we redispatch to a new server from
sess_update_st_cer() to a new inline function, stream_choose_redispatch(), and
use it in do_l7_retry() instead of just setting the state to SI_ST_REQ.
That way, when using L7 retries, we won't redispatch the request to another
server except if "option redispatch" is used.
This should be backported to 2.0.
Sometimes we need to delegate some list processing to a function running
on another thread. In this case the list element will simply be queued
into a dedicated self-locked list and the task responsible for this list
will be woken up, calling the associated function which will run over the
list.
This is what work_list does. Such lists will be dedicated to a limited
type of work but will significantly ease such remote handling. A function
is provided to create these per-thread lists, their tasks and to properly
bind each task to a distinct thread, so that the caller only has to store
the resulting pointer to the start of the structure.
These structures should not be abused though as each head will consume
4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.
When we're purging idle connections, there's a race condition, when we're
removing the connection from the idle list, to add it to the list of
connections to free, if the thread owning the connection tries to free it
at the same time.
To fix this, simply add a per-thread lock, that has to be hold before
removing the connection from the idle list, and when, in conn_free(), we're
about to remove the connection from every list. That way, we know for sure
the connection will stay valid while we remove it from the idle list, to add
it to the list of connections to free.
This should happen rarely enough that it shouldn't have any impact on
performances.
This has not been reported yet, but could provoke random segfaults.
This should be backported to 2.0.
The maximum number of idle connections for a server can be configured by setting
the server option "pool-max-conn". But when we try to add a connection in its
idle list, because of a wrong comparison, it may be rejected because there are
already "pool-max-conn - 1" idle connections.
This patch must be backported to 2.0 and 1.9.
While experimenting with potentially improved fairness and latency using
ticket locks on a Ryzen 16-thread/8-core, a very strange situation happened
a lot for some levels of traffic. Around 300k connections per second, no
more connections would be accepted on the multi-threaded listener but all
others would continue to work fine. All attempts to trace showed that the
threads were all in the trylock in the fd cache, or in the spinlock of
fd_update_events(), or in the one of fd_may_recv(). But as indicated this
was not a deadlock since the process continues to work fine.
After quite some investigation it appeared that the issue is caused by a
lack of fairness between the fdcache's trylock and these functions' spin
locks above. In fact, regardless of the success or failure of the fdcache's
attempt at grabbing the lock, the poller was calling fd_update_events()
which locks the FD once for something that can be done with a CAS, and
then calls fd_may_recv() with another lock for something that most often
didn't change. The high contention on these spinlocks leaves no chance to
any other thread to grab the lock using trylock(), and once this happens,
there is no thread left to process incoming connection events nor to stop
polling on the FD, leaving all threads at 100% CPU but partially operational.
This patch addresses the issue by using bit-test-and-set instead of the OR
in fd_may_recv() / fd_may_send() so that nothing is done if the FD was
already configured as expected. It does the same in fd_update_events()
using a CAS to check if the FD's events need to be changed at all or not.
With this patch applied, it became impossible to reproduce the issue, and
now there's no way to saturate all 16 CPUs with the load used for testing,
as no more than 1350-1400 were noticed at 300+kcps vs 1600.
Ideally this patch should go further and try to remove the remaining
incarnations of the fdlock as this seems possible, but it's difficult
enough to be done in a distinct patch that will not have to be backported.
It is possible that workloads involving a high connection rate may slightly
benefit from this patch and observe a slightly lower CPU usage even when
the service doesn't misbehave.
This patch must be backported to 2.0 and 1.9.
These calls can take quite some time and leave the thread harmless so
it's better to mark it as such. This makes "show sess" respond way
faster during high loads running on processes build with DEBUG_UAF
since these calls are stressed a lot.
When calling mmap(), in general the system gives us a page but does not
really allocate it until we first dereference it. And it turns out that
this time is much longer than the time to perform the mmap() syscall.
Unfortunately, when running with memory debugging enabled, we mmap/munmap()
each object resulting in lots of such calls and a high contention on the
allocator. And the first accesses to the page being done under the pool
lock is extremely damaging to other threads.
The simple fact of writing a 0 at the beginning of the page after
allocating it and placing the POOL_LINK pointer outside of the lock is
enough to boost the performance by 8x in debug mode and to save the
watchdog from triggering on lock contention. This is what this patch
does.
The malloc and free calls and especially the underlying mmap/munmap()
can occasionally take a huge amount of time and even cause the thread
to sleep. This is visible when haproxy is compiled with DEBUG_UAF which
causes every single pool allocation/free to allocate and release pages.
In this case, when using the locked pools, the watchdog can occasionally
fire under high contention (typically requesting 40000 1M objects in
parallel over 8 threads). Then, "perf top" shows that 50% of the CPU
time is spent in mmap() and munmap(). The reason the watchdog fires is
because some threads spin on the pool lock which is held by other threads
waiting on mmap() or munmap().
This patch modifies this so that the pool lock is released during these
syscalls. Not only this allows other threads to request try to allocate
their data in parallel, but it also considerably reduces the lock
contention.
Note that the locked pools are only used on small architectures where
high thread counts would not make sense, so this will not provide any
benefit in the general case. However it makes the debugging versions
way more stable, which is always appreciated.
In the function stream_int_notify(), when the opposite stream-interface is
blocked because there is no more room into the input buffer, if the flag
CF_WRITE_PARTIAL is set on this buffer, it is unblocked. It is a way to unblock
the reads on the other side because some data was sent.
But it is a problem during the fast-forwarding because only the stream is able
to remove the flag CF_WRITE_PARTIAL. So it is possible to have this flag because
of a previous send while the input buffer of the opposite stream-interface is
now full. In such case, the opposite stream-interface will be woken up for
nothing because its input buffer is full. If the same happens on the opposite
side, we will have a loop consumming all the CPU.
To fix the bug, the opposite side is now only notify if there is some available
room in its input buffer in the function si_cs_send(), so only if some data was
sent.
This patch must be backported to 2.0 and 1.9.
This code should be now used by action to stop at the same time the rules
processing and the possible following processings. And from its side, the return
code ACT_RET_STOP should be used to only stop rules processing.
So concretely, for TCP rules, there is no changes. ACT_RET_STOP and ACT_RET_DONE
are handled the same way. However, for HTTP rules, ACT_RET_STOP should now be
mapped on HTTP_RULE_RES_STOP and ACT_RET_DONE on HTTP_RULE_RES_DONE. So this
way, a action will have the possibilty to stop all processing or only rules
processing.
Note that changes about the TCP is done in this commit but changes about the
HTTP will be done in another one because it will fix a bug in the same time.
This patch must be backported to 2.0 because a bugfix depends on it.
When deciding if we keep an idle connection in the session, check if
the number of connections currently in the session is >= the max allowed,
not >, or we'll keep an extra connection.
This should be backported to 1.9 and 2.0.
Just calling conn_force_unsubscribe() from conn_upgrade_mux_fe() is not
enough, as there may be multiple XPRT involved. Instead, require that
any user of conn_upgrade_mux_fe() unsubscribe itself before calling it.
This should fix upgrading a TCP connection to HTX when using SSL.
This should be backported to 2.0.
The receive limit of an HTX channel must be calculated against the total size of
the HTX message. Otherwise, the buffer may never be seen as full whereas the
receive limit is 0. Indeed, the function channel_htx_full() already takes care
to add a block size to the buffer's reserve (8 bytes). So if the function
channel_htx_recv_limit() also keep a block size free in addition to the buffer's
reserve, it means that at least 2 block size will be kept free but only one will
be taken into account, freezing the stream if the option http-buffer-request is
enabled.
This patch fixes the Github issue #136. It should be backported to 2.0 and
1.9. Thanks jaroslawr (Jarosław Rzeszótko) for his help.
Revert commit fe4abe62c7.
The goal was to make sure for health-checks, we would not get sockets in
TIME_WAIT. To do so, we would not call shutdown() if linger_risk is set.
However that is wrong, and that means shutw would never be forwarded to
the server, and thus we could get connection that are never properly closed.
Instead, to fix the original problem as described here :
https://www.mail-archive.com/haproxy@formilux.org/msg34080.html
Just make sure the checks code call cs_shutr() before calling cs_shutw().
If shutr has been called, conn_sock_shutw() will make no attempt to call
shutdown(), as it knows close() will be called.
We should really review and revamp the shutr/shutw code, as described in
github issue #142.
This should be backported to 1.9 and 2.0.
When using a level lower than admin on the master CLI, a \n is output
before the response, this is caused by the response of the "operator" or
"user" that are sent before the actual command.
To fix this problem we introduce the flag APPCTX_CLI_ST1_NOLF which ask
a command response to not be followed by the final \n.
This patch made a special case with the command operator and user
followed by a - so they are not followed by \n.
This patch must be backported to 2.0 and 1.9.
As its name suggest, this function change the value length of a block. But it
also update the HTX message accordingly. It simplifies the HTX API. The function
htx_set_blk_value_len() is still available and must be used with caution because
this one does not update the HTX message. It just updates the HTX block. It
should be considered as an internal function. When possible,
htx_change_blk_value_len() should be used instead.
This function is used to fix a bug affecting the 2.0. So, this patch must be
backported to 2.0.
Server states can be recovered from either a "global" file (all backends)
or a "local" file (per backend).
The way the algorithm to parse the state file was first implemented was good
enough for a low number of backends and servers per backend.
Basically, for each backend the state file (global or local) is opened,
parsed entirely and for each line we check if it contains data related to
a server from the backend we're currently processing.
We must read the file entirely, just in case some lines for the current
backend are stored at the end of the file.
This does not scale at all!
This patch changes the behavior above for the "global" file only. Now,
the global file is read and parsed once and all lines it contains are
stored in a tree, for faster discovery.
This result in way much less fopen, fgets, and strcmp calls, which make
loading of very big state files very quick now.
In commit 86eded6c6 ("CLEANUP: tasks: rename task_remove_from_tasklet_list()
to tasklet_remove_*") which consisted in removing the casts between tasks
and tasklet, I was a bit too fast to believe that we only saw tasklets in
this function since process_runnable_tasks() also uses it with tasks under
a cast. So removing the bookkeeping on task_list_size was not appropriate.
Bah, the joy of casts which hide the real thing...
This patch does two things at once to address this mess once for all:
- it restores the decrement of task_list_size when it's a real task,
but moves it to process_runnable_task() since it's the only place
where it's allowed to call it with a task
- it moves the increment there as well and renames
task_insert_into_tasklet_list() to tasklet_insert_into_tasklet_list()
of obvious consistency reasons.
This way the increment/decrement of task_list_size is made at the only
places where the cast is enforced, so it has less risks to be missed.
The comments on top of these functions were updated to reflect that they
are only supposed to be used with tasklets and that the caller is responsible
for keeping task_list_size up to date if it decides to enforce a task there.
Now we don't have to worry anymore about how these functions work outside
of the scheduler, which is better longterm-wise. Thanks to Christopher for
spotting this mistake.
No backport is needed.
In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. Not
doing so will result in getting sockets in TIME_WAIT for some time.
This is particularly observable with health checks.
This should be backported to 1.9.
The function really only operates on tasklets, its arguments are always
tasklets cast as tasks to match the function's type, to be cast back to
a struct tasklet. Let's rename it to tasklet_remove_from_tasklet_list(),
take a struct tasklet, and get rid of the undesired task casts.
It's really confusing to call it a task because it's a tasklet and used
in places where tasks and tasklets are used together. Let's rename it
to tasklet to remove this confusion.
The first one, HTX_SL_F_HAS_SCHM, will be used to know the request has an
explicit scheme. So, in H2, it is always true because the pseudo-header
":scheme" is mandatory. In H1, it is only true when an absolute URI is found on
the start-line. The other flags, HTX_SL_F_SCHM_HTTP and HTX_SL_F_SCHM_HTTPS,
will be used to know which scheme the request have. For now, other protocols are
not handled.
The aim of these flags is to pass this information to the backend side in
general, and to the H2 mux in particular. So the multiplexer will have a chance
to use this information to send the right scheme to the server.
In the HTX structure, the field <first> is used to know where to (re)start the
analysis. It may differ from the message's head. It is especially important to
update it to handle 1xx messages, to be sure to restart the analysis on the next
message (another 1xx message or the final one). It is also updated when some
data are forwarded (the headers or part of the body). But this update is an
error and must never be done at the analysis level. It is a bug, because some
sample fetches may be used after the data forwarding (but before the first send
of course). At this stage, if the first block position does not point on the
start-line, most of HTTP sample fetches fail.
So now, when something is forwarding by HTX analyzers, the first block position
is not update anymore.
This issue was reported on Github. See #119. No backport needed.
When channel_full() is called for an HTX stream, we fall back on the HTX
version. This function is called, among other, from tcp_inspect_request(). With
this patch, the inspect delay is respected again.
This patch must be backported to 1.9.
With both I/O and tasks in the same tasklet list, we now have a very
smooth and responsive scheduler, providing a good fairness between I/O
activities. With the lower layers relying on tasklet a lot (I/O wakeup,
subscribe, etc), there may often be a large number of totally autonomous
tasklets doing their business such as forwarding data between two muxes.
But the task scheduler historically refrained from picking tasks from the
priority-ordered run queue to put them into the tasklet list until this
later had less than max_runqueue_depth entries. This was to make sure that
low-latency, high-priority tasks would have an opportunity to be dequeued
before others even if they arrive late. But the counter used for this is
still the tasklet list size, which contains countless I/O events. This
causes an unfairness between unbounded I/Os and bounded tasks, resulting
for example in the CLI responding slower when forwarding 40 Gbps of HTTP
traffic spread over a thousand of connections.
A good solution consists in sticking to the initial intent of
max_runqueue_depth which is to limit the number of tasks in the list
(to maintain fairness between them) and not to limit the number of these
tasks among tasklets. It just turns out that the task_list_size initially
was this task counter and changed over time to be a tasklet list size.
Let's simply refrain from updating it for pure tasklets so that it takes
back its original role of counting real tasks as its name implies. With
this change the CLI becomes instantly responsive under load again.
This patch may possibly be backported to 1.9 though it requires some
careful checks.
The function htx_add_data_before() was removed because it was buggy. The
function htx_move_blk_before() may be used if necessary to do something
equivalent, except it just moves blocks. It doesn't handle the adding.
In an HTX message, it may have 2 available rooms to store a new block. The first
one is between the blocks and their payload. Blocks are added starting from the
end of the buffer and their payloads are added starting from the begining. So
the first free room is between these 2 edges. The second one is at the begining
of the buffer, when we start to wrap to add new payloads. Once we start to use
this one, the other one is ignored until the next defragmentation of the HTX
message.
In theory, there is no problem. But in practice, some lacks in the HTX structure
force us to defragment too often HTX messages to always be in a known state. The
second free room is not tracked as it should do and the first one may be easily
corrupted when rewrites happen.
So to fix the problem and avoid unecessary defragmentation, the HTX structure
has been refactored. The front (the block's position of the first payload before
the blocks) is no more stored. Instead we keep the relative addresses of 3 edges:
* tail_addr : The start address of the free space in front of the the blocks
table
* head_addr : The start address of the free space at the beginning
* end_addr : The end address of the free space at the beginning
Here is the general view of the HTX message now:
head_addr end_addr tail_addr
| | |
V V V
+------------+------------+------------+------------+------------------+
| | | | | |
| PAYLOAD | Free space | PAYLOAD | Free space | Blocks area |
| ==> | 1 | ==> | 2 | <== |
+------------+------------+------------+------------+------------------+
<head_addr> is always lower or equal to <end_addr> and <tail_addr>. <end_addr>
is always lower or equal to <tail_addr>.
In addition;, to simplify everything, the blocks area are now contiguous. It
doesn't wrap anymore. So the head is always the block with the lowest position,
and the tail is always the one with the highest position.
The function htx_add_data_before() is buggy and cannot work. It first add a data
block and then move it before another one, passed in argument. The problem
happens when a defragmentation is done to add the new block. In this case, the
reference is no longer valid, because the blocks are rearranged. So, instead of
moving the new block before the reference, it is moved at the head of the HTX
message.
So this function has been removed. It was only used by the compression filter to
add a last data block before a TLR, EOT or EOM block. Now, the new function
htx_add_last_data() is used. It adds a last data block, after all others and
before any TLR, EOT or EOM block. Then, the next bock is get. It is the first
non-data block after data in the HTX message. The compression loop continues
with it.
This patch must be backported to 1.9.
This function provides an alternate way to leave a critical section run
under thread_isolate(). Currently, a thread may remain in thread_release()
without having the time to notice that the rdv mask was released and taken
again by another thread entering thread_isolate() (often the same that just
released it). This is because threads wait in harmless mode in the loop,
which is compatible with the conditions to enter thread_isolate(). It's
not possible to make them wait with the harmless bit off or we cannot know
when the job is finished for the next thread to start in thread_isolate(),
and if we don't clear the rdv bit when going there, we create another
race on the start point of thread_isolate().
This new synchronous variant of thread_release() makes use of an extra
mask to indicate the threads that want to be synchronously released. In
this case, they will be marked harmless before releasing their sync bit,
and will wait for others to release their bit as well, guaranteeing that
thread_isolate() cannot be started by any of them before they all left
thread_sync_release(). This allows to construct synchronized blocks like
this :
thread_isolate()
/* optionally do something alone here */
thread_sync_release()
/* do something together here */
thread_isolate()
/* optionally do something alone here */
thread_sync_release()
And so on. This is particularly useful during initialization where several
steps have to be respected and no thread must start a step before the
previous one is completed by other threads.
This one must not be placed after any call to thread_release() or it would
risk to block an earlier call to thread_isolate() which the current thread
managed to leave without waiting for others to complete, and end up here
with the thread's harmless bit cleared, blocking others. This might be
improved in the future.
As reported in GH issue #109 and in discourse issue
https://discourse.haproxy.org/t/haproxy-returns-408-or-504-error-when-timeout-client-value-is-every-25d
the time parser doesn't error on overflows nor underflows. This is a
recurring problem which additionally has the bad taste of taking a long
time before hitting the user.
This patch makes parse_time_err() return special error codes for overflows
and underflows, and adds the control in the call places to report suitable
errors depending on the requested unit. In practice, underflows are almost
never returned as the parsing function takes care of rounding values up,
so this might possibly happen on 64-bit overflows returning exactly zero
after rounding though. It is not really possible to cut the patch into
pieces as it changes the function's API, hence all callers.
Tests were run on about every relevant part (cookie maxlife/maxidle,
server inter, stats timeout, timeout*, cli's set timeout command,
tcp-request/response inspect-delay).
When we look up an dictionary entry in the cache used upon transmission
we store the last result in ->prev_lookup of struct dcache_tx so that
to compare it with the subsequent entries to look up and save performances.
When allocating new dictionary entries we store the length of the strings.
May be useful so that not to have to call strlen() too much often at runing
time.
We store pointers to server names dictionary entries in a pre-allocated array of
ebpt_node's (->entries member of struct dcache_tx) to cache those sent to remote
peers. Consequently the ID used to identify the server name dictionary entry is
also used as index for this array. There is no need to implement a lookup by key
for this dictionary cache.
The fd_sets we've been using in the log encoding functions are not portable
and were shown to break at least under Cygwin. This patch gets rid of them
in favor of the new bitmap functions. It was verified with the config below
that the log output was exactly the same before and after the change :
defaults
mode http
option httplog
log stdout local0
timeout client 1s
timeout server 1s
timeout connect 1s
frontend foo
bind :8001
capture request header chars len 255
backend bar
option httpchk "GET" "/" "HTTP/1.0\r\nchars: \x01\x02\x03\x04\x05\x06\x07\x08\x09\x0b\x0c\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
server foo 127.0.0.1:8001 check
We now have ha_bit_{set,clr,flip,test} to manipulate bitfields made
of arrays of longs. The goal is to get rid of the remaining non-portable
FD_{SET,CLR,ISSET} that still exist at a few places.
Just like we have a synchronous recv() function for the stream interface,
let's have a synchronous send function that we'll be able to call from
different places. For now this only moves the code, nothing more.
We should not update the two directions at once, in fact we should update
the Rx path after recv() and the Tx path after send(). Let's start by
splitting the update function in two for this.
The purpose of making idle-conns switch to SI_ST_CON was to make the
transition detectable and the operation retryable in case of connection
error. Now we have the RDY state for this which is much more suitable
since it indicates a validated connection on which we didn't necessarily
send anything yet. This will still lead to a transition to EST while not
requiring unnatural write polling nor connect timeouts.
The main reason for all the trouble we're facing with stream interface
error or timeout reports during the connection phase is that we currently
can't make the difference between a connection attempt and a validated
connection attempt. It is problematic because we tend to switch early
to SI_ST_EST but can't always do what we want in this state since it's
supposed to be set when we don't need to visit sess_establish() again.
This patch introduces a new state betwen SI_ST_CON and SI_ST_EST, which
is SI_ST_RDY. It indicates that we've verified that the connection is
ready. It's a transient state, like SI_ST_DIS, that cannot persist when
leaving process_stream(). For now it is not set, only verified in various
tests where SI_ST_CON was used or SI_ST_EST depending on the cases.
The stream-int state diagram was minimally updated to reflect the new
state, though it is largely obsolete and would need to be seriously
updated.
The stream interface state checks involving ranges were replaced with
checks on a set of states, already revealing some issues. No issue was
fixed, all was replaced in a one-to-one mapping for easier control. Some
checks involving a strict difference were also replaced with fields to
be clearer. At this stage, the result must be strictly equivalent. A few
tests were also turned to their bit-field equivalent for better readability
or in preparation for upcoming changes.
The test performed in the SPOE filter was swapped so that the closed and
error states are evicted first and that the established vs conn state is
tested second.
At some places we do check for ranges of stream-int states but those
are confusing as states ordering is not well known (e.g. it's not obvious
that CER is between CON and EST). Let's create a bit field from states so
that we can match multiple states at once instead. The new enum si_state_bit
contains SI_SB_* which are state bits instead of state values. The function
si_state_in() indicates if the state in argument is one of those represented
by the bit mask in second argument.
Now that the various handshakes come with their own XPRT, there's no
need for the CONN_FL_SOCK* flags, and the conn_sock_want|stop functions,
so garbage-collect them.
Add a new XPRT that is used when using non-SSL handshakes, such as proxy
protocol or Netscaler, instead of taking care of it in conn_fd_handler().
This XPRT is installed when any of those is used, and it removes itself once
the handshake is done.
This should allow us to remove the distinction between CO_FL_SOCK* and
CO_FL_XPRT*.
Add a new method to xprt_ops, remove_xprt. When called, if the provided
xprt_ctx is the same as the xprt's underlying xprt_ctx, it then uses the
new xprt provided, otherwise it calls the remove_xprt method of the next
xprt.
The goal is to be able to add a temporary xprt, that removes itself from
the chain when it did what it had to do. This will be used to implement
a pseudo-xprt for anything that just requires a handshake (such as the
proxy protocol).
As the SSL code may have different needs than the upper layer, ie it may want
to receive when the upper layer wants to right, instead of directly forwarding
the subscribe to the underlying xprt, handle it ourself. The SSL code will
know remember any subscribe call, and wake the tasklet when it is ready
for more I/O.
This type of blocks is useless because transition between data and trailers is
obvious. And when there is no trailers, the end-of-message is still there to
know when data end for chunked messages.
HTTP trailers are now parsed in the same way headers are. It means trailers are
converted to K/V blocks followed by an end-of-trailer marker. For now, to make
things simple, the type for trailer blocks are not the same than for header
blocks. But the aim is to make no difference between headers and trailers by
using the same type. Probably for the end-of marker too.
It was only done for the headers (including the EOH marker). data were prefixed
by the info field of these blocks. The payload and the trailers of the messages
were stored in raw. The total size of headers and payload were kept in the
cached object state to help output formatting.
Now, info about each HTX block is store in the cache. Only data are allowed to
be splitted. Otherwise, all blocks of an HTX message are handled the same way,
both when storing a message in the cache and when delivering it from the
cache. This will help the cache implementation to be more robust to internal
changes in the HTX. Especially for the upcoming parsing of trailers. There is
also no more need to keep extra info in the cached object state.
In channel_htx_forward() and channel_htx_forward_forever(), if the HTX message
is empty, the underlying buffer may be really empty too. And we have no warranty
the caller will call htx_to_buf() later. And in practice, it is almost never
done. So the channel's buffer must not be altered. Otherwise, the buffer may be
considered as full (data == size) for an empty HTX message and no outgoing data.
This patch must be backported to 1.9.
Make usage of the APIs implemented for dictionaries (dict.c) and their LRU caches (struct dcache)
so that to send/receive server names used for the server by name stickiness. These
names are sent over the network as follows:
- in every case we send the encode length of the data (STD_T_DICT), then
- if the server names is not present in the cache used upon transmission (struct dcache_tx)
we cache it and we the ID of this TX cache entry followed the encode length of the
server name, and finally the sever name itseft (non NULL terminated string).
- if the server name is present, we repead these operations but we only send the TX cache
entry ID.
Upon receipt, the couple of (cache IDs, server name) are stored the LRU cache used
only upon receipt (struct dcache_rx). As the peers protocol is symetrical, the fact
that the server name is present in the received data (resp. or not) denotes if
the entry is absent (resp. or not).
This simple patch only adds definitions to create a new stick-table
data type ID and a new standard type to store information in relation
wich dictionary entries (STD_T_DICT).
We want to send some stick-table data fields stored as strings in dictionaries
without consuming too much memory and CPU. To do so we implement with this patch
a cache for send/received dictionaries entries. These dictionary of strings entries are
stored in others real dictionary entries with an identifier as key (unsigned int)
and a pointer to the dictionary of strings entries as values.
This patch adds minimalistic definitions to implement dictionary new data structure
which is an ebtree of ebpt_node structs with strings as keys. Note that this has nothing
to see with real dictionary data structure (maps of keys in association with values).
As reported in GH issue #99, when hard-stop-after triggers and threads
are in use, the chance that any thread releases the resources in use by
the other ones is non-null. Thus no thread should be allowed to deinit()
nor exit by itself.
Here we take a different approach. We simply use a 3rd possible value
for the "killed" variable so that all threads know they must break out
of the run-poll-loop and immediately stop.
This patch was tested by commenting the stream_shutdown() calls in
hard_stop() to increase the chances to see a stream use released
resources. With this fix applied, it never crashes anymore.
This fix should be backported to 1.9 and 1.8.
Have "socks4" and "check-via-socks4" server keyword added.
Implement handshake with SOCKS4 proxy server for tcp stream connection.
See issue #82.
I have the "SOCKS: A protocol for TCP proxy across firewalls" doc found
at "https://www.openssh.com/txt/socks4.protocol". Please reference to it.
[wt: for now connecting to the SOCKS4 proxy over unix sockets is not
supported, and mixing IPv4/IPv6 is discouraged; indeed, the control
layer is unique for a connection and will be used both for connecting
and for target address manipulation. As such it may for example report
incorrect destination addresses in logs if the proxy is reached over
IPv6]
Remove the active_tasks_mask variable, we can deduce if we've work to do
by other means, and it is costly to maintain. Instead, introduce a new
function, thread_has_tasks(), that returns non-zero if there's tasks
scheduled for the thread, zero otherwise.
Add session flags, and add a new flag, SESS_FL_PREFER_LAST, to be set when
we use NTLM authentication, and we should reuse the last connection. This
should fix using NTLM with HTX. This totally replaces TX_PREFER_LAST.
This should be backported to 1.9.
In lock profiles it's visible that there is a huge contention on the
buffer lock. The reason is that when offer_buffers() is called, it
systematically takes the lock before verifying if there is any
waiter. However doing so doesn't protect against races since a
waiter can happen just after we release the lock as well. Similarly
in h2 we take the lock every time an h2c is going to be released,
even without checking that the h2c belongs to a wait list. These
two have now been addressed by verifying non-emptiness of the list
prior to taking the lock.
Haproxy is designed to be able to continue to run even under very low
memory conditions. However this can sometimes have a serious impact on
performance that it hard to diagnose. Let's report counters of failed
pool and buffer allocations per thread in show activity.
We have been abusing the do_poll()'s timeout for a while, making it zero
whenever there is some known activity. The problem this poses is that it
complicates activity diagnostic by incrementing the poll_exp field for
each known activity. It also requires extra computations that could be
avoided.
This change passes a "wake" argument to say that the poller must not
sleep. This simplifies the operations and allows one to differenciate
expirations from activity.
In order to later allow htx_add_data() to transmit partial blocks and
avoid defragmenting the buffer, we'll need to return the number of bytes
consumed. This first modification makes the function do this and its
callers take this into account. At the moment the function still works
atomically so it returns either the block size or zero. However all
call places have been adapted to consider any value between zero and
the block size.
The functions channel_htx_fwd_payload() and channel_htx_fwd_all() should now be
used to forward, respectively, a part of the HTX payload or all of it. These
functions forward data and update the first block position.
We don't store the start-line position anymore in the HTX message. Instead we
store the first block position to analyze. For now, it is almost the same. But
once all changes will be made on this part, this position will have to be used
by HTX analyzers, and only in the analysis context, to know where the analyse
should start.
When new blocks are added in an HTX message, if the first block position is not
defined, it is set. When the block pointed by it is removed, it is set to the
block following it. -1 remains the value to unset the position. the first block
position is unset when the HTX message is empty. It may also be unset on a
non-empty message, meaning every blocks were already analyzed.
From HTX analyzers point of view, this position is always set during headers
analysis. When they are waiting for a request or a response, if it is unset, it
means the analysis should wait. But once the analysis is started, and as long as
headers are not forwarded, it points to the message start-line.
As mentionned, outside the HTX analysis, no code must rely on the first block
position. So multiplexers and applets must always use the head position to start
a loop on an HTX message.
The function channel_htx_fwd_headers() should now be used by HTX analyzers to
forward all headers of an HTX message, from the start-line to the corresponding
EOH. It takes care to update the star-line position.
The field hdrs_bytes has been added in the structure htx_sl. It should be used
to set how many bytes are help by all headers, from the start-line to the
corresponding EOH block. it must be set to -1 if it is unknown.
This functions should be used to get the maximum size for a block, not exceeding
the max amount of bytes passed in argument. Thus max may be set to -1 to have no
limit.
When channel_recv_max() is called for an HTX stream, we fall back on the HTX
version. This function is called from si_cs_recv(). This will let us pass the
max amount of bytes to read to HTX multiplexers.
Now, we only return the start-line. If not found, NULL is returned. No lookup is
performed and the HTX message is no more updated. It is now the caller
responsibility to update the position of the start-line to the right value. So
when it is not found, i.e sl_pos is set to -1, it means the last start-line has
been already processed and the next one has not been inserted yet.
It is mandatory to rely on this kind of warranty to store 1xx informational
responses and final reponse in the same HTX message.
It is the first block relatively to the start-line. So it is the start-line if
its position is set (sl_pos != -1), otherwise it is the head. The functions
htx_get_first() and htx_get_first_blk() can be used to get it. This change is
mandatory to consider 1xx informational messages as part of a response.
The head of an HTX message is heavily used whereas the wrap position is only
used when a block is added or removed. So it is more logical to store the head
position in the HTX message instead of the wrap one. The wrap position can be
easily deduced. To get it, the new function htx_get_wrap() may be used.
On armv7 haproxy doesn't work because of the fixes on the double-word
CAS. There are two issues. The first one is that the last argument in
case of dwcas is a pointer to the set of value and not a value ; the
second is that it's not enough to cast the data as (void*) since it will
be a single word. Let's fix this by using the pointers as an array of
long. This was tested on i386, armv7, x86_64 and aarch64 and it is now
fine. An alternate approach using a struct was attempted as well but it
used to produce less optimal code.
This fix must be backported to 1.9. This fixes github issue #105.
Cc: Olivier Houchard <ohouchard@haproxy.com>
The unused fd_del and fd_skip were being abused during debugging sessions
as general purpose event counters. With their removal, let's officially
have dedicated counters for such use cases. These counters are called
"ctr0".."ctr2" and are listed at the end when DEBUG_DEV is set.
The purpose is to manipulate rings made of series of buffers so that
it is possible to continue to work on a next buffer once one is full.
This will be used by muxes to deal with contention between multiple
streams and a single output buffer. No data is expected to span over
multiple buffers, all of them will be used like a regular buffer. This
will significantly limit the amount of changes and the code complexity
while still supporting larger output buffering.
The ring is made of a head and a tail indexes both of which point to a
buffer descriptor. At least one descriptor is always valid, so it could
be seen as a form of pagination always presenting one buffer. The root
of the ring is itself stored into a buffer descriptor so that the user
only has to declare a buffer array and to call br_init() on it in order
to use it.
It has not been used for many years, is unlikely to be reused and
conflicts with the similarly named macro in flt_trace, causing warnings
at build time when including debug.h in low-level files. Let's simply
remove it.
It's amazing that the value was still incremented under the date lock,
let's first use an atomic increment for the counter and move it out of
the date lock to reduce contention. These are just counters, we don't
need to take locks if we're not rotating, atomic ops are enough. This
patch does this, and leaves the lock for when the period is over. It's
important to note that some values might be added just before or just
after a rotation but this is not a problem since we don't care if a
value is counted in the previous or next period when it's exactly on
the edge. Great care was taken to ensure that the current counter is
always atomically updated.
Other minor cleanups were performed, such as avoiding to reload the
value from memory after a CAS, or using &~1 instead of two shifts to
remove the lowest bit.
Many times we've been missing per-process traffic statistics. While it
didn't make sense in multi-process mode, with threads it does. Thus we
now have a counter of bytes emitted by raw_sock, and a freq counter for
these as well. However, freq_ctr are limited to 32 bits, and given that
loads of 300 Gbps have already been reached over a loopback using
splicing, we need to downscale this a bit. Here we're storing 1/32 of
the byte rate, which gives a theorical limit of 128 GB/s or ~1 Tbps,
which is more than enough. Let's have fun re-reading this sentence in
2029 :-) The values can be read in "show info" output on the CLI.
SI_TKILL is for Linux. We're again in the non-portable area. Both OSes
use macros to define these values so we can #ifdef them. Let's make
SI_TKILL defined based on SI_LWP when only the latter is defined.
These commands don't follow the same flow as the rest of the commands,
each of them iterates over all header lines before switching to the
next directive. In addition they make no distinction between start
line and headers and can lead to unparsable rewrites which are very
difficult to deal with internally.
Most of them are still occasionally found in configurations, mainly
because of the usual "we've always done this way". By marking them
deprecated and emitting a warning and recommendation on first use of
each of them, we will raise users' awareness of users regarding the
cleaner, faster and more reliable alternatives.
Some use cases of "reqrep" still appear from time to time for URL
rewriting that is not so convenient with other rules. But at least
users facing this requirement will explain their use case so that we
can best serve them. Some discussion started on this subject in a
thread linked to from github issue #100.
The goal is to remove them in 2.1 since they require to reparse the
result before indexing it and we don't want this hack to live long.
The following directives were marked deprecated :
-reqadd
-reqallow
-reqdel
-reqdeny
-reqiallow
-reqidel
-reqideny
-reqipass
-reqirep
-reqitarpit
-reqpass
-reqrep
-reqtarpit
-rspadd
-rspdel
-rspdeny
-rspidel
-rspideny
-rspirep
-rsprep
Mustafa Yildirim reported in Discourse that ports >32767 advertised
in SRV records are wrong. Given the high value they definitely
correspond to a sign extension of a negative number. The cause was
indeed that the port is declared as a signed int in the dns_answer_item
structure, and Lukas confirmed in github issue #103 that turning it to
unsigned addresses the issue.
It is worth noting that there are other such fields in this structure
that don't look right (ttl, priority, class, type) and that someone
should audit this part to be certain they are properly typed.
This fix must be backported to 1.9 and likely to 1.8 as well.
We still have quite a number of build macros which are mapped 1:1 to a
USE_something setting in the makefile but which have a different name.
This patch cleans this up by renaming them to use the USE_something
one, allowing to clean up the makefile and make it more obvious when
reading the code what build option needs to be added.
The following renames were done :
ENABLE_POLL -> USE_POLL
ENABLE_EPOLL -> USE_EPOLL
ENABLE_KQUEUE -> USE_KQUEUE
ENABLE_EVPORTS -> USE_EVPORTS
TPROXY -> USE_TPROXY
NETFILTER -> USE_NETFILTER
NEED_CRYPT_H -> USE_CRYPT_H
CONFIG_HAP_CRYPT -> USE_LIBCRYPT
CONFIG_HAP_NS -> DUSE_NS
CONFIG_HAP_LINUX_SPLICE -> USE_LINUX_SPLICE
CONFIG_HAP_LINUX_TPROXY -> USE_LINUX_TPROXY
CONFIG_HAP_LINUX_VSYSCALL -> USE_LINUX_VSYSCALL
It seems it's not defined on FreeBSD while it's mentioned on Linux that
clock_gettime() can be detected using this. Given that we also have the
test for _POSIX_TIMERS>0 that should cover it well enough. If it breaks
on other systems, we'll see.
Report was here :
https://github.com/haproxy/haproxy/runs/133866993
We currently have the ability to register functions to be called early
on thread creation and at thread deinitialization. It turns out this is
not sufficient because certain such functions may use resources that are
being allocated by the other ones, thus creating a race condition depending
only on the linking order. For example the mworker needs to register a
file descriptor while the pollers will reallocate the fd_updt[] array.
Similarly logs and trashes may be used by some init functions while it's
unclear whether they have been deduplicated.
The same issue happens on deinit, if the fd_updt[] or trash is released
before some functions finish to use them, we'll get into trouble.
This patch creates a couple of early and late callbacks for per-thread
allocation/freeing of resources. A few init functions were moved there,
and the fd init code was split between the two (since it used to both
allocate and initialize at once). This way the init/deinit sequence is
expected to be safe now.
This patch should be backported to 1.9 as at least the trash/log issue
seems to be present. The run_thread_poll_loop() code is a bit different
there as the mworker is not a callback, but it will have no effect and
it's enough to drop the mworker changes.
This bug was reported by Ilya Shipitsin in github issue #104.
This will be used by the watchdog to detect that a thread locked up.
It's only defined on platforms supporting it. This patch only reserves
the room for the timer in the struct. A special value was reserved for
the uninitialized timer. The problem is that the POSIX API was horribly
designed, defining no invalid value, thus for each timer it is required
to keep a second variable to indicate whether it's valid. A quick check
shows that defining a 32-bit invalid value is not something uncommon
across other implementations, with ~0 being common. Let's try with this
and if it causes issues we can revisit this decision.
This flag is constantly cleared by the scheduler and will be set by the
watchdog timer to detect stuck threads. It is also set by the "show
threads" command so that it is easy to spot if the situation has evolved
between two subsequent calls : if the first "show threads" shows no stuck
thread and the second one shows such a stuck thread, it indicates that
this thread didn't manage to make any forward progress since the previous
call, which is extremely suspicious.
This function dumps a lot of information about a stream into the provided
buffer. It is now used by stream_dump_and_crash() and will be used by the
debugger as well.
These functions are used respectively to signal one thread or all threads.
When multithreading is disabled, it's always the current thread which is
signaled.
Some code starts to add ifdefs everywhere to work around the lack of
threads_harmless_mask when threads are not compiled in. This one is
often used to indicate a thread having joined the rendez-vous point or
a thread sleeping in the poller. By setting it to zero we translate
what usually is required in debugging code (i.e. the only thread is
currently working) and for signal handlers we can use a combination of
threads_harmless_mask and sleeping_threads_mask to detect the polling
cases as well. Similarly do the same with threads_want_rdv_mask which
is less often used though.
The struct mworker_proc is not uniformly freed everywhere, sometimes leading
to leaks of the `id` string (and possibly the other strings).
Introduce a mworker_free_child function instead of duplicating the freeing
logic everywhere to prevent this kind of issues.
This leak was reported in issue #96.
It looks like the leaks have been introduced in commit 9a1ee7ac31,
which is specific to 2.0-dev. Backporting `mworker_free_child` might be
helpful to ease backporting other fixes, though.
Some structures have optional fields which depend on availability of
certain features on certain platforms, and having to stuff lots of
ifdefs in these structs makes them unreadable. Using real values like
ints requires some initialization and adds even more confusion.
Here we take a different approach : we create an empty type called
empty_t to use as a substitute for the real type that is not implemented
and which doesn't contain any value (it's an empty struct). Thus it has
a size of zero but an address, thus a pointer may point to it. It will
not have to be initialized though. Some initialization code might even
continue to work and do nothing like initializing it using memset with
its sizeof which is zero.
The clock_gettime() man page says we must check that _POSIX_TIMERS is
defined to a value greater than zero, not just that it's simply defined
so let's fix this right now.
Event ports are kqueue/epoll polling class for Solaris. Code is based
on https://github.com/joyent/haproxy-1.8/tree/joyent/dev-v1.8.8.
Event ports are available only on SunOS systems derived from
Solaris 10 and later (including illumos systems).
Since we're likely to access this thread_info struct more frequently in
the future, let's reserve the thread-local symbol to access it directly
and avoid always having to combine thread_info and tid. This pointer is
set when tid is set.
In order to ease the internal time API, we'll have the threads time always
present even when threads are disabled. Let's make sure clockid_t, and the
minimum clock times are defined even on older or non-compatible systems.
It doesn't make sense to keep this struct thread_info in global.h, it
causes difficulties to access its contents from hathreads.h, let's move
it to the threads where it ought to have been created.
Historically standard.h was the location where we used to (re-)define the
standard set of macros and functions, and to complement the ones missing
on the target OS. Over time it has become a toolbox in itself relying on
many other things, and its definition of LONGBITS is used everywhere else
(e.g. for MAX_THREADS), resulting in painful circular dependencies.
Let's move these few defines (integer sizes) to compat.h where other
similar definitions normally are.
It's a bit too easy to crash by accident when using dump_hex() on any
area. Let's have a function to check if the memory may safely be read
first. This one abuses the stat() syscall checking if it returns EFAULT
or not, in which case it means we're not allowed to read from there. In
other situations it may return other codes or even a success if the
area pointed to by the file exists. It's important not to abuse it
though and as such it's tested only once per output line.
This function dumps all existing threads using the thread dump mechanism
then aborts. This will be used by the lockup detection and by debugging
tools.
This is the per-thread CPU runtime clock, it will be used to measure
the CPU usage of each thread and by the lockup detection mechanism. It
must only be retrieved at the beginning of run_thread_poll_loop() since
the thread must already have been started for this. But it must be done
before performing any per-thread initcall so that all thread init
functions have access to the clock ID.
Note that it could make sense to always have this clockid available even
in non-threaded situations and place the process' clock there instead.
But it would add portability issues which are currently easy to deal
with by disabling threads so it may not be worth it for now.
This way we'll be able to store more per-thread information than just
the pthread pointer. The storage became an array of struct instead of
an allocated array since it's very small (typically 512 bytes) and not
worth the hassle of dealing with memory allocation on this. The array
was also renamed thread_info to make its intended usage more explicit.
Now that we have the guarantee that init calls happen before any other
thread starts, we don't need anymore the workaround installed by commit
1605c7ae6 ("BUG/MEDIUM: threads/mworker: fix a race on startup") and we
can instead rely on a regular per-thread initcall for this function. It
will only be performed on worker thread #0, the other ones and the master
have nothing to do, just like in the original code that was only moved
to the function.
The current "show threads" command was too limited as it was not possible
to dump other threads' detailed states (e.g. their tasks). This patch
goes further by using thread signals so that each thread can dump its
own state in turn into a shared buffer provided by the caller. Threads
are synchronized using a mechanism very similar to the rendez-vous point
and using this method, each thread can safely dump any of its contents
and the caller can finally report the aggregated ones from the buffer.
It is important to keep in mind that the list of signal-safe functions
is limited, so we take care of only using chunk_printf() to write to a
pre-allocated buffer.
This mechanism is enabled by USE_THREAD_DUMP and is enabled by default
on Linux 2.6.28+. On other platforms it falls back to the previous
solution using the loop and the less precise dump.
With the thread debugger it becomes visible that we can leave some
wandering pointers for a while in curr_task, which is inappropriate.
This patch addresses this by resetting curr_task to NULL before really
freeing the area. This way it becomes safe even regarding signals.
At some places we're using a painful ifdef to decide whether to use
sched_yield() or pl_cpu_relax() to relax in loops, this is hardly
exportable. Let's move this to ha_thread_relax() instead and une
this one only.
Instead of having them dump into the trash and initialize it, let's have
the caller initialize a buffer and pass it. This will be convenient to
dump multiple threads at once into a single buffer.
The new function ha_thread_dump() will dump debugging info about all known
threads. The current thread will contain a bit more info. The long-term goal
is to make it possible to use it in signal handlers to improve the accuracy
of some dumps.
The function dumps its output into the trash so as it was trivial to add,
a new "show threads" command appeared on the CLI.
Gil Bahat reported build issues on Cygwin starting with 1.9 due to a
difference in the way the linker handles the weak symbols there,
causing multiple declarations of ist_lc[] and ist_uc[]. It's likely
that this issue could also happen on any older or non-ELF linker.
This patch addresses this by using literals instead on such platforms,
leaving it to the compiler to merge the constants when it can. On other
platforms the resulting executable is slightly larger due to strings
that could not be merged but this is a minor detail compared to not
being able to build at all.
If this change alone is confirmed to fix these issues, it's safe to
backport to 1.9.
We do have some code paths testing for impossible errors that tend to
be quite confusing, first for maintenance (what to do on such errors,
and how far to guess the bug), second for developers as it tends to
hide the main purpose and expectations of these call places. Also
most of the time impossible errors are ignored by the callers so the
tests are not even usable during debugging.
Let's instead implement a BUG_ON macro which takes a condition, which
if true, will cause a message to be emitted and optionally to crash the
process. Additionally, these calls inserted at various places server as
hints and documentation for developers to know that such conditions
must absolutely not happen.
This is only enabled when DEBUG_STRICT or DEBUG_STRICT_NOCRASH are set.
As its name implies, DEBUG_STRICT_NOCRASH only performs the test but
does not crash, which can be useful to track some checkpoints.
At the moment nothing uses this code.
On recent gcc versions with the null-deref checks, ABORT_NOW() rightfully
emits such a warning. But here it's on purpose. Simply changing the memory
address to 1 makes gcc happy.
It was only set and not consumed after the previous change. The reason
is that the task's context always contains the relevant information,
so there is no need for a second pointer.
Some code parts use LIST_ISEMPTY() a lot on list elements to detect
if they were reset consecutive to their removal from a list, but this
test is always confusing as this was initially designed for list heads.
Instead let's have a new macro, LIST_ADDED(), which returns true when
the element is in a list (i.e. it's not "empty").
In conn_xprt_close(), after calling xprt->close(), don't forget to set
conn->xprt_ctx to NULL, or we may attempt to reuse the now-free'd
conn->xprt_ctx if the connection failed and we're retrying it.
This low-level asm implementation of a double CAS was implemented only
for certain architectures (x86_64, armv7, armv8). When threads are not
used, they were not defined, but since they were called directly from
a few locations, they were causing build issues on certain platforms
with threads disabled. This was addressed in commit f4436e1 ("BUILD:
threads: Add __ha_cas_dw fallback for single threaded builds") by
making it fall back to HA_ATOMIC_CAS() when threads are not defined,
but this actually made the situation worse by breaking other cases.
This patch fixes this by creating a high-level macro HA_ATOMIC_DWCAS()
which is similar to HA_ATOMIC_CAS() except that it's intended to work
on a double word, and which rely on the asm implementations when threads
are in use, and uses its own open-coded implementation when threads are
not used. The 3 call places relying on __ha_cas_dw() were updated to
use HA_ATOMIC_DWCAS() instead.
This change was tested on i586, x86_64, armv7, armv8 with and without
threads with gcc 4.7, armv8 with gcc 5.4 with and without threads, as
well as i586 with gcc-3.4 without threads. It will need to be backported
to 1.9 along with the fix above to fix build on armv7 with threads
disabled.
The following macros are now defined for openssl < 1.1 so that we
can remove the code performing direct access to the structures :
BIO_get_data(), BIO_set_data(), BIO_set_init(), BIO_meth_free(),
BIO_meth_new(), BIO_meth_set_gets(), BIO_meth_set_puts(),
BIO_meth_set_read(), BIO_meth_set_write(), BIO_meth_set_create(),
BIO_meth_set_ctrl(), BIO_meth_set_destroy()
Add a new action for http-request, disable-l7-retry, that can be used to
disable any attempt at retry requests (see retry-on) if it fails for any
reason other than a connection failure.
This is useful for example to make sure POST requests aren't retried.
__ha_cas_dw() is used in fd_rm_from_fd_list() and when built without
USE_THREADS=1 the linker fails to find __ha_cas_dw(). Add a definition
of __ha_cas_dw() for the #ifndef USE_THREADS case.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
It's always a pain to have to stuff lots of #ifdef USE_OPENSSL around
ssl headers, it even results in some of them appearing in a random order
and multiple times just to benefit form an existing ifdef block. Let's
make these headers safe for inclusion when USE_OPENSSL is not defined,
they now perform the test themselves and do nothing if USE_OPENSSL is
not defined. This allows to remove no less than 8 such ifdef blocks
and make include blocks more readable.
Since we're providing a compatibility layer for multiple OpenSSL
implementations and their derivatives, it is important that no C file
directly includes openssl headers but only passes via openssl-compat
instead. As a bonus this also gets rid of redundant complex rules for
inclusion of certain files (engines etc).
Some defines like OPENSSL_VERSION or X509_getm_notBefore() have nothing
to do in ssl_sock and must move to openssl-compat.h so that they are
consistently shared by the whole code. A warning in the code was added
against wild additions of macros there.
LibreSSL causes lots of build issues by pretending to be OpenSSL 2.0.0,
and it requires lots of care for each #if added to cover any specific
OpenSSL features.
This commit addresses the problem by making LibreSSL only advertise the
version it forked from (1.0.1g) and by starting to use tests based on
its real version to enable features instead of working by exclusion.
Most tests on OPENSSL_VERSION_NUMBER have become complex and break all
the time because this number is fake for some derivatives like LibreSSL.
This patch creates a new macro, HA_OPENSSL_VERSION_NUMBER, which will
carry the real openssl version defining the compatibility level, and
this version will be adjusted depending on the variants.
The tfo code was based on an old patch, and the value of the SRV_F_FASTOPEN
flag it used was since reused for SRV_F_COOKIESET. So give SRV_F_FASTOPEN
its own value.
When a sample fetch is encoded, we use its context to set info about the
fragmentation. But if the sample is not found, the function sample_process()
returns NULL. So we me be sure the sample exists before setting its context.
This patch must be backported to 1.9 and 1.8.
This patch implements a new global parameter for the master-worker mode.
When setting the mworker-max-reloads value, a worker receive a SIGTERM
if its number of reloads is greater than this value.
Since previous commit it's not needed anymore to test a task pointer
before calling task_destory() so let's just remove these tests from
the various callers before they become confusing. The function's
arguments were also documented. The same should probably be done
with tasklet_free() which involves a test in roughly half of the
call places.
Commit 3f795f7 ("MEDIUM: tasks: Merge task_delete() and task_free() into
task_destroy().") replaced task_delete() and task_free() with a single
function named task_destroy().
This patch adds a check for struct task* argument in function
task_destroy() to prevent a possible segfault on NULL and also to make
the function safer for use in other cases.
Now we atomically allocate the my_regex struct within function
regex_comp() and compile the regex or free both in case of failure. The
pointer to the allocated my_regex struct is returned directly. The
my_regex* argument to regex_comp() is removed.
Function regex_free() was modified so that it systematically frees the
my_regex entry. The function does nothing when called with a NULL as
argument (like free()). It will avoid existing risk of not properly
freeing the initialized area.
Other structures are also updated in order to be compatible (the ones
related to Lua and action rules).
With this patch we add a prefix to stick-table names declared in "peers" sections
concatenating the "peers" section name followed by a '/' character with
the stick-table name. Consequently, "peers" sections have their own
namespace for their stick-tables. Obviously, these stick-table names are not the
ones which should be sent over the network. So these configurations must be
compatible and should make A and B peers communicate with peers protocol:
# haproxy A config, old way stick-table declerations
peers mypeers
peer A ...
peer B ...
backend t1
stick-table type string size 10m store gpc0 peers mypeers
# haproxy B config, new way stick-table declerations
peers mypeers
peer A ...
peer B ...
table t1 type string size store gpc0 10m
This "network" name is stored in ->nid new field of stktable struct. The "local"
stktable-name is still stored in ->id.
Add a list of proxies for all the stick-tables (->proxies_list struct stktable
member) so that to be able to compute the process bindings of the peers after having
parsed the configuration file.
The proxies are added to the stick-tables they reference when parsing
stick-tables lines in proxy sections, when checking the actions in
check_trk_action() and when resolving samples args for stick-tables
without checking is they are duplicates. We check only there is no loop.
Then, after having parsed everything, we add the proxy bindings to the
peers frontend bindings with stick-tables they reference.
This patch adds the support for the "table" line parsing in "peers" sections
to declare stick-table in such sections. This also prevents the user from having
to declare dummy backends sections with a unique stick-table inside.
Even if still supported, this usage will become deprecated.
To do so, the ->table member of proxy struct which is a stktable struct is replaced
by a pointer to a stktable struct allocated at parsing time in src/cfgparse-listen.c
for the dummy stick-table backends and in src/cfgparse.c for "peers" sections.
This has an impact on the code for stick-table sample converters and on the stickiness
rules parsers which first store the name of the dummy before resolving the rules.
This patch replaces proxy_tbl_by_name() calls by stktable_find_by_name() calls
to lookup for stick-tables stored in "stktable_by_name" ebtree at parsing time.
There is only one remaining place where proxy_tbl_by_name() is used: src/hlua.c.
At several places in the code we relied on the fact that ->size member of stick-table
was equal to zero to consider the stick-table was present by not configured,
this do not make sense anymore as ->table member of struct proxyis fow now on a pointer.
These tests are replaced by a test on ->table value itself.
In "peers" section we do not have to temporary store the name of the section the
stick-table are attached to because this name is obviously already known just after
having entered this "peers" section.
About the CLI stick-table I/O handler, the pointer to proxy struct is replaced by
a pointer to a stktable struct.
With this patch we move the code responsible of parsing "stick-table"
lines to implement parse_stick_table() function in src/stick-tabble.c
so that to be able to parse "stick-table" elsewhere than in proxy sections.
We have have also added a conf struct to stktable struct to store the filename
and the line in the file the stick-table has been parsed to help in
diagnosing and displaying any configuration issue.
This implements support for the new API which relies on a call to
setsockopt().
On systems that support it (currently, only Linux >= 4.11), this enables
using TCP fast open when connecting to server.
Please note that you should use the retry-on "conn-failure", "empty-response"
and "response-timeout" keywords, or the request won't be able to be retried
on failure.
Co-authored-by: Olivier Houchard <ohouchard@haproxy.com>
The connect() method had 2 arguments, "data", that tells if there's pending
data to be sent, and "delack" that tells if we have to use a delayed ack
inconditionally, or if the backend is configured with tcp-smart-connect.
Turn that into one argument, "flags".
That way it'll be easier to provide more informations to connect() without
adding extra arguments.
SSL_SESSION_get0_id_context is introduced in LibreSSL-2.7.0
async operations are not supported by LibreSSL
early data is not supported by LibreSSL
packet_length is removed from SSL struct in LibreSSL
Add a way to retry requests if we got a junk response from the server, ie
an incomplete response, or something that is not valid HTTP.
To do so, one can use the new "junk-response" keyword for retry-on.
Add a new keyword for retry-on, 0rtt-rejected. If set, we will try to
replay requests for which we sent early data that got rejected by the
server.
If that option is set, we will attempt to use 0rtt if "allow-0rtt" is set
on the server line even if the client didn't send early data.
When running in HTX mode, if we sent the request, but failed to get the
answer, either because the server just closed its socket, we hit a server
timeout, or we get a 404, 408, 425, 500, 501, 502, 503 or 504 error,
attempt to retry the request, exactly as if we just failed to connect to
the server.
To do so, add a new backend keyword, "retry-on".
It accepts a list of keywords, which can be "none" (never retry),
"conn-failure" (we failed to connect, or to do the SSL handshake),
"empty-response" (the server closed the connection without answering),
"response-timeout" (we timed out while waiting for the server response),
or "404", "408", "425", "500", "501", "502", "503" and "504".
The default is "conn-failure".
Currently the thread array is a local variable inside a function block
and there is no access to it from outside, which often complicates
debugging. Let's make it global and export it. Also the allocation
return is now checked.
When we initially experimented with threads and processes support, we
needed to implement arrays of threads per process for cpu-map, but this
is not needed anymore since we support either threads or processes.
Let's simply make the thread-based cpu-map per thread and not per
thread and per process since that's not used anymore. Doing so reduces
the global struct from 33kB to 1.5kB.
When for some reason the session is not the owner of the connection anymore,
make sure we remove CO_FL_SESS_IDLE, even if we're about to call
conn->mux->destroy(), as the destroy may not destroy the connection
immediately if it's still in use.
This should be backported to 1.9.
u
In channel_erase(), don't forget to set output to 0, otherwise the
channel won't seem empty, when it really is, and that could lead to
stream never closing properly.
This should be backported to 1.9.
There is a bug when global.tune.maxaccept is set to -1 (no limit). It is pretty
visible with one process (nbproc sets to 1). The functions listener_accept() and
accept_queue_process() don't expect to handle negative maxaccept values. So
instead of accepting incoming connections without any limit, none are never
accepted and HAProxy loop infinitly in the scheduler.
When there are 2 or more processes, the bug is a bit more subtile. The limit for
a listener is set to 1. So only one connection is accepted at a time by a given
listener. This happens because the listener's maxaccept value is an unsigned
integer. In check_config_validity(), it is first set to UINT_MAX (-1 casted in
an unsigned integer), and then some calculations on it leads to an integer
overflow.
To fix the bug, the listener's maxaccept value is now a signed integer. So, if a
negative value is set for global.tune.maxaccept, we keep it untouched for the
listener and no calculation is made on it. Then, in the listener code, this
signed value is casted to a unsigned one. It simplifies all tests instead of
dealing with negative values. So, it limits the number of connections accepted
at a time to UINT_MAX at most. But, honestly, it not an issue.
This patch must be backported to 1.9 and 1.8.
Port range uses a ring buffer, and unfortunately, when making haproxy
multithreaded, it's been overlooked, and the ring buffer is not thread-safe.
When specifying a source range, 2 or more threads could pick the same
port, and of course only one of them could use the port, the others would
always fail the connection.
To fix this, make it a lock-free ring buffer. This is easier than usual
because we know the ring buffer can never be full.
This should be backported to 1.8 and 1.9.
The same way we have HA_ATOMIC_STORE(), implement HA_ATOMIC_LOAD().
This should be backported to 1.8 and 1.9, as we need it for a bug fix
in port ranges.
It's not logical to report context switch rates per thread in show activity
because everything else is a counter and it's not even possible to compare
values. Let's only report counts. Further, this simplifies the scheduler's
code.