Move the definition of the various _HA_ATOMIC_* macros that use
__atomic_* in the #if GCC_VERSION >= 4.7, not just after it, so that we
can build with older versions of gcc again.
The upgrade is performed when an H2 preface is detected when the first request
on a connection is parsed. The CS is destroyed by setting EOS flag on it. A
special flag is added on the HTX message to warn the HTX analyzers the stream
will be closed because of an upgrade. This way, no error and no log are
emitted. When the mux h1 is released, we create a mux h2, without any CS and
passing the buffer with the unparsed H2 preface.
The flag H1_MF_CLEAN_CONN_HDR has been added to let the H1 parser sanitize
connection headers. It means it will remove all "close" and "keep-alive" values
during the parsing. One noticeable effect is that connection headers may be
unfolded. In practice, this is not a problem because it is not frequent to have
multiple values for the connection headers.
If this flag is set, during the parsing The function
h1_parse_next_connection_header() is called in a loop instead of
h1_parse_conection_header().
No need to backport this patch
When creating a new initcall, don't forget to define the symbols, as it may
not be done automatically and that would lead to undefined symbols.
This should be backported to 1.9.
I found on an (old) AIX 5.1 machine that stdint.h didn't exist while
inttypes.h which is expected to include it does exist and provides the
desired functionalities.
As explained here, stdint being just a subset of inttypes for use in
freestanding environments, it's probably always OK to switch to inttypes
instead:
https://pubs.opengroup.org/onlinepubs/009696799/basedefs/stdint.h.html
Also it's even clearer here in the autoconf doc :
https://www.gnu.org/software/autoconf/manual/autoconf-2.61/html_node/Header-Portability.html
"The C99 standard says that inttypes.h includes stdint.h, so there's
no need to include stdint.h separately in a standard environment.
Some implementations have inttypes.h but not stdint.h (e.g., Solaris
7), but we don't know of any implementation that has stdint.h but not
inttypes.h"
The current initcall implementation relies on dedicated sections (one
section per init stage) to store the initcall descriptors. Then upon
startup, these sections are scanned from beginning to end and all items
found there are called in sequence.
On platforms like AIX or Cygwin it seems difficult to figure the
beginning and end of sections as the linker doesn't seem to provide
the corresponding symbols. In order to replace this, this patch
simply implements an array of single linked (one per init stage)
which are fed using constructors for each register call. These
constructors are declared static, with a name depending on their
line number in the file, in order to avoid name clashes. The final
effect is the same, except that the method is slightly more expensive
in that it explicitly produces code to register these initcalls :
$ size haproxy.sections haproxy.constructor
text data bss dec hex filename
4060312 249176 1457652 5767140 57ffe4 haproxy.sections
4062862 260408 1457652 5780922 5835ba haproxy.constructor
This mechanism is enabled as an alternative to the default one when
build option USE_OBSOLETE_LINKER is set. This option is currently
enabled by default only on AIX and Cygwin, and may be attempted for
any target which fails to build complaining about missing symbols
__start_init_* and/or __stop_init_*.
Once confirmed as a reliable fix, this will likely have to be backported
to 1.9 where AIX and Cygwin do not build anymore.
Older Solaris and AIX versions do not have unsetenv(). This adds a
fairly simple implementation which scans the environment, for use
with those systems. It will simply require to pass the define in
the "DEFINE" macro at build time like this :
DEFINE="-Dunsetenv=my_unsetenv"
http_known_methods, HTTP_100 and HTTP_103 were not declared extern and
as such were multiply defined since they were in http.h. There was
apparently no more side effect but it may depend on the platform and
the linker.
This needs to be backported to 1.9.
Injecting on a saturated listener started to exhibit some deadlocks
again between LIST_POP_LOCKED() and LIST_DEL_LOCKED(). Olivier found
it was due to a leftover from a previous debugging session. This patch
fixes it.
This will have to be backported if the other LIST_*_LOCKED() patches
are backported.
Some packages used to rely on DEFAULT_MAXCONN to set the default global
maxconn value to use regardless of the initial ulimit. The recent changes
made the lowest bound set to 100 so that it is compatible with almost any
environment. Now that DEFAULT_MAXCONN is not needed for anything else, we
can use it for the lowest bound set when maxconn is not configured. This
way it retains its original purpose of setting the default maxconn value
eventhough most of the time the effective value will be higher thanks to
the automatic computation based on "ulimit -n".
This entry was still set to 2000 but never used anymore. The only places
where it appeared was as an alias to SYSTEM_MAXCONN which forces it, so
let's turn these ones to SYSTEM_MAXCONN and remove the default value for
DEFAULT_MAXCONN. SYSTEM_MAXCONN still defines the upper bound however.
Add variants of the HA_ATOMIC* macros, prefixed with a _, that do the
atomic operation with no barrier generated by the compiler. It is expected
the developer adds barriers manually if needed.
When using the new __atomic* API, ask the compiler to generate barriers.
A variant of those functions that don't generate barriers will be added later.
Before that, using HA_ATOMIC* would not generate any barrier, and some parts
of the code should be reviewed and missing barriers should be added.
This should probably be backported to 1.8 and 1.9.
Implement __ha_barrier functions to be used when trying to protect data
modified by atomic operations (except when using HA_ATOMIC_STORE).
On intel, atomic operations either use the LOCK prefix and xchg, and both
atc as full barrier, so there's no need to add an extra barrier.
We already have my_ffsl() to find the lowest bit set in a word, and
this patch implements the search for the highest bit set in a word.
On x86 it uses the bsr instruction and on other architectures it
uses an efficient implementation.
It turns out that we call LIST_DEL+LIST_INIT very frequently and that
the compiler doesn't know what pointers get modified in the e->n->p
and e->p->n dance, so when LIST_INIT() is called, it reloads these
pointers, which is quite a bit of a mess in terms of performance.
This patch adds LIST_DEL_INIT() to perform the two operations at once
using local temporary variables so that the compiler knows these
pointers are left unaffected.
Well, that's becoming embarrassing. Now this fixes commit 4ef6801c
("BUG/MEDIUM: list: correct fix for LIST_POP_LOCKED's removal of last
element") which itself tried to fix commit 285192564. This fix only
works under low contention and was tested with the listener's queue.
With the idle conns it's obvious that it's still wrong since adding
more than one element to the list leaves a LLIST_BUSY pointer into
the list's head. This was visible when accumulating idle connections
in a server's list.
This new version of the fix almost goes back to the original code,
except that since then we addressed issues with expectedly idempotent
operations that were not. Now the code has been verified on paper again
and has survived 300 million connections spread over 4 threads.
This will have to be backported if the commit above is backported.
As seen with Olivier, in the end the fix in commit 285192564 ("BUG/MEDIUM:
list: fix LIST_POP_LOCKED's removal of the last pointer") is wrong,
the code there was right but the bug was triggered by another bug in
LIST_ADDQ_LOCKED() which doesn't properly update the list's head by
inserting in the wrong order.
This will have to be backported if the commit above is backported.
There is a very difficult to reproduce race in the listener's accept
code, which is much easier to reproduce once connection limits are
properly enforced. It's an ABBA lock issue :
- the following functions take l->lock then lq_lock :
disable_listener, pause_listener, listener_full, limit_listener,
do_unbind_listener
- the following ones take lq_lock then l->lock :
resume_listener, dequeue_all_listener
This is because __resume_listener() only takes the listener's lock
and expects to be called with lq_lock held. The problem can easily
happen when listener_full() and limit_listener() are called a lot
while in parallel another thread releases sessions for the same
listener using listener_release() which in turn calls resume_listener().
This scenario is more prevalent in 2.0-dev since the removal of the
accept lock in listener_accept(). However in 1.9 and before, a different
but extremely unlikely scenario can happen :
thread1 thread2
............................ enter listener_accept()
limit_listener()
............................ long pause before taking the lock
session_free()
dequeue_all_listeners()
lock(lq_lock) [1]
............................ try_lock(l->lock) [2]
__resume_listener()
spin_lock(l->lock) =>WAIT[2]
............................ accept()
l->accept()
nbconn==maxconn =>
listener_full()
state==LI_LIMITED =>
lock(lq_lock) =>DEADLOCK[1]!
In practice it is almost impossible to trigger it because it requires
to limit both on the listener's maxconn and the frontend's rate limit,
at the same time, and to release the listener when the connection rate
goes below the limit between poll() returns the FD and the lock is
taken (a few nanoseconds). But maybe with threads competing on the
same core it has more chances to appear.
This patch removes the lq_lock and replaces it with a lockless queue
for the listener's wait queue (well, technically speaking a self-locked
queue) brought by commit a8434ec14 ("MINOR: lists: Implement locked
variations.") and its few subsequent fixes. This relieves us from the
need of the lq_lock and removes the deadlock. It also gets rid of the
distinction between __resume_listener() and resume_listener() since the
only difference was the lq_lock. All listener removals from the list
are now unconditional to avoid races on the state. It's worth noting
that the list used to never be initialized and that it used to work
only thanks to the state tests, so the initialization has now been
added.
This patch must carefully be backported to 1.9 and very likely 1.8.
It is mandatory to be careful about replacing all manipulations of
l->wait_queue, global.listener_queue and p->listener_queue.
These operations previously used to return a "locked" element, which is
a constraint when multiple threads try to delete the same element, because
the second one will block indefinitely. Instead, let's make sure that both
LIST_DEL_LOCKED() and LIST_POP_LOCKED() always reinitialize the element
after deleting it. This ensures that the second thread will immediately
unblock and succeed with the removal. It also secures the pop vs delete
competition that may happen when trying to remove an element that's about
to be dequeued.
Commit a8434ec14 ("MINOR: lists: Implement locked variations.")
introduced locked lists which use the elements pointers as locks
for concurrent operations. Under heavy stress the lists occasionally
fail. The cause is a missing barrier at some points when updating
the list element and the head : nothing prevents the compiler (or
CPU) from updating the list head first before updating the element,
making another thread jump to a wrong location. This patch simply
adds the missing barriers before these two opeations.
This will have to be backported if the commit above is backported.
There was a typo making the last updated pointer be the pre-last element's
prev instead of the last's prev element. It didn't show up during early
tests because the contention is very rare on this one and it's implicitly
recovered when updating the pointers to go to the next element, but it was
clearly visible in the listener_accept() tests by having all threads block
on LIST_POP_LOCKED() with n==p==LLIST_BUSY.
This will have to be backported if commit a8434ec14 ("MINOR: lists:
Implement locked variations.") is backported.
Commit a8434ec14 ("MINOR: lists: Implement locked variations.")
introduced locked lists which use the elements pointers as locks
for concurrent operations. A copy-paste typo in LIST_ADDQ_LOCKED()
causes corruption in the list in case the next pointer is already
held, as it restores the previous pointer into the next one. It
may impact the server pools.
This will have to be backported if the commit above is backported.
Threads have long matured by now, still for most users their usage is
not trivial. It's about time to enable them by default on platforms
where we know the number of CPUs bound. This patch does this, it counts
the number of CPUs the process is bound to upon startup, and enables as
many threads by default. Of course, "nbthread" still overrides this, but
if it's not set the default behaviour is to start one thread per CPU.
The default number of threads is reported in "haproxy -vv". Simply using
"taskset -c" is now enough to adjust this number of threads so that there
is no more need for playing with cpu-map. And thanks to the previous
patches on the listener, the vast majority of configurations will not
need to duplicate "bind" lines with the "process x/y" statement anymore
either, so a simple config will automatically adapt to the number of
processors available.
Function mask_find_rank_bit() returns the bit position in mask <m> of
the nth bit set of rank <r>, between 0 and LONGBITS-1 included, starting
from the left. For example ranks 0,1,2,3 for mask 0x55 will be 6, 4, 2
and 0 respectively. This algorithm is based on a popcount variant and
is described here : https://graphics.stanford.edu/~seander/bithacks.html.
In LIST_DEL_LOCKED(), initialize p2 to NULL, and only attempt to set it back
to its previous value if we had a previous element, and thus p2 is non-NULL.
Implement LIST_ADD_LOCKED(), LIST_ADDQ_LOCKED(), LIST_DEL_LOCKED() and
LIST_POP_LOCKED().
LIST_ADD_LOCKED, LIST_ADDQ_LOCKED and LIST_DEL_LOCKED work the same as
LIST_ADD, LIST_ADDQ and LIST_DEL, except before any manipulation it locks
the relevant elements of the list, so it's safe to manipulate the list
with multiple threads.
LIST_POP_LOCKED() removes the first element from the list, and returns its
data.
This function is useful to parse strings made of unsigned integers
and to allocate a C array of unsigned integers from there.
For instance this function allocates this array { 1, 2, 3, 4, } from
this string: "1.2.3.4".
The function htx_drain() can now be used to drain data from an HTX message.
It will be used by other commits to fix bugs, so it must be backported to 1.9.
1xx responses does not work in HTTP2 when the HTX is enabled. First of all, when
a response is parsed, only one HEADERS frame is expected. So when an interim
response is received, the flag H2_SF_HEADERS_RCVD is set and the next HEADERS
frame (for another interim repsonse or the final one) is parsed as a trailers
one. Then when the response is sent, because an EOM block is found at the end of
the interim HTX response, the ES flag is added on the frame, closing too early
the stream. Here, it is a design problem of the HTX. Iterim responses are
considered as full messages, leading to some ambiguities when HTX messages are
processed. This will not be fixed now, but we need to keep it in mind for future
improvements.
To fix the parsing bug, the flag H2_MSGF_RSP_1XX is added when the response
headers are decoded. When this flag is set, an EOM block is added into the HTX
message, despite the fact that there is no ES flag on the frame. And we don't
set the flag H2_SF_HEADERS_RCVD on the corresponding H2S. So the next HEADERS
frame will not be parsed as a trailers one.
To fix the sending bug, the ES flag is not set on the frame when an interim
response is processed and the flag H2_SF_HEADERS_SENT is not set on the
corresponding H2S.
This patch must be backported to 1.9.
The existing threading flag in the 51Degrees API
(FIFTYONEDEGREES_NO_THREADING) has now been mapped to the HAProxy
threading flag (USE_THREAD), and the 51Degrees module code has been made
thread safe.
In Pattern, the cache is now locked with a spin lock from hathreads.h
using a new lable 'OTHER_LOCK'. The workset pool is now created with the
same size as the number of threads to avoid any time waiting on a
worket.
In Hash Trie, the global device offsets structure is only used in single
threaded operation. Multi threaded operation creates a new offsets
structure in each thread.
For some embedded systems, it's pointless to have 32- or even 64- large
arrays of processes when it's known that much fewer processes will be
used in the worst case. Let's introduce this MAX_PROCS define which
contains the highest number of processes allowed to run at once. It
still defaults to LONGBITS but may be lowered.
We'll call popcount() more often so better use a parallel method
than an iterative one. One optimal design is proposed at the site
below. It requires a fast multiplication though, but even without
it will still be faster than the iterative one, and all relevant
64 bit platforms do have a multiply unit.
https://graphics.stanford.edu/~seander/bithacks.html
When compiling with DEBUG_FAIL_ALLOC, add a new option, tune.fail-alloc,
that gives the percentage of chances an allocation fails.
This is useful to check that allocation failures are always handled
gracefully.
The previous patch clarifies the fact that the htx pointer is never null
along all the code. This test for a null will never match, didn't catch
the pointer 1 before the fix for b_is_null(), but it confuses the compiler
letting it think that any dereferences made to this pointer after this
test could actually mean we're dereferencing a null. Let's now drop this
test. This saves us from having to add impossible tests everywhere to
avoid the warning.
This should be backported to 1.9 if the b_is_null() patch is backported.
Update the comments above htxbuf() and htx_from_buf() to make it clear
that they always return valid htx pointers so that callers know they do
not have to test them. This is only true after the fix on b_is_null()
which was the only known corner case.
This should be backported to 1.9 if the b_is_null() patch is backported.
In b_is_null(), make sure we return 1 if the buffer is waiting for its
allocation, as users assume there's memory allocated if b_is_null() returns
0.
The indirect impact of not having this was that htxbuf() would not match
b_is_null() for a buffer waiting for an allocation, and would thus return
the value 1 for the htx pointer, causing various crashes under low memory
condition.
Note that this patch makes gcc versions 6 and above report two null-deref
warnings in proto_htx.c since htx_is_empty() continues to check for a null
pointer without knowing that this is protected by the test on b_is_null().
This is addressed by the following patches.
This should be backported to 1.9.
The new function h2_frame_check() checks the protocol limits for the
received frame (length, ID, direction) and returns a verdict made of
a connection error code. The purpose is to be able to validate any
frame regardless of the state and the ability to call the frame handler,
and to emit a GOAWAY early in this case.
There's some value in being able to limit MAX_THREADS, either to save
precious resources in embedded environments, or to protect certain
deployments against accidently incorrect settings.
With this patch, if MAX_THREADS is defined at build time, it will be
used. However, given that LONGBITS is not a macro but is defined
according to sizeof(long), we can't check the value range at build
time and instead we need to perform the check at early boot time.
However, the compiler is able to optimize away the constant comparisons
and doesn't even emit the check code when values are correct.
The output message regarding threading support was improved to report
the number of threads.
The header used to be parsed only in HTX but not in legacy. And even in
HTX mode, the value was dropped. Let's always parse it and report the
parsed value back so that we'll be able to store it in the streams.
RFC7541#6.3 mandates that an error is reported when a dynamic table size
update announces a size larger than the one configured with settings. This
is tested by h2spec using test "hpack/6.3/1".
This must be backported to 1.9 and possibly 1.8 as well.
This patch adds H2_FT_HDR_MASK to group all frame types carrying headers
information, and H2_FT_LATE_MASK to group frame types allowed to arrive
after a stream was closed.
While testing fixes, it's sometimes confusing to rebuild only one C file
(e.g. a mux) and not to have the correct commit ID reported in "haproxy -v"
nor on the stats page.
This patch adds a new "version.c" file which is always rebuilt. It's
very small and contains only 3 variables derived from the various
version strings. These variables are used instead of the macros at the
few places showing the version. This way the output version of the
running code is always correct for the parts that were rebuilt.
Currently the H1 headers parser works for either a request or a response
because it starts from the start line. It is also able to resume its
processing when it was interrupted, but in this case it doesn't update
the list.
Make it support a new flag, H1_MF_HDRS_ONLY so that the caller can
indicate it's only interested in the headers list and not the start
line. This will be convenient to parse H1 trailers.
This function is usable to transform a list of H2 header fields to a
HTX trailers block. It takes care of rejecting forbidden headers and
pseudo-headers when performing the conversion. It also emits the
trailing CRLF that is currently needed in the HTX trailers block.
This function is usable to transform a list of H2 header fields to a
H1 trailers block. It takes care of rejecting forbidden headers and
pseudo-headers when performing the conversion.
This function will be used to move parts of a buffer to another place
in the same buffer, even if the parts overlap. In order to keep things
under reasonable control, it only uses a length and absolute offsets
for the source and destination, and doesn't consider head nor data.
Released version 2.0-dev0 with the following main changes :
- BUG/MAJOR: connections: Close the connection before freeing it.
- REGTEST: Require the option LUA to run lua tests
- REGTEST: script: Process script arguments before everything else
- REGTEST: script: Evaluate the varnishtest command to allow quoted parameters
- REGTEST: script: Add the option --clean to remove previous log direcotries
- REGTEST: script: Add the option --debug to show logs on standard ouput
- REGTEST: script: Add the option --keep-logs to keep all log directories
- REGTEST: script: Add the option --use-htx to enable the HTX in regtests
- REGTEST: script: Print only errors in the results report
- REGTEST: Add option to use HTX prefixed by the macro 'no-htx'
- REGTEST: Make reg-tests target support argument.
- REGTEST: Fix a typo about barrier type.
- REGTEST: Be less Linux specific with a syslog regex.
- REGTEST: Missing enclosing quotes for ${tmpdir} macro.
- REGTEST: Exclude freebsd target for some reg tests.
- BUG/MEDIUM: h2: Don't forget to quit the sending_list if SUB_CALL_UNSUBSCRIBE.
- BUG/MEDIUM: mux-h2: Don't forget to quit the send list on error reports
- BUG/MEDIUM: dns: Don't prevent reading the last byte of the payload in dns_validate_response()
- BUG/MEDIUM: dns: overflowed dns name start position causing invalid dns error
- BUG/MINOR: compression/htx: Don't compress responses with unknown body length
- BUG/MINOR: compression/htx: Don't add the last block of data if it is empty
- MEDIUM: mux_h1: Implement h1_show_fd.
- REGTEST: script: Add support of alternatives in requited options list
- REGTEST: Add a basic test for the compression
- BUG/MEDIUM: mux-h2: don't needlessly wake up the demux on short frames
- REGTEST: A basic test for "http-buffer-request"
- BUG/MEDIUM: server: Also copy "check-sni" for server templates.
- MINOR: ssl: Add ssl_sock_set_alpn().
- MEDIUM: checks: Add check-alpn.
When producing an HTX message, we can't rely on the next-level H1 parser
to check and deduplicate the content-length header, so we have to do it
while parsing a message. The algorithm is the exact same as used for H1
messages.
When using DEBUG_MEMORY_POOLS, when we want to crash, instead of using
*(int *)0 = 0, use *(volatile int *)0 = 0, or clang will just translate it
to a nop, instead of dereferencing 0.
All the HTX definition is self-contained and doesn't really depend on
anything external since it's a mostly protocol. In addition, some
external similar files (like h2) also placed in common used to rely
on it, making it a bit awkward.
This patch moves the two htx.h files into a single self-contained one.
The historical dependency on sample.h could be also removed since it
used to be there only for http_meth_t which is now in http.h.
The new function hpack_encode_path() supports encoding a path into
the ":path" header. It knows about "/" and "/index.html" which use
a single byte, and falls back to literal encoding for other ones,
with a fast path for short paths < 127 bytes.
The new function hpack_encode_scheme() supports encoding a scheme
into the ":scheme" header. It knows about "https" and "http" which use
a single byte, and falls back to literal encoding for other ones.
The new function hpack_encode_method() supports encoding a method.
It knows about GET and POST which use a single byte, and falls back
to literal encoding for other ones.
This header exists with 7 different values, it's worth taking them
into account for the encoding, hence these functions. One of them
makes use of an integer only and computes the 3 output bytes in case
of literal. The other one benefits from the knowledge of an existing
string, which for example exists in the case of H1 to H2 encoding.
For long header values whose index is known, hpack_encodde_long_idx()
may now be used. This function emits the short index and follows with
the header's value.
Most direct calls to HPACK functions are made to encode short header
fields like methods, schemes or statuses, whose lengths and indexes
are known. Let's have a small function to do this.
We'll need these functions from other inline functions, let's make them
accessible. len_to_bytes() was renamed to hpack_len_to_bytes() since it's
now exposed.
This macro may be used to block constant propagation that lets the compiler
detect a possible NULL dereference on a variable resulting from an explicit
assignment in an impossible check. Sometimes a function is called which does
safety checks and returns NULL if safe conditions are not met. The place
where it's called cannot hit this condition and dereferencing the pointer
without first checking it will make the compiler emit a warning about a
"potential null pointer dereference" which is hard to work around. This
macro "washes" the pointer and prevents the compiler from emitting tests
branching to undefined instructions. It may only be used when the developer
is absolutely certain that the conditions are guaranteed and that the
pointer passed in argument cannot be NULL by design.
A typical use case is a top-level function doing this :
if (frame->type == HEADERS)
parse_frame(frame);
Then parse_frame() does this :
void parse_frame(struct frame *frame)
{
const char *frame_hdr;
frame_hdr = frame_hdr_start(frame);
if (*frame_hdr == FRAME_HDR_BEGIN)
process_frame(frame);
}
and :
const char *frame_hdr_start(const struct frame *frame)
{
if (frame->type == HEADERS)
return frame->data;
else
return NULL;
}
Above parse_frame() is only called for frame->type == HEADERS so it will
never get a NULL in return from frame_hdr_start(). Thus it's always safe
to dereference *frame_hdr since the check was already performed above.
It's then safe to address it this way instead of inventing dummy error
code paths that may create real bugs :
void parse_frame(struct frame *frame)
{
const char *frame_hdr;
frame_hdr = frame_hdr_start(frame);
ALREADY_CHECKED(frame_hdr);
if (*frame_hdr == FRAME_HDR_BEGIN)
process_frame(frame);
}
Calling tolower/toupper for each character is slow, a lookup into a
256-byte table is cheaper, especially for common characters used in
header field names which all fit into a cache line. Let's create these
two variables marked weak so that they're included only once.
The ist functions were missing functions to copy an IST into a target
buffer, making some code have to resort to memcpy(), which tends to be
overkill for small strings, that the compiler cannot guess. In addition
sometimes there is a need to turn a string to lower or upper case so it
had to be overwritten after the operation.
This patch adds 6 functions to copy an ist to a buffer, as binary or as a
string (i.e. a zero is or is not appended), and optionally to apply a
lower case or upper case transformation on the fly.
A number of tests were performed to optimize the processing for small
strings. The loops are marked unlikely to dissuade the compilers from
over-optimizing them and switching to SIMD instructions. The lower case
or upper case transformations used to rely on external functions for
each character and to crappify the code due to clobbered registers,
which is not acceptable when we know that only a certain class of chars
has to be transformed, so the test was open-coded.
Till now we could only produce an HTTP/1 request from a list of H2
request headers. Now the new function h2_make_htx_request() does the
same but using the HTX encoding instead, while respecting the H2
semantics. The code is not much different from the first version,
only the encoding differs.
For now it's not used.
During startup, after the configuration parsing, all HTTP error messages
(errorloc, errorfile or default messages) are converted into HTX messages and
stored in dedicated buffers. We use it to return errors in the HTX analyzers
instead of using ugly OOB blocks.
Having a thread_local for the pool cache is messy as we need to
initialize all elements upon startup, but we can't until the threads
are created, and once created it's too late. For this reason, the
allocation code used to check for the pool's initialization, and
it was the release code which used to detect the first call and to
initialize the cache on the fly, which is not exactly optimal.
Now that we have initcalls, let's turn this into a per-thread array.
This array is initialized very early in the boot process (STG_PREPARE)
so that pools are always safe to use. This allows to remove the tests
from the alloc/free calls.
Doing just this has removed 2.5 kB of code on all cumulated pool_alloc()
and pool_free() paths.
Instead of exporting a number of pools and having to manually delete
them in deinit() or to have dedicated destructors to remove them, let's
simply kill all pools on deinit().
For this a new function pool_destroy_all() was introduced. As its name
implies, it destroys and frees all pools (provided they don't have any
user anymore of course).
This allowed to remove 4 implicit destructors, 2 explicit ones, and 11
individual calls to pool_destroy(). In addition it properly removes
the mux_pt_ctx pool which was not cleared on exit (no backport needed
here since it's 1.9 only). The sig_handler pool doesn't need to be
exported anymore and became static now.
The new function create_pool_callback() takes 3 args including the
return pointer, and creates a pool with the specified name and size.
In case of allocation error, it emits an error message and returns.
The new macro REGISTER_POOL() registers a callback using this function
and will be usable to request some pools creation and guarantee that
the allocation will be checked. An even simpler approach is to use
DECLARE_POOL() and DECLARE_STATIC_POOL() which declare and register
the pool.
Using __decl_spinlock(), __decl_rwlock(), __decl_aligned_spinlock()
and __decl_aligned_rwlock(), one can now simply declare a spinlock
or an rwlock which will automatically be initialized at boot time
by calling the ha_spin_init() or ha_rwlock_init() callback. The
"aligned" variants enforce a 64-byte alignment on the lock.
This patch adds ha_spin_init() and ha_rwlock_init() which are used as
a callback to initialise locks at boot time. They perform exactly the
same as HA_SPIN_INIT() or HA_RWLOCK_INIT() but from within a real
function.
We currently have to deal with multiple initialization stages in a way
that can be confusing, because certain parts rely on others having been
properly initialized. Most calls consist in adding lists to existing
lists, whose heads are initialized in the declaration so this is easy.
But some calls create new pools and require pools to be properly
initialized. Pools currently are thread-local and as such cannot be
pre-initialized, requiring run-time checks.
All this could be simplified by using multiple boot stages and allowing
functions to be registered at various stages.
One approach might be to use gcc's constructor priorities, but this
requires gcc >= 4.3 which eliminates a wide spectrum of working compilers,
and some versions of certain compilers (like clang 3.0) are known for
silently ignore these priorities.
Instead we can use our own init function registration mechanism. A first
attempt was made using register_function() calls in all constructors but
this made the code more painful.
This patch's approach is different. It creates sections containing
arrays of pointers to "initcall" descriptors. An initcall contains a
pointer to a function and an argument. Each section corresponds to a
specific initialization stage. Each module creates such descriptors
for various calls it requires. The main() function starts by scanning
each of these sections in turn to process these initcalls.
This will make it possible to remove many constructors from various
modules, by simply placing initcalls for the requested functions next
to the keyword lists that need to be called.
A first attempt was made by placing the initcalls directly into the
sections instead of creating an array of pointers, but it becomes
sensitive to the array's alignment which depends on the compiler and
the linker, so it seems too fragile.
For now we support 6 init stages :
- STG_PREPARE : preset variables, tables and list heads
- STG_LOCK : initialize spinlocks and rwlocks
- STG_ALLOC : allocate the required structures
- STG_POOL : create pools
- STG_REGISTER : register static lists (keywords etc)
- STG_INIT : subsystems normal initialization
These ones are declared directly in the files where they are needed
using one of the INITCALL* macros, passing 0 to 3 pointers as
arguments.
The API should possibly be extended to support a return value to give
a status to the caller, and to support a unified API, possibly a bit
more flexibility in the arguments. In this case it might make sense to
support a set of macros to register functions having a different API
and to pass the function type in the initcall itself.
Special thanks to Olivier for showing how to scan sections as this is
not something particularly well documented and exactly what I've been
missing to achieve this.
Building with musl and gcc-5.3 for MIPS returns this :
include/common/buf.h: In function 'b_dist':
include/common/buf.h:252:2: error: unknown type name 'ssize_t'
ssize_t dist = to - from;
^
Including stdint or stddef is not sufficient there to get ssize_t,
unistd is needed as well. It's likely that other platforms will have
the same issue. This patch also addresses it in ist.h and memory.h.
At the moment the situation with activity measurement is quite tricky
because the struct activity is defined in global.h and declared in
haproxy.c, with operations made in time.h and relying on freq_ctr
which are defined in freq_ctr.h which itself includes time.h. It's
barely possible to touch any of these files without breaking all the
circular dependency.
Let's move all this stuff to activity.{c,h} and be done with it. The
measurement of active and stolen time is now done in a dedicated
function called just after tv_before_poll() instead of mixing the two,
which used to be a lazy (but convenient) decision.
No code was changed, stuff was just moved around.
This was the largest function of the whole file, taking a rough second
to build alone. Let's move it to a distinct file along with a few
dependencies. Doing so saved about 2 seconds on the total build time.
The config parser is the largest file to build and its build dominates
the total project's build time. Let's start to split it into multiple
smaller pieces by extracting the "global" section parser into a new
file called "cfgparse-global.c". This removes 1/4th of the file's build
time.
These 2 functions are pretty naive. They only split a start-line into its 3
substrings or a header line into its name and value. Spaces before and after
each part are skipped. No CRLF at the end are expected.
This patch implements http_apply_early_hint_rule() function is responsible of
building HTTP 103 Early Hint responses each time a "early-hint" rule is matched.
When namespaces are disabled, support is still reported because the file
is built with almost nothing in it but built anyway. Instead of extending
the scope of the numerous ifdefs in this file, better avoid building it
when namespaces are diabled. In this case we define my_socketat() as an
inline function mapping directly to socket(). The struct netns_entry
still needs to be defined because it's used by various other functions
in the code.
It was reported here that authentication may fail when threads are
enabled :
https://bugzilla.redhat.com/show_bug.cgi?id=1643941
While I couldn't reproduce the issue, it's obvious that there is a
problem with the use of the non-reentrant crypt() function there.
On Linux systems there's crypt_r() but not on the vast majority of
other ones. Thus a first approach consists in placing a lock around
this crypt() call. Another patch may relax it when crypt_r() is
available.
This fix must be backported to 1.8. Thanks to Ryan O'Hara for the
quick notification.
Commit 27346b01a ("OPTIM: tools: optimize my_ffsl() for x86_64") optimized
my_ffsl() for intensive use cases in the scheduler, but as half of the times
I got it wrong so it counted bits the reverse way. It doesn't matter for the
scheduler nor fd cache but it broke cpu-map with threads which heavily relies
on proper ordering.
We should probably consider dropping support for gcc < 3.4 and switching
to builtins for these ones, though often they are as ambiguous.
No backport is needed.
When building with DEBUG_MEMORY_POOLS, an element returned from the
cache would not have its pool link initialized unless it's allocated
using pool_alloc(). This is problematic for buffer allocators which
use pool_alloc_dirty(), as freeing this object will make the code
think it was allocated from another pool. This patch does two things :
- make __pool_get_from_cache() set the link
- remove the extra initialization from pool_alloc() since it's always
done in either __pool_get_first() or __pool_refill_alloc()
This patch is marked MINOR since it only affects code explicitly built
for debugging. No backport is needed.
When mapping memory with mmap(), we should use a fd of -1, not 0. 0 may
work on linux, but it doesn't work on FreeBSD, and probably other OSes.
It would be nice to backport this to 1.8 to help debugging there.
Commit ac6c880 ("BUILD: memory: fix pointer declaration for atomic CAS")
attemtped to fix a build warning affecting the lock-free version of the
pool allocator. But the fix tried to hide the cause instead of addressing
it, thus clang still complains about (void **) not matching (void ***).
The real solution is to declare free_list (void **) and not to use a cast.
Now this builds fine with gcc/clang with and without threads.
No backport is needed.
The purpose is to detect if threads or processes are competing for the
same CPU. This can happen when threads are incorrectly bound, or after a
reload if the previous process still has an important activity. With
threads this situation is problematic because a preempted thread holding
a lock will block other ones waiting for this lock to be released.
A first attempt consisted in measuring the cumulated lost time more
precisely but the system's scheduler is smart enough to try to limit the
thread preemption rate by mostly context switching during poll()'s blank
periods, so most of the time lost is not seen. In essence this is good
because it means a thread is not preempted with a lock held, and even
regarding the rendez-vous point it cannot prevent the other ones from
making progress. But still it happens tens to hundreds of times per
second that a thread might be preempted, so it's still possible to detect
that the situation is happening, thus it's interesting to measure and
report its frequency.
Each time we enter the poller, we check the CPU time spent working and
see if we've lost time doing something else. To limit false positives,
we're only interested in losses of 500 microseconds or more (i.e. half
a clock tick on a 1 kHz system). If so, it indicates that some time was
stolen by another thread or process. Note that we purposely store some
sub-millisecond counters so that under heavy traffic with a 1 kHz clock,
it's still possible to measure something without being subject to the
risk of rounding errors (i.e. if exactly 1 ms is stolen it's possible
that the time difference could often be slightly lower).
This counter of lost CPU time slots time is reported in "show activity"
in numbers of milliseconds of CPU lost per second, per 15s, and total
over the process' life. By definition, the per-second counter cannot
report values larger than 1000 per thread per second and the 15s one
will be limited to 15000/s in the worst case, but it's possible that
peak values exceed such thresholds after long pauses.
These two functions retrieve respectively the monotonic clock time and
the per-thread CPU time when available on the platform, or return zero.
These syscalls may require to link with -lrt on certain libc, which is
enabled in the Makefile with USE_RT=1 (default on Linux systems).
The calls to HA_ATOMIC_CAS() on the lockfree version of the pool allocator
were mistakenly done on (void*) for the old value instead of (void **).
While this has no impact on "recent" gcc, it does have one for gcc < 4.7
since the CAS was open coded and it's not possible to assign a temporary
variable of type "void".
No backport is needed, this only affects 1.9.
By placing this code into time.h (tv_entering_poll() and tv_leaving_poll())
we can remove the logic from the pollers and prepare for extending this to
offer more accurate time measurements.
The 4 pollers all contain the same code used to compute the poll timeout.
This is pointless, let's centralize this into fd.h. This also gets rid of
the useless SCHEDULER_RESOLUTION macro which used to work arond a very old
linux 2.2 bug causing select() to wake up slightly before the timeout.
Each thread now keeps the last ~512 kB of freed objects into a local
cache. There are some heuristics involved so that a specific pool cannot
use more than 1/8 of the total cache in number of objects. Tests have
shown that 512 kB is an optimal size on a 24-thread test running on a
dual-socket machine, resulting in an overall 7.5% performance increase
and a cache miss ratio reducing from 19.2 to 17.7%. Anyway it seems
pointless to keep more than an L2 cache, which probably explains why
sizes between 256 and 512 kB are optimal.
Cached objects appear in two lists, one per pool and one LRU to help
with fair eviction. Currently there is no way to check each thread's
cache state nor to flush it. This cache cannot be disabled and is
enabled as soon as the lockless pools are enabled (i.e.: threads are
enabled, no pool debugging is in use and the CPU supports a double word
CAS).
For caching it will be convenient to have indexes associated with pools,
without having to dereference the pool itself. One solution could consist
in replacing all pool pointers with integers but this would limit the
number of allocatable pools. Instead here we allocate the 32 first pools
from a pre-allocated array whose base address is known so that it's trivial
to convert a pool to an index in this array. Pools that cannot fit there
will be allocated normally.
This statement is used as a hint for the compiler so that it knows that
the location where it's placed cannot be reached. It will mostly be used
after longjmp() or equivalent statements that deal with error processing
and that the compiler doesn't know will not return on certain conditions,
so that it doesn't complain about null dereferences on error paths.
HTTP_FLG_* and HTTP_IS_* were moved from "proto/proto_http.h" to "common/http.h"
but the associated comment was forgotten during the move.
This is 1.9-specific and should not be backported.
This call is now used quite a bit in the fd cache, to decide which cache
to add/remove the fd to/from, when waking up a task for a single thread
in __task_wakeup(), in fd_cant_recv() and in fd_process_cached_events(),
and we can replace it with a single instruction, removing ~30 instructions
and ~80 bytes from the inner loop of some of these functions.
In addition the test for zero value was replaced with a comment saying
that it is illegal and leads to an undefined behaviour. The code does
not make use of this useless case today.
In commit f161d0f51 ("BUG/MINOR: pools/threads: don't ignore DEBUG_UAF
on double-word CAS capable archs") I moved some defines and accidently
messed up with lockfree pools. The problem is that the HA_HAVE_CAS_DW
macro is not defined anymore where the CONFIG_HAP_LOCKLESS_POOLS macro
is set, so this fix implicitly disabled lockfree pools.
This patch fixes this by moving the capabilities definition to config.h
(probably that we'd benefit from having an "arch.h" file to declare the
capabilities offered by the architecture). In a test on a 12-core machine,
we used to measure 19s spent in the pool lock for 1M requests without
this patch, and 0 with it so that's definitely a net saving.
No backport is required, this is only for 1.9.
OpenSSL released support for TLSv1.3. It also added a separate function
SSL_CTX_set_ciphersuites that is used to set the ciphers used in the
TLS 1.3 handshake. This change adds support for that new configuration
option by adding a ciphersuites configuration variable that works
essentially the same as the existing ciphers setting.
Note that it should likely be backported to 1.8 in order to ease usage
of the now released openssl-1.1.1.
In ci_insert_line2() and b_rep_blk(), we can't afford to wrap, so don't use
b_tail() to check if we do, use __b_tail() instead.
This should be backported to previous versions.
The current proto_http.c file is huge and contains different processing
domains making it very difficult to work on an alternative representation.
This commit moves some parts to other files :
- ACL registration code => http_acl.c
This code only creates some ACL mappings and doesn't know anything
about HTTP nor about the representation. This code could even have
moved to acl.c but it was not worth polluting it again.
- HTTP sample conversion => http_conv.c
This code doesn't depend on the internal representation but definitely
manipulates some HTTP elements, such as dates. It also has access to
captures.
- HTTP sample fetching => http_fetch.c
This code does depend entirely on the internal representation but is
totally independent on the analysers. Placing it into a different
file will ease the transition to the new representation and the
creation of a wrapper if required. An include file was created due
to CHECK_HTTP_MESSAGE_FIRST() being used at various places.
- HTTP action registration => http_act.c
This code doesn't directly interact with the messages nor the
transaction but it does so via some exported http functions like
http_replace_req_line() or http_set_status() so it will be easier
to change only this after the conversion.
- a few very generic parts were found and moved to http.{c,h} as
relevant.
It is worth noting that the functions moved to these new files are not
referenced anywhere outside of the files and are only called as registered
callbacks, so these files do not even require associated include files.
Tim Düsterhus found using afl-fuzz that some parts of the HPACK decoder
use incorrect bounds checking which do not catch negative values after
a type cast. The first culprit is hpack_valid_idx() which takes a signed
int and is fed with an unsigned one, but a few others are affected as
well due to being designed to work with an uint16_t as in the table
header, thus not being able to detect the high offset bits, though they
are not exposed if hpack_valid_idx() is fixed.
The impact is that the HPACK decoder can be crashed by an out-of-bounds
read. The only work-around without this patch is to disable H2 in the
configuration.
CVE-2018-14645 was assigned to this bug.
This patch addresses all of these issues at once. It must be backported
to 1.8.
These two functions were apparently written on the same model as their
parents when added by commit 11bcb6c4f ("[MEDIUM] IPv6 support for syslog")
except that they perform an assignment instead of a return, and as a
result fall through the next case where the assigned value may possibly
be partially overwritten. At least under Linux the port offset is the
same in both sockaddr_in and sockaddr_in6 so the value is written twice
without side effects.
This needs to be backported as far as 1.5.
This protocol is based on the uxst one, but it uses socketpair and FD
passing insteads of a connect()/accept().
The "sockpair@" prefix has been implemented for both bind and server
keywords.
When HAProxy wants to connect through a sockpair@, it creates 2 new
sockets using the socketpair() syscall and pass one of the socket
through the FD specified on the server line.
On the bind side, haproxy will receive the FD, and will use it like it
was the FD of an accept() syscall.
This protocol was designed for internal communication within HAProxy
between the master and the workers, but it's possible to use it
externaly with a wrapper and pass the FD through environment variabls.
The following functions only deal with header field values and are agnostic
to the HTTP version so they were moved to http.c :
http_header_match2(), find_hdr_value_end(), find_cookie_value_end(),
extract_cookie_value(), parse_qvalue(), http_find_url_param_pos(),
http_find_next_url_param().
Those lacking the "http_" prefix were modified to have it.
These error codes and messages are agnostic to the version, even if
they are represented as HTTP/1.0 messages. Ultimately they will have
to be transformed into internal HTTP messages to be used everywhere.
The HTTP/1.1 100 Continue message was turned to an IST and the local
copy in the Lua code was removed.
This function is purely HTTP once http_txn is put aside. So the original
one was renamed to http_txn_get_path() and it extracts the relevant offsets
from the txn to pass them to http_get_path(). One benefit of the new version
is that it returns the length at the same time so that allowed to slightly
simplify http_get_path_from_string() which had to look up the end pointer
previously and which is not needed anymore.
It's a bit painful to have to deal with HTTP semantics for each protocol
version (H1 and H2), and working on the version-agnostic code further
emphasizes the problem.
This patch creates http.h and http.c which are agnostic to the version
in use, and which borrow a few parts from proto_http and from h1. For
example the once thought h1-specific h1_char_classes array is in fact
dictated by RFC7231 and is used to parse HTTP headers. A few changes
were made to a few files which were including proto_http.h while they
only needed http.h.
Certain string definitions pre-dated the introduction of indirect
strings (ist) so some were used to simplify the definition of the known
HTTP methods. The current lookup code saves 2 kB of a heavily used table
and is faster than the previous table based lookup (typ. 14 ns vs 16
before).
This function was split in two at commit f7d0447 ("MINOR: buffers:
split b_putblk() into __b_putblk()") but it's wrong, the first half's
length is not adjusted to the requested size so it copies more than
desired.
This is purely 1.9-specific, no backport is needed.
We've been missing it several times and now we'll need it to increment
a request counter. Let's do it once for all.
This patch will need to be backported to 1.8 with the associated fix.
Since commit 843b7cb ("MEDIUM: chunks: make the chunk struct's fields
match the buffer struct") a chunk length is unsigned so we can remove
negative size checks.
When b_slow_realign is called with the <output> parameter equal to 0, the
buffer's head, after the realign, must be set to 0. It was errornously set to
the buffer's size, because there was no test on the value of <output>.
The current synchronization point enforces certain restrictions which
are hard to workaround in certain areas of the code. The fact that the
critical code can only be called from the sync point itself is a problem
for some callback-driven parts. The "show fd" command for example is
fragile regarding this.
Also it is expensive in terms of CPU usage because it wakes every other
thread just to be sure all of them join to the rendez-vous point. It's a
problem because the sleeping threads would not need to be woken up just
to know they're doing nothing.
Here we implement a different approach. We keep track of harmless threads,
which are defined as those either doing nothing, or doing harmless things.
The rendez-vous is used "for others" as a way for a thread to isolate itself.
A thread then requests to be alone using thread_isolate() when approaching
the dangerous area, and then waits until all other threads are either doing
the same or are doing something harmless (typically polling). The function
only returns once the thread is guaranteed to be alone, and the critical
section is terminated using thread_release().
When threads are disabled, some variables such as tid and tid_bit are
still checked everywhere, the MAX_THREADS_MASK macro is ~0UL while
MAX_THREADS is 1, and the all_threads_mask variable is replaced with a
macro forced to zero. The compiler cannot optimize away all this code
involving checks on tid and tid_bit, and we end up in special cases
where all_threads_mask has to be specifically tested for being zero or
not. It is not even certain the code paths are always equivalent when
testing without threads and with nbthread 1.
Let's change this to make sure we always present a single thread when
threads are disabled, and have the relevant values declared as constants
so that the compiler can optimize all the tests away. Now we have
MAX_THREADS_MASK set to 1, all_threads_mask set to 1, tid set to zero
and tid_bit set to 1. Doing just this has removed 4 kB of code in the
no-thread case.
A few checks for all_threads_mask==0 have been removed since it never
happens anymore.
An offsetof() macro was introduced with commit 928fbfa ("MINOR: compiler:
introduce offsetoff().") with a fallback for older compilers. But this
breaks gcc 3.4 because __size_t and __uintptr_t are not defined there.
However size_t and uintptr_t are, so let's fix it this way. No backport
needed.
The purpose is to make sure that all variables which directly depend
on this nbthread argument are set at the right moment. For now only
all_threads_mask needs to be set. It used to be set while calling
thread_sync_init() which is called too late for certain checks. The
same function handles threads and non-threads, which removes the need
for some thread-specific knowledge from cfgparse.c.
If nbthread is MAX_THREADS, the shift operation needed to compute
all_threads_mask fails in thread_sync_init(). Instead pass a number
of threads to this function and let it compute the mask without
overflowing.
This should be backported to 1.8.
This lock was necessary to manipulate the pendconn element between
concurrent places, but was causing great difficulties in the list walk
by having to iterate over multiple entries instead of being able to
safely pick the first one (in fact the first element was always the
right one but the locking model was hard to prove).
Here since we know we can always rely on the queue's locks, we take
the queue's lock every time we need to modify the element. In practice
it was already the case everywhere except in pendconn_dequeue() which
only works on an element that was already detached. This function had
to be protected against the risk of meeting an incompletely detached
element (which could be unlinked but not yet assigned). By taking the
queue lock around the LIST_ISEMPTY test, it's enough to ensure that a
concurrent thread either didn't begin or had completed the operation.
The true benefit really is in pendconn_process_next_strm() where we
can again safely work with the first element of each queue. This will
significantly simplify next updates to this code.
Whenever it's possible to avoid a copy, b_xfer() will simply swap the
buffer's heads without touching the data. This has brought the performance
back from 140 kH/s to 202 kH/s on the test case.
The latter function is more suited to operations that don't require any
check because the check has already been performed. It will be used by
other b_* functions.
This function is used a lot in block copies and is needlessly
complicated since it still uses pointer arithmetic. Let's fall
back to regular offsets and simplify it. This removed around
23 bytes from b_putblk() and it removed any conditional jump.
In thread_sync_barrier, we exit when all threads have set their own bit in the
barrier mask. It is done by comparing it to all_threads_mask. But we must not
use a simple equality to do so, becaue all_threads_mask may change. Since commit
ba86c6c25 ("MINOR: threads: Be sure to remove threads from all_threads_mask on
exit"), when a thread exit, its bit is removed from all_threads_mask. Instead,
we must use a bitwise AND to test is all bits of all_threads_mask are set.
This also requires that all_threads_mask is set to volatile if we want to
catch changes.
This patch must be backported in 1.8.
Now all the code used to manipulate chunks uses a struct buffer instead.
The functions are still called "chunk*", and some of them will progressively
move to the generic buffer handling code as they are cleaned up.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
It used to be called 'len' during the reorganisation but strictly speaking
it's not a length since it wraps. Also we already use '_data' as the suffix
to count available data, and data is also what we use to indicate the amount
of data in a pipe so let's improve consistency here. It was important to do
this in two operations because data used to be the name of the pointer to
the storage area.
This one is more generic and designed to work on a random block. It
may later get a b_rep_ist() variant since many strings are already
available as (ptr,len).
There was no point keeping that function in the buffer part since it's
exclusively used by HTTP at the channel level, since it also automatically
appends the CRLF. This further cleans up the buffer code.
The new file istbuf.h links the indirect strings (ist) with the buffers.
The purpose is to encourage addition of more standard buffer manipulation
functions that rely on this in order to improve the overall ease of use
along all the code. Just like ist.h and buf.h, this new file is not
expected to depend on anything beyond these two files.
A few functions were added and/or converted from buffer.h :
- b_isteq() : indicates if a buffer and a string match
- b_isteat() : consumes a string from the buffer if it matches
- b_istput() : appends a small string to a buffer (all or none)
- b_putist() : appends part of a large string to a buffer
The equivalent functions were removed from buffer.h and changed at the
various call places.
The two variants now do exactly the same (appending at the tail of the
buffer) so let's not keep the distinction between these classes of
functions and have generic ones for this. It's also worth noting that
b{i,o}_putchk() wasn't used at all and was removed.
There's no distinction between in and out data now. The latter covers
the needs of the former and supports wrapping. The extra cost is
negligible given the locations where it's used.
Since we never access this field directly anymore, but only through the
channel's wrappers, it can now move to the channel. The buffers are now
completely free from the distinction between input and output data.
Since we use "_data" for the amount of data at many places, as opposed to
"_space" for the amount of space, let's rename the "data" field to "area"
so that we can reuse "data" later for the amount of data in the buffer
(currently called "len" despite not being contigous).
b_set_data() is used :
- in proto_http and hlua to trim input data (b_set_data(co_data()))
- in SPOE to append data to a buffer while building a message
In no case will this truncate a buffer so we can safely remove the
test for len < b->output.
b_del() is used in :
- mux_h2 with the demux buffer : always processes input data
- checks with output data though output is not considered at all there
- b_eat() which is not used anywhere
- co_skip() where the len is always <= output
Thus the distinction for output data is not needed anymore and the
decrement can be made inconditionally in co_skip().
This is intentionally the minimal and safest set of changes, some cleanups
area still required. These changes are quite tricky and cannot be
independantly tested, so it's important to keep this patch as bisectable
as possible.
buf_empty and buf_wanted were changed and are now exactly similar since
there's no <p> member in the structure anymore. Given that no test is
ever made in the code to check that buf == &buf_wanted, it may be possible
that we don't need to have two anymore, unless some buf_empty tests have
precedence. This will have to be investigated.
A significant part of this commit affects the HTTP compression code,
which used to deeply manipulate the input and output buffers without
any reasonable solution for a better abstraction. For this reason, if
any regression is met and designates this patch as the culprit, it is
important to run tests which specifically involve compression or which
definitely don't use it in order to spot the issue.
Cc: Olivier Houchard <ohouchard@haproxy.com>
For the same consistency reasons, let's use b_empty() at the few places
where an empty buffer is expected, or c_empty() if it's done on a channel.
Some of these places were there to realign the buffer so
{b,c}_realign_if_empty() was used instead.
We used to have variations around buffer_total_space() and
size-buffer_len() or size-b_data(). Let's simplify all this. buffer_len()
was also removed as not used anymore.
Now the new API functions are being used everywhere, we can get rid
of b_ptr(). A few last users like bi_istput() and bo_istput() appear
to only differ by what part of the buffer they're increasing, but
that should quickly be merged.
Now that there are no more users requiring to modify the buffer anymore,
switch these ones to const char and const buffer. This will make it more
obvious next time send functions are tempted to modify the buffer's output
count. Minor adaptations were necessary at a few call places which were
using char due to the function's previous prototype.
Till now the callers had to know which one to call for specific use cases.
Let's fuse them now since a single one will remain after the API migration.
Given that bi_del() may only be used where o==0, just combine the two tests
by first removing output data then only input.
This will be important so that we can parse a buffer without touching it.
Now we indicate where from the buffer's head we plan to start to copy, and
for how many bytes. This will be used by send functions to loop at the end
of the buffer without having to update the buffer's output byte count.
This new functoin limits itself to the amount of data available in the
buffer and doesn't care about the direction anymore. It's only called
from co_getblk() which already checks that no more than the available
output bytes is requested.
These ones were merged into a single b_contig_space() that covers both
(the bo_ case was a simplified version of the other one). The function
doesn't use ->i nor ->o anymore.
This function was sometimes used from a channel and sometimes from a buffer.
In both cases it requires knowledge of the size of the output data (to skip
them). Here the split ensures the channel can deal with this point, and that
other places not having output data can continue to work.
These ones manipulate the output data count which will be specific to
the channel soon, so prepare the call points to use the channel only.
The b_* functions are now unused and were removed.
Where relevant, the channel version is used instead. The buffer version
was ported to be more generic and now takes a swap buffer and the output
byte count to know where to set the alignment point. The H2 mux still
uses buffer_slow_realign() with buf->o but it will change later.
Many places deal with buffer realignment after data removal. The method
is always the same : if the buffer is empty, set its pointer to the origin.
Let's have a function for this so that we have less code to change with the
new API.
Add a new function that lets you set the amount of input in a buffer.
For now it extends/truncates b->i except if the total length is
below b->o in which case it clears i and adjusts o.
Instead of doing b->i -= directly, introduce b_sub(), that does the job, to
make it easier to switch to the future API.
Also add b_add(), that increases b->i, instead of using it directly, and
bo_add(), that does increase b->o.
Here's the list of newly introduced functions :
- b_data(), returning the total amount of data in the buffer (currently i+o)
- b_orig(), returning the origin of the storage area, that is, the place of
position 0.
- b_wrap(), pointer to wrapping point (currently data+size)
- b_size(), returning the size of the buffer
- b_room(), returning the amount of bytes left available
- b_full(), returning true if the buffer is full, otherwise false
- b_stop(), pointer to end of data mark (currently p+i), used to compute
distances or a stop pointer for a loop.
- b_peek(), this one will help make the transition to the new buffer model.
It returns a pointer to a position in the buffer known from an offest
relative to the beginning of the data in the buffer. Thus, we can replace
the following occurrences :
bo_ptr(b) => b_peek(b, 0);
bo_end(b) => b_peek(b, b->o);
bi_ptr(b) => b_peek(b, b->o);
bi_end(b) => b_peek(b, b->i + b->o);
b_ptr(b, ofs) => b_peek(b, b->o + ofs);
- b_head(), pointer to the beginning of data (currently bo_ptr())
- b_tail(), pointer to first free place (currently bi_ptr())
- b_next() / b_next_ofs(), pointer to the next byte, taking wrapping
into account.
- b_dist(), returning the distance between two pointers belonging to a buffer
- b_reset(), which resets the buffer
- b_space_wraps(), indicating if the free space wraps around the buffer
- b_almost_full(), indicating if 3/4 or more of the buffer are used
Some of these are provided with the unchecked variants using the "__"
prefix, or with the "_ofs" suffix indicating they return a relative
position to the buffer's origin instead of a pointer.
Cc: Olivier Houchard <ohouchard@haproxy.com>
Passing unsigned ints everywhere is painful, and will cause some headache
later when we'll want to integrate better with struct ist which already
uses size_t. Let's switch buffers to use size_t instead.
The buffer code currently depends on pools and other stuff and is not
really autonomous anymore. The rewrite of the new API is an opportunity
to clean this up. This patch creates a new file (buf.h) which does not
depend on other elements and which will only contain what is needed to
perform the most basic buffer operations. The new API will be introduced
in this file and the conversion will be finished once buffer.h is empty.
The definition of struct buffer was moved to this new file, using more
explicity stdint types for the sizes and offsets.
Most new functions will be implemented in two variants :
__b_something() : unchecked variant, no wrapping is expected
b_something() : wrapping-checked variant
This way callers will be able to select which one to use depending on
the use cases.
The behavior of sigprocmask in an multithreaded environment is
undefined.
The new macro ha_sigmask() calls either pthreads_sigmask() or
sigprocmask() if haproxy was built with thread support or not.
This should be backported to 1.8.
A few users reported that building without threads was accidently broken
after commit 6b96f72 ("BUG/MEDIUM: pollers: Use a global list for fd
shared between threads.") due to all_threads_mask not being defined.
It's OK to set it to zero as other code parts do when threads are
enabled but only one thread is used.
This needs to be backported to 1.8.
With the old model, any fd shared by multiple threads, such as listeners
or dns sockets, would only be updated on one threads, so that could lead
to missed event, or spurious wakeups.
To avoid this, add a global list for fd that are shared, using the same
implementation as the fd cache, and only remove entries from this list
when every thread as updated its poller.
[wt: this will need to be backported to 1.8 but differently so this patch
must not be backported as-is]
We'll need this in order to support uploading chunks. The h2 to h1
converter checks for the presence of the content-length header field
as well as the CONNECT method and returns these information to the
caller. The caller indicates whether or not a body is detected for
the message (presence of END_STREAM or not). No transfer-encoding
header is emitted yet.
With gcc < 4.7, when HAProxy is built with threads, the macros
HA_ATOMIC_CAS/XCHG/STORE relies on the legacy __sync builtins. These macros
are slightly complicated than the versions relying on the '_atomic'
builtins. Internally, some local variables are defined, prefixed with '__' to
avoid name clashes with the caller.
On the other hand, the macros HA_ATOMIC_UPDATE_MIN/MAX call HA_ATOMIC_CAS. Some
local variables are also definied in these macros, following the same naming
rule as below. The problem is that '__new' variable is used in
HA_ATOMIC_MIN/_MAX and in HA_ATOMIC_CAS. Obviously, the behaviour is undefined
because '__new' in HA_ATOMIC_CAS is left uninitialized. Unfortunatly gcc fails
to detect this error.
To fix the problem, all internal variables to macros are now suffixed with name
of the macros to avoid clashes (for instance, '__new_cas' in HA_ATOMIC_CAS).
This patch must be backported in 1.8.
The management of the servers and the proxies queues was not thread-safe at
all. First, the accesses to <strm>->pend_pos were not protected. So it was
possible to release it on a thread (for instance because the stream is released)
and to use it in same time on another one (because we redispatch pending
connections for a server). Then, the accesses to stream's information (flags and
target) from anywhere is forbidden. To be safe, The stream's state must always
be updated in the context of process_stream.
So to fix these issues, the queue module has been refactored. A lock has been
added in the pendconn structure. And now, when we try to dequeue a pending
connection, we start by unlinking it from the server/proxy queue and we wake up
the stream. Then, it is the stream reponsibility to really dequeue it (or
release it). This way, we are sure that only the stream can create and release
its <pend_pos> field.
However, be careful. This new implementation should be thread-safe
(hopefully...). But it is not optimal and in some situations, it could be really
slower in multi-threaded mode than in single-threaded one. The problem is that,
when we try to dequeue pending connections, we process it from the older one to
the newer one independently to the thread's affinity. So we need to wait the
other threads' wakeup to really process them. If threads are blocked in the
poller, this will add a significant latency. This problem happens when maxconn
values are very low.
This patch must be backported in 1.8.
When the block of data need to be split to support the wrapping, the start of
the second block of data was wrong. We must be sure to skup data copied during
the first memcpy.
This patch must be backported to 1.8.
When the block of data need to be split to support the wrapping, the start of
the second block of data was wrong. We must be sure to skip data copied during
the first memcpy.
This patch must be backported to 1.8, 1.7, 1.6 and 1.5.
Since we use padding before the allocated page, it's trivial to place
the allocated address there and see if it gets mangled once we release
it.
This may be backported to stable releases already using DEBUG_UAF.
Commit 158fa75 ("MINOR: pools: implement DEBUG_UAF to detect use after free")
implemented pool use-after-free detection, but the mmap() return value isn't
properly checked, preventing the call to pool_alloc_area() from returning
NULL. So on out-of-memory a mangled pointer is returned, causing a crash on
the pool_alloc() site instead of forcing a GC. It doesn't affect regular
operations however, just complicates complex bug investigations.
This fix should be backported to 1.8 and to 1.7.
Since commit cf975d4 ("MINOR: pools/threads: Implement lockless memory
pools."), we support lockless pools. However the parts dedicated to
detecting use-after-free are not present in this part, making DEBUG_UAF
useless in this situation.
The present patch sets a new define CONFIG_HAP_LOCKLESS_POOLS when such
a compatible architecture is detected, and when pool debugging is not
requested, then makes use of this everywhere in pools and buffers
functions. This way enabling DEBUG_UAF will automatically disable the
lockless version.
No backport is needed as this is purely 1.9-dev.
This removes the end label from memory.h.
The labels are unused as of cf975d46bc
which is unreleased (and incidentally the first commit containing
those labels, thus they never have been used).
A TLS ticket keys file can be updated on the CLI and used in same time. So we
need to protect it to be sure all accesses are thread-safe. Because updates are
infrequent, a R/W lock has been used.
This patch must be backported in 1.8
Commit f61f0cb ("MINOR: threads: Introduce double-width CAS on x86_64
and arm.") introduced the double CAS. But the ARMv7 version is bogus,
it uses the value of the pointers instead of dereferencing them. When
lucky, it simply doesn't build due to impossible registers combinations.
Otherwise it will immediately crash at run time when facing traffic.
No backport is needed, this bug was introduced in 1.9-dev.
Create a local, per-thread, fdcache, for file descriptors that only belongs
to one thread, and make the global fd cache mostly lockless, as we can get
a lot of contention on the fd cache lock.
Recent changes to the enum were not synchronized with the lock debugging
code. Now we use a switch/case instead of an array so that the compiler
throws a warning if there is any inconsistency.
To be backported to 1.8 (at least to add the START entry).
Marc Fournier reported an interesting case when using threads with the
master-worker mode : sometimes, a listener would have its FD closed
during startup. Sometimes it could even be health checks seeing this.
What happens is that after the threads are created, and the pollers
enabled on each threads, the master-worker pipe is registered, and at
the same time a close() is performed on the write side of this pipe
since the children must not use it.
But since this is replicated in every thread, what happens is that the
first thread closes the pipe, thus releases the FD, and the next thread
starting a listener in parallel gets this FD reassigned. Then another
thread closes the FD again, which this time corresponds to the listener.
It can also happen with the health check sockets if they're started
early enough.
This patch splits the mworker_pipe_register() function in two, so that
the close() of the write side of the FD is performed very early after the
fork() and long before threads are created (we don't need to delay it
anyway). Only the pipe registration is done in the threaded code since
it is important that the pollers are properly allocated for this.
The mworker_pipe_register() function now takes care of registering the
pipe only once, and this is guaranteed by a new surrounding lock.
The call to protocol_enable_all() looks fragile in theory since it
scans the list of proxies and their listeners, though in practice
all threads scan the same list and take the same locks for each
listener so it's not possible that any of them escapes the process
and finishes before all listeners are started. And the operation is
idempotent.
This fix must be backported to 1.8. Thanks to Marc for providing very
detailed traces clearly showing the problem.
This one allows not to inflate some structures when threads are
disabled. Now struct global is 1.4 kB instead of 33 kB.
Should be backported to 1.8 for ease of backporting of upcoming
patches.
Till now the use of __atomic_* gcc builtins required gcc >= 4.7. Since
some supported and quite common operating systems like CentOS 6 still
come with older versions (4.4) and the mapping to the older builtins
is reasonably simple, let's implement it.
This code is only used for gcc < 4.7. It has been quickly tested on a
machine using gcc 4.4.4 and provided expected results.
This patch should be backported to 1.8.
In hpack_dht_make_room(), we try to fulfill this rule form RFC7541#4.4 :
"It is not an error to attempt to add an entry that is larger than the
maximum size; an attempt to add an entry larger than the maximum size
causes the table to be emptied of all existing entries and results in
an empty table."
Unfortunately it is not consistent with the way it's used in
hpack_dht_insert() as this last one will consider a success as a
confirmation it can copy the header into the table, and a failure as
an indexing error. This results in the two following issues :
- if a client sends too large a header into an empty table, this
header may overflow the table. Fortunately, most clients send
small headers like :authority first, and never mark headers that
don't fit into the table as indexable since it is counter-productive ;
- if a client sends too large a header into a populated table, the
operation fails after the table is totally flushed and the request
is not processed.
This patch fixes the two issues at once :
- a header not fitting into an empty table is always a sign that it
will never fit ;
- not fitting into the table is not an error
Thanks to Yves Lafon for reporting detailed traces demonstrating this
issue. This fix must be backported to 1.8.
If the hpack decoder sees an invalid header index, it emits value
"### ERR ###" that was used during debugging instead of rejecting the
block. This is harmless, and was detected by h2spec.
To backport to 1.8.
Released version 1.9-dev0 with the following main changes :
- BUG/MEDIUM: stream: don't automatically forward connect nor close
- BUG/MAJOR: stream: ensure analysers are always called upon close
- BUG/MINOR: stream-int: don't try to read again when CF_READ_DONTWAIT is set
- MEDIUM: mworker: Add systemd `Type=notify` support
- BUG/MEDIUM: cache: free callback to remove from tree
- CLEANUP: cache: remove unused struct
- MEDIUM: cache: enable the HTTP analysers
- CLEANUP: cache: remove wrong comment
- MINOR: threads/atomic: rename local variables in macros to avoid conflicts
- MINOR: threads/plock: rename local variables in macros to avoid conflicts
- MINOR: threads/atomic: implement pl_mb() in asm on x86
- MINOR: threads/atomic: implement pl_bts() on non-x86
- MINOR: threads/build: atomic: replace the few inlines with macros
- BUILD: threads/plock: fix a build issue on Clang without optimization
- BUILD: ebtree: don't redefine types u32/s32 in scope-aware trees
- BUILD: compiler: add a new type modifier __maybe_unused
- BUILD: h2: mark some inlined functions "unused"
- BUILD: server: check->desc always exists
- BUG/MEDIUM: h2: properly report connection errors in headers and data handlers
- MEDIUM: h2: add a function to emit an HTTP/1 request from a headers list
- MEDIUM: h2: change hpack_decode_headers() to only provide a list of headers
- BUG/MEDIUM: h2: always reassemble the Cookie request header field
- BUG/MINOR: systemd: ignore daemon mode
- CONTRIB: spoa_example: allow to compile outside HAProxy.
- CONTRIB: spoa_example: remove bref, wordlist, cond_wordlist
- CONTRIB: spoa_example: remove last dependencies on type "sample"
- CONTRIB: spoa_example: remove SPOE enums that are useless for clients
- CLEANUP: cache: reorder includes
- MEDIUM: shctx: use unsigned int for len and block_count
- MEDIUM: cache: "show cache" on the cli
- BUG/MEDIUM: cache: use key=0 as a condition for freeing
- BUG/MEDIUM: cache: refcount forbids to free the objects
- BUG/MEDIUM: cache fix cli_kws structure
- BUG/MEDIUM: deinit: correctly deinitialize the proxy and global listener tasks
- BUG/MINOR: ssl: Always start the handshake if we can't send early data.
- MINOR: ssl: Don't disable early data handling if we could not write.
- MINOR: pools: prepare functions to override malloc/free in pools
- MINOR: pools: implement DEBUG_UAF to detect use after free
- BUG/MEDIUM: threads/time: fix time drift correction
- BUG/MEDIUM: threads/time: maintain a common time reference between all threads
- MINOR: sample: Add "thread" sample fetch
- BUG/MINOR: Use crt_base instead of ca_base when crt is parsed on a server line
- BUG/MINOR: stream: fix tv_request calculation for applets
- BUG/MAJOR: h2: always remove a stream from the send list before freeing it
- BUG/MAJOR: threads/task: dequeue expired tasks under the WQ lock
- MINOR: ssl: Handle reading early data after writing better.
- MINOR: mux: Make sure every string is woken up after the handshake.
- MEDIUM: cache: store sha1 for hashing the cache key
- MINOR: http: implement the "http-request reject" rule
- MINOR: h2: send RST_STREAM before GOAWAY on reject
- MEDIUM: h2: don't gracefully close the connection anymore on Connection: close
- MINOR: h2: make use of client-fin timeout after GOAWAY
- MEDIUM: config: ensure that tune.bufsize is at least 16384 when using HTTP/2
- MINOR: ssl: Handle early data with BoringSSL
- BUG/MEDIUM: stream: always release the stream-interface on abort
- BUG/MEDIUM: cache: free ressources in chn_end_analyze
- MINOR: cache: move the refcount decrease in the applet release
- BUG/MINOR: listener: Allow multiple "process" options on "bind" lines
- MINOR: config: Support a range to specify processes in "cpu-map" parameter
- MINOR: config: Slightly change how parse_process_number works
- MINOR: config: Export parse_process_number and use it wherever it's applicable
- MINOR: standard: Add my_ffsl function to get the position of the bit set to one
- MINOR: config: Add auto-increment feature for cpu-map
- MINOR: config: Support partial ranges in cpu-map directive
- MINOR:: config: Remove thread-map directive
- MINOR: config: Add the threads support in cpu-map directive
- MINOR: config: Add threads support for "process" option on "bind" lines
- MEDIUM: listener: Bind listeners on a thread subset if specified
- CLEANUP: debug: Use DPRINTF instead of fprintf into #ifdef DEBUG_FULL/#endif
- CLEANUP: log: Rename Alert/Warning in ha_alert/ha_warning
- MINOR/CLEANUP: proxy: rename "proxy" to "proxies_list"
- CLEANUP: pools: rename all pool functions and pointers to remove this "2"
- DOC: update the roadmap file with the latest changes merged in 1.8
- DOC: fix mangled version in peers protocol documentation
- DOC: add initial peers protovol v2.0 documentation.
- DOC: mention William as maintainer of the cache and master-worker
- DOC: add Christopher and Emeric as maintainers of the threads
- MINOR: cache: replace a fprint() by an abort()
- MEDIUM: cache: max-age configuration keyword
- DOC: explain HTTP2 timeout behavior
- DOC: cache: configuration and management
- MAJOR: mworker: exits the master on failure
- BUG/MINOR: threads: don't drop "extern" on the lock in include files
- MINOR: task: keep a pointer to the currently running task
- MINOR: task: align the rq and wq locks
- MINOR: fd: cache-align fdtab and fdcache locks
- MINOR: buffers: cache-align buffer_wq_lock
- CLEANUP: server: reorder some fields in struct server to save 40 bytes
- CLEANUP: proxy: slightly reorder the struct proxy to reduce holes
- CLEANUP: checks: remove 16 bytes of holes in struct check
- CLEANUP: cache: more efficiently pack the struct cache
- CLEANUP: fd: place the lock at the beginning of struct fdtab
- CLEANUP: pools: align pools on a cache line
- DOC: config: add a few bits about how to configure HTTP/2
- BUG/MAJOR: threads/queue: avoid recursive locking in pendconn_get_next_strm()
- BUILD: Makefile: reorder object files by size
There are just a few pools, and they're stressed a lot, so it makes
sense to dedicate them a cache line to avoid contention and to place
the lock at the beginning.
Commit 9dcf9b6 ("MINOR: threads: Use __decl_hathreads to declare locks")
accidently lost a few "extern" in certain lock declarations, possibly
causing certain entries to be declared at multiple places. Apparently
it hasn't caused any harm though.
The offending ones were :
- fdtab_lock
- fdcache_lock
- poll_lock
- buffer_wq_lock