The hathreads.h file has quickly become a total mess because it contains
thread definitions, atomic operations and locking operations, all this for
multiple combinations of threads, debugging and architectures, and all this
done with random ordering!
This first patch extracts all the atomic ops code from hathreads.h to move
it to haproxy/atomic.h. The code there still contains several sections
based on non-thread vs thread, and GCC versions in the latter case. Each
section was arranged in the exact same order to ease finding.
The redundant HA_BARRIER() which was the same as __ha_compiler_barrier()
was dropped in favor of the latter which follows the naming convention
of all other barriers. It was only used in freq_ctr.c which was updated.
Additionally, __ha_compiler_barrier() was defined inconditionally but
used only for thread-related operations, so it was made thread-only like
HA_BARRIER() used to be. We'd still need to have two types of compiler
barriers, one for the general case (e.g. signals) and another one for
concurrency, but this was not addressed here.
Some comments were added at the beginning of each section to inform about
the use case and warn about the traps to avoid.
Some files which continue to include hathreads.h solely for atomic ops
should now be updated.
Half of the users of this include only need the type definitions and
not the manipulation macros nor the inline functions. Moves the various
types into mini-clist-t.h makes the files cleaner. The other one had all
its includes grouped at the top. A few files continued to reference it
without using it and were cleaned.
In addition it was about time that we'd rename that file, it's not
"mini" anymore and contains a bit more than just circular lists.
File buf.h is one common cause of pain in the dependencies. Many files in
the code need it to get the struct buffer definition, and a few also need
the inlined functions to manipulate a buffer, but the file used to depend
on a long chain only for BUG_ON() (addressed by last commit).
Now buf.h is split into buf-t.h which only contains the type definitions,
and buf.h for all inlined functions. Callers who don't care can continue
to use buf.h but files in types/ must only use buf-t.h. sys/types.h had
to be added to buf.h to get ssize_t as used by b_move(). It's worth noting
that ssize_t is only supposed to be a size_t supporting -1, so b_move()
ought to be rethought regarding this.
The files were moved to haproxy/ and all their users were updated
accordingly. A dependency issue was addressed on fcgi whose C file didn't
include buf.h.
This one used to be stored into debug.h but the debug tools got larger
and require a lot of other includes, which can't use BUG_ON() anymore
because of this. It does not make sense and instead this macro should
be placed into the lower includes and given its omnipresence, the best
solution is to create a new bug.h with the few surrounding macros needed
to trigger bugs and place assertions anywhere.
Another benefit is that it won't be required to add include <debug.h>
anymore to use BUG_ON, it will automatically be covered by api.h. No
less than 32 occurrences were dropped.
The FSM_PRINTF macro was dropped since not used at all anymore (probably
since 1.6 or so).
Fortunately that file wasn't made dependent upon haproxy since it was
integrated, better isolate it before it's too late. Its dependency on
api.h was the result of the change from config.h, which in turn wasn't
correct. It was changed back to stddef.h for size_t and sys/types.h for
ssize_t. The recently added reference to MAX() was changed as it was
placed only to avoid a zero length in the non-free-standing version and
was causing a build warning in the hpack encoder.
This file is to openssl what compat.h is to the libc, so it makes sense
to move it to haproxy/. It could almost be part of api.h but given the
amount of openssl stuff that gets loaded I fear it could increase the
build time.
Note that this file contains lots of inlined functions. But since it
does not depend on anything else in haproxy, it remains safe to keep
all that together.
There is one "template.h" per include subdirectory to show how to create
a new file but in practice nobody knows they're here so they're useless.
Let's simply remove them.
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
This is where other imported components are located. All files which
used to directly include ebtree were touched to update their include
path so that "import/" is now prefixed before the ebtree-related files.
The ebtree.h file was slightly adjusted to read compiler.h from the
common/ subdirectory (this is the only change).
A build issue was encountered when eb32sctree.h is loaded before
eb32tree.h because only the former checks for the latter before
defining type u32. This was addressed by adding the reverse ifdef
in eb32tree.h.
No further cleanup was done yet in order to keep changes minimal.
The support for reqrep and friends was removed in 2.1 but the
chain_regex() function and the "action" field in the regex struct
was still there. This patch removes them.
One point worth mentioning though. There is a check_replace_string()
function whose purpose was to validate the replacement strings passed
to reqrep. It should also be used for other replacement regex, but is
never called. Callers of exp_replace() should be checked and a call to
this function should be added to detect the error early.
This patch adds new statement "server" into ring section, and the
related "timeout connect" and "timeout server".
server <name> <address> [param*]
Used to configure a syslog tcp server to forward messages from ring buffer.
This supports for all "server" parameters found in 5.2 paragraph.
Some of these parameters are irrelevant for "ring" sections.
timeout connect <timeout>
Set the maximum time to wait for a connection attempt to a server to succeed.
Arguments :
<timeout> is the timeout value specified in milliseconds by default, but
can be in any other unit if the number is suffixed by the unit,
as explained at the top of this document.
timeout server <timeout>
Set the maximum time for pending data staying into output buffer.
Arguments :
<timeout> is the timeout value specified in milliseconds by default, but
can be in any other unit if the number is suffixed by the unit,
as explained at the top of this document.
Example:
global
log ring@myring local7
ring myring
description "My local buffer"
format rfc3164
maxlen 1200
size 32764
timeout connect 5s
timeout server 10s
server mysyslogsrv 127.0.0.1:6514
Commit 04f5fe87d3 introduced an rwlock in the pools to deal with the risk
that pool_flush() dereferences an area being freed, and commit 899fb8abdc
turned it into a spinlock. The pools already contain a spinlock in case of
locked pools, so let's use the same and simplify the code by removing ifdefs.
At this point I'm really suspecting that if pool_flush() would instead
rely on __pool_get_first() to pick entries from the pool, the concurrency
problem could never happen since only one user would get a given entry at
once, thus it could not be freed by another user. It's not certain this
would be faster however because of the number of atomic ops to retrieve
one entry compared to a locked batch.
HTTP_1XX, HTTP_3XX and HTTP_4XX message templates are no longer used. Only
HTTP_302 and HTTP_303 are used during configuration parsing by "errorloc" family
directives. So these templates are removed from the generic http code. And
HTTP_302 and HTTP_303 templates are moved as static strings in the function
parsing "errorloc" directives.
There is no reason to not use proxy's error replies to emit 401/407
responses. The function http_reply_40x_unauthorized(), responsible to emit those
responses, is not really complex. It only adds a
WWW-Authenticate/Proxy-Authenticate header to a generic message.
So now, error replies can be defined for 401 and 407 status codes, using
errorfile or http-error directives. When an http-request auth rule is evaluated,
the corresponding error reply is used. For 401 responses, all occurrences of the
WWW-Authenticate header are removed and replaced by a new one with a basic
authentication challenge for the configured realm. For 407 responses, the same
is done on the Proxy-Authenticate header. If the error reply must not be
altered, "http-request return" rule must be used instead.
During pool_free(), when the ->allocated value is 125% of needed_avg or
more, instead of putting the object back into the pool, it's immediately
freed using free(). By doing this we manage to significantly reduce the
amount of memory pinned in pools after transient traffic spikes.
During a test involving a constant load of 100 concurrent connections
each delivering 100 requests per second, the memory usage was a steady
21 MB RSS. Adding a 1 minute parallel load of 40k connections all looping
on 100kB objects made the memory usage climb to 938 MB before this patch.
With the patch it was only 660 MB. But when this parasit load stopped,
before the patch the RSS would remain at 938 MB while with the patch,
it went down to 480 then 180 MB after a few seconds, to stabilize around
69 MB after about 20 seconds.
This can be particularly important to improve reloads where the memory
has to be shared between the old and new process.
Another improvement would be welcome, we ought to have a periodic task
to check pools usage and continue to free up unused objects regardless
of any call to pool_free(), because the needed_avg value depends on the
past and will not cover recently refilled objects.
This adds a sliding estimate of the pools' usage. The goal is to be able
to use this to start to more aggressively free memory instead of keeping
lots of unused objects in pools. The average is calculated as a sliding
average over the last 1024 consecutive measures of ->used during calls to
pool_free(), and is bumped up for 1/4 of its history from ->allocated when
allocation from the pool fails and results in a call to malloc().
The result is a floating value between ->used and ->allocated, that tries
to react fast to under-estimates that result in expensive malloc() but
still maintains itself well in case of stable usage, and progressively
goes down if usage shrinks over time.
This new metric is reported as "needed_avg" in "show pools".
Sadly due to yet another include dependency hell, we couldn't reuse the
functions from freq_ctr.h so they were temporarily duplicated into memory.h.
Recent commit 2bdcc70fa7 ("MEDIUM: hpack: use a pool for the hpack table")
made the hpack code finally use a pool with very unintrusive code that was
assumed to be trivial enough to adjust if the code needed to be reused
outside of haproxy. Unfortunately the code in contrib/hpack already uses
it and broke the oss-fuzz tests as it doesn't build anymore.
This patch adds an HPACK_STANDALONE macro to decide if we should use the
pools or malloc+free. The resulting macros are called hpack_alloc() and
hpack_free() respectively, and the size must be passed into the pool
itself.
The htx_copy_msg() function can now be used to copy the HTX message stored in a
buffer in an existing HTX message. It takes care to not overwrite existing
data. If the destination message is empty, a raw copy is performed. All the
message is copied or nothing.
This function is used instead of channel_htx_copy_msg().
This reverts commit 957ec59571.
As discussed with Emeric, the current syntax is not extensible enough,
this will be turned to a section instead in a forthcoming patch.
Instead of using malloc/free to allocate an HPACK table, let's declare
a pool. However the HPACK size is configured by the H2 mux, so it's
also this one which allocates it after post_check.
This patch adds the new global statement:
ring <name> [desc <desc>] [format <format>] [size <size>] [maxlen <length>]
Creates a named ring buffer which could be used on log line for instance.
<desc> is an optionnal description string of the ring. It will appear on
CLI. By default, <name> is reused to fill this field.
<format> is the log format used when generating syslog messages. It may be
one of the following :
iso A message containing only the ISO date, followed by the text.
The PID, process name and system name are omitted. This is
designed to be used with a local log server.
raw A message containing only the text. The level, PID, date, time,
process name and system name are omitted. This is designed to be
used in containers or during development, where the severity only
depends on the file descriptor used (stdout/stderr). This is
the default.
rfc3164 The RFC3164 syslog message format. This is the default.
(https://tools.ietf.org/html/rfc3164)
rfc5424 The RFC5424 syslog message format.
(https://tools.ietf.org/html/rfc5424)
short A message containing only a level between angle brackets such as
'<3>', followed by the text. The PID, date, time, process name
and system name are omitted. This is designed to be used with a
local log server. This format is compatible with what the systemd
logger consumes.
timed A message containing only a level between angle brackets such as
'<3>', followed by ISO date and by the text. The PID, process
name and system name are omitted. This is designed to be
used with a local log server.
<length> is the maximum length of event message stored into the ring,
including formatted header. If the event message is longer
than <length>, it would be truncated to this length.
<name> is the ring identifier, which follows the same naming convention as
proxies and servers.
<size> is the optionnal size in bytes. Default value is set to BUFSIZE.
Note: Historically sink's name and desc were refs on const strings. But with new
configurable rings a dynamic allocation is needed.
The EVP_MD_CTX_create() and EVP_MD_CTX_destroy() functions were renamed to
EVP_MD_CTX_new() and EVP_MD_CTX_free() in OpenSSL 1.1.0, respectively. These
functions are used by the digest converter, introduced by the commit 8e36651ed
("MINOR: sample: Add digest and hmac converters"). So for prior versions of
openssl, macros are used to fallback on old functions.
This patch must only be backported if the commit 8e36651ed is backported too.
This one really ought to be defined in hathreads.h like all other thread
definitions, which is what this patch does. As expected, all files but
one (regex.h) were already including hathreads.h when using THREAD_LOCAL;
regex.h was fixed for this.
This was the last entry in config.h which is now useless.
The setting of CONFIG_HAP_LOCKLESS_POOLS depending on threads and
compat was done in config.h for use only in memory.h and memory.c
where other settings are dealt with. Further, the default pool cache
size was set there from a fixed value instead of being set from
defaults.h
Let's move the decision to enable lockless pools via
CONFIG_HAP_LOCKLESS_POOLS to memory.h, and set the default pool
cache size in defaults.h like other default settings.
This was the next-to-last setting in config.h.
CONFIG_HAP_MEM_OPTIM was introduced with memory pools in 1.3 and dropped
in 1.6 when pools became the only way to allocate memory. Still the
option remained present in config.h. Let's kill it.
Just like in previous patch, it happens that HA_ATOMIC_UPDATE_MIN() and
HA_ATOMIC_UPDATE_MAX() would evaluate the (val) argument up to 3 times.
However this time it affects both thread and non-thread versions. It's
strange because the copy was properly performed for the (new) argument
in order to avoid this. Anyway it was done for the "val" one as well.
A quick code inspection showed that this currently has no effect as
these macros are fairly limited in usage.
It would be best to backport this for long-term stability (till 1.8)
but it will not fix an existing bug.
When threads are disabled, HA_ATOMIC_CAS() becomes a simple compound
expression. However this expression presents a problem, which is that
its arguments are evaluated multiple times, once for the comparison
and once again for the assignement. This presents a risk of performing
some side-effect operations twice in the non-threaded case (e.g. in
case of auto-increment or function return).
The macro was rewritten using local copies for arguments like the
other macros do.
Fortunately a complete inspection of the code indicates that this case
currently never happens. It was however responsible for the strict-aliasing
warning emitted when building fd.c without threads but with 64-bit CAS.
This may be backported as far as 1.8 though it will not fix any existing
bug and is more of a long-term safety measure in case a future fix would
depend on this behavior.
I changed my mind twice on this one and pushed after the last test with
threads disabled, without re-enabling long long, causing this rightful
build warning.
This needs to be backported if the previous commit ff64d3b027 ("MINOR:
threads: export the POSIX thread ID in panic dumps") is backported as
well.
It is very difficult to map a panic dump against a gdb thread dump
because the thread numbers do not match. However gdb provides the
pthread ID but this one is supposed to be opaque and not to be cast
to a scalar.
This patch provides a fnuction, ha_get_pthread_id() which retrieves
the pthread ID of the indicated thread and casts it to an unsigned
long long so as to lose the least possible amount of information from
it. This is done cleanly using a union to maintain alignment so as
long as these IDs are stored on 1..8 bytes they will be properly
reported. This ID is now presented in the panic dumps so it now
becomes possible to map these threads. When threads are disabled,
zero is returned. For example, this is a panic dump:
Thread 1 is about to kill the process.
*>Thread 1 : id=0x7fe92b825180 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
stuck=1 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=5119122 now=2009446995 diff=2004327873
curr_task=0xc99bf0 (task) calls=4 last=0
fct=0x592440(task_run_applet) ctx=0xca9c50(<CLI>)
strm=0xc996a0 src=unix fe=GLOBAL be=GLOBAL dst=<CLI>
rqf=848202 rqa=0 rpf=80048202 rpa=0 sif=EST,200008 sib=EST,204018
af=(nil),0 csf=0xc9ba40,8200
ab=0xca9c50,4 csb=(nil),0
cof=0xbf0e50,1300:PASS(0xc9cee0)/RAW((nil))/unix_stream(20)
cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
call trace(20):
| 0x59e4cf [48 83 c4 10 5b 5d 41 5c]: wdt_handler+0xff/0x10c
| 0x7fe92c170690 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13690
| 0x7ffce29519d9 [48 c1 e2 20 48 09 d0 48]: linux-vdso:+0x9d9
| 0x7ffce2951d54 [eb d9 f3 90 e9 1c ff ff]: linux-vdso:__vdso_gettimeofday+0x104/0x133
| 0x57b484 [48 89 e6 48 8d 7c 24 10]: main+0x157114
| 0x50ee6a [85 c0 75 76 48 8b 55 38]: main+0xeaafa
| 0x50f69c [48 63 54 24 20 85 c0 0f]: main+0xeb32c
| 0x59252c [48 c7 c6 d8 ff ff ff 44]: task_run_applet+0xec/0x88c
Thread 2 : id=0x7fe92b6e6700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=786738 now=1086955 diff=300217
curr_task=0
Thread 3 : id=0x7fe92aee5700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=828056 now=1129738 diff=301682
curr_task=0
Thread 4 : id=0x7fe92a6e4700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=818900 now=1153551 diff=334651
curr_task=0
And this is the gdb output:
(gdb) info thr
Id Target Id Frame
* 1 Thread 0x7fe92b825180 (LWP 15234) 0x00007fe92ba81d6b in raise () from /lib64/libc.so.6
2 Thread 0x7fe92b6e6700 (LWP 15235) 0x00007fe92bb56a56 in epoll_wait () from /lib64/libc.so.6
3 Thread 0x7fe92a6e4700 (LWP 15237) 0x00007fe92bb56a56 in epoll_wait () from /lib64/libc.so.6
4 Thread 0x7fe92aee5700 (LWP 15236) 0x00007fe92bb56a56 in epoll_wait () from /lib64/libc.so.6
We can clearly see that while threads 1 and 2 are the same, gdb's
threads 3 and 4 respectively are haproxy's threads 4 and 3.
This may be backported to 2.0 as it removes some confusion in github issues.
A shared tcp-check ruleset is now created to support LDAP check. This way no
extra memory is used if several backends use a LDAP check.
The following sequance is used :
tcp-check send-binary "300C020101600702010304008000"
tcp-check expect rbinary "^30" min-recv 14 \
on-error "Not LDAPv3 protocol"
tcp-check expect custom
The last expect rule relies on a custom function to check the LDAP server reply.
A share tcp-check ruleset is now created to support smtp checks. This way no
extra memory is used if several backends use a smtp check.
The following sequence is used :
tcp-check connect default linger
tcp-check expect rstring "^[0-9]{3}[ \r]" min-recv 4 \
error-status "L7RSP" on-error "%[check.payload(),cut_crlf]"
tcp-check expect rstring "^2[0-9]{2}[ \r]" min-recv 4 \
error-status "L7STS" \
on-error %[check.payload(4,0),ltrim(' '),cut_crlf] \
status-code "check.payload(0,3)"
tcp-echeck send "%[var(check.smtp_cmd)]\r\n" log-format
tcp-check expect rstring "^2[0-9]{2}[- \r]" min-recv 4 \
error-status "L7STS" \
on-error %[check.payload(4,0),ltrim(' '),cut_crlf] \
on-success "%[check.payload(4,0),ltrim(' '),cut_crlf]" \
status-code "check.payload(0,3)"
The variable check.smtp_cmd is by default the string "HELO localhost" by may be
customized setting <helo> and <domain> parameters on the option smtpchk
line. Note there is a difference with the old smtp check. The server gretting
message is checked before send the HELO/EHLO comand.
A share tcp-check ruleset is now created to support redis checks. This way no
extra memory is used if several backends use a redis check.
The following sequence is used :
tcp-check send "*1\r\n$4\r\nPING\r\n"
tcp-check expect string "+PONG\r\n" error-status "L7STS" \
on-error "%[check.payload(),cut_crlf]" on-success "Redis server is ok"
list_for_each_entry_rev() and list_for_each_entry_from_rev() and corresponding
safe versions have been added to iterate on a list in the reverse order. All
these functions work the same way than the forward versions, except they use the
.p field to move for an element to another.
The url_decode() function used by the url_dec converter and a few other
call points is ambiguous on its processing of the '+' character which
itself isn't stable in the spec. This one belongs to the reserved
characters for the query string but not for the path nor the scheme,
in which it must be left as-is. It's only in argument strings that
follow the application/x-www-form-urlencoded encoding that it must be
turned into a space, that is, in query strings and POST arguments.
The problem is that the function is used to process full URLs and
paths in various configs, and to process query strings from the stats
page for example.
This patch updates the function to differentiate the situation where
it's parsing a path and a query string. A new argument indicates if a
query string should be assumed, otherwise it's only assumed after seeing
a question mark.
The various locations in the code making use of this function were
updated to take care of this (most call places were using it to decode
POST arguments).
The url_dec converter is usually called on path or url samples, so it
needs to remain compatible with this and will default to parsing a path
and turning the '+' to a space only after a question mark. However in
situations where it would explicitly be extracted from a POST or a
query string, it now becomes possible to enforce the decoding by passing
a non-null value in argument.
It seems to be what was reported in issue #585. This fix may be
backported to older stable releases.