Thanks to ("MINOR: tools: add vma_set_name() helper"), set a name hint
for user created memory-backed sinks (ring sections without backing-file)
so that they can be easily indentified in /proc/<pid>/maps.
Depending on malloc() implementation, such memory areas will normally be
merged on the heap under MMAP_THRESHOLD (128 kB by default) and will
have a dedicated memory area once the threshold is exceeded. As such, when
large enough, they will appear like this in /proc/<pid>/maps:
7b8e8ac00000-7b8e8bf13000 rw-p 00000000 00:00 0 [anon💍myring]
Since 40d1c84bf0 ("BUG/MAJOR: ring: free the ring storage not the ring
itself when using maps"), munmap() call for startup_logs's ring and
file-backed rings fails to work (EINVAL) and causes memory leaks during
process cleanup.
munmap() fails because it is called with the ring's usable area pointer
which is an offset from the underlying original memory block allocated
using mmap(). Indeed, ring_area() helper function was misused because
it didn't explicitly mention that the returned address corresponds to
the usable storage's area, not the allocated one.
To fix the issue, we add an explicit ring_allocated_area() helper to
return the allocated area for the ring, just like we already have
ring_allocated_size() for the allocated size, and we properly use both
the allocated size and allocated area to manipulate them using munmap()
and msync().
No backport needed.
last_change was a member present in both proxy and server struct. It is
used as an age statistics to report the last update of the object.
Move last_change into fe_counters/be_counters. This is necessary to be
able to manipulate it through generic stat column and report it into
stats-file.
Note that there is a change for proxy structure with now 2 different
last_change values, on frontend and backend side. Special care was taken
to ensure that the value is initialized only on the proxy side. The
other value is set to 0 unless a listen proxy is instantiated. For the
moment, only backend counter is reported in stats. However, with now two
distinct values, stats could be extended to report it on both side.
A recent issue was uncovered by the CI which started to randomly report
segfaults on a few tests, and more systematically on FreeBSD. It turn
out that it was introduced by recent commit 03816ccfa9 ("MAJOR: ring:
insert an intermediary ring_storage level"), which overlooked the munmap()
path of the sink and startup logs: once the ring and its storage were
split, it was no longer correct to munmap() the ring, only its storage
area needs to be unmapped, and the ring must always be freed separately.
Thanks to Christopher and William for their help at trying to reproduce
it and figure the circumstances that triggers it.
No backport is needed.
It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.
The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.
Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.
For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.
Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.
The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.
The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
In ring_resize() we used to check if the new ring was at least as large
as the previous one before resizing it, but what counts is that it's as
large as the previous one's contents. Initially it was thought this
would not really matter, but given that rings are initially created as
BUFSIZE, it's currently not possible to shrink them for debugging
purposes. Now with this change it is.
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.
This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.
It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
Some open-coded constructs were updated to make use of the ring accessors
instead. This allows to remove some direct dependencies on the buffers
API a bit more.
This code becomes even simpler and almost does not need any knowledge
of the structure of the ring anymore. It even highlighted that an old
race had not been fixed due to code duplication, but that's now done.
The rink reader code was duplicated as-is in 2.2 for the ring forwarding
code in commits 494c505703 ("MEDIUM: ring: add server statement to forward
messages from a ring") and 975564784f ("MEDIUM: ring: add new srv statement
to support octet counting forward") (which only differs by using a prefix
instead of a suffix to delimit messages).
Unfortunately, that makes it almost impossible to rework the core ring
code because all these parts rely on it. This first commit aims at
restoring a common structure for the core loop by just calling a distinct
function based on the use case. The functions are either
applet_append_line() when a whole line is to be emitted followed by an LF
character, or syslog_applet_appent_event() when trying to send a TCP
syslog line prepended with its size in decimal.
There is no functional change beyond this.
The sink lock was made to prevent event producers from passing while
there were other threads trying to print a "dropped" message, in order
to guarantee the absence of reordering. It has a serious impact however,
which is that all threads need to take the read lock when producing a
regular trace even when there's no reader.
This patch takes a different approach. The drop counter is shifted left
by one so that the lowest bit is used to indicate that one thread is
already taking care of trying to dump the counter. Threads only read
this value normally, and will only try to change it if it's non-null,
in which case they'll first check if they are the first ones trying to
dump it, otherwise will simply count another drop and leave. This has
a large benefit. First, it will avoid the locking that causes stalls
as soon as a slow reader is present. Second, it avoids any write on the
fast path as long as there's no drop. And it remains very lightweight
since we just need to add +2 or subtract 2*dropped in operations, while
offering the guarantee that the sink_write() has succeeded before
unlocking the counter.
While a reader was previously limiting the traffic to 11k RPS under
4C/8T, now we reach 36k RPS vs 14k with no reader, so readers will no
longer slow the traffic down and will instead even speed it up due to
avoiding the contention down the chain in the ring. The locking cost
dropped from ~75% to ~60% now (it's in ring_write now).
When a reader doesn't read fast enough and causes drops, subsequent
threads try to produce a "dropped" message. But it takes time to
produce and emit this message, in part due to the use of chunk_printf()
that relies on vfprintf() which has to parse the printf format, and
during this time other threads may continue to increment the counter.
This is the reason why this is currently performed in a loop. When
reading what is received, it's common to see a large count followed
by one or two single-digit counts, indicating that we could possibly
have improved that by writing faster.
Let's improve the situation a little bit. First we're now using a
static message prefixed with enough space to write the digits, and a
call to ultoa_r() fills these digits from right to left so that we
don't have to process a format string nor perform a copy of the message.
Second, we now re-check the counter immediately after having prepared
the message so that we still get an opportunity for updating it. In
order to avoid too long loops, this is limited to 10 iterations.
Tests show that the number of single-digit "dropped" counters on output
now dropped roughly by 15-30%. Also, it was observed that with 8 threads,
there's almost never more than one retry.
That's exactly the same as commit 53bfab080c ("BUG/MINOR: sink: fix a race
condition between the writer and the reader") that went into 2.7 and was
backported as far as 2.4, except that since the code was duplicated, the
second instance was not noticed, leaving the race present. The race has
a limited impact, if a forwarder reaches the end of the logs and a new
message arrives before it leaves, the forwarder will only wake up after
yet another new message will be sent. In practice it remains unnoticeable
because for the race to trigger, one needs to have a steady flow of logs,
which means the wakeup will happen anyway.
This should be backported, but no need to insist on it if it resists.
These both flags are set after releasing the applet, in
appctx_shut(). Concretly, it means the applet is shutdown for reads and
writes. Once set, the applet's I/O handler was no longer called. Tests on
these flags are useless. There is no chance to match them.
Since 04276f3d ("MEDIUM: server: split the address and the port into two
different fields") we should not use srv->addr to store server's port
and rely on srv->svc_port instead.
For sink servers, we correctly set >svc_port upon server creation but
we didn't use it when initializing address for the connection.
As a result, FQDN resolution will not work properly with sink servers.
Hopefully, this used to work by accident because sink servers were
resolved using the PA_O_RESOLVE flag in str2sa_range(), which made the
srv->addr contain the port in addition to the address.
But this will fail to work when FQDN resolution is postponed because only
->svc_port will contain the proper server port upon resolution.
For instance, FQDN resolution with servers from log backends (which are
resolved as regular servers, that is, without the PA_O_RESOLVE) will fail
to work because of this.
This may be backported as far as 2.2 even though the bug didn't have
noticeable effects for versions below 2.9
[In 2.2, sink_forward_session_init() didn't exist it should be applied in
sink_forward_session_create()]
"log-bufsize" may now be used for a log server (in a log backend) to
configure the bufsize of implicit ring associated to the server (which
defaults to BUFSIZE).
This helper function can be used to create a new sink from an existing
server struct (and thus existing proxy as well), in order to spare some
resources when possible.
implicit rings were automatically forced to the parent logger format, but
this was done upon ring creation.
This is quite restrictive because we might want to choose the desired
format right before generating the log header (ie: when producing the
log message), depending on the logger (log directive) that is
responsible for the log message, and with current logic this is not
possible. (To this day, we still have dedicated implicit ring per log
directive, but this might change)
In ring_write(), we check if the sink->fmt is specified:
- defined: we use it since it is the most precise format
(ie: for named rings)
- undefined: then we fallback to the format from the logger
With this change, implicit rings' format is now set to UNSPEC upon
creation. This is safe because the log header building function
automatically enforces the "raw" format when UNSPEC is set. And since
logger->format also defaults to "raw", no change of default behavior
should be expected.
Introduce log_header struct to easily pass log header data between
functions and use that to simplify the logic around log header
handling.
While at it, some outdated comments were updated as well.
No change in behavior should be expected.
Since a5b325f92 ("MINOR: protocol: add a real family for existing FDs"),
we don't rely anymore on AF_UNSPEC for buffer rings in do_send_log.
But we kept it as a parsing hint to differentiate between implicit and
named rings during ring buffer postparsing.
However it is still a bit confusing and forces us to systematically rely
on target->addr, even for named buffer rings where it doesn't make much
sense anymore.
Now that target->addr was made a pointer in a recent commit, we can
choose not to initialize it when not needed (i.e.: named rings) and use
this as a hint to distinguish implicit rings during init since they rely
on the addr struct to temporarily store the ring's address until the ring
is actually created during postparsing step.
log targets were immediately embedded in logger struct (previously
named logsrv) and could not be used outside of this context.
In this patch, we're introducing log_target type with the associated
helper functions so that it becomes possible to declare and use log
targets outside of loggers scope.
When 'log' directive was implemented, the internal representation was
named 'struct logsrv', because the 'log' directive would directly point
to the log target, which used to be a (UDP) log server exclusively at
that time, hence the name.
But things have become more complex, since today 'log' directive can point
to ring targets (implicit, or named) for example.
Indeed, a 'log' directive does no longer reference the "final" server to
which the log will be sent, but instead it describes which log API and
parameters to use for transporting the log messages to the proper log
destination.
So now the term 'logsrv' is rather confusing and prevents us from
introducing a new level of abstraction because they would be mixed
with logsrv.
So in order to better designate this 'log' directive, and make it more
generic, we chose the word 'logger' which now replaces logsrv everywhere
it was used in the code (including related comments).
This is internal rewording, so no functional change should be expected
on user-side.
now forward_px only serves as a hint to know if a proxy was created
specifically for the sink, in which case the sink is responsible for it.
Everywhere forward_px was used in appctx context: get the parent proxy from
the sft->srv instead.
This permits to finally get rid of the double link dependency between sink
and proxy.
Removing unnecessary dependency on proxy->parent pointer in
sink appctx functions by directly using the sink sft from the
applet->svcctx to get back to sink related structs.
Thanks to this, proxy used for a ringbuf does not have to be exclusive
to a single sink anymore.
It's useless to check if sink has been created with BUF type after
calling sink_new_buf() since the goal of the function is to create
a new sink of BUF type.
Fixing some typos that have been overlooked during the recent log/sink
API improvements. Using this patch to make sink_new_from_logsrv() static
since it is not used outside of sink.c
The ring lock was initially mostly used for the logs and used to inherit
its name in lock stats. Now that it's exclusively used by rings, let's
rename it accordingly.
To further clean the code and remove duplication, some sink postparsing
and sink->sft finalization is now performed in a dedicated function
named sink_finalize().
In this patch we move sink freeing logic outside of sink_deinit() function
in order to create the sink_free() helper function that could be used
on error paths for example.
We previously had postparsing logic but only for logsrv sinks, but now we
need to make this operation on logsrv directly instead of sinks to prepare
for additional postparsing logic that is not sink-specific.
To do this, we migrated post_sink_resolve() and sink_postresolve_logsrvs()
to their postresolve_logsrvs() and postresolve_logsrv_list() equivalents.
Then, we split postresolve_logsrv_list() so that the sink-only logic stays
in sink.c (sink_resolve_logsrv_buffer() function), and the "generic"
target part stays in log.c as resolve_logsrv().
Error messages formatting was preserved as far as possible but some slight
variations are to be expected.
As for the functional aspect, no change should be expected.
maxlen now defaults ~0 (instead of BUFSIZE) to make sure no implicit
truncation will be performed when the option is not specified, since the
doc doesn't mention any default value for maxlen. As such, if the payload
is too big, it will be dropped (this is the default expected behavior).
Consider the following example:
|log ring@test-ring len 2000 local0
|
|ring test-ring
| maxlen 1000
This would result in emitted logs being silently truncated to 1000 because
test-ring maxlen is smaller than the log directive maxlen.
In this patch we're adding an extra check in post_sink_resolve() to detect
this kind of confusing setups and warn the user about the implicit
truncation when DIAG mode is on.
This commit depends on:
- "MINOR: sink: simplify post_sink_resolve function"
To prevent logs from being silently (and unexpectly droppped) at runtime,
we check that the maxlen parameter from the log directives are
strictly inferior to the targeted ring size.
|global
| tune.bufsize 16384
| log tcp@127.0.0.1:514 len 32768
| log myring@127.0.0.1:514 len 32768
|ring myring
| # no explicit size
On such configs, a diag warning will be reported.
This commit depends on:
- "MINOR: sink: simplify post_sink_resolve function"
- "MINOR: ring: add a function to compute max ring payload"
When user specifies a maxlen parameter that is greater than the size of
a given ring section, a warning is emitted to inform that the max length
exceeds size, and then the maxlen is forced to size.
The logic is good, but imprecise, because it doesn't take into account
the slight overhead from storing payloads into the ring.
In practise, we cannot store a single message which is exactly the same
length than size. Doing so will result in the message being dropped at
runtime.
Thanks to the ring_max_payload() function introduced in "MINOR: ring: add
a function to compute max ring payload", we can now deduce the maximum
value for the maxlen parameter before it could result in messages being
dropped.
When maxlen value is set to an improper value, the warning will be emitted
and maxlen will be forced to the maximum "single" payload len that could
fit in the ring buffer, preventing messages from being dropped
unexpectedly.
This commit depends on:
- "MINOR: ring: add a function to compute max ring payload"
This may be backported as far as 2.2
When errors are encountered in sink_new_from_logsrv() function,
incompetely allocated ressources are freed to prevent memory leaks.
For instance: logsrv implicit server is manually cleaned up on error prior
to returning from the function.
However, since 198e92a8e5 ("MINOR: server: add a global list of all known
servers") every server created using new_server() is registered to the
global list, but unfortunately the manual srv cleanup in
sink_new_from_logsrv() doesn't remove the srv from the global list, so the
freed server will still be referenced there, which can result in invalid
reads later.
Moreover, server API has evolved since, and now the srv_drop() function is
available for that purpose, so let's use it, but make sure that srv is
freed before the proxy because on older versions srv_drop() expects the
srv to be linked to a valid proxy pointer.
This must be backported up to 2.4.
[For 2.4 version, free_server() must be used instead of srv_drop()]
Multiple error paths (memory,IO related) in cfg_post_parse_ring() were
not implemented correcly and could result in memory leak or undefined
behavior.
Fixing them all at once.
This can be backported in 2.4
sft freeing attempt made in a575421 ("BUG/MINOR: sink: missing sft free in
sink_deinit()") is incomplete, because sink->sft is meant to be used as a
list and not a single sft entry.
Because of that, the previous fix only frees the first sft entry, which
fixes memory leaks for single-server forwarders (this is the case for
implicit rings), but could still result in memory leaks when multiple
servers are configured in a explicit ring sections.
What this patch does: instead of directly freeing sink->sft, it iterates
over every list members to free them.
It must be backported up to 2.4 with a575421.
sink_write() currently relies on sink->maxlen to know when to stop
writing a given payload.
But it could be useful to pass a smaller, explicit value to sink_write()
to stop before the ring maxlen, for instance if the ring is shared between
multiple feeders.
sink_write() now takes an optional maxlen parameter:
if maxlen is > 0, then sink_write will stop writing at maxlen if maxlen
is smaller than ring->maxlen, else only ring->maxlen will be considered.
[for haproxy <= 2.7, patch must be applied by hand: that is:
__sink_write() and sink_write() should be patched to take maxlen into
account and function calls to sink_write() should use 0 as second argument
to keep original behavior]
When maxlen parameter exceeds sink size, a warning is generated and maxlen
is enforced to sink size. But the err_code is incorrectly set to ERR_ALERT
Indeed, being a "warning", ERR_WARN should be used here.
This may be backported as far as 2.2