Thanks to all previous changes, it is now possible to move the
stream-interface into the conn-stream. To do so, some SI functions are
removed and their conn-stream counterparts are added. In addition, the
conn-stream is now responsible to create and release the
stream-interface. While the stream-interfaces were inlined in the stream
structure, there is now a pointer in the conn-stream. stream-interfaces are
now dynamically allocated. Thus a dedicated pool is added. It is a temporary
change because, at the end, the stream-interface structure will most
probably disappear.
To be able to move the stream-interface from the stream to the conn-stream,
all access to the SI is done via the conn-stream. This patch is limited to
the dns part.
In the same way the conn-stream has a pointer to the stream endpoint , this
patch adds a pointer to the application entity in the conn-stream
structure. For now, it is a stream or a health-check. It is mandatory to
merge the stream-interface with the conn-stream.
Because appctx is now an endpoint of the conn-stream, there is no reason to
still have the stream-interface as appctx owner. Thus, the conn-stream is
now the appctx owner.
Thanks to previous changes, it is now possible to set an appctx as endpoint
for a conn-stream. This means the appctx is no longer linked to the
stream-interface but to the conn-stream. Thus, a pointer to the conn-stream
is explicitly stored in the stream-interface. The endpoint (connection or
appctx) can be retrieved via the conn-stream.
The crash that was fixed by commit 7045590d8 ("BUG/MAJOR: dns: attempt
to lock globaly for msg waiter list instead of use barrier") was now
completely analysed and confirmed to be partially a result of the
debugging code added to LIST_INLIST(), which was looking at both
pointers and their reciprocals, and that, if used in a concurrent
context, could perfectly return false if a neighbor was being added or
removed while the current one didn't change, allowing the LIST_APPEND
to fail.
As the LIST API was not designed to be used in a concurrent context,
we should not rely on LIST_INLIST() but on the newly introduced
LIST_INLIST_ATOMIC().
This patch simply reverts the commit above to switch to the new test,
saving a lock during potentially long operations. It was verified that
the check doesn't fail anymore.
It is unsure what the performance impact of the fix above could be in
some contexts. If any performance regression is observed, it could make
sense to backport this patch, along with the previous commit introducing
the LIST_INLIST_ATOMIC() macro.
A few places have been caught triggering late bugs recently, always cases
of use-after-free because a freed element was still found in one of the
lists. This patch adds a few checks for such elements in dns_session_free()
before the final pool_free() and dns_session_io_handler() before adding
elements to lists to make sure they remain consistent. They do not trigger
anymore now.
When dns_session_release() calls dns_session_free(), it was shown that
it might still be attached there:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006437d7 in dns_session_free (ds=0x7f895439e810) at src/dns.c:768
768 BUG_ON(!LIST_ISEMPTY(&ds->ring.waiters));
[Current thread is 1 (Thread 0x7f895bbe2700 (LWP 31792))]
(gdb) bt
#0 0x00000000006437d7 in dns_session_free (ds=0x7f895439e810) at src/dns.c:768
#1 0x0000000000643ab8 in dns_session_release (appctx=0x7f89545a4ff0) at src/dns.c:805
#2 0x000000000062e35a in si_applet_release (si=0x7f89545a5550) at include/haproxy/stream_interface.h:236
#3 0x000000000063150f in stream_int_shutw_applet (si=0x7f89545a5550) at src/stream_interface.c:1697
#4 0x0000000000640ab8 in si_shutw (si=0x7f89545a5550) at include/haproxy/stream_interface.h:437
#5 0x0000000000643103 in dns_session_io_handler (appctx=0x7f89545a4ff0) at src/dns.c:725
#6 0x00000000006d776f in task_run_applet (t=0x7f89545a5100, context=0x7f89545a4ff0, state=81924) at src/applet.c:90
#7 0x000000000068b82b in run_tasks_from_lists (budgets=0x7f895bbbf5c0) at src/task.c:611
#8 0x000000000068c258 in process_runnable_tasks () at src/task.c:850
#9 0x0000000000621e61 in run_poll_loop () at src/haproxy.c:2636
#10 0x0000000000622328 in run_thread_poll_loop (data=0x8d7440 <ha_thread_info+64>) at src/haproxy.c:2807
#11 0x00007f895c54a06b in start_thread () from /lib64/libpthread.so.0
#12 0x00007f895bf3772f in clone () from /lib64/libc.so.6
(gdb) p &ds->ring.waiters
$1 = (struct list *) 0x7f895439e8a8
(gdb) p ds->ring.waiters
$2 = {
n = 0x7f89545a5078,
p = 0x7f89545a5078
}
(gdb) p ds->ring.waiters->n
$3 = (struct list *) 0x7f89545a5078
(gdb) p *ds->ring.waiters->n
$4 = {
n = 0x7f895439e8a8,
p = 0x7f895439e8a8
}
Let's always detach it before freeing so that it remains possible to
check the dns_session's ring before releasing it, and possibly catch
bugs.
The barrier is insufficient here to protect the waiters list as we can
definitely catch situations where ds->waiter shows an inconsistency
whereby the element is not attached when entering the "if" block and
is already attached when attaching it later.
This patch uses a larger lock to maintain consistency. Without it the
code would crash in 30-180 minutes under heavy stress, always showing
the same problem (ds->waiter->n->p != &ds->waiter). Now it seems to
always resist, suggesting that this was indeed the problem.
This will have to be backported to 2.4.
Using tcp, after a session release and free, the session can remain
attached to the list of sessions with a response message waiting for
a commit (ds->waiter). This results to a use after free of this
session.
Also, on some error path and after free, a session could remain attached
to the lists of available idle/free sessions (ds->list).
This patch ensure to remove the session from those external lists
before a free.
This patch should be backported to all version including
the dns over tcp (2.4)
We'll need to improve the API to pass other arguments in the future, so
let's start to adapt better to the current use cases. task_new() is used:
- 18 times as task_new(tid_bit)
- 18 times as task_new(MAX_THREADS_MASK)
- 2 times with a single bit (in a loop)
- 1 in the debug code that uses a mask
This patch provides 3 new functions to achieve this:
- task_new_here() to create a task on the calling thread
- task_new_anywhere() to create a task to be run anywhere
- task_new_on() to create a task to run on a specific thread
The change is trivial and will allow us to later concentrate the
required adaptations to these 3 functions only. It's still possible
to call task_new() if needed but a comment was added to encourage the
use of the new ones instead. The debug code was not changed and still
uses it.
appctx_new() is exclusively called with tid_bit and it only uses the
mask to pass it to the accompanying task. There is no point requiring
the caller to know about a mask there, nor is there any point in
creating an applet outside of the context of its own thread anyway.
Let's drop this and pass tid_bit to task_new() directly.
Some of the Lua doc and a few places still used "Haproxy" or "HAproxy".
There was even one "HA proxy". A few of them were in an example of VTest
output, indicating that VTest ought to be fixed as well. No big deal but
better address all the remaining ones so that these inconsistencies stop
spreading around.
The current "ADD" vs "ADDQ" is confusing because when thinking in terms
of appending at the end of a list, "ADD" naturally comes to mind, but
here it does the opposite, it inserts. Several times already it's been
incorrectly used where ADDQ was expected, the latest of which was a
fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain
why we queue the stream at the head of the server list").
Let's use more explicit (but slightly longer) names now:
LIST_ADD -> LIST_INSERT
LIST_ADDQ -> LIST_APPEND
LIST_ADDED -> LIST_INLIST
LIST_DEL -> LIST_DELETE
The same is true for MT_LISTs, including their "TRY" variant.
LIST_DEL_INIT keeps its short name to encourage to use it instead of the
lazier LIST_DELETE which is often less safe.
The change is large (~674 non-comment entries) but is mechanical enough
to remain safe. No permutation was performed, so any out-of-tree code
can easily map older names to new ones.
The list doc was updated.
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
For a long time we've had fdtab[].ev and fdtab[].state which contain two
arbitrary sets of information, one is mostly the configuration plus some
shutdown reports and the other one is the latest polling status report
which also contains some sticky error and shutdown reports.
These ones used to be stored into distinct chars, complicating certain
operations and not even allowing to clearly see concurrent accesses (e.g.
fd_delete_orphan() would set the state to zero while fd_insert() would
only set the event to zero).
This patch creates a single uint with the two sets in it, still delimited
at the byte level for better readability. The original FD_EV_* values
remained at the lowest bit levels as they are also known by their bit
value. The next step will consist in merging the remaining bits into it.
The whole bits are now cleared both in fd_insert() and _fd_delete_orphan()
because after a complete check, it is certain that in both cases these
functions are the only ones touching these areas. Indeed, for
_fd_delete_orphan(), the thread_mask has already been zeroed before a
poller can call fd_update_event() which would touch the state, so it
is certain that _fd_delete_orphan() is alone. Regarding fd_insert(),
only one thread will get an FD at any moment, and it as this FD has
already been released by _fd_delete_orphan() by definition it is certain
that previous users have definitely stopped touching it.
Strictly speaking there's no need for clearing the state again in
fd_insert() but it's cheap and will remove some doubts during some
troubleshooting sessions.
It's been too short for quite a while now and is now full. It's still
time to extend it to 32-bits since we have room for this without
wasting any space, so we now gained 16 new bits for future flags.
The values were not reassigned just in case there would be a few
hidden u16 or short somewhere in which these flags are placed (as
it used to be the case with stream->pending_events).
The patch is tagged MEDIUM because this required to update the task's
process() prototype to use an int instead of a short, that's quite a
bunch of places.
When dns_connect_nameserver() is called, the nameserver has always a dgram
field properly defined. The caller, dns_send_nameserver(), already performed
the appropriate verification.
When a DNS session is created, the call to ring_attach() never fails. The
ring is freshly initialized and there is other watcher on it. Thus, the call
always succeeds.
Instead of catching an error that must never happen, we use the DISGUISE()
macro to make static analyzers happy.
This makes the code more readable and less prone to copy-paste errors.
In addition, it allows to place some __builtin_constant_p() predicates
to trigger a link-time error in case the compiler knows that the freed
area is constant. It will also produce compile-time error if trying to
free something that is not a regular pointer (e.g. a function).
The DEBUG_MEM_STATS macro now also defines an instance for ha_free()
so that all these calls can be checked.
178 occurrences were converted. The vast majority of them were handled
by the following Coccinelle script, some slightly refined to better deal
with "&*x" or with long lines:
@ rule @
expression E;
@@
- free(E);
- E = NULL;
+ ha_free(&E);
It was verified that the resulting code is the same, more or less a
handful of cases where the compiler optimized slightly differently
the temporary variable that holds the copy of the pointer.
A non-negligible amount of {free(str);str=NULL;str_len=0;} are still
present in the config part (mostly header names in proxies). These
ones should also be cleaned for the same reasons, and probably be
turned into ist strings.
dns_session_release() only uses its struct dns_stream_server to access
the lock, so a warning is emitted when threads are disabled. Let's mark
it __maybe_unused.
It seems that fd_delete perform the close of the file descriptor
Se we must not close the fd once again after that.
This should fix issues #1128, #1130 and #1131
This patch fix a case which should never happen writing
in output channel since we check available room before
This patch should fix github issue #1132
This patch fix returns code in case of dns_connect_server is called
on unsupported type (which should not happen). Doing this we have
the warranty that after a return 0 the fd is never -1.
This patch should fix github issues #1127, #1128 and #1130
This patch adds a missing test in dns_session_io_handler, getting
the query id from the buffer of the ring. An error should never
happen since messages are completely added atomically.
This bug should fix github issue #1133
This patch introduce the "dns_stream_nameserver" to use DNS over
TCP on strict nameservers. For the upper layer it is analog to
the api used with udp nameservers except that the user que switch
the name server in "stream" mode at the init using "dns_stream_init".
The fallback from UDP to TCP is not handled and this is not the
purpose of this feature. This is done to choose the transport layer
during the initialization.
Currently there is a hardcoded limit of 4 pipelined transactions
per TCP connections. A batch of idle connections is expired every 5s.
This code is designed to support a maximum DNS message size on TCP: 64k.
Note: this code won't perform retry on unanswered queries this
should be handled by the upper layer
This patch splits current dns.c into two files:
The first dns.c contains code related to DNS message exchange over UDP
and in future other TCP. We try to remove depencies to resolving
to make it usable by other stuff as DNS load balancing.
The new resolvers.c inherit of the code specific to the actual
resolvers.
Note:
It was really difficult to obtain a clean diff dur to the amount
of moved code.
Note2:
Counters and stuff related to stats is not cleany separated because
currently counters for both layers are merged and hard to separate
for now.
This patch splits recv and send functions in two layers. the
lowest is responsible of DNS message transactions over
the network. Doing this we could use DNS message layer
for something else than resolving. Load balancing for instance.
This patch also re-works the way to init a nameserver and
introduce the new struct dns_dgram_server to prepare the arrival
of dns_stream_server and the support of DNS over TCP.
The way to retry a send failure of a request because of EAGAIN
was re-worked. Previously there was no control and all "pending"
queries were re-played each time it reaches a EAGAIN. This
patch introduce a ring to stack messages in case of sent
failure. This patch is emptied if poller shows that the
socket is ready again to push messages.
Counters are currently stored into lowlevel nameservers struct but
most of them are resolving layer data and increased in the upper layer
So this patch renames the prototype used to allocate/dump them with prefix
'resolv' waiting for a clean split.
Some types are specific to resolver code and a renamed using
the 'resolv' prefix instead 'dns'.
-struct dns_query_item {
+struct resolv_query_item {
-struct dns_answer_item {
+struct resolv_answer_item {
-struct dns_response_packet {
+struct resolv_response {
Resolv callbacks are also updated to rely on counters and not on
nameservers.
"show stat domain dns" will now show the parent id (i.e. resolvers
section name).
When a server dns resolution is performed, there is no reason to set an
unconfigured check port with the server port. Because by default, if the
check port is not set, the server's one is used. Thus we can remove this
useless assignment. It is mandatory for next improvements.
V2 of this fix which includes a missing pointer initialization which was
causing a segfault in v1 (949a7f6459)
This bug happens when a service has multiple records on the same host
and the server provides the A/AAAA resolution in the response as AR
(Additional Records).
In such condition, the first occurence of the host will be taken from
the Additional section, while the second (and next ones) will be process
by an independent resolution task (like we used to do before 2.2).
This can lead to a situation where the "synchronisation" of the
resolution may diverge, like described in github issue #971.
Because of this behavior, HAProxy mixes various type of requests to
resolve the full list of servers: SRV+AR for all "first" occurences and
A/AAAA for all other occurences of an existing hostname.
IE: with the following type of response:
;; ANSWER SECTION:
_http._tcp.be2.tld. 3600 IN SRV 5 500 80 A2.tld.
_http._tcp.be2.tld. 3600 IN SRV 5 500 86 A3.tld.
_http._tcp.be2.tld. 3600 IN SRV 5 500 80 A1.tld.
_http._tcp.be2.tld. 3600 IN SRV 5 500 85 A3.tld.
;; ADDITIONAL SECTION:
A2.tld. 3600 IN A 192.168.0.2
A3.tld. 3600 IN A 192.168.0.3
A1.tld. 3600 IN A 192.168.0.1
A3.tld. 3600 IN A 192.168.0.3
the first A3 host is resolved using the Additional Section and the
second one through a dedicated A request.
When linking the SRV records to their respective Additional one, a
condition was missing (chek if said SRV record is already attached to an
Additional one), leading to stop processing SRV only when the target
SRV field matches the Additional record name. Hence only the first
occurence of a target was managed by an additional record.
This patch adds a condition in this loop to ensure the record being
parsed is not already linked to an Additional Record. If so, we can
carry on the parsing to find a possible next one with the same target
field value.
backport status: 2.2 and above
This reverts commit 949a7f6459.
The first part of the patch introduces a bug. When a dns answer item is
allocated, its <ar_item> is only initialized at the end of the parsing, when
the item is added in the answer list. Thus, we must not try to release it
during the parsing.
The second part is also probably buggy. It fixes the issue #971 but reverts
a fix for the issue #841 (see commit fb0884c8297 "BUG/MEDIUM: dns: Don't
store additional records in a linked-list"). So it must be at least
revalidated.
This revert fixes a segfault reported in a comment of the issue #971. It
must be backported as far as 2.2.
This bug happens when a service has multiple records on the same host
and the server provides the A/AAAA resolution in the response as AR
(Additional Records).
In such condition, the first occurence of the host will be taken from
the Additional section, while the second (and next ones) will be process
by an independent resolution task (like we used to do before 2.2).
This can lead to a situation where the "synchronisation" of the
resolution may diverge, like described in github issue #971.
Because of this behavior, HAProxy mixes various type of requests to
resolve the full list of servers: SRV+AR for all "first" occurences and
A/AAAA for all other occurences of an existing hostname.
IE: with the following type of response:
;; ANSWER SECTION:
_http._tcp.be2.tld. 3600 IN SRV 5 500 80 A2.tld.
_http._tcp.be2.tld. 3600 IN SRV 5 500 86 A3.tld.
_http._tcp.be2.tld. 3600 IN SRV 5 500 80 A1.tld.
_http._tcp.be2.tld. 3600 IN SRV 5 500 85 A3.tld.
;; ADDITIONAL SECTION:
A2.tld. 3600 IN A 192.168.0.2
A3.tld. 3600 IN A 192.168.0.3
A1.tld. 3600 IN A 192.168.0.1
A3.tld. 3600 IN A 192.168.0.3
the first A3 host is resolved using the Additional Section and the
second one through a dedicated A request.
When linking the SRV records to their respective Additional one, a
condition was missing (chek if said SRV record is already attached to an
Additional one), leading to stop processing SRV only when the target
SRV field matches the Additional record name. Hence only the first
occurence of a target was managed by an additional record.
This patch adds a condition in this loop to ensure the record being
parsed is not already linked to an Additional Record. If so, we can
carry on the parsing to find a possible next one with the same target
field value.
backport status: 2.2 and above