A few includes were missing in each file. A definition of
struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was
moved to defaults.h
Stdio used to be silently inherited from whatever path but it's needed
for list_pollers() which takes a FILE* and which can thus not be
forward-declared.
Most of the file was a large set of HTX elements manipulation functions
and few types, so splitting them allowed to further reduce dependencies
and shrink the build time. Doing so revealed that a few files (h2.c,
mux_pt.c) needed haproxy/buf.h and were previously getting it through
htx.h. They were fixed.
The pretty confusing "buffer.h" was in fact not the place to look for
the definition of "struct buffer" but the one responsible for dynamic
buffer allocation. As such it defines the struct buffer_wait and the
few functions to allocate a buffer or wait for one.
This patch moves it renaming it to dynbuf.h. The type definition was
moved to its own file since it's included in a number of other structs.
Doing this cleanup revealed that a significant number of files used to
rely on this one to inherit struct buffer through it but didn't need
anything from this file at all.
This moves types/activity.h to haproxy/activity-t.h and
proto/activity.h to haproxy/activity.h.
The macros defining the bit field values for the profiling variable
were moved to the type file to be more future-proof.
Now the file is ready to be stored into its final destination. A few
minor reorderings were performed to keep the file properly organized,
making the various sections more visible (cache & lockless).
In addition and to stay consistent, memory.c was renamed to pool.c.
types/freq_ctr.h was moved to haproxy/freq_ctr-t.h and proto/freq_ctr.h
was moved to haproxy/freq_ctr.h. Files were updated accordingly, no other
change was applied.
This splits the hathreads.h file into types+macros and functions. Given
that most users of this file used to include it only to get the definition
of THREAD_LOCAL and MAXTHREADS, the bare minimum was placed into thread-t.h
(i.e. types and macros).
All the thread management was left to haproxy/thread.h. It's worth noting
the drop of the trailing "s" in the name, to remove the permanent confusion
that arises between this one and the system implementation (no "s") and the
makefile's option (no "s").
For consistency, src/hathreads.c was also renamed thread.c.
A number of files were updated to only include thread-t which is the one
they really needed.
Some future improvements are possible like replacing empty inlined
functions with macros for the thread-less case, as building at -O0 disables
inlining and causes these ones to be emitted. But this really is cosmetic.
This one used to be stored into debug.h but the debug tools got larger
and require a lot of other includes, which can't use BUG_ON() anymore
because of this. It does not make sense and instead this macro should
be placed into the lower includes and given its omnipresence, the best
solution is to create a new bug.h with the few surrounding macros needed
to trigger bugs and place assertions anywhere.
Another benefit is that it won't be required to add include <debug.h>
anymore to use BUG_ON, it will automatically be covered by api.h. No
less than 32 occurrences were dropped.
The FSM_PRINTF macro was dropped since not used at all anymore (probably
since 1.6 or so).
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
Only allow L7 retries when using HTTP, it only really makes sense for HTTP,
anyway, and as the L7 retries code assume the message will be HTX, it will
crash when used with mode TCP.
This should fix github issue #627.
This should be backported to 2.1 and 2.0.
Now we very rarely catch spinning streams, and whenever we catch one it
seems a filter is involved, but we currently report no info about them.
Let's print the list of enabled filters on the stream with such a crash
to help with the reports. A typical output will now look like this:
[ALERT] 121/165908 (1110) : A bogus STREAM [0x7fcaf4016a60] is spinning at 2 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7fcaf4016a60 src=127.0.0.1 fe=l1 be=l1 dst=<CACHE> rqf=6dc42000 rqa=48000 rpf=a0040223 rpa=24000000 sif=EST,10008 sib=DIS,80110 af=(nil),0 csf=0x7fcaf4023c00,10c000 ab=0x7fcaf40235f0,4 csb=(nil),0 cof=0x7fcaf4016610,1300:H1(0x7fcaf4016840)/RAW((nil))/tcpv4(29) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0) filters={0x7fcaf4016fb0="cache store filter", 0x7fcaf4017080="compression filter"}]
This may be backported to 2.0.
With server-template was introduced the possibility to scale the
number of servers in a backend without needing a configuration change
and associated reload. On the other hand it became impractical to
write use-server rules for these servers as they would only accept
existing server labels as argument. This patch allows the use of
log-format notation to describe targets of a use-server rules, such
as in the example below:
listen test
bind *:1234
use-server %[hdr(srv)] if { hdr(srv) -m found }
use-server s1 if { path / }
server s1 127.0.0.1:18080
server s2 127.0.0.1:18081
If a use-server rule is applied because it was conditionned by an
ACL returning true, but the target of the use-server rule cannot be
resolved, no other use-server rule is evaluated and we fall back to
load balancing.
This feature was requested on the ML, and bumped with issue #563.
It's more generic and versatile than the previous shut_your_big_mouth_gcc()
that was used to silence annoying warnings as it's not limited to ignoring
syscalls returns only. This allows us to get rid of the aforementioned
function and the shut_your_big_mouth_gcc_int variable, that started to
look ugly in multi-threaded environments.
In the rare case of immediate connect() (unix sockets, socket pairs, and
occasionally TCP over the loopback), it is counter-productive to subscribe
for sending and then getting immediately back to process_stream() after
having passed through si_cs_process() just to update the connection. We
already know it is established and it doesn't have any handshake anymore
so we just have to complete it and return to process_stream() with the
stream_interface in the SI_ST_RDY state. In this case, process_stream will
simply loop back to the beginning to synchronize the state and turn it to
SI_ST_EST/ASS/CLO/TAR etc.
This will save us from having to needlessly subscribe in the connect()
code, something which in addition cannot work with edge-triggered pollers.
This lock was only needed to protect the buffer_wq list, but now we have
the mt_list for this. This patch simply turns the buffer_wq list to an
mt_list and gets rid of the lock.
It's worth noting that the whole buffer_wait thing still looks totally
wrong especially in a threaded context: the wakeup_cb() callback is
called synchronously from any thread and may end up calling some
connection code that was not expected to run on a given thread. The
whole thing should probably be reworked to use tasklets instead and be
a bit more centralized.
This counter is already incremented when a new request is received (or if an
error occurred waiting it). So it must not be incremented when the stream is
terminated, at the end of process_strem(). This bug was introduced by the commit
cff0f739e ("MINOR: counters: Review conditions to increment counters from
analysers").
No backport needed.
In process_stream(), when a client or a server abort is handled, the
corresponding listener's counter is incremented. But, we must be sure to have a
listener attached to the session. This bug was introduced by the commit
cff0f739e5.
Thanks to Fred to reporting me the bug.
No need to backport this patch, except if commit cff0f739e5 is backported.
The flags in the act_flag enum have been renamed act_opt. It means ACT_OPT
prefix is used instead of ACT_FLAG. The purpose of this patch is to reserve the
action flags for the actions configuration.
Now, for these counters, the following rules are followed to know if it must be
incremented or not:
* if it exists for a frontend, the counter is incremented
* if stats must be collected for the session's listener, if the counter exists
for this listener, it is incremented
* if the backend is already assigned, if the counter exists for this backend,
it is incremented
* if a server is attached to the stream, if the counter exists for this
server, it is incremented
It is not hardcoded rules. Some counters are still handled in a different
way. But many counters are incremented this way now.
For more than a decade we've kept all the sess_update_st_*() functions
in stream.c while they're only there to work in relation with what is
currently being done in backend.c (srv_redispatch_connect, connect_server,
etc). Let's move all this pollution over there and take this opportunity
to try to find slightly less confusing names for these old functions
whose role is only to handle transitions from one specific stream-int
state:
sess_update_st_rdy_tcp() -> back_handle_st_rdy()
sess_update_st_con_tcp() -> back_handle_st_con()
sess_update_st_cer() -> back_handle_st_cer()
sess_update_stream_int() -> back_try_conn_req()
sess_prepare_conn_req() -> back_handle_st_req()
sess_establish() -> back_establish()
The last one remained in stream.c because it's more or less a completion
function which does all the initialization expected on a connection
success or failure, can set analysers and emit logs.
The other ones could possibly slightly benefit from being modified to
take a stream-int instead since it's really what they're working with,
but it's unimportant here.
In process_sticking_rules() we only want to apply the first store-request
rule for a given table, but when doing so we need to make sure we only
count actual store-request rules when we list the sticking rules.
Failure to do so leads to not being able to write store-request and match
sticking rules in any order as a match rule after a store-request rule
will be ignored.
The following configuration reproduces the issue:
global
stats socket /tmp/foobar
defaults
mode http
frontend in
bind *:8080
default_backend bar
backend bar
server s1 127.0.0.1:21212
server s2 127.0.0.1:21211
stick store-request req.hdr(foo)
stick match req.hdr(foo)
stick-table type string size 10
listen foo
bind *:21212
bind *:21211
http-request deny deny_status 200 if { dst_port 21212 }
http-request deny
This patch fixes issue #448 and should be backported as far as 1.6.
Building on a 32-bit platform produces these warnings in trace code:
src/stream.c: In function 'strm_trace':
src/stream.c:226:29: warning: format '%lu' expects argument of type 'long unsigned int', but argument 9 has type 'size_t {aka const unsigned int}' [-Wformat=]
chunk_appendf(&trace_buf, " req=(%p .fl=0x%08x .ana=0x%08x .exp(r,w,a)=(%u,%u,%u) .o=%lu .tot=%llu .to_fwd=%u)",
^
src/stream.c:229:29: warning: format '%lu' expects argument of type 'long unsigned int', but argument 9 has type 'size_t {aka const unsigned int}' [-Wformat=]
chunk_appendf(&trace_buf, " res=(%p .fl=0x%08x .ana=0x%08x .exp(r,w,a)=(%u,%u,%u) .o=%lu .tot=%llu .to_fwd=%u)",
^
src/mux_fcgi.c: In function 'fcgi_trace':
src/mux_fcgi.c:443:29: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t {aka const unsigned int}' [-Wformat=]
chunk_appendf(&trace_buf, " - VAL=%lu", *val);
^
src/mux_h1.c: In function 'h1_trace':
src/mux_h1.c:290:29: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t {aka const unsigned int}' [-Wformat=]
chunk_appendf(&trace_buf, " - VAL=%lu", *val);
^
Let's just cast the type to long. This should be backported to 2.1.
For backends and servers, some average times for last 1024 connections are
already calculated. For the moment, the averages for the time passed in the
queue, the connect time, the response time (for HTTP session only) and the total
time are calculated. Now, in addition, the maximum time observed for these
values are also stored.
In addition, These new counters are cleared as all other max values with the CLI
command "clear counters".
This patch is related to #272.
The "shutdown session server" command used to open-code the list traversal
while there's already a function for this: srv_shutdown_streams(). Better
use it.
We need to call vars_init() when the list is empty otherwise we
can't use variables in the response scope. This regression was
introduced by cda7f3f5 (MINOR: stream: don't prune variables if
the list is empty).
The following config reproduces the issue:
defaults
mode http
frontend in
bind *:11223
http-request set-var(req.foo) str("foo") if { path /bar }
http-request set-header bar %[var(req.foo)] if { var(req.foo) -m found }
http-response set-var(res.bar) str("bar")
http-response set-header foo %[var(res.bar)] if { var(res.bar) -m found }
use_backend out
backend out
server s1 127.0.0.1:11224
listen back
bind *:11224
http-request deny deny_status 200
> GET /ba HTTP/1.1
> Host: localhost:11223
> User-Agent: curl/7.66.0
> Accept: */*
>
< HTTP/1.0 200 OK
< Cache-Control: no-cache
< Content-Type: text/html
> GET /bar HTTP/1.1
> Host: localhost:11223
> User-Agent: curl/7.66.0
> Accept: */*
>
< HTTP/1.0 200 OK
< Cache-Control: no-cache
< Content-Type: text/html
< foo: bar
This must be backported as far as 1.9.
All TCP and HTTP captures are stored in 2 arrays, one for the request and
another for the response. In HAPRoxy 1.5, these arrays are part of the HTTP
transaction and thus are released during its cleanup. Because in this version,
the transaction is part of the stream (in 1.5, streams are still called
sessions), the cleanup is always performed, for HTTP and TCP streams.
In HAProxy 1.6, the HTTP transaction was moved out from the stream and is now
dynamically allocated only when required (becaues of an HTTP proxy or an HTTP
sample fetch). In addition, still in 1.6, the captures arrays were moved from
the HTTP transaction to the stream. This way, it is still possible to capture
elements from TCP rules for a full TCP stream. Unfortunately, the release is
still exclusively performed during the HTTP transaction cleanup. Thus, for a TCP
stream where the HTTP transaction is not required, the TCP captures, if any, are
never released.
Now, all captures are released when the stream is freed. This fixes the memory
leak for TCP streams. For streams with an HTTP transaction, the captures are now
released when the transaction is reset and not systematically during its
cleanup.
This patch must be backported as fas as 1.6.
Runtime traces are now supported for the streams, only if compiled with
debug. process_stream() is covered as well as TCP/HTTP analyzers and filters.
In traces, the first argument is always a stream. So it is easy to get the info
about the channels and the stream-interfaces. The second argument, when defined,
is always a HTTP transaction. And the third one is an HTTP message. The trace
message is adapted to report HTTP info when possible.
Despite the addition of the mux layer, no change have been made on how to enable
the TCP splicing on process_stream(). We still check if transport layer on both
sides support the splicing, but we don't check the muxes support. So it is
possible to start to splice data with an unencrypted H2 connection on a side and
an H1 connection on the other. This leads to a freeze of the stream until a
client or server timeout is reached.
This patch fixed a part of the issue #356. It must be backported as far as 1.8.
the option "http-send-name-header" is an eyesore. It was responsible of several
bugs because it is handled after the message analysis. With the HTX
representation, the situation is cleaner because no rewind on forwarded data is
required. But it remains ugly.
With recent changes in HAProxy, we have the opportunity to make it fairly
better. The message formatting in now done in the HTTP multiplexers. So it seems
to be the right place to handle this option. Now, the server name is added by
the HTTP multiplexers (h1, h2 and fcgi).
The "cache" entry was still present in the fdtab struct and it was
reported in "show sess". Removing it broke the cache-line alignment
on 64-bit machines which is important for threads, so it was fixed
by adding an attribute(aligned()) when threads are in use. Doing it
only in this case allows 32-bit thread-less platforms to see the
struct fit into 32 bytes.
There were 221 places where a status message or an error message were built
to be returned on the CLI. All of them were replaced to use cli_err(),
cli_msg(), cli_dynerr() or cli_dynmsg() depending on what was expected.
This removed a lot of duplicated code because most of the times, 4 lines
are replaced by a single, safer one.
Since last commit there's no point anymore in having two variants of the
same function, let's switch to b_free() only. __b_drop() was renamed to
__b_free() for obvious consistency reasons.
In stream_set_backend(), if we have a TCP stream, and we want to upgrade it
to H2 instead of attempting ot reuse the stream, just destroy the
conn_stream, make sure we don't log anything about the stream, and pretend
we failed setting the backend, so that the stream will get destroyed.
New streams will then be created by the mux, as if the connection just
happened.
This fixes a crash when upgrading from TCP to H2, as the H2 mux totally
ignored the conn_stream provided by the upgrade, as reported in github
issue #196.
This should be backported to 2.0.
In sess_established(), don't immediately switch the backend stream_interface
to SI_ST_DIS if we only got a SHUTR. We may still have something to send,
ie if the request is a POST, and we should be switched to SI_ST8DIS later
when the shutw will happen.
This should be backported to 2.0 and 1.9.
The purpose will be to store the target address there and not to
allocate a connection just for this anymore. For now it's only placed
in the struct, a few fields were moved to plug some holes, and the
entry is freed on release (never allocated yet for now). This must
have no impact. Note that in order to fit, the store_count which
previously was an int was turned into a short, which is way more
than enough given that the hard-coded limit is 8.
No allocation is needed there. Some extra checks were added in the
stream dump code to make sure the source address is effectively valid
(it always is but it doesn't cost much to be certain).
The stream outputs requires to retrieve connections sources and
destinations. The previous call involving conn_get_{to,from}_addr()
was missing a status check which has now been integrated with the
new call since these places already handle connection errors there.
The same code parts were reused for "show peers" and were modified
similarly.
The old module proto_http does not exist anymore. All code dedicated to the HTTP
analysis is now grouped in the file proto_htx.c. So, to finish the polishing
after removing the legacy HTTP code, proto_htx.{c,h} files have been moved in
http_ana.{c,h} files.
In addition, all HTX analyzers and related functions prefixed with "htx_" have
been renamed to start with "http_" instead.
First of all, all legacy HTTP analyzers and all functions exclusively used by
them were removed. So the most of the functions in proto_http.{c,h} were
removed. Only functions to deal with the HTTP transaction have been kept. Then,
http_msg and hdr_idx modules were entirely removed. And finally the structure
http_msg was lightened of all its useless information about the legacy HTTP. The
structure hdr_ctx was also removed because unused now, just like unused states
in the enum h1_state. Note that the memory pool "hdr_idx" was removed and
"http_txn" is now smaller.
All streams created for an HTTP proxy must now use the HTX internal
resprentation. So, it is no more necessary to test the flag PR_O2_USE_HTX. It
means a stream is an HTX stream if the frontend is an HTTP proxy or if the
frontend multiplexer, if any, set the flag MX_FL_HTX.
Since the legacy HTTP mode is disabled, old HTTP analyzers do nothing but call
those of the HTX. So, it is safe to directly call HTX analyzers from
process_stream().
<head> and <tail> fields are now signed 32-bits integers. For an empty HTX
message, these fields are set to -1. So the field <used> is now useless and can
safely be removed. To know if an HTX message is empty or not, we just compare
<head> against -1 (it also works with <tail>). The function htx_nbblks() has
been added to get the number of used blocks.
Move the logic to decide if we redispatch to a new server from
sess_update_st_cer() to a new inline function, stream_choose_redispatch(), and
use it in do_l7_retry() instead of just setting the state to SI_ST_REQ.
That way, when using L7 retries, we won't redispatch the request to another
server except if "option redispatch" is used.
This should be backported to 2.0.
It's really confusing to call it a task because it's a tasklet and used
in places where tasks and tasklets are used together. Let's rename it
to tasklet to remove this confusion.
These flags are not used by analysers, only by the shut* functions, and
they were covered by CF_MASK_STATIC only because in the past the shut
functions were in the middle of the analysers. But here they are causing
excess loop backs which provide no value and increase processing cost.
Ideally the CF_MASK_STATIC bitfield should be revisited, but doing this
alone is enough to reduce by 30% the number of calls to si_sync_send().
In process_stream() we detect a number of conditions to decide to loop
back to the analysers. Some of them are excessive in that they perform
a strict comparison instead of filtering on the flags relevant to the
analysers as is done at other places, resulting in excess wakeups. One
of the effect is that after a successful WRITE_PARTIAL, a second send is
not possible, resulting in the loss of WRITE_PARTIAL, causing another
wakeup! Let's apply the same mask and verify the flags correctly.
The "goto redo" at the end of process_stream() to make the states converge
is still a big source of problems and mostly stems from the very late call
to the send() functions, whose results need to be considered, while it's
being done in si_update_both() when leaving.
This patch extracts the si_sync_send() calls from si_update_both(), and
places them at the relevant places in process_stream(), which are just
after the amount of data to forward is updated and before the shutw()
calls (which were also moved). The stream-interface resynchronization
needs to go slightly upper to take into account the transition from CON
to RDY that will happen consecutive to some successful send(), and that's
all.
By doing so we can now get rid of this loop and have si_update_both()
called only to update the stream interface and channel when leaving the
function, as it was initially designed to work.
It is worth noting that a number of the remaining conditions to perform
a goto resync_XXX still seem suboptimal and would benefit from being
refined to perform les resynchronization. But what matters at this stage
is that the code remains valid and efficient.
Till now when a wakeup happens after a connection is attempted, we go
through sess_update_st_con_tcp() to deal with the various possible events,
then to sess_update_st_cer() to deal with a possible error detected by the
former, or to sess_establish() to complete the connection validation. There
are multiple issues in the way this is handled, which have accumulated over
time. One of them is that any spurious wakeup during SI_ST_CON would validate
the READ_ATTACHED flag and wake the analysers up. Another one is that nobody
feels responsible for clearing SI_FL_EXP if it happened at the same time as
a success (and it is present in all reports of loops to date). And another
issue is that aborts cannot happen after a clean connection setup with no
data transfer (since CF_WRITE_NULL is part of CF_WRITE_ACTIVITY). Last, the
flags cleanup work was hackish, added here and there to please the next
function (typically what had to be donne in commit 7a3367cca to work around
the url_param+reuse issue by moving READ_ATTACHED to CON).
This patch performs a significant lift up of this setup code. First, it
makes sure that the state handlers are the ones responsible for the cleanup
of the stuff they rely on. Typically sess_sestablish() will clean up the
SI_FL_EXP flag because if we decided to validate the connection it means
that we want to ignore this late timeout. Second, it splits the CON and
RDY state handlers because the former only has to deal with failures,
timeouts and non-events, while the latter has to deal with partial or
total successes. Third, everything related to connection success was
moved to sess_establish() since it's the only safe place to do so, and
this function is also called at a few places to deal with synchronous
connections, which are not seen by intermediary state handlers.
The code was made a bit more robust, for example by making sure we
always set SI_FL_NOLINGER when aborting a connection so that we don't
have any risk to leave a connection in SHUTW state in case it was
validated late. The useless return codes of some of these functions
were dropped so that callers only rely on the stream-int's state now
(which was already partially the case anyway).
The code is now a bit cleaner, could be further improved (and functions
renamed) but given the sensitivity of this part, better limit changes to
strictly necessary. It passes all reg tests.
The main reason for all the trouble we're facing with stream interface
error or timeout reports during the connection phase is that we currently
can't make the difference between a connection attempt and a validated
connection attempt. It is problematic because we tend to switch early
to SI_ST_EST but can't always do what we want in this state since it's
supposed to be set when we don't need to visit sess_establish() again.
This patch introduces a new state betwen SI_ST_CON and SI_ST_EST, which
is SI_ST_RDY. It indicates that we've verified that the connection is
ready. It's a transient state, like SI_ST_DIS, that cannot persist when
leaving process_stream(). For now it is not set, only verified in various
tests where SI_ST_CON was used or SI_ST_EST depending on the cases.
The stream-int state diagram was minimally updated to reflect the new
state, though it is largely obsolete and would need to be seriously
updated.
The stream interface state checks involving ranges were replaced with
checks on a set of states, already revealing some issues. No issue was
fixed, all was replaced in a one-to-one mapping for easier control. Some
checks involving a strict difference were also replaced with fields to
be clearer. At this stage, the result must be strictly equivalent. A few
tests were also turned to their bit-field equivalent for better readability
or in preparation for upcoming changes.
The test performed in the SPOE filter was swapped so that the closed and
error states are evicted first and that the established vs conn state is
tested second.
The test for the send-name-header field used to cover all states between
SI_ST_CON and SI_ST_CLO, which include SI_ST_CER and SI_ST_DIS. Trying to
send a header in these states makes no sense at all, so let's fix this.
This should have no visible impact so no backport is needed.
With this patch we modify the stickiness server targets lookup behavior.
First we look for this server targets by their names before looking for them by their
IDs if not found. We also insert a dictionary entry for the name of the server targets
and store the address of this entry in the underlying stick-table.
During 1.9 development cycle a shortcut was made in process_stream() to
update the analysers immediately after an I/O even detected on the send()
path while leaving the function. In order to prevent this from being abused
by a single stream stealing all the CPU, the loop didn't cover the initial
recv() call, so that events ultimately converge.
This has caused a number of issues over time because the conditions to
decide to loop are a bit tricky. For example the CF_READ_PARTIAL flag is
not immediately removed from rqf_last and may appear for a long time at
this point, sometimes causing some loops to last long.
Another unexpected side effect is that all analysers are called again with
no data to process, just because CF_WRITE_PARTIAL is present. We cannot get
rid of this event even if of very rare use, because some analysers might
wait for some data to leave a buffer before proceeding. With a full loop,
this event would have been merged with a subsequent recv() allowing analysers
to do something more useful than just ack an event they don't care about.
While during early 1.9-dev it was very important to be kind with the
scheduler, nowadays it's lock-free for local tasks so this optimization
is much less interesting to use it for I/Os, especially if we factor in
the trouble it causes.
This patch thus removes the use of the loop for regular I/Os and instead
performs a task_wakeup() with an I/O event so that the task will be
scheduled after all other ones and will have a chance to perform another
recv() and possibly to gather more I/O events to be processed at once.
Synchronous errors and transitions to SI_ST_DIS however are still handled
by the loop.
Doing so significantly reduces the average number of calls to analysers
(those are typically halved when compression is enabled in legacy mode),
and as a side benefit, has increased the H1 performance by about 1%.
The head of an HTX message is heavily used whereas the wrap position is only
used when a block is added or removed. So it is more logical to store the head
position in the HTX message instead of the wrap one. The wrap position can be
easily deduced. To get it, the new function htx_get_wrap() may be used.
It was not as efficient as the watchdog in that it would only trigger
after the problem resolved by itself, and still required a huge margin
to make sure we didn't trigger for an invalid reason. This used to leave
little indication about the cause. Better use the watchdog now and
improve it if needed.
The detector of unkillable tasks remains active though.
This function dumps a lot of information about a stream into the provided
buffer. It is now used by stream_dump_and_crash() and will be used by the
debugger as well.
When we receive a read0, and we're still in SI_ST_CON state (so on an
outgoing conneciton), don't immediately switch to SI_ST_DIS, or, we would
never call sess_establish(), and so the analysers will never run.
Instead, let sess_establish() handle that case, and switch to SI_ST_DIS if
we already have CF_SHUTR on the channel.
This should be backported to 1.9.
The test consisted in checking that there was always a timeout on a
stream's task and was only enabled when built in development mode,
but 1) it is never tested and 2) if it had been tested it would have
been noticed that it triggers a bit too easily on the CLI. Let's get
rid of this old one.
This makes sure that the stream is not visible from its own task just
before starting to free some of its components. This way we have the
guarantee that a stream found in a task list is totally valid and can
safely be dereferenced.
Add a new action for http-request, disable-l7-retry, that can be used to
disable any attempt at retry requests (see retry-on) if it fails for any
reason other than a connection failure.
This is useful for example to make sure POST requests aren't retried.
A backend stream-interface attached to a reused connection remains in the state
SI_ST_CONN until some data are sent to validate the connection. But when the
url_param algorithm is used to balance connections, no data are sent while the
connection is not established. So it is a chicken and egg situation.
To solve the problem, if no error is detected and when the request channel is
waiting for the connect(), we mark the read side as attached on the response
channel as soon as possible and we wake the request channel up once. This
happens in 2 places. The first one is right after the connect(), when the
stream-interface is still in state SI_ST_CON, in the function
sess_update_st_con_tcp(). The second one is when an applet is used instead of a
real connection to a server, in the function sess_prepare_conn_req(). In fact,
it is done when the backend stream-interface is set to the state SI_ST_EST.
This patch must be backported to 1.9.
When running in HTX mode, if we sent the request, but failed to get the
answer, either because the server just closed its socket, we hit a server
timeout, or we get a 404, 408, 425, 500, 501, 502, 503 or 504 error,
attempt to retry the request, exactly as if we just failed to connect to
the server.
To do so, add a new backend keyword, "retry-on".
It accepts a list of keywords, which can be "none" (never retry),
"conn-failure" (we failed to connect, or to do the SSL handshake),
"empty-response" (the server closed the connection without answering),
"response-timeout" (we timed out while waiting for the server response),
or "404", "408", "425", "500", "501", "502", "503" and "504".
The default is "conn-failure".
In sess_update_st_con_tcp(), if we have an error on the stream_interface
because we tried to send early_data but failed, don't flag the request
channel as CF_WRITE_ERROR, or we will never reach the analyser that sends
back the 425 response.
This should be backported to 1.9.
On some occasions we've had loops happening when processing actions
(e.g. a yield not being well understood) resulting in analysers being
called in loops until the analysis timeout without incrementing the
stream's call count, thus this type of bug cannot be caught by the
current protection system.
What this patch proposes is to start to measure the time spent in analysers
when profiling is enabled on the thread, in order to detect if a stream is
really misbehaving. In this case we measured the consumed CPU time, not the
wall clock time, so as not to be affected by possible noisy neighbours
sharing the same CPU. When more than 100ms are spent in an analyser, we
trigger the stream_dump_and_crash() function to report the anomaly.
The choice of 100ms comes from the fact that regular calls only take around
1 microsecond and it seems reasonable to accept a degradation factor of
100000, which covers very slow machines such as home gateways running on
sub-ghz processors, with extremely heavy configurations. Some complete
tests show that even this common bogus map_regm() entry supposedly designed
to extract a port from an IP:port entry does not trigger the timeout (25 ms
evaluation time for a 4kB header, exercise left to the reader to spot the
mistake) :
([0-9]{0,3}).([0-9]{0,3}).([0-9]{0,3}).([0-9]{0,3}):([0-9]{0,5}) \5
However this one purposely designed to kill haproxy definitely dies as it
manages to completely freeze the whole process for more than one second
on a 4 GHz CPU for only 120 bytes in :
(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})b \1
This protection will definitely help during the code stabilization period
and may possibly be left enabled later depending on reported issues or not.
If you've noticed that your workload is affected by this patch, please
report it as you have very likely found a bug. And in the mean time you
can turn profiling off to disable it.
If a stream is caught spinning over itself at more than 100000 loops per
second and for more than one second, the process will be aborted and the
offender reported on the console and logs. Typical figures usually are just
a few tens to hundreds per second over a very short time so there is a huge
margin here. Using even higher values could also work but there is the risk
of not being able to catch offenders if multiple ones start to bug at the
same time and share the load. This code should ideally be disabled for
stable releases, though in theory nothing should ever trigger it.
During 1.9 development (and even a bit after) we've started to face a
significant number of situations where streams were abusively spinning
due to an uncaught error flag or complex conditions that couldn't be
correctly identified. Sometimes streams wake appctx up and conversely
as well. More importantly when this happens the only fix is to restart.
This patch adds a new function to report a serious error, some relevant
info and to crash the process using abort() so that a core dump is
available. The purpose will be for this function to be called in various
situations where the process is unfixable. It will help detect these
issues much earlier during development and may even help fixing test
platforms which are able to automatically restart when such a condition
happens, though this is not the primary purpose.
This patch only provides the function and doesn't use it yet.
Very similarly to previous commit doing the same for streams, we now
measure and report an appctx's call rate. This will help catch applets
which do not consume all their data and/or which do not properly report
that they're waiting for something else. Some of them like peers might
theorically be able to exhibit some occasional peeks when teaching a
full table to a nearby peer (e.g. the new replacement process), but
nothing close to what a bogus service can do so there is no risk of
confusion.
Quite a few times some bugs have made a stream task incorrectly
handle a complex combination of events, which was often reported as
"100% CPU", and was usually caused by the event not being properly
identified and flushed, and the stream's handler called in loops.
This patch adds a call rate counter to the stream struct. It's not
huge, it's really inexpensive (especially compared to the rest of the
processing function) and will easily help spot such tasks in "show sess"
output, possibly even allowing to kill them.
A future patch should probably consist in alerting when they're above a
certain threshold, possibly sending a dump and killing them. Some options
could also consist in aborting in order to get an analyzable core dump
and let a service manager restart a fresh new process.
A regression was introduced with the commit c9aecc8ff ("BUG/MEDIUM: stream:
Don't request a server connection if a shutw was scheduled"). Among other this,
it breaks the CLI when the shutr on the client side is handled with the client
data. To depend on the flag CF_SHUTW_NOW to not establish the server connection
when an error on the client side is detected is the right way to fix the bug,
because this flag may be set without any error on the client side.
So instead, we abort the request where the error is handled and only when the
backend stream-interface is in the state SI_ST_INI. This way, there is no
ambiguity on the reason why the abort accurred. The stream-interface is also
switched to the state SI_ST_CLO.
This patch must be backported to 1.9. If the commit c9aecc8ff is backported to
previous versions, this one MUST also be backported. Otherwise, it MAY be
backported to older versions that 1.9 with caution.
Fix some missing initializations wich came with 333939c commit (MINOR: action:
new '(http-request|tcp-request content) do-resolve' action). The DNS contexts of
streams which were allocated were not initialized by stream_new(). This leaded to
accesses to non-allocated memory when freeing these contexts with stream_free().
The 'do-resolve' action is an http-request or tcp-request content action
which allows to run DNS resolution at run time in HAProxy.
The name to be resolved can be picked up in the request sent by the
client and the result of the resolution is stored in a variable.
The time the resolution is being performed, the request is on pause.
If the resolution can't provide a suitable result, then the variable
will be empty. It's up to the admin to take decisions based on this
statement (return 503 to prevent loops).
Read carefully the documentation concerning this feature, to ensure your
setup is secure and safe to be used in production.
This patch creates a global counter to track various errors reported by
the action 'do-resolve'.
If a shutdown for writes was performed on the client side (CF_SHUTW is set on
the request channel) while the server connection is still unestablished (the
stream-int is in the state SI_ST_INI), then it is aborted. It must also be
aborted when the shudown for write is pending (only CF_SHUTW_NOW is
set). Otherwise, some errors on the request channel can be ignored, leaving the
stream in an undefined state.
This patch must be backported to 1.9. It may probably be backported to all
suported versions, but it is unclear if the bug is visbile for older versions
than 1.9. So it is probably safer to wait bug reports on these versions to
backport this patch.
task_delete() was never used without calling task_free() just after, and
task_free() was only used on error pathes to destroy a just-created task,
so merge them into task_destroy(), that will remove the task from the
wait queue, and make sure the task is either destroyed immediately if it's
not in the run queue, or destroyed when it's supposed to run.
The flag SF_HTX has been added to know when a stream uses the HTX or not. It is
set when an HTX stream is created. There are 2 conditions to set it. The first
one is when the HTTP frontend enables the HTX. The second one is when the attached
conn_stream uses an HTX multiplexer.
In process_stream(), only try again when there's the SI_FL_ERR flag and we're
in a connected state, otherwise we can loop forever.
It used to work because si_update_both() bogusly removed the SI_FL_ERR flag,
and it would never be set at this point. Now it does, so take that into
account.
Many, many thanks to Maciej Zdeb for reporting the problem, and helping
investigating it.
This should be backported to 1.9.
In commit d7704b534, we introduced and expiration flag on the stream interface,
which is used for the connect, the queue and the turn around. Because the
turn around state isn't an error, the flag was reset in process_stream(), and
later in commit cff6411f9 when introducing the SI_FL_ERR flag, the cleanup
of the flag at this place was erroneously generalized.
To fix this, the SI_FL_EXP flag is only cleared at the end of the turn around
state, and nobody should clear the stream interface flags anymore.
This should be backported to 1.9, it has no known impact on older versions.
As si_update_both() sets prev_state to state for each stream_interface, if
we want to check it changed, copy it before calling si_update_both().
This should be backported to 1.9.
Don't inconditionally remove the SI_FL_ERR code in si_update_both(), which
is called at the end of process_stream(). Doing so was a bug that was there
since the flag was introduced, because we were always setting si->flags to
SI_FL_NONE, however we don't want to lose that one, except if we will retry
connecting, so only remove it in sess_update_st_cer().
This should be backported to 1.9.
Because the flag CF_SHUTR is no more set to mark the end of the message by the
H2 multiplexer, we can rely on it again to detect aborts. there is no more need
to make a check on the flag SI_FL_CLEAN_ABRT when the option abortonclose is
enabled. So, this option should work as before for h2 clients.
This patch must be backported to 1.9 with the previous EOI patches.
In addition to stats and cache applets, there are also HTTP applet services
declared in an http-request rule. All these applets are now handled the same
way. Among other things, the header Expect is handled at the same place for all
these applets.
It's never easy to guess what services are built in. We currently have
the prometheus exporter in contrib/ which is the only extension for now.
Let's enumerate all available ones just like we do for filterr and pollers.
gcc 6+ complains about a possible null-deref here due to the test in
objt_server() :
if (objt_server(s->target))
HA_ATOMIC_ADD(&objt_server(s->target)->counters.retries, 1);
Let's simply change it to __objt_server(). This can be backported to
1.9 and 1.8.
Commit 32211a1 ("BUG/MEDIUM: stream: Don't forget to free
s->unique_id in stream_free().") addressed a memory leak but in
exchange may cause double-free due to the fact that after freeing
s->unique_id it doesn't null it and then calls http_end_txn()
which frees it again. Thus the process quickly crashes at runtime.
This fix must be backported to all stable branches where the
aforementioned patch was backported.
In stream_free(), free s->unique_id. We may still have one, because it's
allocated in log.c::strm_log() no matter what, even if it's a TCP connection
and thus it won't get free'd by http_end_txn().
Failure to do so leads to a memory leak.
This should probably be backported to all maintained branches.
In 1.5-dev13, a bug was introduced by commit e3224e870 ("BUG/MINOR:
session: ensure that we don't retry connection if some data were sent").
If a connection error is reported after some data were sent (and lost),
we used to accidently mark the front connection as being in error instead
of only the back one because the two direction flags were applied to the
same channel. This case is extremely rare with raw connections but can
happen a bit more often with multiplexed streams. This will result in
the error not being correctly reported to the client.
This patch can be backported to all supported versions.
When a connection reuse fails, we must not wait before retrying, as most
likely the issue is related to the reused connection and not to the server
itself.
This should be backported to 1.9, though it depends on previous patches
dealing with SI_ST_CON for connection reuse.
Before the first send() attempt, we should be in SI_ST_CON, not
SI_ST_EST, since we have not yet attempted to send and we are
allowed to retry. This is particularly important with complex
outgoing muxes which can fail during the first send attempt (e.g.
failed stream ID allocation).
It only requires that sess_update_st_con_tcp() knows about this
possibility, as we must not forcefully close a reused connection
when facing an error in this case, this will be handled later.
This may be backported to 1.9 with care after some observation period.
We currently detect a number of situations where we have to immediately
deal with a state change, but we failed to consider the case of the
synchronous error reported on the stream-interface. We definitely do not
want to have to wait for a timeout to handle this one, especially at the
beginning of the connection when it can lead to an immediate retry.
This should be backported to 1.9.
The "show sess all" command didn't allow to detect whether compression
is in use for a given stream, which is sometimes annoying. Let's add a
few more info about the HTTP messages, namely the flags, body len, chunk
len and the "next" pointer.
The "waiting" flag indicates if the stream is waiting for some memory,
and was placed on the same output line as the txn for ease of reading.
But since 1.6 the txn is not part of the stream anymore so this output
was placed under a condition, resulting in "waiting" to appear only
when a txn is present. Let's move it upper, closer to the stream's
flags to fix this.
This may safely be backported though it has little value for older
versions.
Commit b9af88151 ("MINOR: stream/htx: Add info about the HTX structs in
"show sess all" command") accidently forgot the flags on the request
path, it was only on the response path.
It makes sense to backport this to 1.9 so that both outputs are the same.
This one used to rely on a few spin locks around lists manipulations
only but 1) there were still a few races (e.g. when aborting, or
between STAT_ST_INIT and STAT_ST_LIST), and 2) after last commit
which dumps htx info it became obvious that dereferencing the buffer
contents is not safe at all.
This patch uses the thread isolation from the rendez-vous point
instead, to guarantee that nothing moves during the dump. It may
make the dump a bit slower but it will be 100% safe.
This fix must be backported to 1.9, and possibly to 1.8 which likely
suffers from the short races above, eventhough they're extremely
hard to trigger.
For HTX streams, info about the HTX structure is now dumped for the request and
the response channels in "show sess all" command.
The patch may be backported to 1.9.
As long-time changes have accumulated over time, the exported functions
of the stream-interface were almost all prefixed "si_<something>" while
most private ones (mostly callbacks) were called "stream_int_<something>".
There were still a few confusing exceptions, which were addressed to
follow this shcme :
- stream_sock_read0(), only used internally, was renamed stream_int_read0()
and made static
- stream_int_notify() is only private and was made static
- stream_int_{check_timeouts,report_error,retnclose,register_handler,update}
were renamed si_<something>.
Now it is clearer when checking one of these if it risks to be used outside
or not.
Before setting the infinite forward, we first forward all remaining input data
from the channel. Of course for HTX streams, this must be done using the amount
of data in the HTX message not in the channel (which appears as full because of
the HTX).
When the connection failed, we don't really want to close the conn_stream,
as we're probably about to retry, so just make sure the file descriptor is
closed.
These flags haven't been used for a while. SF_TUNNEL was reintroduced
by commit d62b98c6e ("MINOR: stream: don't set backend's nor response
analysers on SF_TUNNEL") to handle the two-level streams needed to
deal with the first model for H2, and was not removed after this model
was abandonned. SF_INITIALIZED was only set. SF_CONN_TAR was never
referenced at all.
All the HTX definition is self-contained and doesn't really depend on
anything external since it's a mostly protocol. In addition, some
external similar files (like h2) also placed in common used to rely
on it, making it a bit awkward.
This patch moves the two htx.h files into a single self-contained one.
The historical dependency on sample.h could be also removed since it
used to be there only for http_meth_t which is now in http.h.
The CLI proxy was not handling payload. To do that, we needed to keep a
connection active on a server and to transfer each new line over that
connection until we receive a empty line.
The CLI proxy handles the payload in the same way that the CLI do it.
Examples:
$ echo -e "@1;add map #-1 <<\n$(cat data)\n" | socat /tmp/master-socket -
$ socat /tmp/master-socket readline
prompt
master> @1
25130> add map #-1 <<
+ test test
+ test2 test2
+ test3 test3
+
25130>
To ease the fast forwarding and the infinte forwarding on HTX proxies, 2
functions have been added to let the channel be almost aware of the way data are
stored in its buffer. By calling these functions instead of legacy ones, we are
sure to forward the right amount of data.
These potential null-deref warnings are emitted on gcc 7 and above
when threads are disabled due to the use of objt_server() after an
existing validity test. Let's switch to __objt_server() since we
know the pointer is valid, it will not confuse the compiler.
Some of these may be backported to 1.8.
We can reach sess_update_st_con_tcp() while we still have a connection
attached, so take that into account, and free the connection, instead of
assuming it's always a conn_stream.
signal_init(), init_log(), init_stream(), and init_task() all used to
only preset some values and lists. This needs to be done very early to
provide a reliable interface to all other users. The calls used to be
explicit in haproxy.c:init(). Now they're placed in initcalls at the
STG_PREPARE stage. The functions are not exported anymore.
This commit replaces the explicit pool creation that are made in
constructors with a pool registration. Not only this simplifies the
pools declaration (it can be done on a single line after the head is
declared), but it also removes references to pools from within
constructors. The only remaining create_pool() calls are those
performed in init functions after the config is parsed, so there
is no more user of potentially uninitialized pool now.
It has been the opportunity to remove no less than 12 constructors
and 6 init functions.
This patch replaces a number of __decl_hathread() followed by HA_SPIN_INIT
or HA_RWLOCK_INIT by the new __decl_spinlock() or __decl_rwlock() which
automatically registers the lock for initialization in during the STG_LOCK
init stage. A few static modifiers were lost in the process, but since they
were not essential at all it was not worth extending the API to provide such
a variant.
This switches explicit calls to various trivial registration methods for
keywords, muxes or protocols from constructors to INITCALL1 at stage
STG_REGISTER. All these calls have in common to consume a single pointer
and return void. Doing this removes 26 constructors. The following calls
were addressed :
- acl_register_keywords
- bind_register_keywords
- cfg_register_keywords
- cli_register_kw
- flt_register_keywords
- http_req_keywords_register
- http_res_keywords_register
- protocol_register
- register_mux_proto
- sample_register_convs
- sample_register_fetches
- srv_register_keywords
- tcp_req_conn_keywords_register
- tcp_req_cont_keywords_register
- tcp_req_sess_keywords_register
- tcp_res_cont_keywords_register
- flt_register_keywords
When ending a stream, if the origin is an appctx, the appctx will have been
destroyed already, but it does not destroy the session. So later, when we
try to destroy the session, we try to dereference sess->origin and die
trying.
Fix this by explicitely setting sess->origin to NULL before calling
session_free().
If an ALPN (or a NPN) was chosen for a server, defer choosing the mux until
after the SSL handshake is done, and the ALPN/NPN has been negociated, so
that we know which mux to pick.
As we now will no longer try tro subscribe to recv/send events before the
connection is established, there's no need to reactivate polling on the fd
when retrying connection. It will be activated later on subscribe.
Right now we measure for each task the cumulated time spent waiting for
the CPU and using it. The timestamp uses a 64-bit integer to report a
nanosecond-level date. This is only enabled when "profiling.tasks" is
enabled, and consumes less than 1% extra CPU on x86_64 when enabled.
The cumulated processing time and wait time are reported in "show sess".
The task's counters are also reset when an HTTP transaction is reset
since the HTTP part pretends to restart on a fresh new stream. This
will make sure we always report correct numbers for each request in
the logs.
At the moment the situation with activity measurement is quite tricky
because the struct activity is defined in global.h and declared in
haproxy.c, with operations made in time.h and relying on freq_ctr
which are defined in freq_ctr.h which itself includes time.h. It's
barely possible to touch any of these files without breaking all the
circular dependency.
Let's move all this stuff to activity.{c,h} and be done with it. The
measurement of active and stolen time is now done in a dedicated
function called just after tv_before_poll() instead of mixing the two,
which used to be a lazy (but convenient) decision.
No code was changed, stuff was just moved around.
When a server is down, the channel's data must not be consumed. This is
required to allow redispatch and connection retry. So now, we wait for
the connection to be marked as connected, with the flag CO_FL_CONNECTED,
before starting to consume channel's data. In the mux, this event is
tracked with the flag H1C_F_CS_WAIT_CONN.
Now, the connection mode is detected in the mux and not in HTX analyzers
anymore. Keep-alive connections are now managed by the mux. A new stream is
created for each transaction. This removes the most important part of the
synchronization between channels and the HTTP transaction cleanup. These changes
only affect the HTX part (proto_htx.c). Legacy HTTP analyzers remain untouched
for now.
On the client-side, the mux is responsible to create new streams when a new
request starts. It is also responsible to parse and update the "Connection:"
header of the response. On the server-side, the mux is responsible to parse and
update the "Connection:" header of the request. Muxes on each side are
independent. For now, there is no connection pool on the server-side, so it
always close the server connection.
Instead of trying to receive as soon as the connection is created, and to
eventually have to transfer subscription if we move connections, wait
until the connection is established before attempting to recv.
Remaining calls to si_cant_put() were all for lack of room and were
turned to si_rx_room_blk(). A few places where SI_FL_RXBLK_ROOM was
cleared by hand were converted to si_rx_room_rdy().
The now unused si_cant_put() function was removed.
The stream interface used to conflate a missing buffer and lack of
buffer space into SI_FL_WAIT_ROOM but this causes difficulties as
these cannot be checked at the same moment and are not resolved at
the same moment either. Now we instead mark the buffer as presumably
available using si_rx_buff_rdy() and mark it as unavailable+requested
using si_rx_buff_blk().
The call to si_alloc_buf() was moved after si_stop_put(). This makes
sure that the SI_FL_RX_WAIT_EP flag is cleared on allocation failure so
that the function is called again if the callee fails to do its work.
This flag is not enough to describe all blocking situations, as can be
seen in each case we remove it. The muxes has taught us that using multiple
blocking flags in parallel will be much easier, so let's start to do this
now. This patch only renames this flags in order to make next changes more
readable.
There are still some unwelcome synchronous calls to si_cs_recv() in
process_stream(). Let's have a new function si_sync_recv() to perform
a synchronous receive call on a stream interface regardless of the type
of its endpoint, and move these calls there. For now it only implements
conn_streams since it doesn't seem useful to support applets there. The
function implements an extra check for the stream interface to be in an
established state before attempting anything.
In commit f26c26c ("BUG/MEDIUM: stream-int: change the way buffer room
is requested by a stream-int") we used to call si_want_put() at the
end of sess_update_st_con_tcp(), when switching to SI_ST_EST state.
But this is incorrect as there are a few other situations where we
can switch to this state, such as in si_connect() where a connection
reuse is detected, or when directly calling an applet (in which case
that was already covered anyway). For now it doesn't have any side
effect but it could impact connection reuse after the stream-int
changes by stalling an immediately reused connection.
Let's move this flag change to sess_establish() instead, which is the
only place which is always called exactly once on connection setup.
No backport is needed, this is purely 1.9.
Subsequent to the recent stream-int updates, we started to consider that
SI_FL_WANT_PUT needs to be set when receipt is enabled, but this is wrong
and results in 100% CPU when an HTTP client stays idle after a keep-alive
request because the stream-int has nothing to provide and nothing to send.
In fact just like for applets this flag should reflect the continuation
of an attempt. So it's si_cs_recv() which should set the flag, and clear
it if it has nothing more to provide. This function is called the first
time in process_stream()), and called again during transfers, so it will
always be up to date during stream_int_update() and stream_int_notify().
As a special case, it should also be set when a connection switches to
the established state. And we should absolutely refrain from calling
si_cs_recv() to re-enable reading, normally just setting this flag
(from within the stream-int's handler or prior to calling si_chk_rcv())
is expected to be OK.
A corner case remains where it was observed that in stream_int_notify() we
can sometimes be called with an empty output channel with SI_FL_WAIT_ROOM
and no CF_WRITE_PARTIAL, so there's no way to detect that we should
re-enable receiving. It's easy to also take care of this condition
there for the time it takes to figure if this situation is expected
or not.
Now it becomes more obvious that relying on a single flag to request
room (or on two flags to arbiter activity) is not workable given the
autonomy of both sides. The mux_h2 has taught us that blocking flags
are much more reliable, require much less condition and are much easier
to deal with. That's probably something to consider quickly in this
area.
No backport is needed.
It's far from being clean, but at least it allows to resync both CS and
applets from the same place, taking into account the fact that CS are
processed synchronously for the send side while appletx are processed
outside of the process_stream() loop. The arrangement is optimised to
minimize the amount of iteration by handling send first, then updating
the SI_FL_WAIT_ROOM flags and only then dealing with si_chk_rcv() on
both sides. The SI_FL_WANT_PUT flag is set if needed before calling
si_chk_rcv() since this is done prior to calling stream_int_update().
Now there's no risk that stream_int_notify() is called anymore during
such operations, thus we cannot have any spurious wake-up anymore. The
case where a successful send() could complete a pending connect() is
handled by taking any stream-int state changes into account at the
call place, which is normal since process_stream() is designed to
iterate till stabilisation.
Doing this solves most of the remaining inconsistencies between CS and
applets.
The function used to be called in turn for each side of the stream, but
since it's called exclusively from process_stream(), it prevents us from
making use of the knowledge we have of the operations in progress for
each side, resulting in having to go all the way through functions like
stream_int_notify() which are not appropriate there.
That patch creates a new function, si_update_both() which takes two
stream interfaces expected to belong to the same stream, and processes
their flags in a more suitable order, but for now doesn't change the
logic at all.
The next step will consist in trying to reinsert the rest of the socket
layer-specific update code to ultimately update the flags correctly at
the end of the operation.
It doesn't make sense to limit this code to applets, as any stream
interface can use it. Let's rename it by simply dropping the "applet_"
part of the name. No other change was made except updating the comments.
This function replaces stream_res_available(), which is used as a callback
for the buffer allocator. It now carefully checks which stream interface
was blocked on a buffer allocation, tries to allocate the input buffer to
this stream interface, and wakes the task up once such a buffer was found.
It will automatically remove the SI_FL_WAIT_ROOM flag upon success since
the info this flag indicates becomes wrong as soon as the buffer is
allocated.
The code is still far from being perfect because if a call to si_cs_recv()
fails to allocate a buffer, we'll still end up passing via process_stream()
again, but this could be improved in the future by using finer-grained
wake-up notifications.
This patch implements analysers for parsing the CLI and extra features
for the master's CLI.
For each command (sent alone, or separated by ; or \n) the request
analyser will determine to which server it should send the request.
The 'mode cli' proxy is able to parse a prefix for each command which is
used to select the apropriate server. The prefix start by @ and is
followed by "master", the PID preceded by ! or the relative PID. (e.g.
@master, @1, @!1234). The servers are not round-robined anymore.
The command is sent with a SHUTW which force the server to close the
connection after sending its response. However the proxy allows a
keepalive connection on the client side and does not close.
The response analyser does not do much stuff, it only reinits the
connection when it received a close from the server, and forward the
response. It does not analyze the response data.
The only guarantee of the end of the response is the close of the
server, we can't rely on the double \n since it's not send by every
command.
This could be reimplemented later as a filter.
With the new synchronous si_cs_send() at the end of process_stream(),
we're seeing re-appear the I/O layer specific part of the stream interface
which is supposed to deal with I/O event subscription. The only difference
is that now we subscribe to I/Os only after having attempted (and failed)
them.
This patch brings a cleanup in this by reintroducing stream_int_update_conn()
with the send code from process_stream(). However this alone would not be
enough because the flags which are cleared afterwards would result in the
loss of the possible events (write events only at the moment). So the flags
clearing and stream-int state updates are also performed inside si_update()
between the generic code and the I/O specific code. This definitely makes
sense as after this call we can simply check again for channel and SI flag
changes and decide to loop once again or not.
The rationale here is that we should never need to try to send() at the
beginning of process_stream() because :
- if something was pending, it's very unlikely that it was unblocked
and not sent just between the last poll() and the wakeup instant.
- if something pending was recently sent, then we don't have anything
to send anymore.
So at first glance it doesn't seem like there could be any valid case
where trying to send before entering the function brings any benefit.
If a buffer allocation failed, we have SI_FL_WAIT_ROOM set and c_size(buf)
being zero. It's the only moment where we have a new opportunity to try to
allocate this buffer. However we don't want to waste our time trying this
if both are non-null since it indicates missing room without any changed
condition.
The vars_prune() and vars_init() functions involve locking while most of
the time there is no variable at all in streams nor sessions. Let's check
for emptiness before calling these functions. Simply doing this has
increased the multithreaded performance from 1.5 to 5% depending on the
workload.
The behaviour of the flag CF_WRITE_PARTIAL was modified by commit
95fad5ba4 ("BUG/MAJOR: stream-int: don't re-arm recv if send fails") due
to a situation where it could trigger an immediate wake up of the other
side, both acting in loops via the FD cache. This loss has caused the
need to introduce CF_WRITE_EVENT as commit c5a9d5bf, to replace it, but
both flags express more or less the same thing and this distinction
creates a lot of confusion and complexity in the code.
Since the FD cache now acts via tasklets, the issue worked around in the
first patch no longer exists, so it's more than time to kill this hack
and to restore CF_WRITE_PARTIAL's semantics (i.e.: there has been some
write activity since we last left process_stream).
This patch mostly reverts the two commits above. Only the part making
use of CF_WROTE_DATA instead of CF_WRITE_PARTIAL to detect the loss of
data upon connection setup was kept because it's more accurate and
better suited.
Make sure we call tasklet_free() only after si_release_endpoint(), when the
unsubscribe() method has been called, so that we're sure the mux won't
attempt to access the taslet.
Avoid using conn_xprt_want_send/recv, and totally nuke cs_want_send/recv,
from the upper layers. The polling is now directly handled by the connection
layer, it is activated on subscribe(), and unactivated once we got the event
and we woke the related task.
When retrying to connect to a server, because the previous connection failed,
make sure if we subscribed to the previous connection, the polling flags will
be true for the new fd.
No backport is needed.
In case pool_alloc() fails in stream_new(), we try to detach the stream
from the list before it has been added, dereferencing a NULL. In order
to fix it, simply move the LIST_DEL call upwards.
This must be backported to 1.8.
Make sure we unsubscribe from events before si_release_endpoint destroys
the conn_stream, or it will be never called. To do so, move the call to
unsubscribe to si_release_endpoint() directly.
This is 1.9-specific and shouldn't be backported.
When subscribing, we don't need to provide a list element, only the h2 mux
needs it. So instead, Add a list element to struct h2s, and use it when a
list is needed.
This forces us to use the unsubscribe method, since we can't just unsubscribe
by using LIST_DEL anymore.
This patch is larger than it should be because it includes some renaming.
These ones are mostly called from cfgparse.c for the parsing and do
not depend on the HTTP representation. The functions's prototypes
were moved to proto/http_rules.h, making this file work exactly like
tcp_rules. Ideally we should stop calling these functions directly
from cfgparse and register keywords, but there are a few cases where
that wouldn't work (stats http-request) so it's probably not worth
trying to go this far.
At the eand of process_stream(), we wake the task if there's something in
the input buffer, after attempting a recv. However this is wrong, and we should
only do so if we received new data. Just check the CF_READ_PARTIAL flag.
This is 1.9-specific and should not be backported.
Instead of using si_cs_io_cb() in process_stream() use si_cs_send/si_cs_recv
instead, as si_cs_io_cb() may lead to process_stream being woken up when it
shouldn't be, and thus timeout would never get triggered.
Instead of waiting for the connection layer to let us know we can read,
attempt to receive as soon as process_stream() is called, and subscribe
to receive events if we can't receive yet.
Now, except for idle connections, the recv(), send() and wake() methods are
no more, all the lower layers do is waking tasklet for anybody waiting
for I/O events.
The handshake processing time used to be stored per stream, which was
valid when there was exactly one stream per session. With H2 and
multiplexing it's not the case anymore and the reported handshake times
are wrong in the logs as it's computed between the TCP accept() and the
stream creation. Let's first move the handshake where it belongs, which
is the session.
However, this is not enough because we don't want to report an excessive
idle time either for H2 (since many requests use the connection).
So the solution used here is to have the stream retrieve sess->tv_accept
and the handshake duration when the stream is created, and let the mux
immediately reset them. This way, the handshake time becomes zero for the
second and subsequent requests in H2 (which was already the case in H1),
and the idle time exactly counts how long the connection remained unused
while it could be used, so in H1 it runs from the end of the previous
response and in H2 it runs from the end of the previous request since the
channel is already available.
This patch will need to be backported to 1.8.
The request counter is incremented when creating a new stream and when
resetting a stream, preparing for a new request. Unfortunately during
the thread migration this was missed, leading to non-atomic increments
in case threads are in use. The most visible side effect is that two
requests may have the same ID from time to time in the logs. However
the SPOE also uses this ID to route responses back to the stream so it
may also lead to occasional spurious SPOE timeouts.
Note that it still doesn't guarantee temporal unicity in the stream
identifiers since a long and a short connection could technically use
the same ID. The likeliness that this happens at the same time is almost
null (roughly threads*runqueue_depth/2^32 that it happens in the same
poll loop), but it will have to be addressed later anyway.
This patch must be backported to 1.8 with the other one it relies on
("MINOR: thread: implement HA_ATOMIC_XADD()").
When freeing the stream, make sure we remove the stream interfaces from the
wait lists, in case it was in there.
This is 1.9-specific, no backport is needed.
Instead of just using the conn_stream wait_list, give the stream_interface
its own. When the conn_stream will have its own buffers, the stream_interface
may have to wait on it.
This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content. At this point they are not used
yet, which is the purpose of the next commit, but all the logic to
set and clear the values is there.
The current name is misleading as it implies a queue size, but the value
instead indicates a position in the queue.
The value is only the queue size at the exact moment the element is enqueued.
Soon we will gain the ability to insert anywhere into the queue, upon which
clarity of the name is more important.
We'll soon need to rely on the pendconn position at the time of dequeuing
to figure the position a stream took in the queue. Usually it's not a
problem since pendconn_free() is called once the connection starts, but
it will make a difference for failed dequeues (eg: queue timeout reached).
Thus it's important to call pendconn_free() before logging in cases we are
not certain whether it was already performed, and to call pendconn_unlink()
after we know the pendconn will not be used so that we collect the queue
state as accurately as possible. As a benefit it will also make the
server's and backend's queues count more accurate in these cases.
Now pendconn_free() takes a stream, checks that pend_pos is set, clears
it, and uses pendconn_unlink() to complete the job. It's cleaner and
centralizes all the bookkeeping work in pendconn_unlink() only and
ensures that there's a single place where the stream's position in the
queue is manipulated.
It remained some fragments of the old buffers API in debug messages, here and
there.
This was caused by the recent buffer API changes, no backport is needed.
Now all the code used to manipulate chunks uses a struct buffer instead.
The functions are still called "chunk*", and some of them will progressively
move to the generic buffer handling code as they are cleaned up.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
For the same consistency reasons, let's use b_empty() at the few places
where an empty buffer is expected, or c_empty() if it's done on a channel.
Some of these places were there to realign the buffer so
{b,c}_realign_if_empty() was used instead.
Passing unsigned ints everywhere is painful, and will cause some headache
later when we'll want to integrate better with struct ist which already
uses size_t. Let's switch buffers to use size_t instead.
Since applets are now part of the main scheduler, it's useful to report
their nice value and the number of calls to the applet handler, to see
where the CPU is spent.
There's no real reason to have a specific scheduler for applets anymore, so
nuke it and just use tasks. This comes with some benefits, the first one
being that applets cannot induce high latencies anymore since they share
nice values with other tasks. Later it will be possible to configure the
applets' nice value. The second benefit is that the applet scheduler was
not very thread-friendly, having a big lock around it in prevision of this
change. Thus applet-intensive workloads should now scale much better with
threads.
Some more improvement is possible now : some applets also use a task to
handle timers and timeouts. These ones could now be simplified to use only
one task.
In preparation for thread-specific runqueues, change the task API so that
the callback takes 3 arguments, the task itself, the context, and the state,
those were retrieved from the task before. This will allow these elements to
change atomically in the scheduler while the application uses the copied
value, and even to have NULL tasks later.
In order to use arbitrary data in the CLI (multiple lines or group of words
that must be considered as a whole, for example), it is now possible to add a
payload to the commands. To do so, the first line needs to end with a special
pattern: <<\n. Everything that follows will be left untouched by the CLI parser
and will be passed to the commands parsers.
Per-command support will need to be added to take advantage of this
feature.
Signed-off-by: Aurlien Nephtali <aurelien.nephtali@corp.ovh.com>
issue was identified by cppcheck
[src/map.c:372] -> [src/map.c:376]: (warning) Variable 'appctx->st2' is reassigned a value before the old one has been used. 'break;' missing?
[src/map.c:433] -> [src/map.c:437]: (warning) Variable 'appctx->st2' is reassigned a value before the old one has been used. 'break;' missing?
[src/map.c:555] -> [src/map.c:559]: (warning) Variable 'appctx->st2' is reassigned a value before the old one has been used. 'break;' missing?
[src/stream.c:3264] -> [src/stream.c:3268]: (warning) Variable 'appctx->st2' is reassigned a value before the old one has been used. 'break;' missing?
Signed-off-by: Ilya Shipitsin <chipitsine@gmail.com>
The management of the servers and the proxies queues was not thread-safe at
all. First, the accesses to <strm>->pend_pos were not protected. So it was
possible to release it on a thread (for instance because the stream is released)
and to use it in same time on another one (because we redispatch pending
connections for a server). Then, the accesses to stream's information (flags and
target) from anywhere is forbidden. To be safe, The stream's state must always
be updated in the context of process_stream.
So to fix these issues, the queue module has been refactored. A lock has been
added in the pendconn structure. And now, when we try to dequeue a pending
connection, we start by unlinking it from the server/proxy queue and we wake up
the stream. Then, it is the stream reponsibility to really dequeue it (or
release it). This way, we are sure that only the stream can create and release
its <pend_pos> field.
However, be careful. This new implementation should be thread-safe
(hopefully...). But it is not optimal and in some situations, it could be really
slower in multi-threaded mode than in single-threaded one. The problem is that,
when we try to dequeue pending connections, we process it from the older one to
the newer one independently to the thread's affinity. So we need to wait the
other threads' wakeup to really process them. If threads are blocked in the
poller, this will add a significant latency. This problem happens when maxconn
values are very low.
This patch must be backported in 1.8.
An fd cache entry might be removed and added at the end of the list, while
another thread is parsing it, if that happens, we may miss fd cache entries,
to avoid that, add a new field in the struct fdtab, "added_mask", which
contains a mask for potentially affected threads, if it is set, the
corresponding thread will set its bit in fd_cache_mask, to avoid waiting in
poll while it may have more work to do.
Create a local, per-thread, fdcache, for file descriptors that only belongs
to one thread, and make the global fd cache mostly lockless, as we can get
a lot of contention on the fd cache lock.
Since the fd update tables are per-thread, we need to have a bit per
thread to indicate whether an update exists, otherwise this can lead
to lost update events every time multiple threads want to update the
same FD. In practice *for now*, it only happens at start time when
listeners are enabled and ask for polling after facing their first
EAGAIN. But since the pollers are still shared, a lost event is still
recovered by a neighbor thread. This will not reliably work anymore
with per-thread pollers, where it has been observed a few times on
startup that a single-threaded listener would not always accept
incoming connections upon startup.
It's worth noting that during this code review it appeared that the
"new" flag in the fdtab isn't used anymore.
This fix should be backported to 1.8.
A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The "clear
counters" commands also reset these counters. The output is sent as a
single write(), which currently produces up to about 7 kB of data for
64 threads. If more counters are added, it may be necessary to write
into multiple buffers, or to reset the counters.
To backport to 1.8 to help collect more detailed bug reports.
James Mc Bride reported an interesting case affecting all versions since
at least 1.5 : if a client aborts a connection on an empty buffer at the
exact moment a server redispatch happens, the CF_SHUTW_NOW flag on the
channel is immediately turned into CF_SHUTW, which is not caught by
check_req_may_abort(), leading the redispatch to be performed anyway
with the channel marked as shut in both directions while the stream
interface correctly establishes. This situation makes no sense.
Ultimately the transfer times out and the server-side stream interface
remains in EST state while the client is in CLO state, and this case
doesn't correspond to anything we can handle in process_stream, leading
to poll() being woken up all the time without any progress being made.
And the session cannot even be killed from the CLI.
So we must ensure that check_req_may_abort() also considers the case
where the channel is already closed, which is what this patch does.
Thanks to James for providing detailed captures allowing to diagnose
the problem.
This fix must be backported to all maintained versions.
The H2 mux can cleanly report an error when a client closes, which is not
the case for the pass-through mux which only reports shutr. That was the
reason why "option abortonclose" was created since there was no way to
distinguish a clean shutdown after sending the request from an abort.
The problem is that in case of H2, the streams are always shut read after
the request is complete (when the END_STREAM flag is received), and that
when this lands on a backend configured with "option abortonclose", this
aborts the request. Disabling abortonclose is not always an option when
H1 and H2 have to coexist.
This patch makes use of the newly introduced mux capabilities reported
via the stream interface's SI_FL_CLEAN_ABRT indicating that the mux is
safe and that there is no need to turn a clean shutread into an abort.
This way abortonclose has no effect on requests initiated from an H2
mux.
This patch as well as these 3 previous ones need to be backported to
1.8 :
- BUG/MINOR: h2: properly report a stream error on RST_STREAM
- MINOR: mux: add flags to describe a mux's capabilities
- MINOR: stream-int: set flag SI_FL_CLEAN_ABRT when mux supports clean aborts
By copying the info in the stream interface that the mux cleanly reports
aborts, we'll have the ability to check this flag wherever needed regardless
of the presence of a mux or not.
Commit 3e13cba ("MEDIUM: session: make use of the connection's destroy
callback") ensured that connections could be autonomous to destroy the
session they initiated, but it didn't take care of doing the same for
applets. Such applets are used for peers, Lua and SPOE outgoing
connections. In this case, once the stream ends, it closes everything
and nothing takes care of releasing the session. The problem is not
immediately obvious since the only visible effect is that older
processes will not quit on reload after having leaked one such session.
For now we check in stream_free() if the session's origin is the applet
we're releasing, and then free the session as well. Something more
uniform should probably be done once we manage to unify applets and
connections a bit more.
This fix needs to be backported to 1.8. Thanks to Emmanuel Hocdet for
reporting the problem.
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
The cache exhibited a but in process_stream() where upon abort it is
possible to switch the stream-int's state to SI_ST_CLO without calling
si_release_endpoint(), resulting in a possibly missing ->release() for
the applet.
It should affect all other applets as well (eg: lua, spoe, peers) and
should carefully be backported to stable branches after some observation
period.
When the stats code was moved to an applet, it wasn't completely
cleaned of its usage of the HTTP transaction and it used to store
the HTTP status in txn->status and to set the HTTP request date to
<now> from within the applet. This is totally wrong because the
applet is seen as a server from the HTTP engine, which parses its
response, so the http_txn must not be touched there.
This was made visible by the cache which would always exhibit a
negative TR log, indicating that nowhere in the code we took care of
setting s->logs.tv_request while the code above used to continue to
hide this. Another side effect of this issue is that under load, if
the stats applet call risks to be delayed, the reported t_queue can
appear negative by being below tv_request-tv_accept.
This patch removes the assignment of tv_request and txn->status from
the applet code and instead sets the tv_request if still unset when
connecting to the applet. This ensures that all applets report correct
request timers now.
A recent issue affecting HTTP/2 + redirect + cache has uncovered an old
problem affecting all existing versions regarding the way events are
reported to analysers.
It happens that when an event is reported, analysers see it and may
decide to temporarily pause processing and prevent other analysers from
processing the same event. Then the event may be cleared and upon the
next call to the analysers, some of them will never see it.
This is exactly what happens with CF_READ_NULL if it is received before
the request is processed, like during redirects : the first time, some
analysers see it, pause, then the event may be converted to a SHUTW and
cleared, and on next call, there's nothing to process. In practice it's
hard to get the CF_READ_NULL flag during the request because requests
have CF_READ_DONTWAIT, preventing the read0 from happening. But on
HTTP/2 it's presented along with any incoming request. Also on a TCP
frontend the flag is not set and it's possible to read the NULL before
the request is parsed.
This causes a problem when filters are present because flt_end_analyse
needs to be called to release allocated resources and remove the
CF_FLT_ANALYZE flag. And the loss of this event prevents the analyser
from being called and from removing itself, preventing the connection
from ever ending.
This problem just shows that the event processing needs a serious revamp
after 1.8. In the mean time we can deal with the really problematic case
which is that we *want* to call analysers if CF_SHUTW is set on any side
ad it's the last opportunity to terminate a processing. It may
occasionally result in some analysers being called for nothing in half-
closed situations but it will take care of the issue.
An example of problematic configuration triggering the bug in 1.7 is :
frontend tcp
bind :4445
default_backend http
backend http
redirect location /
compression algo identity
Then submitting requests which immediately close will have for effect
to accumulate streams which will never be freed :
$ printf "GET / HTTP/1.1\r\n\r\n" >/dev/tcp/0/4445
This fix must be backported to 1.7 as well as any version where commit
c0c672a ("BUG/MINOR: http: Fix conditions to clean up a txn and to
handle the next request") was backported. This commit didn't cause the
bug but made it much more likely to happen.
Upon stream instanciation, we used to enable channel auto connect
and auto close to ease TCP processing. But commit 9aaf778 ("MAJOR:
connection : Split struct connection into struct connection and
struct conn_stream.") has revealed that it was a bad idea because
this commit enables reading of the trailing shutdown that may follow
a small requests, resulting in a read and a shutr turned into shutw
before the stream even has a chance to apply the filters. This
causes an issue with impossible situations where the backend stream
interface is still in SI_ST_INI with a closed output, which blocks
some streams for example when performing a redirect with filters
enabled.
Let's change this so that we only enable these two flags if there is
no analyser on the stream. This way process_stream() has a chance to
let the analysers decide whether or not to allow the shutdown event
to be transferred to the other side.
It doesn't seem possible to trigger this issue before 1.8, so for now
it is preferable not to backport this fix.
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
It happens that no single analyser has ever needed to set res.analyse_exp,
so that process_stream() didn't consider it when computing the next task
expiration date. Since Lua actions were introduced in 1.6, this can be
needed on http-response actions for example, so let's ensure it's properly
handled.
Thanks to Nick Dimov for reporting this bug. The fix needs to be
backported to 1.7 and 1.6.
When a write activity is reported on a channel, it is important to keep this
information for the stream because it take part on the analyzers' triggering.
When some data are written, the flag CF_WRITE_PARTIAL is set. It participates to
the task's timeout updates and to the stream's waking. It is also used in
CF_MASK_ANALYSER mask to trigger channels anaylzers. In the past, it was cleared
by process_stream. Because of a bug (fixed in commit 95fad5ba4 ["BUG/MAJOR:
stream-int: don't re-arm recv if send fails"]), It is now cleared before each
send and in stream_int_notify. So it is possible to loss this information when
process_stream is called, preventing analyzers to be called, and possibly
leading to a stalled stream.
Today, this happens in HTTP2 when you call the stat page or when you use the
cache filter. In fact, this happens when the response is sent by an applet. In
HTTP1, everything seems to work as expected.
To fix the problem, we need to make the difference between the write activity
reported to lower layers and the one reported to the stream. So the flag
CF_WRITE_EVENT has been added to notify the stream of the write activity on a
channel. It is set when a send succedded and reset by process_stream. It is also
used in CF_MASK_ANALYSER. finally, it is checked in stream_int_notify to wake up
a stream and in channel_check_timeouts.
This bug is probably present in 1.7 but it seems to have no effect. So for now,
no needs to backport it.
Now we don't remove the session when a stream dies, instead we
detach the stream and let the mux decide to release the connection
and call session_free() instead.
Since multiple streams can share one session attached to one listener,
the listener_release() call must be done in session_free() and not in
stream_free(), otherwise we end up with a negative count in H2.
Now that the mux will take care of closing the client connection at the
right moment, we don't need to close the client connection anymore, and
we just need to close the conn_stream.
At all call places where a conn_stream is in use, we can now use
cs_close() to get rid of a conn_stream and of its underlying connection
if the mux estimates it makes sense. This is what is currently being
done for the pass-through mux.
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
For HTTP/2 and QUIC, we'll need to deal with multiplexed streams inside
a connection. After quite a long brainstorming, it appears that the
connection interface to the existing streams is appropriate just like
the connection interface to the lower layers. In fact we need to have
the mux layer in the middle of the connection, between the transport
and the data layer.
A mux can exist on two directions/sides. On the inbound direction, it
instanciates new streams from incoming connections, while on the outbound
direction it muxes streams into outgoing connections. The difference is
visible on the mux->init() call : in one case, an upper context is already
known (outgoing connection), and in the other case, the upper context is
not yet known (incoming connection) and will have to be allocated by the
mux. The session doesn't have to create the new streams anymore, as this
is performed by the mux itself.
This patch introduces this and creates a pass-through mux called
"mux_pt" which is used for all new connections and which only
calls the data layer's recv,send,wake() calls. One incoming stream
is immediately created when init() is called on the inbound direction.
There should not be any visible impact.
Note that the connection's mux is purposely not set until the session
is completed so that we don't accidently run with the wrong mux. This
must not cause any issue as the xprt_done_cb function is always called
prior to using mux's recv/send functions.
The stick table API was slightly reworked:
A global spin lock on stick table was added to perform lookup and
insert in a thread safe way. The handling of refcount on entries
is now handled directly by stick tables functions under protection
of this lock and was removed from the code of callers.
The "stktable_store" function is no more externalized and users should
now use "stktable_set_entry" in any case of insertion. This last one performs
a lookup followed by a store if not found. So the code using "stktable_store"
was re-worked.
Lookup, and set_entry functions automatically increase the refcount
of the returned/stored entry.
The function "sticktable_touch" was renamed "sticktable_touch_local"
and is now able to decrease the refcount if last arg is set to true. It
is allowing to release the entry without taking the lock twice.
A new function "sticktable_touch_remote" is now used to insert
entries coming from remote peers at the right place in the update tree.
The code of peer update was re-worked to use this new function.
This function is also able to decrease the refcount if wanted.
The function "stksess_kill" also handle a parameter to decrease
the refcount on the entry.
A read/write lock is added on each entry to protect the data content
updates of the entry.
Now, each proxy contains a lock that must be used when necessary to protect
it. Moreover, all proxy's counters are now updated using atomic operations.
First, we use atomic operations to update jobs/totalconn/actconn variables,
listener's nbconn variable and listener's counters. Then we add a lock on
listeners to protect access to their information. And finally, listener queues
(global and per proxy) are also protected by a lock. Here, because access to
these queues are unusal, we use the same lock for all queues instead of a global
one for the global queue and a lock per proxy for others.
2 global locks have been added to protect, respectively, the run queue and the
wait queue. And a process mask has been added on each task. Like for FDs, this
mask is used to know which threads are allowed to process a task.
For many tasks, all threads are granted. And this must be your first intension
when you create a new task, else you have a good reason to make a task sticky on
some threads. This is then the responsibility to the process callback to lock
what have to be locked in the task context.
Nevertheless, all tasks linked to a session must be sticky on the thread
creating the session. It is important that I/O handlers processing session FDs
and these tasks run on the same thread to avoid conflicts.
Calls to build_logline() are audited in order to use dynamic trash buffers
allocated by alloc_trash_chunk() instead of global trash buffers.
This is similar to commits 07a0fec ("BUG/MEDIUM: http: Prevent
replace-header from overwriting a buffer") and 0d94576 ("BUG/MEDIUM: http:
prevent redirect from overwriting a buffer").
This patch should be backported in 1.7, 1.6 and 1.5. It relies on commit
b686afd ("MINOR: chunks: implement a simple dynamic allocator for trash
buffers") for the trash allocator, which has to be backported as well.
These flags are not exactly for the data layer, they instead indicate
what is expected from the transport layer. Since we're going to split
the connection between the transport and the data layers to insert a
mux layer, it's important to have a clear idea of what each layer does.
All function conn_data_* used to manipulate these flags were renamed to
conn_xprt_*.
Certain types and enums are very specific to the HTTP/1 parser, and we'll
need to share them with the HTTP/2 to HTTP/1 translation code. Let's move
them to h1.c/h1.h. Those with very few occurrences or only used locally
were renamed to explicitly mention the relevant HTTP version :
enum ht_state -> h1_state.
http_msg_state_str -> h1_msg_state_str
HTTP_FLG_* -> H1_FLG_*
http_char_classes -> h1_char_classes
Others like HTTP_IS_*, HTTP_MSG_* are left to be done later.
For HTTP/2 we'll need some buffer-only equivalent functions to some of
the ones applying to channels and still squatting the bi_* / bo_*
namespace. Since these names have kept being misleading for quite some
time now and are really getting annoying, it's time to rename them. This
commit will use "ci/co" as the prefix (for "channel in", "channel out")
instead of "bi/bo". The following ones were renamed :
bi_getblk_nc, bi_getline_nc, bi_putblk, bi_putchr,
bo_getblk, bo_getblk_nc, bo_getline, bo_getline_nc, bo_inject,
bi_putchk, bi_putstr, bo_getchr, bo_skip, bi_swpbuf
Commit bcb86ab ("MINOR: session: add a streams field to the session
struct") added this list of streams that is not needed anymore. Let's
get rid of it now.
When
1) HAProxy configured to enable splice on both directions
2) After some high load, there are 2 input channels with their socket buffer
being non-empty and pipe being full at the same time, sitting in `fd_cache`
without any other fds.
The 2 channels will repeatedly be stopped for receiving (pipe full) and waken
for receiving (data in socket), thus getting out and in of `fd_cache`, making
their fd swapping location in `fd_cache`.
There is a `if (entry < fd_cache_num && fd_cache[entry] != fd) continue;`
statement in `fd_process_cached_events` to prevent frequent polling, but since
the only 2 fds are constantly swapping location, `fd_cache[entry] != fd` will
always hold true, thus HAProxy can't make any progress.
The root cause of the issue is dual :
- there is a single fd_cache, for next events and for the ones being
processed, while using two distinct arrays would avoid the problem.
- the write side of the stream interface wakes the read side up even
when it couldn't write, and this one really is a bug.
Due to CF_WRITE_PARTIAL not being cleared during fast forwarding, a failed
send() attempt will still cause ->chk_rcv() to be called on the other side,
re-creating an entry for its connection fd in the cache, causing the same
sequence to be repeated indefinitely without any opportunity to make progress.
CF_WRITE_PARTIAL used to be used for what is present in these tests : check
if a recent write operation was performed. It's part of the CF_WRITE_ACTIVITY
set and is tested to check if timeouts need to be updated. It's also used to
detect if a failed connect() may be retried.
What this patch does is use CF_WROTE_DATA() to check for a successful write
for connection retransmits, and to clear CF_WRITE_PARTIAL before preparing
to send in stream_int_notify(). This way, timeouts are still updated each
time a write succeeds, but chk_rcv() won't be called anymore after a failed
write.
It seems the fix is required all the way down to 1.5.
Without this patch, the only workaround at this point is to disable splicing
in at least one direction. Strictly speaking, splicing is not absolutely
required, as regular forwarding could theorically cause the issue to happen
if the timing is appropriate, but in practice it appears impossible to
reproduce it without splicing, and even with splicing it may vary.
The following config manages to reproduce it after a few attempts (haproxy
going 100% CPU and having to be killed) :
global
maxpipes 50000
maxconn 10000
listen srv1
option splice-request
option splice-response
bind :8001
server s1 127.0.0.1:8002
server$ tcploop 8002 L N20 A R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000
client$ tcploop 8001 N20 C T S1000000 R10 J
There are several places where we see feconn++, feconn--, totalconn++ and
an increment on the frontend's number of connections and connection rate.
This is done exactly once per session in each direction, so better take
care of this counter in the session and simplify the callers. At least it
ensures a better symmetry. It also ensures consistency as till now the
lua/spoe/peers frontend didn't have these counters properly set, which can
be useful at least for troubleshooting.
Instead of duplicating some sensitive listener-specific code in the
session and in the stream code, let's call listener_release() when
releasing a connection attached to a listener.
Each user of a session increments/decrements the jobs variable at its
own place, resulting in a real mess and inconsistencies between them.
Let's have session_new() increment jobs and session_free() decrement
it.
The server state and weight was reworked to handle
"pending" values updated by checks/CLI/LUA/agent.
These values are commited to be propagated to the
LB stack.
In further dev related to multi-thread, the commit
will be handled into a sync point.
Pending values are named using the prefix 'next_'
Current values used by the LB stack are named 'cur_'
Unfortunatly, a regression bug was introduced in the commit 1486b0ab
("BUG/MEDIUM: http: Switch HTTP responses in TUNNEL mode when body length is
undefined"). HTTP responses with undefined body length are blocked until timeout
when the compression is enabled. This bug was fixed in commit 69744d92
("BUG/MEDIUM: http: Fix blocked HTTP/1.0 responses when compression is
enabled").
The bug is still the same. We do not forward response data because we are
waiting for the synchronization between the HTTP request and the response.
To fix the bug, conditions to infinitly forward channel data has been slightly
relaxed. Now, it is done if there is no more analyzer registered on the channel
or if _FLT_END analyzer is still there but without the flag CF_FLT_ANALYZE. This
last condition is only possible when a channel is waiting the end of the other
side. So, fundamentally, it means that no one is analyzing the channel
anymore. This is a transitional state during a sync phase.
This patch must be backported in 1.7.
Currently a task is allocated in session_new() and serves two purposes :
- either the handshake is complete and it is offered to the stream via
the second arg of stream_new()
- or the handshake is not complete and it's diverted to be used as a
timeout handler for the embryonic session and repurposed once we land
into conn_complete_session()
Furthermore, the task's process() function was taken from the listener's
handler in conn_complete_session() prior to being replaced by a call to
stream_new(). This will become a serious mess with the mux.
Since it's impossible to have a stream without a task, this patch removes
the second arg from stream_new() and make this function allocate its own
task. In session_accept_fd(), we now only allocate the task if needed for
the embryonic session and delete it later.
Till now connections used to rely exclusively on file descriptors. It
was planned in the past that alternative solutions would be implemented,
leading to member "union t" presenting sock.fd only for now.
With QUIC, the connection will need to continue to exist but will not
rely on a file descriptor but a connection ID.
So this patch introduces a "connection handle" which is either a file
descriptor or a connection ID, to replace the existing "union t". We've
now removed the intermediate "struct sock" which was never used. There
is no functional change at all, though the struct connection was inflated
by 32 bits on 64-bit platforms due to alignment.
Commit 4850e51 ("BUG/MAJOR: lua: Do not force the HTTP analysers in
use-services") fixed a bug in how services are used in Lua, but this
fix broke the ability for Lua services to support keep-alive.
The cause is that we branch to a service while we have not yet set the
body analysers on the request nor the response, and when we start to
deal with the response we don't have any request analyser anymore. This
leads the response forward engine to detect an error and abort. It's
very likely that this also causes some random truncation of responses
though this has not been observed during the tests.
The root cause is not the Lua part in fact, the commit above was correct,
the problem is the implementation of the "use-service" action. When done
in an HTTP request, it bypasses the load balancing decisions and the
connect() phase. These ones are normally the ones preparing the request
analysers to parse the body when keep-alive is set. This should be dealt
with in the main process_use_service() function in fact.
That's what this patch does. If process_use_service() is called from the
http-request rule set, it enables the XFER_BODY analyser on the request
(since the same is always set on the response). Note that it's exactly
what is being done on the stats page which properly supports keep-alive
and compression.
This fix must be backported to 1.7 and 1.6 as the breakage appeared in 1.6.3.
Now each stream is added to the session's list of streams, so that it
will be possible to know all the streams belonging to a session, and
to know if any stream is still attached to a sessoin.
stream_free() used to close the front connection by using s->sess->origin,
instead of using s->si[0].end. This is very visible in HTTP/2 where the
front connection is abusively closed and causes all sort of issues including
crashes caused by double closes due to the same origin being referenced many
times.
It's also suspected that it may have caused some of the early issues met
during the Lua development.
It's uncertain whether stable branches are affected. It might be worth
backporting it once it has been confirmed not to create new impacts.
Commits 2ab8867 ("MINOR: ssl: compare server certificate names to the
SNI on outgoing connections") and 96c7b8d ("BUG/MINOR: ssl: Fix check
against SNI during server certificate verification") made it possible
to check that the server's certificate matches the name presented in
the SNI field. While it solves a class of problems, it opens another
one which is that by failing such a connection, we'll retry it and put
more load on the server. It can be a real problem if a user can trigger
this issue, which is what will very often happen when the SNI is forwarded
from the client to the server.
This patch solves this by detecting that this very specific hostname
verification failed and that the hostname was provided using SNI, and
then it simply disables retries and the failure is immediate.
At the time of writing this patch, the previous patches were not backported
(yet), so no backport is needed for this one unless the aforementionned
patches are backported as well. This patch requires previous patches
"BUG/MINOR: ssl: make use of the name in SNI before verifyhost" and
"MINOR: ssl: add a new error code for wrong server certificates".
This flag can be set on a channel to pretend there is activity on it. This is a
way to wake-up the corresponding stream and evaluate stream analyzers on the
channel. It is correctly handled on both channels but removed only on the
request channel.
This patch is flagged as a bug but for now, CF_WAKE_ONCE is never set on the
response channel.
In order to authorize call of appctx_wakeup on running task:
- from within the task handler itself.
- in futur, from another thread.
The appctx is considered paused as default after running the handler.
The handler should explicitly call appctx_wakeup to be re-called.
When the appctx_free is called on a running handler. The real
free is postponed at the end of the handler process.
In order to implement hot-pluggable applets like we'll need for HTTP/2
which will speak a different protocol than the expected one, it will be
mandatory to be able to clear all analysers from the request and response
channel and/or to keep only the ones the applet initializer installed.
Unfortunately for now in sess_establish() we systematically place a number
of analysers inherited from the frontend, backend and some hard-coded ones.
This patch reuses the now unused SF_TUNNEL flag on the stream to indicate
we're dealing with a tunnel and don't want to add more analysers anymore.
It will be usable to install such a specific applet.
Ideally over the long term it might be nice to be able to set the mode on
the stream instead of the proxy so that we can decide to change a stream's
mode (eg: TCP, HTTP, HTTP/2) at run time. But it would require many more
changes for a gain which is not yet obvious.
It doesn't make sense that stream_new() doesn't sets the target nor
analysers and that the caller has to do it even if it doesn't know
about streams (eg: in session_accept_fd()). This causes trouble for
H2 where the applet handling the protocol cannot properly change
these information during its init phase.
Let's ensure it's always set and that the callers don't set it anymore.
Note: peers and lua don't use analysers and that's properly handled.
The task_wakeup was called on stream_new, but the task/stream
wasn't fully initialized yet. The task_wakeup must be called
explicitly by the caller once the task/stream is initialized.
To finish a HTTP transaction and to start the new one, we check, among other
things, that there is enough space in the reponse buffer to eventually inject a
message during the parsing of the next request. Because these messages can reach
the maximum buffers size, it is mandatory to have an empty response
buffer. Remaining input data are trimmed during the txn cleanup (in
http_reset_txn), so we just need to check that the output data were flushed.
The current implementation depends on channel_congested, which does check the
reserved area is available. That's not of course good enough. There are other
tests on the reponse buffer is http_wait_for_request. But conditions to move on
are almost the same. So, we can imagine some scenarii where some output data
remaining in the reponse buffer during the request parsing prevent any messages
injection.
To fix this bug, we just wait that output data were flushed before cleaning up
the HTTP txn (ie. s->res.buf->o == 0). In addition, in http_reset_txn we realign
the response buffer (note the buffer is empty at this step).
Thanks to this changes, there is no more need to set CF_EXPECT_MORE on the
response channel in http_end_txn_clean_session. And more important, there is no
more need to check the response buffer state in http_wait_for_request. This
remove a workaround on response analysers to handle HTTP pipelining.
This patch can be backported in 1.7, 1.6 and 1.5.
A tcp half connection can cause 100% CPU on expiration.
First reproduced with this haproxy configuration :
global
tune.bufsize 10485760
defaults
timeout server-fin 90s
timeout client-fin 90s
backend node2
mode tcp
timeout server 900s
timeout connect 10s
server def 127.0.0.1:3333
frontend fe_api
mode tcp
timeout client 900s
bind :1990
use_backend node2
Ie timeout server-fin shorter than timeout server, the backend server
sends data, this package is left in the cache of haproxy, the backend
server continue sending fin package, haproxy recv fin package. this
time the session information is as follows:
time the session information is as follows:
0x2373470: proto=tcpv4 src=127.0.0.1:39513 fe=fe_api be=node2
srv=def ts=08 age=1s calls=3 rq[f=848000h,i=0,an=00h,rx=14m58s,wx=,ax=]
rp[f=8004c020h,i=0,an=00h,rx=,wx=14m58s,ax=] s0=[7,0h,fd=6,ex=]
s1=[7,18h,fd=7,ex=] exp=14m58s
rp has set the CF_SHUTR state, next, the client sends the fin package,
session information is as follows:
0x2373470: proto=tcpv4 src=127.0.0.1:39513 fe=fe_api be=node2
srv=def ts=08 age=38s calls=4 rq[f=84a020h,i=0,an=00h,rx=,wx=,ax=]
rp[f=8004c020h,i=0,an=00h,rx=1m11s,wx=14m21s,ax=] s0=[7,0h,fd=6,ex=]
s1=[9,10h,fd=7,ex=] exp=1m11s
After waiting 90s, session information is as follows:
0x2373470: proto=tcpv4 src=127.0.0.1:39513 fe=fe_api be=node2
srv=def ts=04 age=4m11s calls=718074391 rq[f=84a020h,i=0,an=00h,rx=,wx=,ax=]
rp[f=8004c020h,i=0,an=00h,rx=?,wx=10m49s,ax=] s0=[7,0h,fd=6,ex=]
s1=[9,10h,fd=7,ex=] exp=? run(nice=0)
cpu information:
6899 root 20 0 112224 21408 4260 R 100.0 0.7 3:04.96 haproxy
Buffering is set to ensure that there is data in the haproxy buffer, and haproxy
can receive the fin package, set the CF_SHUTR flag, If the CF_SHUTR flag has been
set, The following code does not clear the timeout message, causing cpu 100%:
stream.c:process_stream:
if (unlikely((res->flags & (CF_SHUTR|CF_READ_TIMEOUT)) == CF_READ_TIMEOUT)) {
if (si_b->flags & SI_FL_NOHALF)
si_b->flags |= SI_FL_NOLINGER;
si_shutr(si_b);
}
If you have closed the read, set the read timeout does not make sense.
With or without cf_shutr, read timeout is set:
if (tick_isset(s->be->timeout.serverfin)) {
res->rto = s->be->timeout.serverfin;
res->rex = tick_add(now_ms, res->rto);
}
After discussion on the mailing list, setting half-closed timeouts the
hard way here doesn't make sense. They should be set only at the moment
the shutdown() is performed. It will also solve a special case which was
already reported of some half-closed timeouts not working when the shutw()
is performed directly at the stream-interface layer (no analyser involved).
Since the stream interface layer cannot know the timeout values, we'll have
to store them directly in the stream interface so that they are used upon
shutw(). This patch does this, fixing the problem.
An easier reproducer to validate the fix is to keep the huge buffer and
shorten all timeouts, then call it under tcploop server and client, and
wait 3 seconds to see haproxy run at 100% CPU :
global
tune.bufsize 10485760
listen px
bind :1990
timeout client 90s
timeout server 90s
timeout connect 1s
timeout server-fin 3s
timeout client-fin 3s
server def 127.0.0.1:3333
$ tcploop 3333 L W N20 A P100 F P10000 &
$ tcploop 127.0.0.1:1990 C S10000000 F
It is important to defined analyzers (AN_REQ_* and AN_RES_*) in the same order
they are evaluated in process_stream. This order is really important because
during analyzers evaluation, we run them in the order of the lower bit to the
higher one. This way, when an analyzer adds/removes another one during its
evaluation, we know if it is located before or after it. So, when it adds an
analyzer which is located before it, we can switch to it immediately, even if it
has already been called once but removed since.
With the time, and introduction of new analyzers, this order was broken up. the
main problems come from the filter analyzers. We used values not related with
their evaluation order. Furthermore, we used same values for request and response
analyzers.
So, to fix the bug, filter analyzers have been splitted in 2 distinct lists to
have different analyzers for the request channel than those for the response
channel. And of course, we have moved them to the right place.
Some other analyzers have been reordered to respect the evaluation order:
* AN_REQ_HTTP_TARPIT has been moved just before AN_REQ_SRV_RULES
* AN_REQ_PRST_RDP_COOKIE has been moved just before AN_REQ_STICKING_RULES
* AN_RES_STORE_RULES has been moved just after AN_RES_WAIT_HTTP
Note today we have 29 analyzers, all stored into a 32 bits bitfield. So we can
still add 4 more analyzers before having a problem. A good way to fend off the
problem for a while could be to have a different bitfield for request and
response analyzers.
[wt: all of this must be backported to 1.7, and part of it must be backported
to 1.6 and 1.5]
When an entity tries to get a buffer, if it cannot be allocted, for example
because the number of buffers which may be allocated per process is limited,
this entity is added in a list (called <buffer_wq>) and wait for an available
buffer.
Historically, the <buffer_wq> list was logically attached to streams because it
were the only entities likely to be added in it. Now, applets can also be
waiting for a free buffer. And with filters, we could imagine to have more other
entities waiting for a buffer. So it make sense to have a generic list.
Anyway, with the current design there is a bug. When an applet failed to get a
buffer, it will wait. But we add the stream attached to the applet in
<buffer_wq>, instead of the applet itself. So when a buffer is available, we
wake up the stream and not the waiting applet. So, it is possible to have
waiting applets and never awakened.
So, now, <buffer_wq> is independant from streams. And we really add the waiting
entity in <buffer_wq>. To be generic, the entity is responsible to define the
callback used to awaken it.
In addition, applets will still request an input buffer when they become
active. But they will not be sleeped anymore if no buffer are available. So this
is the responsibility to the applet I/O handler to check if this buffer is
allocated or not. This way, an applet can decide if this buffer is required or
not and can do additional processing if not.
[wt: backport to 1.7 and 1.6]
A stream can be awakened for different reasons. During its processing, it can be
early stopped if no buffer is available. In this situation, the reason why the
stream was awakened is lost, because we rely on the task state, which is reset
after each processing loop.
In many cases, that's not a big deal. But it can be useful to accumulate the
task states if the stream processing is interrupted, especially if some filters
need to be called.
To be clearer, here is an simple example:
1) A stream is awakened with the reason TASK_WOKEN_MSG.
2) Because no buffer is available, the processing is interrupted, the stream
is back to sleep. And the task state is reset.
3) Some buffers become available, so the stream is awakened with the reason
TASK_WOKEN_RES. At this step, the previous reason (TASK_WOKEN_MSG) is lost.
Now, the task states are saved for a stream and reset only when the stream
processing is not interrupted. The correspoing bitfield represents the pending
events for a stream. And we use this one instead of the task state during the
stream processing.
Note that TASK_WOKEN_TIMER and TASK_WOKEN_RES are always removed because these
events are always handled during the stream processing.
[wt: backport to 1.7 and 1.6]
<run_queue> is used to track the number of task in the run queue and
<run_queue_cur> is a copy used for the reporting purpose. These counters has
been renamed, respectively, <tasks_run_queue> and <tasks_run_queue_cur>. So the
naming is consistent between tasks and applets.
[wt: needed for next fixes, backport to 1.7 and 1.6]
These 2 analyzers are responsible of the data forwarding in, respectively, HTTP
mode and TCP mode. Now, the analyzer responsible of the HTTP data forwarding is
called before the one responsible of the TCP data forwarding. This will allow
the filtering of tunneled data in HTTP.
[wt: backport desired in 1.7 - no impact right now but may impact the ability
to backport future fixes]
Historically we used to have the stick counters processing put into
session.c which became stream.c. But a big part of it is now in
stick-table.c (eg: converters) but despite this we still have all
the sample fetch functions in stream.c
These parts do not depend on the stream anymore, so let's move the
remaining chunks to stick-table.c and have cleaner files.
What remains in stream.c is everything needed to attach/detach
trackers to the stream and to update the counters while the stream
is being processed.
There's no more reason to keep tcp rules processing inside proto_tcp.c
given that there is nothing in common there except these 3 letters : tcp.
The tcp rules are in fact connection, session and content processing rules.
Let's move them to "tcp-rules" and let them live their life there.
It could be argued that it's between server, stream and session but
at least due to the fact that it operates on streams, its best place
is in stream.c.
Move 'show sess' CLI functions to stream.c and use the cli keyword API
to register it on the CLI.
[wt: the choice of stream vs session makes sense because since 1.6 these
really are streams that we're dumping and not sessions anymore]
proto/dumpstats.h has been split in 4 files:
* proto/cli.h contains protypes for the CLI
* proto/stats.h contains prototypes for the stats
* types/cli.h contains definition for the CLI
* types/stats.h contains definition for the stats
The bug described in commit 568743a ("BUG/MEDIUM: stream-int: completely
detach connection on connect error") was not a stream-interface layer bug
but a connection layer bug. There was exactly one place in the code where
we could change a file descriptor's status without first checking whether
it is valid or not, it was in conn_stop_polling(). This one is called when
the polling status is changed after an update, and calls fd_stop_both even
if we had already closed the file descriptor :
1479388298.484240 ->->->->-> conn_fd_handler > conn_cond_update_polling
1479388298.484240 ->->->->->-> conn_cond_update_polling > conn_stop_polling
1479388298.484241 ->->->->->->-> conn_stop_polling > conn_ctrl_ready
1479388298.484241 conn_stop_polling < conn_ctrl_ready
1479388298.484241 ->->->->->->-> conn_stop_polling > fd_stop_both
1479388298.484242 ->->->->->->->-> fd_stop_both > fd_update_cache
1479388298.484242 ->->->->->->->->-> fd_update_cache > fd_release_cache_entry
1479388298.484242 fd_update_cache < fd_release_cache_entry
1479388298.484243 fd_stop_both < fd_update_cache
1479388298.484243 conn_stop_polling < fd_stop_both
1479388298.484243 conn_cond_update_polling < conn_stop_polling
1479388298.484243 conn_fd_handler < conn_cond_update_polling
The problem with the previous fix above is that it break the http_proxy mode
and possibly even some Lua parts and peers to a certain extent ; all outgoing
connections where the target address is initially copied into the outgoing
connection which experience a retry would use a random outgoing address after
the retry because closing and detaching the connection causes the target
address to be lost. This was attempted to be addressed by commit 0857d7a
("BUG/MAJOR: stream: properly mark the server address as unset on connect
retry") but it used to only solve the most visible effect and not the root
cause.
Prior to this fix, it was possible to cause this config to keep CLOSE_WAIT
for as long as it takes to expire a client or server timeout (note the
missing client timeout) :
listen test
mode http
bind :8002
server s1 127.0.0.1:8001
$ tcploop 8001 L0 W N20 A R P100 S:"HTTP/1.1 200 OK\r\nContent-length: 0\r\n\r\n" &
$ tcploop 8002 N200 C T W S:"GET / HTTP/1.0\r\n\r\n" O P10000 K
With this patch, these CLOSE_WAIT properly vanish when both processes leave.
This commit reverts the two fixes above and replaces them with the proper
fix in connection.h. It must be backported to 1.6 and 1.5. Thanks to
Robson Roberto Souza Peixoto for providing very detailed traces showing
some obvious inconsistencies leading to finding this bug.
Quite a lot of people have been complaining about option contstats not
working correctly anymore since about 1.4. The reason was that one reason
for the significant performance boost between 1.3 and 1.4 was the ability
to forward data between a server and a client without waking up the stream
manager. And we couldn't afford to force sessions to constantly wake it
up given that most of the people interested in contstats are also those
interested in high performance transmission.
An idea was experimented with in the past, consisting in limiting the
amount of transmissible data before waking it up, but it was not usable
on slow connections (eg: FTP over modem lines, RDP, SSH) as stats would
be updated too rarely if at all, so that idea was dropped.
During a discussion today another idea came up : ensure that stats are
updated once in a while, since it's the only thing that matters. It
happens that we have the request channel's analyse_exp timeout that is
used to wake the stream up after a configured delay, and that by
definition this timeout is not used when there's no more analyser
(otherwise the stream would wake up and the stats would be updated).
Thus here the idea is to reuse this timeout when there's no analyser
and set it to now+5 seconds so that a stream wakes up at least once
every 5 seconds to update its stats. It should be short enough to
provide smooth traffic graphs and to allow to debug outputs of "show
sess" more easily without inflicting too much load even for very large
number of concurrent connections.
This patch is simple enough and safe enough to be backportable to 1.6
if there is some demand.
Bartosz Koninski reported that recent commit 568743a ("BUG/MEDIUM: stream-int:
completely detach connection on connect error") introduced a nasty side effect
during the connection retry, causing a reconnect attempt to be performed on a
random address, possibly another server from another backend as shown in his
reproducer.
The real reason is in fact that by calling si_release_endpoint() after a
failed connect attempt and not clearing the SN_ADDR_SET flag on the
stream, we indicate that we continue to trust the connection's current
address. So upon next connect attempt, a connection is picked from the
pool and the last target address is reused. It is not very easy to
reproduce it 100% reliably out of production traffic, but it's easier to
see when haproxy is started with -dM to force the pools to be filled with
junk, and where subsequent connections are not even attempted due to their
family being incorrect.
The correct fix consists in clearing the SN_ADDR_SET flag after calling
si_release_endpoint() during a retry. But it outlines a deeper issue which
is in fact that the target address is *stored* in the connection and its
validity is stored in the stream. Until we have a better solution to store
target addresses, it would be better to rely on the connection flags
(CO_FL_ADDR_TO_SET) for this purpose. But it also outlines the fact that
the same issue still exists in Lua sockets and in idle sockets, which
fortunately are not affected by this issue.
Thanks to Bartosz for providing all the elements needed to understand the
problem.
This fix needs to be backported to 1.6 and 1.5.
Tq is the time between the instant the connection is accepted and a
complete valid request is received. This time includes the handshake
(SSL / Proxy-Protocol), the idle when the browser does preconnect and
the request reception.
This patch decomposes %Tq in 3 measurements names %Th, %Ti, and %TR
which returns respectively the handshake time, the idle time and the
duration of valid request reception. It also adds %Ta which reports
the request's active time, which is the total time without %Th nor %Ti.
It replaces %Tt as the total time, reporting accurate measurements for
HTTP persistent connections.
%Th is avalaible for TCP and HTTP sessions, %Ti, %TR and %Ta are only
avalaible for HTTP connections.
In addition to this, we have new timestamps %tr, %trg and %trl, which
log the date of start of receipt of the request, respectively in the
default format, in GMT time and in local time (by analogy with %t, %T
and %Tl). All of them are obviously only available for HTTP. These values
are more relevant as they more accurately represent the request date
without being skewed by a browser's preconnect nor a keep-alive idle
time.
The HTTP log format and the CLF log format have been modified to
use %tr, %TR, and %Ta respectively instead of %t, %Tq and %Tt. This
way the default log formats now produce the expected output for users
who don't want to manually fiddle with the log-format directive.
Example with the following log-format :
log-format "%ci:%cp [%tr] %ft %b/%s h=%Th/i=%Ti/R=%TR/w=%Tw/c=%Tc/r=%Tr/a=%Ta/t=%Tt %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r"
The request was sent by hand using "openssl s_client -connect" :
Aug 23 14:43:20 haproxy[25446]: 127.0.0.1:45636 [23/Aug/2016:14:43:20.221] test~ test/test h=6/i=2375/R=261/w=0/c=1/r=0/a=262/t=2643 200 145 - - ---- 1/1/0/0/0 0/0 "GET / HTTP/1.1"
=> 6 ms of SSL handshake, 2375 waiting before sending the first char (in
fact the time to type the first line), 261 ms before the end of the request,
no time spent in queue, 1 ms spend connecting to the server, immediate
response, total active time for this request = 262ms. Total time from accept
to close : 2643 ms.
The timing now decomposes like this :
first request 2nd request
|<-------------------------------->|<-------------- ...
t tr t tr ...
---|----|----|----|----|----|----|----|----|--
: Th Ti TR Tw Tc Tr Td : Ti ...
:<---- Tq ---->: :
:<-------------- Tt -------------->:
:<--------- Ta --------->:
Bryan Talbot reported a very interesting bug. The sc_trackers() sample
fetch seems to have escaped the sanitization that was performed during 1.5
to ensure all dereferences of stkctr_entry() were safe. Here if a tacker
is set on a backend and is then checked against a different backend where
the entry doesn't exist, stkctr_entry() returns NULL and this is dereferenced
to retrieve the ref count. Thanks to Bryan for his detailed bug report featuring
a working config and reproducer.
This fix must be backported to 1.6 and 1.5.
Tim Butler reported a troubling issue affecting all versions since 1.5.
When a connection error occurs and a retry is performed on the same server,
the server connection first goes into the turn-around state (SI_ST_TAR) for
one second. During this time, the client may speak and try to push some
data. The tests in place confirm that the stream interface is in a state
<= SI_ST_EST and that a connection exists, so all ingredients are present
to try to perform a send() to forward data. The send() cannot be performed
since the connection's control layer is marked as not ready, but the
polling flags are changed, and due to the remaining error flag present
on the connection, the polling on the FD is disabled in both directions.
But if this FD was reassigned to another connection in the mean time, it
is this FD which is disabled, and it causes a timeout on another connection.
A configuration allowing to reproduce the issue looks like this :
listen test
bind :8003
server s1 127.0.0.1:8001 # this one should be closed
listen victim
bind :8002
server s1 127.0.0.1:8000 # this one should respond slowly (~50ms)
Two parallel injections should be run with short time-outs (100ms). After
some time, some dead connections will appear in listener "victim" due to
their I/Os being disabled by some of the failed transfers on "test"
instance. These ones will only be flushed on time out. A dead connection
looks like this :
> show sess 0x7dcb70
0x7dcb70: [07/Aug/2016:08:58:40.120151] id=3771 proto=tcpv4 source=127.0.0.1:34682
flags=0xce, conn_retries=3, srv_conn=0x7da020, pend_pos=(nil)
frontend=victim (id=3 mode=tcp), listener=? (id=1) addr=127.0.0.1:8002
backend=victim (id=3 mode=tcp) addr=127.0.0.1:37736
server=s1 (id=1) addr=127.0.0.1:8000
task=0x7dcaf8 (state=0x08 nice=0 calls=2 exp=<NEVER> age=30s)
si[0]=0x7dcd68 (state=EST flags=0x08 endp0=CONN:0x7e2410 exp=<NEVER>, et=0x000)
si[1]=0x7dcd88 (state=EST flags=0x18 endp1=CONN:0x7e0cd0 exp=<NEVER>, et=0x000)
co0=0x7e2410 ctrl=tcpv4 xprt=RAW data=STRM target=LISTENER:0x7d9ea8
flags=0x0020b306 fd=122 fd.state=25 fd.cache=0 updt=0
co1=0x7e0cd0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0x7da020
flags=0x0020b306 fd=93 fd.state=20 fd.cache=0 updt=0
req=0x7dcb80 (f=0x848000 an=0x0 pipe=0 tofwd=-1 total=129)
an_exp=<NEVER> rex=<NEVER> wex=<NEVER>
buf=0x7893c0 data=0x7893d4 o=0 p=0 req.next=0 i=0 size=0
res=0x7dcbc0 (f=0x80008000 an=0x0 pipe=0 tofwd=-1 total=0)
an_exp=<NEVER> rex=<NEVER> wex=<NEVER>
buf=0x7893c0 data=0x7893d4 o=0 p=0 rsp.next=0 i=0 size=0
The solution against this issue is to completely detach the connection upon
error instead of only performing a forced close.
This fix should be backported to 1.6 and 1.5.
Special thanks to Tim who did all the troubleshooting work and provided a
lot of traces allowing to find the root cause of this problem.
When a stick table is tracked, and another one is used later on the
configuration, a segfault occurs.
The function "smp_create_src_stkctr" can return a NULL value, and
its value is not tested, so one other function try to dereference
a NULL pointer. This patch just add a verification of the NULL
pointer.
The problem is reproduced with this configuration:
listen www
mode http
bind :12345
tcp-request content track-sc0 src table IPv4
http-request allow if { sc0_inc_gpc0(IPv6) gt 0 }
server dummy 127.0.0.1:80
backend IPv4
stick-table type ip size 10 expire 60s store gpc0
backend IPv6
stick-table type ipv6 size 10 expire 60s store gpc0
Thank to kabefuna@gmail.com for the bug report.
This patch must be backported in the 1.6 and 1.5 version.
'channel_analyze' callback has been removed. Now, there are 2 callbacks to
surround calls to analyzers:
* channel_pre_analyze: Called BEFORE all filterable analyzers. it can be
called many times for the same analyzer, once at each loop until the
analyzer finishes its processing. This callback is resumable, it returns a
negative value if an error occurs, 0 if it needs to wait, any other value
otherwise.
* channel_post_analyze: Called AFTER all filterable analyzers. Here, AFTER
means when an analyzer finishes its processing. This callback is NOT
resumable, it returns a negative value if an error occurs, any other value
otherwise.
Pre and post analyzer callbacks are not automatically called. 'pre_analyzers'
and 'post_analyzers' bit fields in the filter structure must be set to the right
value using AN_* flags (see include/types/channel.h).
The flag AN_RES_ALL has been added (AN_REQ_ALL already exists) to ease the life
of filter developers. AN_REQ_ALL and AN_RES_ALL include all filterable
analyzers.
Now, to call an analyzer in 'process_stream' function, we should use
FLT_ANALAYZE or ANALYZE macros, depending if this is a filterable analyzer or
not.
In 1.4-dev3, commit 31971e5 ("[MEDIUM] add support for infinite forwarding")
made it possible to configure the lower layer to forward data indefinitely
by setting the forward size to CHN_INFINITE_FORWARD (4GB-1). By then larger
chunk sizes were not supported so there was no confusion in the usage of the
function.
Since 1.5 we support 64-bit content-lengths and chunk sizes and the function
has grown to support 64-bit arguments, though it still limits a single pass
to 32-bit quantities (what fit in the channel's to_forward field). The issue
now becomes that a 4GB-1 content-length can be confused with infinite
forwarding (in fact it's 4GB-1+what was already in the buffer). It causes a
visible effect when transferring this exact size because the transfer rate
is lower than with other sizes due in part to the disabling of the Nagle
algorithm on the sendto() call.
In theory with keep-alive it should prevent a second request from being
processed after such a transfer, but since the analysers are still present,
the forwarding analyser properly counts down the remaining size to transfer
and ultimately the transaction gets correctly reset so there is no visible
effect.
Since the root cause of the issue is an API problem (lack of distinction
between a real valid length and a magic value), this patch modifies the API
to have a new dedicated function called channel_forward_forever() to program
a permanent forwarding. The existing function __channel_forward() was modified
to properly take care of the requested sizes and ensure it 1) never overflows
and 2) never reaches CHN_INFINITE_FORWARD by accident.
It is worth noting that the function used to have a bug causing a 2GB
forward to be scheduled if it was called with less data than what is present
in buf->i. Fortunately this bug couldn't be triggered with existing code.
This fix should be backported to 1.6 and 1.5. While it also theorically
affects 1.4, it's better not to backport it there, as the risk of breaking
large object transfers due to significant API differences is high, compared
to the fact that the largest supported objects (4GB-1) are just slower to
transfer.
The previous buffer space bug has revealed an issue causing some stalled
connections to remain orphaned forever, preventing an old process from
dying. The issue is that once in a while a task may be woken up because
a disabled expiration timer has been reached despite no timeout being
reached. In this case we exit very early but the SI_FL_DONT_WAKE flag
wasn't cleared, resulting in new events not waking the task up. It may
be one of the reasons why a few people have already observed some peers
connections stuck in CLOSE_WAIT state.
This bug was introduced in 1.5-dev13 by commit 798f432 ("OPTIM: session:
don't process the whole session when only timers need a refresh"), so
the fix must be backported to 1.6 and 1.5.
Depending on the path that led to sess_update_stream_int(), it's
possible that we had a read error on the frontend, but that we haven't
checked if we may abort the connection.
This was seen in particular the following setup: tcp mode, with
abortonclose set, frontend using ssl. If the ssl connection had a first
successful read, but the second read failed, we would stil try to open a
connection to the backend, although we had enough information to close
the connection early.
sess_update_stream_int() had some logic to handle that case in the
SI_ST_QUE and SI_ST_TAR, but that was missing in the SI_ST_ASS case.
This patches addresses the issue by verifying the state of the req
channel (and the abortonclose option) right before opening the
connection to the backend, so we have the opportunity to close the
connection there, and factorizes the shared SI_ST_{QUE,TAR,ASS} code.
The sc_* sample fetch can work without the struct strm, because the
tracked counters are also stored in the session. So, this patchs
removes the check for the strm existance.
This bug is recent and was introduced in 1.7-dev2 by commit 6204cd9
("BUG/MAJOR: vars: always retrieve the stream and session from the sample")
This bugfix must be backported in 1.6.
This is the continuation of previous patch called "BUG/MAJOR: samples:
check smp->strm before using it".
It happens that variables may have a session-wide scope, and that their
session is retrieved by dereferencing the stream. But nothing prevents them
from being used from a streamless context such as tcp-request connection,
thus crashing the process. Example :
tcp-request connection accept if { src,set-var(sess.foo) -m found }
In order to fix this, we have to always ensure that variable manipulation
only happens via the sample, which contains the correct owner and context,
and that we never use one from a different source. This results in quite a
large change since a lot of functions are inderctly involved in the call
chain, but the change is easy to follow.
This fix must be backported to 1.6, and requires the last two patches.
Since commit 6879ad3 ("MEDIUM: sample: fill the struct sample with the
session, proxy and stream pointers") merged in 1.6-dev2, the sample
contains the pointer to the stream and sample fetch functions as well
as converters use it heavily.
The problem is that earlier commit 87b0966 ("REORG/MAJOR: session:
rename the "session" entity to "stream"") had split the session and
stream resulting in the possibility for smp->strm to be NULL before
the stream was initialized. This is what happens in tcp-request
connection rulesets, as discovered by Baptiste.
The sample fetch functions must now check that smp->strm is valid
before using it. An alternative could consist in using a dummy stream
with nothing in it to avoid some checks but it would only result in
deferring them to the next step anyway, and making it harder to detect
that a stream is valid or the dummy one.
There is still an issue with variables which requires a complete
independant fix. They use strm->sess to find the session with strm
possibly NULL and passed as an argument. All call places indirectly
use smp->strm to build strm. So the problem is there but the API needs
to be changed to remove this duplicate argument that makes it much
harder to know what pointer to use.
This fix must be backported to 1.6, as well as the next one fixing
variables.
This new analyzer will be called for each HTTP request/response, before the
parsing of the body. It is identified by AN_FLT_HTTP_HDRS.
Special care was taken about the following condition :
* the frontend is a TCP proxy
* filters are defined in the frontend section
* the selected backend is a HTTP proxy
So, this patch explicitly add AN_FLT_HTTP_HDRS analyzer on the request and the
response channels when the backend is a HTTP proxy and when there are filters
attatched on the stream.
This patch simplifies http_request_forward_body and http_response_forward_body
functions.
When no filter is attached to the stream, the CPU footprint due to the calls to
filters_* functions is huge, especially for chunk-encoded messages. Using macros
to check if we have some filters or not is a great improvement.
Furthermore, instead of checking the filter list emptiness, we introduce a flag
to know if filters are attached or not to a stream.
HTTP compression has been rewritten to use the filter API. This is more a PoC
than other thing for now. It allocates memory to work. So, if only for that, it
should be rewritten.
In the mean time, the implementation has been refactored to allow its use with
other filters. However, there are limitations that should be respected:
- No filter placed after the compression one is allowed to change input data
(in 'http_data' callback).
- No filter placed before the compression one is allowed to change forwarded
data (in 'http_forward_data' callback).
For now, these limitations are informal, so you should be careful when you use
several filters.
About the configuration, 'compression' keywords are still supported and must be
used to configure the HTTP compression behavior. In absence of a 'filter' line
for the compression filter, it is added in the filter chain when the first
compression' line is parsed. This is an easy way to do when you do not use other
filters. But another filter exists, an error is reported so that the user must
explicitly declare the filter.
For example:
listen tst
...
compression algo gzip
compression offload
...
filter flt_1
filter compression
filter flt_2
...
This patch adds the support of filters in HAProxy. The main idea is to have a
way to "easely" extend HAProxy by adding some "modules", called filters, that
will be able to change HAProxy behavior in a programmatic way.
To do so, many entry points has been added in code to let filters to hook up to
different steps of the processing. A filter must define a flt_ops sutrctures
(see include/types/filters.h for details). This structure contains all available
callbacks that a filter can define:
struct flt_ops {
/*
* Callbacks to manage the filter lifecycle
*/
int (*init) (struct proxy *p);
void (*deinit)(struct proxy *p);
int (*check) (struct proxy *p);
/*
* Stream callbacks
*/
void (*stream_start) (struct stream *s);
void (*stream_accept) (struct stream *s);
void (*session_establish)(struct stream *s);
void (*stream_stop) (struct stream *s);
/*
* HTTP callbacks
*/
int (*http_start) (struct stream *s, struct http_msg *msg);
int (*http_start_body) (struct stream *s, struct http_msg *msg);
int (*http_start_chunk) (struct stream *s, struct http_msg *msg);
int (*http_data) (struct stream *s, struct http_msg *msg);
int (*http_last_chunk) (struct stream *s, struct http_msg *msg);
int (*http_end_chunk) (struct stream *s, struct http_msg *msg);
int (*http_chunk_trailers)(struct stream *s, struct http_msg *msg);
int (*http_end_body) (struct stream *s, struct http_msg *msg);
void (*http_end) (struct stream *s, struct http_msg *msg);
void (*http_reset) (struct stream *s, struct http_msg *msg);
int (*http_pre_process) (struct stream *s, struct http_msg *msg);
int (*http_post_process) (struct stream *s, struct http_msg *msg);
void (*http_reply) (struct stream *s, short status,
const struct chunk *msg);
};
To declare and use a filter, in the configuration, the "filter" keyword must be
used in a listener/frontend section:
frontend test
...
filter <FILTER-NAME> [OPTIONS...]
The filter referenced by the <FILTER-NAME> must declare a configuration parser
on its own name to fill flt_ops and filter_conf field in the proxy's
structure. An exemple will be provided later to make it perfectly clear.
For now, filters cannot be used in backend section. But this is only a matter of
time. Documentation will also be added later. This is the first commit of a long
list about filters.
It is possible to have several filters on the same listener/frontend. These
filters are stored in an array of at most MAX_FILTERS elements (define in
include/types/filters.h). Again, this will be replaced later by a list of
filters.
The filter API has been highly refactored. Main changes are:
* Now, HA supports an infinite number of filters per proxy. To do so, filters
are stored in list.
* Because filters are stored in list, filters state has been moved from the
channel structure to the filter structure. This is cleaner because there is no
more info about filters in channel structure.
* It is possible to defined filters on backends only. For such filters,
stream_start/stream_stop callbacks are not called. Of course, it is possible
to mix frontend and backend filters.
* Now, TCP streams are also filtered. All callbacks without the 'http_' prefix
are called for all kind of streams. In addition, 2 new callbacks were added to
filter data exchanged through a TCP stream:
- tcp_data: it is called when new data are available or when old unprocessed
data are still waiting.
- tcp_forward_data: it is called when some data can be consumed.
* New callbacks attached to channel were added:
- channel_start_analyze: it is called when a filter is ready to process data
exchanged through a channel. 2 new analyzers (a frontend and a backend)
are attached to channels to call this callback. For a frontend filter, it
is called before any other analyzer. For a backend filter, it is called
when a backend is attached to a stream. So some processing cannot be
filtered in that case.
- channel_analyze: it is called before each analyzer attached to a channel,
expects analyzers responsible for data sending.
- channel_end_analyze: it is called when all other analyzers have finished
their processing. A new analyzers is attached to channels to call this
callback. For a TCP stream, this is always the last one called. For a HTTP
one, the callback is called when a request/response ends, so it is called
one time for each request/response.
* 'session_established' callback has been removed. Everything that is done in
this callback can be handled by 'channel_start_analyze' on the response
channel.
* 'http_pre_process' and 'http_post_process' callbacks have been replaced by
'channel_analyze'.
* 'http_start' callback has been replaced by 'http_headers'. This new one is
called just before headers sending and parsing of the body.
* 'http_end' callback has been replaced by 'channel_end_analyze'.
* It is possible to set a forwarder for TCP channels. It was already possible to
do it for HTTP ones.
* Forwarders can partially consumed forwardable data. For this reason a new
HTTP message state was added before HTTP_MSG_DONE : HTTP_MSG_ENDING.
Now all filters can define corresponding callbacks (http_forward_data
and tcp_forward_data). Each filter owns 2 offsets relative to buf->p, next and
forward, to track, respectively, input data already parsed but not forwarded yet
by the filter and parsed data considered as forwarded by the filter. A any time,
we have the warranty that a filter cannot parse or forward more input than
previous ones. And, of course, it cannot forward more input than it has
parsed. 2 macros has been added to retrieve these offets: FLT_NXT and FLT_FWD.
In addition, 2 functions has been added to change the 'next size' and the
'forward size' of a filter. When a filter parses input data, it can alter these
data, so the size of these data can vary. This action has an effet on all
previous filters that must be handled. To do so, the function
'filter_change_next_size' must be called, passing the size variation. In the
same spirit, if a filter alter forwarded data, it must call the function
'filter_change_forward_size'. 'filter_change_next_size' can be called in
'http_data' and 'tcp_data' callbacks and only these ones. And
'filter_change_forward_size' can be called in 'http_forward_data' and
'tcp_forward_data' callbacks and only these ones. The data changes are the
filter responsability, but with some limitation. It must not change already
parsed/forwarded data or data that previous filters have not parsed/forwarded
yet.
Because filters can be used on backends, when we the backend is set for a
stream, we add filters defined for this backend in the filter list of the
stream. But we must only do that when the backend and the frontend of the stream
are not the same. Else same filters are added a second time leading to undefined
behavior.
The HTTP compression code had to be moved.
So it simplifies http_response_forward_body function. To do so, the way the data
are forwarded has changed. Now, a filter (and only one) can forward data. In a
commit to come, this limitation will be removed to let all filters take part to
data forwarding. There are 2 new functions that filters should use to deal with
this feature:
* flt_set_http_data_forwarder: This function sets the filter (using its id)
that will forward data for the specified HTTP message. It is possible if it
was not already set by another filter _AND_ if no data was yet forwarded
(msg->msg_state <= HTTP_MSG_BODY). It returns -1 if an error occurs.
* flt_http_data_forwarder: This function returns the filter id that will
forward data for the specified HTTP message. If there is no forwarder set, it
returns -1.
When an HTTP data forwarder is set for the response, the HTTP compression is
disabled. Of course, this is not definitive.
Arkadiy Kulev noticed that if a server is marked down while a connection
is being trying to establish, we still insist on performing retries on
the same server, which is absurd. Better perform the redispatch if we
already know the server is down. Because of this, it's likely that the
observe-l4 and sudden-death mechanisms are not optimal an cannot help
much the connection which was used to detect the problem.
The fix should be backported to 1.6 and 1.5 at least.
client-fin and server-fin are bogus. They are applied on the write
side after a SHUTR was seen. The immediate effect is that sometimes
if a SHUTR was seen after a SHUTW on the same side, the timeout is
enabled again regardless of the fact that the output is already
closed. This results in the timeout event not to be processed and
a busy poll loop to happen until another timeout on the stream gets
rid of it. Note that haproxy continues its job during this, it's just
that it eats all the CPU trying to handle an event that it ignores.
An reproducible case consists in having a client stop reading data from
a server to ensure data remain in the response buffer, then the client
sends a shutdown(write). If abortonclose is enabled on haproxy, the
shutdown is passed to the server side and the server responds with a
SHUTR that cannot immediately be forwarded to the client since the
buffer is full. During this time the event is ignored and the task is
woken again in loops.
It is worth noting that the timeout handling since 1.5 is a bit fragile
and that it might be possible that other similar conditions still exist,
so the timeout handling should be audited regarding this issue.
Many thanks to BaiYang for providing detailed information showing the
problem in action.
This bug also affects 1.5 thus the fix must be backported.
This new target can be called from the frontend or the backend. It
is evaluated just before the backend choice and just before the server
choice. So, the input stream or HTTP request can be forwarded to a
server or to an internal service.
Mailing list participant "mlist" reported negative conn_cur values in
stick tables as the result of "tcp-request connection track-sc". The
reason is that after the stick entry it copied from the session to the
stream, both the session and the stream grab a reference to the entry
and when the stream ends, it decrements one reference and one connection,
then the same is done for the session.
In fact this problem was already encountered slightly differently in the
past and addressed by Thierry using the patch below as it was believed by
then to be only a refcount issue since it was the observable symptom :
827752e "BUG/MEDIUM: stick-tables: refcount error after copying SC..."
In reality the problem is that the stream must touch neither the refcount
nor the connection count for entries it inherits from the session. While
we have no way to tell whether a track entry was inherited from the session
(since they're simply memcpy'd), it is possible to prevent the stream from
touching an entry that already exists in the session because that's a
guarantee that it was inherited from it.
Note that it may be a temporary fix. Maybe in the future when a session
gives birth to multiple streams we'll face a situation where a session may
be updated to add more tracked entries after instanciating some streams.
The correct long-term fix is to mark some tracked entries as shared or
private (or RO/RW). That will allow the session to track more entries
even after the same trackers are being used by early streams.
No backport is needed, this is only caused by the session/stream split in 1.6.
A crash was reported when using the "famous" http-send-name-header
directive. This time it's a bit tricky, it requires a certain number of
conditions to be met including maxconn on a server, queuing, timeout in
the queue and cookie-based persistence.
The problem is that in stream.c, before calling http_send_name_header(),
we check a number of conditions to know if we have to replace the header
name. But prior to reaching this place, it's possible for
sess_update_stream_int() to fail and change the stream-int's state to
SI_ST_CLO, send an error 503 to the client, and flush all buffers. But
http_send_name_header() can only be called with valid buffer contents
matching the http_msg's description. So when it rewinds the stream to
modify the header, buf->o becomes negative by the size of the incoming
request and is used as the argument to memmove() which basically
displaces 4GB of memory off a few bytes to write the new name, resulting
in a core and a core file that's really not fun to play with.
The solution obviously consists in refraining from calling this nasty
function when the stream interface is already closed.
This bug also affects 1.5 and possibly 1.4, so the fix must be backported
there.
The union name "data" is a little bit heavy while we read the source
code because we can read "data.data.sint". The rename from "data" to "u"
makes the read easiest like "data.u.sint".
This patch remove the struct information stored both in the struct
sample_data and in the striuct sample. Now, only thestruct sample_data
contains data, and the struct sample use the struct sample_data for storing
his own data.
During 1.5-dev20 there was some code refactoring to make the src_* fetch
function use the same code as sc_*. Unfortunately this introduced a
regression where src_* doesn't create an entry anymore if it does not
exist in the table. The reason is that smp_fetch_sc_stkctr() only calls
stktable_lookup_key() while src_inc_*/src_clr_* used to make use of
stktable_update_key() which additionally create the entry if it does
not exist.
There's no point modifying the common function for these two exceptions,
so instead we now have a function dedicated to the creation of this entry
for src_* only. It is called when the entry didn't exist, so that requires
minimal modifications to existing code.
Thanks to Thierry Fournier for helping diagnose the issue.
This fix must be backported to 1.5.
During the processing of tcp-request connection, the stream doesn't exists, so the
stick counters are stored in the session. When the stream is created it must
inherit from the session sc.
This patch fix this behavior.
[WT: this is specific to 1.6, no backport needed]
Currently it is possible for the current_rule field to be evaluated before
being set, leading to valgrind complaining:
==16783== Conditional jump or move depends on uninitialised value(s)
==16783== at 0x44E662: http_res_get_intercept_rule (proto_http.c:3730)
==16783== by 0x44E662: http_process_res_common (proto_http.c:6528)
==16783== by 0x4797B7: process_stream (stream.c:1851)
==16783== by 0x414634: process_runnable_tasks (task.c:238)
==16783== by 0x40B02F: run_poll_loop (haproxy.c:1528)
==16783== by 0x407F25: main (haproxy.c:1887)
This was introduced by commit 152b81e7b2.
This patch removes the 32 bits unsigned integer and the 32 bit signed
integer. It replaces these types by a unique type 64 bit signed.
This makes easy the usage of integer and clarify signed and unsigned use.
With the previous version, signed and unsigned are used ones in place of
others, and sometimes the converter loose the sign. For example, divisions
are processed with "unsigned", if one entry is negative, the result is
wrong.
Note that the integer pattern matching and dotted version pattern matching
are already working with signed 64 bits integer values.
There is one user-visible change : the "uint()" and "sint()" sample fetch
functions which used to return a constant integer have been replaced with
a new more natural, unified "int()" function. These functions were only
introduced in the latest 1.6-dev2 so there's no impact on regular
deployments.
When the HTTP forwarder is used, it resets msg->sov so that we know that
the parsing pointer has advanced by exactly (msg->eoh + msg->eol - msg->sov)
bytes which may have to be rewound in case we want to perform an HTTP fetch
after forwarding has started (eg: upon connect).
But when the backend is in TCP mode, there may be no HTTP forwarding
analyser installed, still we may want to perform these HTTP fetches in
case we have already ensured at the TCP layer that we have a properly
parsed HTTP transaction.
In order to solve this, we reset msg->sov before doing a channel_forward()
so that we can still compute http_rewind() on the pending data. That ensures
the buffer is always rewindable even in mixed TCP+HTTP mode.
This patch adds support of variables during the processing of each stream. The
variables scope can be set as 'session', 'transaction', 'request' or 'response'.
The variable type is the type returned by the assignment expression. The type
can change while the processing.
The allocated memory can be controlled for each scope and each request, and for
the global process.
First, findproxy() was renamed proxy_find_by_name() so that its explicit
that a name is required for the lookup. Second, we give this function
the ability to search for tables if needed. Third we now provide inline
wrappers to pass the appropriate PR_CAP_* flags and to explicitly look
up a frontend, backend or table.
For backend load balancing it sometimes makes sense to redispatch rather
than retrying against the same server. For example, when machines or routers
fail you may not want to waste time retrying against a dead server and
would instead prefer to immediately redispatch against other servers.
This patch allows backend sections to specify that they want to
redispatch on a particular interval. If the interval N is positive the
redispatch occurs on every Nth retry, and if the interval N is negative then
the redispatch occurs on the Nth retry prior to the last retry (-1 is the
default and maintains backwards compatibility). In low latency environments
tuning this setting can save a few hundred milliseconds when backends fail.
This patch removes the structs "session", "stream" and "proxy" from
the sample-fetches and converters function prototypes.
This permits to remove some weight in the prototype call.
Since 1.5, the request body analyser has become independant from any
other element and does not even disturb the message forwarder anymore.
And since it's disabled by default, we can place it before most
analysers so that it's can preempt any other one if an intermediary
one enables it.
The applets don't fiddle with SI_FL_WAIT_ROOM anymore, instead they indicate
what they want, possibly that they failed (eg: WAIT_ROOM), and it's done() /
update() which finally updates the WAIT_* flags according to the channels'
and stream interface's states. This solves the issue of the pauses during a
"show sess" without creating busy loops.
We used to allocate a request buffer so that we could process applets
from process_stream(), and this was causing some trouble because it was
not possible for an analyzer to return an error to an applet, which
we'll need for HTTP/2. Now that we don't call applets anymore from
process_stream() we can simplify this and ensure that a response is
always allocated to process a stream.
Now si->update() is used to update any type of stream interface, whether
it's an applet, a connection or even nothing. We don't call si_applet_call()
anymore at the end of the resync and we don't have the risk that the
stream's task is reinserted into the run queue, which makes the code
a bit simpler.
The stream_int_update_applet() function was simplified to ensure that
it remained compatible with this standardized calling convention. It
was almost copy-pasted from the update code dedicated to connections.
Just like for si_applet_done(), it seems that it should be possible to
merge the two functions except that it would require some slow operations,
except maybe if the type of end point is tested inside the update function
itself.
We don't pass sess->origin anymore but the pointer to the previous step. Now
it should be much easier to chain elements together once applets are moved out
of streams. Indeed, the session is only used for configuration and not for the
dynamic chaining anymore.
It's not the stream's job to manipulate the connection's flags, it's
more related to the session that accepted the new connection. And the
only case where we have to do it conditionally is based on the frontend
which is known from the session, thus it makes sense to do it there.
When the stream is instanciated from an applet, it doesn't necessarily
have a listener. The listener was sparsely used there, just to retrieve
the task function, update the listeners' stats, and set the analysers
and default target, both of which are often zero from applets. Thus
these elements are now initialized with default values that the caller
is free to change if desired.
The auto-forwarding mechanism in case no analyser is set is generic
to the streams. Also the timeouts on the client side are better preset
in the stream initialization as well.
The function was called stream_accept_session(), let's rename it
stream_new() and make it return the newly allocated pointer. It's
more convenient for some callers who need it.
This function was specified as being able to return 3 states, which had
repercussions to the stream accept function. It was used at the time
when the frontend would do the monitoring itself. This is not the case
anymore, so let's simplify this.
In stream_accept_session(), we perform some operations that explicitly
want a connection as the origin, but we'll soon have other types of
origin (eg: applet). Thus change the test to ensure we only call this
code with connections. Additionally, we refrain from calling fe->accept()
if the origin is not a connection, because for now the only fe->accept()
may only use a connection (frontend_accept).
This concerns everythins related to accepting a new session and
expiring the embryonic session. There's still a hard-coded call
to stream_accept_session() which could be set somewhere in the
frontend, but for now it's not a problem.
Now that the previous changes were made, we can add a struct task
pointer to stream_complete() and get rid of it in struct session.
The new relation between connection, session and task are like this :
orig -- sess <-- context
| |
v |
conn -- owner ---> task
Some session-specific parts should now move away from stream.
The function now only initializes a session, calls the tcp req connection
rules, and calls stream_complete() to finish initialization. If a handshake
is needed, it is done without allocating the stream at all.
Temporarily, in order to limit the amount of changes, the task allocated
is put into sess->task, and it is used by the connection for the handshake
or is offered to the stream. At this point we set the relation between
sess/task/conn this way :
orig -- sess <-- context
| ^ +- task -+ |
v | v |
conn -- owner task
The task must not remain in the session and ultimately it is planned to
remove this task pointer from the session because it can be found by
having conn->owner = task, and looping back from sess to conn, and to
find the session from the connection via the task.
Since the tcp-request connection rules don't need the stream anymore, we
can safely move the session-specific stuff earlier and prepare for a split
of session and stream initialization. Some work remains to be done.
This one is not needed anymore since we cannot track the stream counters
prior to reaching these locations. Only session counters may be tracked
and they're properly committed during session_free().
It passes a NULL wherever a stream was needed (acl_exec_cond() and
action_ptr mainly). It can still track the connection rate correctly
and block based on ACLs.
In order to support sessions tracking counters, we first ensure that there
is no overlap between streams' stkctr and sessions', and we allow an
automatic lookup into the session's counters when the stream doesn't
have a counter or when the stream doesn't exist during an access via
a sample fetch. The functions used to update the stream counters only
update them and not the session counters however.
The stick counters in the session will be used for everything not related
to contents, hence the connections / concurrent sessions / etc. They will
be usable by "tcp-request connection" rules even without a stream. For now
they're just allocated and initialized.
Doing so ensures we don't need to use the stream anymore to prepare the
log information to report a failed handshake on an embryonic session.
Thus, prepare_mini_sess_log_prefix() now takes a session in argument.
Now that we have sess->origin to carry that information along, we don't
need to put that into strm->target anymore, so we remove one dependence
on the stream in embryonic connections.
Many such function need a session, and till now they used to dereference
the stream. Once we remove the stream from the embryonic session, this
will not be possible anymore.
So as of now, sample fetch functions will be called with this :
- sess = NULL, strm = NULL : never
- sess = valid, strm = NULL : tcp-req connection
- sess = valid, strm = valid, strm->txn = NULL : tcp-req content
- sess = valid, strm = valid, strm->txn = valid : http-req / http-res
All of them can now retrieve the HTTP transaction *if it exists* from
the stream and be sure to get NULL there when called with an embryonic
session.
The patch is a bit large because many locations were touched (all fetch
functions had to have their prototype adjusted). The opportunity was
taken to also uniformize the call names (the stream is now always "strm"
instead of "l4") and to fix indent where it was broken. This way when
we later introduce the session here there will be less confusion.
Now this one is dynamically allocated. It means that 280 bytes of memory
are saved per TCP stream, but more importantly that it will become
possible to remove the l7 pointer from fetches and converters since
it will be deduced from the stream and will support being null.
A lot of care was taken because it's easy to forget a test somewhere,
and the previous code used to always trust s->txn for being valid, but
all places seem to have been visited.
All HTTP fetch functions check the txn first so we shouldn't have any
issue there even when called from TCP. When branching from a TCP frontend
to an HTTP backend, the txn is properly allocated at the same time as the
hdr_idx.
This one will not necessarily be allocated for each stream, and we want
to use the fact that it equals null to know it's not present so that we
can always deduce its presence from the stream pointer.
This commit only creates the new pool.
The header captures are now general purpose captures since tcp rules
can use them to capture various contents. That removes a dependency
on http_txn that appeared in some sample fetch functions and in the
order by which captures and http_txn were allocated.
Interestingly the reset of the header captures were done at too many
places as http_init_txn() used to do it while it was done previously
in every call place.
When s->si[0].end was dereferenced as a connection or anything in
order to retrieve information about the originating session, we'll
now use sess->origin instead so that when we have to chain multiple
streams in HTTP/2, we'll keep accessing the same origin.
Just like for the listener, the frontend is session-wide so let's move
it to the session. There are a lot of places which were changed but the
changes are minimal in fact.
There is now a pointer to the session in the stream, which is NULL
for now. The session pool is created as well. Some parts will move
from the stream to the session now.
With HTTP/2, we'll have to support multiplexed streams. A stream is in
fact the largest part of what we currently call a session, it has buffers,
logs, etc.
In order to catch any error, this commit removes any reference to the
struct session and tries to rename most "session" occurrences in function
names to "stream" and "sess" to "strm" when that's related to a session.
The files stream.{c,h} were added and session.{c,h} removed.
The session will be reintroduced later and a few parts of the stream
will progressively be moved overthere. It will more or less contain
only what we need in an embryonic session.
Sample fetch functions and converters will have to change a bit so
that they'll use an L5 (session) instead of what's currently called
"L4" which is in fact L6 for now.
Once all changes are completed, we should see approximately this :
L7 - http_txn
L6 - stream
L5 - session
L4 - connection | applet
There will be at most one http_txn per stream, and a same session will
possibly be referenced by multiple streams. A connection will point to
a session and to a stream. The session will hold all the information
we need to keep even when we don't yet have a stream.
Some more cleanup is needed because some code was already far from
being clean. The server queue management still refers to sessions at
many places while comments talk about connections. This will have to
be cleaned up once we have a server-side connection pool manager.
Stream flags "SN_*" still need to be renamed, it doesn't seem like
any of them will need to move to the session.