First of all, all legacy HTTP analyzers and all functions exclusively used by
them were removed. So the most of the functions in proto_http.{c,h} were
removed. Only functions to deal with the HTTP transaction have been kept. Then,
http_msg and hdr_idx modules were entirely removed. And finally the structure
http_msg was lightened of all its useless information about the legacy HTTP. The
structure hdr_ctx was also removed because unused now, just like unused states
in the enum h1_state. Note that the memory pool "hdr_idx" was removed and
"http_txn" is now smaller.
This commit breaks the compatibility with filters still relying on the legacy
HTTP code. The legacy callbacks were removed (http_data, http_chunk_trailers and
http_forward_data).
For now, the filters must still set the flag FLT_CFG_FL_HTX to be used on HTX
streams.
Since the legacy HTTP mode is disbabled, all HTTP sample fetches work on HTX
streams. So it is safe to remove all code relying on HTTP legacy mode. Among
other things, the function smp_prefetch_http() was removed with the associated
macros CHECK_HTTP_MESSAGE_FIRST() and CHECK_HTTP_MESSAGE_FIRST_PERM().
Since the legacy HTTP mode is disabled and no multiplexer relies on it anymore,
there is no reason to have 2 multiplexer protocols for the HTTP. So the protocol
PROTO_MODE_HTX was removed and all HTTP multiplexers use now PROTO_MODE_HTTP.
Because the h2 multiplexer only uses the HTX mode, following H2 functions were
removed :
* h2_prepare_h1_reqline
* h2_make_h1_request()
* h2_make_h1_trailers()
Instead of using a array of (struct block), it is more natural and intuitive to
use an array of char. Indeed, not only (struct block) are stored in this array,
but also their payload.
<head> and <tail> fields are now signed 32-bits integers. For an empty HTX
message, these fields are set to -1. So the field <used> is now useless and can
safely be removed. To know if an HTX message is empty or not, we just compare
<head> against -1 (it also works with <tail>). The function htx_nbblks() has
been added to get the number of used blocks.
A long time ago, applets were seen as an alternative to connections,
and since their respective sizes were roughly equal it appeared wise
to share the same pool. Nowadays, connections got significantly larger
but applets are not that often used, except for the cache. However
applets are mostly complementary and not alternatives anymore, as
it's very possible not to have a back connection or to share one with
other streams.
The connections will soon lose their addresses and their size will
shrink so much that appctx won't fit anymore. Given that the old
benefits of sharing these pools have long disappeared, let's stop
doing this and have a dedicated pool for appctx.
Since commit 81492c989 ("MINOR: threads: flatten the per-thread cpu-map"),
we don't keep the proc*thread matrix anymore to represent the full binding
possibilities, but only the proc and thread ones. The problem is that the
per-process binding is not the same for each thread and for the process,
and the proc[] array was assumed to store the per-proc first thread value
when doing this change. Worse, the logic present there tries to deal with
thread ranges and process ranges in a way which automatically exclused the
other possibility (since ranges cannot be used on both) but as such fails
to apply changes if neither the process nor the thread is expressed as a
range.
The real problem comes from the fact that specifying cpu-map 1/1 doesn't
yet reveal if the per-process mask or the per-thread mask needs to be
updated. In practice it's the thread one but then the current storage
doesn't allow to store the binding of the first thread of each other
process in nbproc>1 configurations.
When removing the proc*thread matrix, what ought to have been kept was
both the thread column for process 1 and the process line for threads 1,
but instead only the thread column was kept. This patch reintroduces the
storage of the configuration for the first thread of each process so that
it is again possible to store either the per-thread or per-process
configuration.
As a partial workaround for existing configurations, it is possible to
systematically indicate at least two processes or two threads at once
and map them by pairs or more so that at least two values are present
in the range. E.g :
# set processes 1-4 to cpus 0-3 :
cpu-map auto:1-4/1 0 1 2 3
# or:
cpu-map 1-2/1 0 1
cpu-map 2-3/1 2 3
# set threads 1-4 to cpus 0-3 :
cpu-map auto:1/1-4 0 1 2 3
# or :
cpu-map 1/1-2 0 1
cpu-map 3/3-4 2 3
This fix must be backported to 2.0.
Move the logic to decide if we redispatch to a new server from
sess_update_st_cer() to a new inline function, stream_choose_redispatch(), and
use it in do_l7_retry() instead of just setting the state to SI_ST_REQ.
That way, when using L7 retries, we won't redispatch the request to another
server except if "option redispatch" is used.
This should be backported to 2.0.
Sometimes we need to delegate some list processing to a function running
on another thread. In this case the list element will simply be queued
into a dedicated self-locked list and the task responsible for this list
will be woken up, calling the associated function which will run over the
list.
This is what work_list does. Such lists will be dedicated to a limited
type of work but will significantly ease such remote handling. A function
is provided to create these per-thread lists, their tasks and to properly
bind each task to a distinct thread, so that the caller only has to store
the resulting pointer to the start of the structure.
These structures should not be abused though as each head will consume
4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.
When we're purging idle connections, there's a race condition, when we're
removing the connection from the idle list, to add it to the list of
connections to free, if the thread owning the connection tries to free it
at the same time.
To fix this, simply add a per-thread lock, that has to be hold before
removing the connection from the idle list, and when, in conn_free(), we're
about to remove the connection from every list. That way, we know for sure
the connection will stay valid while we remove it from the idle list, to add
it to the list of connections to free.
This should happen rarely enough that it shouldn't have any impact on
performances.
This has not been reported yet, but could provoke random segfaults.
This should be backported to 2.0.
The maximum number of idle connections for a server can be configured by setting
the server option "pool-max-conn". But when we try to add a connection in its
idle list, because of a wrong comparison, it may be rejected because there are
already "pool-max-conn - 1" idle connections.
This patch must be backported to 2.0 and 1.9.
While experimenting with potentially improved fairness and latency using
ticket locks on a Ryzen 16-thread/8-core, a very strange situation happened
a lot for some levels of traffic. Around 300k connections per second, no
more connections would be accepted on the multi-threaded listener but all
others would continue to work fine. All attempts to trace showed that the
threads were all in the trylock in the fd cache, or in the spinlock of
fd_update_events(), or in the one of fd_may_recv(). But as indicated this
was not a deadlock since the process continues to work fine.
After quite some investigation it appeared that the issue is caused by a
lack of fairness between the fdcache's trylock and these functions' spin
locks above. In fact, regardless of the success or failure of the fdcache's
attempt at grabbing the lock, the poller was calling fd_update_events()
which locks the FD once for something that can be done with a CAS, and
then calls fd_may_recv() with another lock for something that most often
didn't change. The high contention on these spinlocks leaves no chance to
any other thread to grab the lock using trylock(), and once this happens,
there is no thread left to process incoming connection events nor to stop
polling on the FD, leaving all threads at 100% CPU but partially operational.
This patch addresses the issue by using bit-test-and-set instead of the OR
in fd_may_recv() / fd_may_send() so that nothing is done if the FD was
already configured as expected. It does the same in fd_update_events()
using a CAS to check if the FD's events need to be changed at all or not.
With this patch applied, it became impossible to reproduce the issue, and
now there's no way to saturate all 16 CPUs with the load used for testing,
as no more than 1350-1400 were noticed at 300+kcps vs 1600.
Ideally this patch should go further and try to remove the remaining
incarnations of the fdlock as this seems possible, but it's difficult
enough to be done in a distinct patch that will not have to be backported.
It is possible that workloads involving a high connection rate may slightly
benefit from this patch and observe a slightly lower CPU usage even when
the service doesn't misbehave.
This patch must be backported to 2.0 and 1.9.
These calls can take quite some time and leave the thread harmless so
it's better to mark it as such. This makes "show sess" respond way
faster during high loads running on processes build with DEBUG_UAF
since these calls are stressed a lot.
When calling mmap(), in general the system gives us a page but does not
really allocate it until we first dereference it. And it turns out that
this time is much longer than the time to perform the mmap() syscall.
Unfortunately, when running with memory debugging enabled, we mmap/munmap()
each object resulting in lots of such calls and a high contention on the
allocator. And the first accesses to the page being done under the pool
lock is extremely damaging to other threads.
The simple fact of writing a 0 at the beginning of the page after
allocating it and placing the POOL_LINK pointer outside of the lock is
enough to boost the performance by 8x in debug mode and to save the
watchdog from triggering on lock contention. This is what this patch
does.
The malloc and free calls and especially the underlying mmap/munmap()
can occasionally take a huge amount of time and even cause the thread
to sleep. This is visible when haproxy is compiled with DEBUG_UAF which
causes every single pool allocation/free to allocate and release pages.
In this case, when using the locked pools, the watchdog can occasionally
fire under high contention (typically requesting 40000 1M objects in
parallel over 8 threads). Then, "perf top" shows that 50% of the CPU
time is spent in mmap() and munmap(). The reason the watchdog fires is
because some threads spin on the pool lock which is held by other threads
waiting on mmap() or munmap().
This patch modifies this so that the pool lock is released during these
syscalls. Not only this allows other threads to request try to allocate
their data in parallel, but it also considerably reduces the lock
contention.
Note that the locked pools are only used on small architectures where
high thread counts would not make sense, so this will not provide any
benefit in the general case. However it makes the debugging versions
way more stable, which is always appreciated.
In the function stream_int_notify(), when the opposite stream-interface is
blocked because there is no more room into the input buffer, if the flag
CF_WRITE_PARTIAL is set on this buffer, it is unblocked. It is a way to unblock
the reads on the other side because some data was sent.
But it is a problem during the fast-forwarding because only the stream is able
to remove the flag CF_WRITE_PARTIAL. So it is possible to have this flag because
of a previous send while the input buffer of the opposite stream-interface is
now full. In such case, the opposite stream-interface will be woken up for
nothing because its input buffer is full. If the same happens on the opposite
side, we will have a loop consumming all the CPU.
To fix the bug, the opposite side is now only notify if there is some available
room in its input buffer in the function si_cs_send(), so only if some data was
sent.
This patch must be backported to 2.0 and 1.9.
This code should be now used by action to stop at the same time the rules
processing and the possible following processings. And from its side, the return
code ACT_RET_STOP should be used to only stop rules processing.
So concretely, for TCP rules, there is no changes. ACT_RET_STOP and ACT_RET_DONE
are handled the same way. However, for HTTP rules, ACT_RET_STOP should now be
mapped on HTTP_RULE_RES_STOP and ACT_RET_DONE on HTTP_RULE_RES_DONE. So this
way, a action will have the possibilty to stop all processing or only rules
processing.
Note that changes about the TCP is done in this commit but changes about the
HTTP will be done in another one because it will fix a bug in the same time.
This patch must be backported to 2.0 because a bugfix depends on it.
When deciding if we keep an idle connection in the session, check if
the number of connections currently in the session is >= the max allowed,
not >, or we'll keep an extra connection.
This should be backported to 1.9 and 2.0.
Just calling conn_force_unsubscribe() from conn_upgrade_mux_fe() is not
enough, as there may be multiple XPRT involved. Instead, require that
any user of conn_upgrade_mux_fe() unsubscribe itself before calling it.
This should fix upgrading a TCP connection to HTX when using SSL.
This should be backported to 2.0.
The receive limit of an HTX channel must be calculated against the total size of
the HTX message. Otherwise, the buffer may never be seen as full whereas the
receive limit is 0. Indeed, the function channel_htx_full() already takes care
to add a block size to the buffer's reserve (8 bytes). So if the function
channel_htx_recv_limit() also keep a block size free in addition to the buffer's
reserve, it means that at least 2 block size will be kept free but only one will
be taken into account, freezing the stream if the option http-buffer-request is
enabled.
This patch fixes the Github issue #136. It should be backported to 2.0 and
1.9. Thanks jaroslawr (Jarosław Rzeszótko) for his help.
Revert commit fe4abe62c7c5206dff1802f42d17014e198b9141.
The goal was to make sure for health-checks, we would not get sockets in
TIME_WAIT. To do so, we would not call shutdown() if linger_risk is set.
However that is wrong, and that means shutw would never be forwarded to
the server, and thus we could get connection that are never properly closed.
Instead, to fix the original problem as described here :
https://www.mail-archive.com/haproxy@formilux.org/msg34080.html
Just make sure the checks code call cs_shutr() before calling cs_shutw().
If shutr has been called, conn_sock_shutw() will make no attempt to call
shutdown(), as it knows close() will be called.
We should really review and revamp the shutr/shutw code, as described in
github issue #142.
This should be backported to 1.9 and 2.0.
When using a level lower than admin on the master CLI, a \n is output
before the response, this is caused by the response of the "operator" or
"user" that are sent before the actual command.
To fix this problem we introduce the flag APPCTX_CLI_ST1_NOLF which ask
a command response to not be followed by the final \n.
This patch made a special case with the command operator and user
followed by a - so they are not followed by \n.
This patch must be backported to 2.0 and 1.9.
As its name suggest, this function change the value length of a block. But it
also update the HTX message accordingly. It simplifies the HTX API. The function
htx_set_blk_value_len() is still available and must be used with caution because
this one does not update the HTX message. It just updates the HTX block. It
should be considered as an internal function. When possible,
htx_change_blk_value_len() should be used instead.
This function is used to fix a bug affecting the 2.0. So, this patch must be
backported to 2.0.
Server states can be recovered from either a "global" file (all backends)
or a "local" file (per backend).
The way the algorithm to parse the state file was first implemented was good
enough for a low number of backends and servers per backend.
Basically, for each backend the state file (global or local) is opened,
parsed entirely and for each line we check if it contains data related to
a server from the backend we're currently processing.
We must read the file entirely, just in case some lines for the current
backend are stored at the end of the file.
This does not scale at all!
This patch changes the behavior above for the "global" file only. Now,
the global file is read and parsed once and all lines it contains are
stored in a tree, for faster discovery.
This result in way much less fopen, fgets, and strcmp calls, which make
loading of very big state files very quick now.
In commit 86eded6c6 ("CLEANUP: tasks: rename task_remove_from_tasklet_list()
to tasklet_remove_*") which consisted in removing the casts between tasks
and tasklet, I was a bit too fast to believe that we only saw tasklets in
this function since process_runnable_tasks() also uses it with tasks under
a cast. So removing the bookkeeping on task_list_size was not appropriate.
Bah, the joy of casts which hide the real thing...
This patch does two things at once to address this mess once for all:
- it restores the decrement of task_list_size when it's a real task,
but moves it to process_runnable_task() since it's the only place
where it's allowed to call it with a task
- it moves the increment there as well and renames
task_insert_into_tasklet_list() to tasklet_insert_into_tasklet_list()
of obvious consistency reasons.
This way the increment/decrement of task_list_size is made at the only
places where the cast is enforced, so it has less risks to be missed.
The comments on top of these functions were updated to reflect that they
are only supposed to be used with tasklets and that the caller is responsible
for keeping task_list_size up to date if it decides to enforce a task there.
Now we don't have to worry anymore about how these functions work outside
of the scheduler, which is better longterm-wise. Thanks to Christopher for
spotting this mistake.
No backport is needed.
In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. Not
doing so will result in getting sockets in TIME_WAIT for some time.
This is particularly observable with health checks.
This should be backported to 1.9.
The function really only operates on tasklets, its arguments are always
tasklets cast as tasks to match the function's type, to be cast back to
a struct tasklet. Let's rename it to tasklet_remove_from_tasklet_list(),
take a struct tasklet, and get rid of the undesired task casts.
It's really confusing to call it a task because it's a tasklet and used
in places where tasks and tasklets are used together. Let's rename it
to tasklet to remove this confusion.
The first one, HTX_SL_F_HAS_SCHM, will be used to know the request has an
explicit scheme. So, in H2, it is always true because the pseudo-header
":scheme" is mandatory. In H1, it is only true when an absolute URI is found on
the start-line. The other flags, HTX_SL_F_SCHM_HTTP and HTX_SL_F_SCHM_HTTPS,
will be used to know which scheme the request have. For now, other protocols are
not handled.
The aim of these flags is to pass this information to the backend side in
general, and to the H2 mux in particular. So the multiplexer will have a chance
to use this information to send the right scheme to the server.
In the HTX structure, the field <first> is used to know where to (re)start the
analysis. It may differ from the message's head. It is especially important to
update it to handle 1xx messages, to be sure to restart the analysis on the next
message (another 1xx message or the final one). It is also updated when some
data are forwarded (the headers or part of the body). But this update is an
error and must never be done at the analysis level. It is a bug, because some
sample fetches may be used after the data forwarding (but before the first send
of course). At this stage, if the first block position does not point on the
start-line, most of HTTP sample fetches fail.
So now, when something is forwarding by HTX analyzers, the first block position
is not update anymore.
This issue was reported on Github. See #119. No backport needed.
When channel_full() is called for an HTX stream, we fall back on the HTX
version. This function is called, among other, from tcp_inspect_request(). With
this patch, the inspect delay is respected again.
This patch must be backported to 1.9.
With both I/O and tasks in the same tasklet list, we now have a very
smooth and responsive scheduler, providing a good fairness between I/O
activities. With the lower layers relying on tasklet a lot (I/O wakeup,
subscribe, etc), there may often be a large number of totally autonomous
tasklets doing their business such as forwarding data between two muxes.
But the task scheduler historically refrained from picking tasks from the
priority-ordered run queue to put them into the tasklet list until this
later had less than max_runqueue_depth entries. This was to make sure that
low-latency, high-priority tasks would have an opportunity to be dequeued
before others even if they arrive late. But the counter used for this is
still the tasklet list size, which contains countless I/O events. This
causes an unfairness between unbounded I/Os and bounded tasks, resulting
for example in the CLI responding slower when forwarding 40 Gbps of HTTP
traffic spread over a thousand of connections.
A good solution consists in sticking to the initial intent of
max_runqueue_depth which is to limit the number of tasks in the list
(to maintain fairness between them) and not to limit the number of these
tasks among tasklets. It just turns out that the task_list_size initially
was this task counter and changed over time to be a tasklet list size.
Let's simply refrain from updating it for pure tasklets so that it takes
back its original role of counting real tasks as its name implies. With
this change the CLI becomes instantly responsive under load again.
This patch may possibly be backported to 1.9 though it requires some
careful checks.
The function htx_add_data_before() was removed because it was buggy. The
function htx_move_blk_before() may be used if necessary to do something
equivalent, except it just moves blocks. It doesn't handle the adding.
In an HTX message, it may have 2 available rooms to store a new block. The first
one is between the blocks and their payload. Blocks are added starting from the
end of the buffer and their payloads are added starting from the begining. So
the first free room is between these 2 edges. The second one is at the begining
of the buffer, when we start to wrap to add new payloads. Once we start to use
this one, the other one is ignored until the next defragmentation of the HTX
message.
In theory, there is no problem. But in practice, some lacks in the HTX structure
force us to defragment too often HTX messages to always be in a known state. The
second free room is not tracked as it should do and the first one may be easily
corrupted when rewrites happen.
So to fix the problem and avoid unecessary defragmentation, the HTX structure
has been refactored. The front (the block's position of the first payload before
the blocks) is no more stored. Instead we keep the relative addresses of 3 edges:
* tail_addr : The start address of the free space in front of the the blocks
table
* head_addr : The start address of the free space at the beginning
* end_addr : The end address of the free space at the beginning
Here is the general view of the HTX message now:
head_addr end_addr tail_addr
| | |
V V V
+------------+------------+------------+------------+------------------+
| | | | | |
| PAYLOAD | Free space | PAYLOAD | Free space | Blocks area |
| ==> | 1 | ==> | 2 | <== |
+------------+------------+------------+------------+------------------+
<head_addr> is always lower or equal to <end_addr> and <tail_addr>. <end_addr>
is always lower or equal to <tail_addr>.
In addition;, to simplify everything, the blocks area are now contiguous. It
doesn't wrap anymore. So the head is always the block with the lowest position,
and the tail is always the one with the highest position.
The function htx_add_data_before() is buggy and cannot work. It first add a data
block and then move it before another one, passed in argument. The problem
happens when a defragmentation is done to add the new block. In this case, the
reference is no longer valid, because the blocks are rearranged. So, instead of
moving the new block before the reference, it is moved at the head of the HTX
message.
So this function has been removed. It was only used by the compression filter to
add a last data block before a TLR, EOT or EOM block. Now, the new function
htx_add_last_data() is used. It adds a last data block, after all others and
before any TLR, EOT or EOM block. Then, the next bock is get. It is the first
non-data block after data in the HTX message. The compression loop continues
with it.
This patch must be backported to 1.9.