When commit 0350b90e3 ("MEDIUM: htx: make htx_add_data() never defragment
the buffer") was introduced, it made htx_add_data() actually be able to
add less data than it was asked for, and the callers must use the returned
value to know how much was added. The H2 code used to rely on the frame
length instead of the return value. A version of the code doing this was
written but is obviously not the one that got merged, resulting in breaking
large uploads or downloads when HTX would have instead defragmented the
buffer because the HTX side sees less contents than what the H2 side sees.
This patch fixes this again. No backport is needed.
Olivier found that commit 99ad1b3e8 ("MINOR: mux-h2: stop relying on
CS_FL_REOS") managed to break abortonclose again with H2. What happens
is that while the CS_FL_REOS flag was set on some transitions to the
HREM state, it's not set on all and is in fact only set when the low
level connection is closed. So making the replacement condition match
the HREM and ERROR states is not correct and causes completely correct
requests to send advertise an early close of the connection layer while
only the stream's input is closed.
In order to avoid this, we now properly split the checks for the CLOSED
state and for the closed connection. This way there is no risk to set
the EOS flag too early on the connection.
No backport is needed.
The function really only operates on tasklets, its arguments are always
tasklets cast as tasks to match the function's type, to be cast back to
a struct tasklet. Let's rename it to tasklet_remove_from_tasklet_list(),
take a struct tasklet, and get rid of the undesired task casts.
It's really confusing to call it a task because it's a tasklet and used
in places where tasks and tasklets are used together. Let's rename it
to tasklet to remove this confusion.
By default, the scheme "https" is always used. But when an explicit scheme was
defined and when this scheme is "http", we use it in the request sent to the
server. This is done by checking flags of the start-line. If the flag
HTX_SL_F_HAS_SCHM is set, it means an explicit scheme was defined on the client
side. And if the flag HTX_SL_F_SCHM_HTTP is set, it means the scheme "http" was
used.
Instead of using the macro MAX_HTTP_HDR to limit the number of headers parsed
before throwing an error, we now use the custom global variable
global.tune.max_http_hdr.
This patch must be backported to 1.9.
There seems to be a tricky case in the H2 mux related to stream flow
control versus buffer a full situation : is a large response cannot
be entirely sent to the client due to the stream window being too
small, the stream is paused with the SFCTL flag. Then the upper
layer stream might get bored and expire this stream. It will then
shut it down first. But the shutdown operation might fail if the
mux buffer is full, resulting in the h2s being subscribed to the
deferred_shut event with the stream *not* added to the send_list
since it's blocked in SFCTL. In the mean time the upper layer completely
closes, calling h2_detach(). There we have a send_wait (the pending
shutw), the stream is marked with SFCTL so we orphan it.
Then if the client finally reads all the data that were clogging the
buffer, the send_list is run again, but our stream is not there. From
this point, the connection's stream list is not empty, the mux buffer
is empty, so the connection's timeout is not set. If the client
disappears without updating the stream's window, nothing will expire
the connection.
This patch makes sure we always keep the connection timeout updated.
There might be finer solutions, such as checking that there are still
living streams in the connection (i.e. streams not blocked in SFCTL
state), though this is not necessarily trivial nor useful, since the
client timeout is the same for the upper level stream and the connection
anyway.
This patch needs to be backported to 1.9 and 1.8 after some observation.
This type of blocks is useless because transition between data and trailers is
obvious. And when there is no trailers, the end-of-message is still there to
know when data end for chunked messages.
HTTP trailers are now parsed in the same way headers are. It means trailers are
converted to K/V blocks followed by an end-of-trailer marker. For now, to make
things simple, the type for trailer blocks are not the same than for header
blocks. But the aim is to make no difference between headers and trailers by
using the same type. Probably for the end-of marker too.
Usually when calling offer_buffer(), we don't expect to offer it to
ourselves. But with h2 we have the same buffer_wait for the two directions
so we can unblock the recv path when completing a send(), or we can unblock
part of the mux buffer after sending the first few buffers that we managed
to collect. Thus it is important to always accept to wake up any requester.
A few parts of this patch could possibly be backported but earlier versions
already have other issues related to low-buffer condition so it's not sure
it's worth taking the risk to make things worse.
The test for the mux alloc failure in h2_send() right after an attempt
at h2_process_mux() used to make sense as it tried to detect that this
latter failed to produce data. But now that we have a list of buffers,
it is a perfectly valid situation where there can still be data in the
buffer(s).
So now when we see this flag we only declare it's the last run on the
loop. In addition we need to make sure we break out of the loop on
snd_buf failure, or we'll loop indefinitely, for example when the buf
is full and we can't send.
No backport is needed.
In h2c_frt_stream_new, if we failed to create the stream for some reason,
don't forget to set h2s->cs to NULL before calling h2s_destroy(), otherwise
h2s_destroy() will call h2s_close(), which will attempt to access
h2s->cs->flags if it's non-NULL.
This should be backported to 1.9.
In lock profiles it's visible that there is a huge contention on the
buffer lock. The reason is that when offer_buffers() is called, it
systematically takes the lock before verifying if there is any
waiter. However doing so doesn't protect against races since a
waiter can happen just after we release the lock as well. Similarly
in h2 we take the lock every time an h2c is going to be released,
even without checking that the h2c belongs to a wait list. These
two have now been addressed by verifying non-emptiness of the list
prior to taking the lock.
In order to later allow htx_add_data() to transmit partial blocks and
avoid defragmenting the buffer, we'll need to return the number of bytes
consumed. This first modification makes the function do this and its
callers take this into account. At the moment the function still works
atomically so it returns either the block size or zero. However all
call places have been adapted to consider any value between zero and
the block size.
1xx informational messages (all except 101) are now part of the HTTP reponse,
semantically speaking. These messages are not followed by an EOM anymore,
because a final reponse is always expected. All these parts can also be
transferred to the channel in same time, if possible. The HTX response analyzer
has been update to forward them in loop, as the legacy one.
In the function htx_xfer_blks(), we take care to transfer all headers in one
time. When the current block is a start-line, we check if there is enough space
to transfer all headers too. If not, and if the destination is empty, a parsing
error is reported on the source.
The H2 multiplexer is the only one to use this function. When a parsing error is
reported during the transfer, the flag CS_FL_EOI is also set on the conn_stream.
Now, the SI calls h2_rcv_buf() with the right count value. So we can rely on
it. Unlike the H1 multiplexer, it is fairly easier for the H2 multiplexer
because the HTX message already exists, we only transfer blocks from the H2S to
the channel. And this part is handled by htx_xfer_blks().
This patch makes the function more accurate. Thanks to the function
htx_get_max_blksz(), the transfer of data has been simplified. Note that now the
total number of bytes copied (metadata + payload) is returned. This slighly
change how the function is used in the H2 multiplexer.
in the H2 multiplexer, when a HEADERS frame is built before sending it, we have
the warranty the start-line is the head of the HTX message. It is safer to rely
on this fact than on the sl_pos value. For now, it's safe to use sl_pos in muxes
because HTTP 1xx messages are considered as full messages in HTX and only one
HTTP message can be stored at a time in HTX. But we are trying to handle 1xx
messages as a part of the reponse message. In this way, an HTTP reponse will be
the sum of all 1xx informational messages followed by the final response. So it
will be possible to have several start-line in the same HTX message. And the
sl_pos will point to the first unprocessed start-line from the analyzers point
of view.
Now when we fail to send because the mux buffer is full, before giving
up and marking MFULL, we try to allocate another buffer in the mux's
ring to try again. Thanks to this (and provided there are enough buffers
allocated to the mux's ring), a single stream picked in the send_list
cannot steal all the mux's room at once. For this, we expand the ring
size to 31 buffers as it seems to be optimal on benchmarks since it
divides the number of context switches by 3. It will inflate each H2
conn's memory by 1 kB.
The bandwidth is now much more stable. Prior to this, it a test on
h2->h1 with very large objects (1 GB), a few tens of connections and
a few tens of streams per connection would show a varying performance
between 34 and 95 Gbps on 2 cores/4 threads, with h2_snd_buf() stopped
on a buffer full condition between 300000 and 600000 times per second.
Now the performance is constantly between 88 and 96 Gbps. Measures show
that buffer full conditions are met around only 159 times per second
in this case, or rougly 2000 to 4000 times less often.
This makes the code more readable and reduces the calls to br_tail().
In addition, all calls to h2_get_buf() are now made via this local
variable, which should significantly help for retries.
Now send() uses a loop to iterate over all buffers to be sent. These
buffers are released and deleted from the vector once completely sent.
If any buffer gets released, offer_buffers() is called to wake up some
waiters.
For now it's only one buffer long so the head and tails are always the
same, thus it doesn't change what used to work. In short, br_tail(h2c->mbuf)
was inserted everywhere we used to have h2c->mbuf.
Transferring large objects over H2 sometimes shows unexplained performance
variations. A long analysis resulted in the following discovery. Often the
mux buffer looks like this :
[ empty_head | data | empty_tail ]
Typical numbers are (very common) :
- empty_head = 31
- empty_tail = 16 (total free=47)
- data = 16337
- size = 16384
- data to copy: 43
The reason for these holes are the blocking factors that are not always
the same in and out (due to keeping 9 bytes for the frame size, or the
56 bytes corresponding to the HTX header). This can easily happen 10000
times a second if the network bandwidth permits it!
In this case, while copying a DATA frame we find that the buffer has its
free space wrapped so we decide to realign it to optimize the copy. It's
possible that this practice stems from the code used to emit headers,
which do not support fragmentation and which had no other option left.
But it comes with two problems :
- we don't check if the data fits, which results in a memcpy for nothing
- we can move huge amounts of data to just copy a small block.
This patch addresses this two ways :
- first, by not forcing a data realignment if what we have to copy does
not fit, as this is totally pointless ;
- second, by refusing to move too large data blocks. The threshold was
set to 1 kB, because it may make sense to move 1 kB of data to copy
a 15 kB one at once, which will leave as a single 16 kB block, but
it doesn't make sense to mvoe 15 kB to copy just 1 kB. In all cases
the data would fit and would just be split into two blocks, which is
not very expensive, hence the low limit to 1 kB
With such changes, realignments are very rare, they show up around once
every 15 seconds at 60 Gbps, and look like this, resulting in a much more
stable bit rate :
buf=0x7fe6ec0c3510,h=16333,d=35,s=16384 room=16349 in=16337
This patch should be safe for backporting to 1.9 if some performance
issues are reported there.
In HTX, when a HEADERS frame is formatted before sending it to the client or the
server, If an EOM is found because there is no body, we must count it in the
number bytes sent.
This patch must be backported to 1.9.
It is not legal to subscribe if we're already subscribed, or to unsubscribe
if we did not subscribe, so instead of trying to handle those cases, just
assert that it's ok using the new BUG_ON() macro.
Just like CS_FL_REOS previously, the CS_FL_EOI flag is abused as a proxy
for H2_SF_ES_RCVD. The problem is that this flag is consumed by the
application layer and is set immediately when an end of stream was met,
which is too early since the application must retrieve the rxbuf's
contents first. The effect is that some transfers are truncated (mostly
the first one of a connection in most tests).
The problem of mixing CS flags and H2S flags in the H2 mux is not new
(and is currently being addressed) but this specific one was emphasized
in commit 63768a63d ("MEDIUM: mux-h2: Don't mix the end of the message
with the end of stream") which was backported to 1.9. Note that other
flags, particularly CS_FL_REOS still need to be asynchronously reported,
though their impact seems more limited for now.
This patch makes sure that all internal uses of CS_FL_EOI are replaced
with a test on H2_SF_ES_RCVD (as there is a 1-to-1 equivalence) and that
CS_FL_EOI is only reported once the rxbuf is empty.
This should ideally be backported to 1.9 unless it causes too much
trouble due to the recent changes in this area, as 1.9 *seems* not
to be directly affected by this bug.
This flag was introduced early in 1.9 development (a3f7efe00) to report
the fact that the rxbuf that was present on the conn_stream was followed
by a shutr. Since then the rxbuf moved from the conn_stream to the h2s
(638b799b0) but the flag remained on the conn_stream. It is problematic
because some state transitions inside the mux depend on it, thus depend
on the CS, and as such have to test for its existence before proceeding.
This patch replaces the test on CS_FL_REOS with a test on the only
states that set this flag (H2_SS_CLOSED, H2_SS_HREM, H2_SS_ERROR).
The few places where the flag was set were removed (the flag is not
used by the data layer).
This flag is currently set when an incoming close was received, which
results in the stream being in either H2_SS_HREM, H2_SS_CLOSED, or
H2_SS_ERROR states, so let's remove the test for the OPEN and HLOC
cases.
If the stream closes and quits while there's no room in the mux buffer
to send an RST frame, next time it is attempted it will not lead to
the connection being closed because the conn_stream will have been
released and the KILL_CONN flag with it as well.
This patch reserves a new H2_SF_KILL_CONN flag that is copied from
the CS when calling shut{r,w} so that the stream remains autonomous
on this even when the conn_stream leaves.
This should ideally be backported to 1.9 though it depends on several
previous patches that may or may not be suitable for backporting. The
severity is very low so there's no need to insist in case of trouble.
In h2s_wake_one_stream() we used to rely on the temporary flags used to
adjust the CS to determine the new h2s state. This really is not convenient
and creates far too many dependencies. This commit just moves the same
condition to the places where the temporary flags were set so that we
don't have to rely on these anymore. Whether these are relevant or not
was not the subject of the operation, what matters was to make sure the
conditions to adjust the stream's state and the CS's flags remain the
same. Later it could be studied if these conditions are correct or not.
h2s_wake_one_stream() has access to all the required elements to update
the connstream's flags and figure the necessary state transitions, so
let's move the conditions there from h2_wake_some_streams().
It's problematic to have to pass some CS flags to this function because
that forces some h2s state transistions to update them just in time
while some of them are supposed to only be updated during I/O operations.
As a first step this patch transfers the decision to pass CS_FL_ERR_PENDING
from the caller to the leaf function h2s_wake_one_stream(). It is easy
since this is the only flag passed there and it depends on the position of
the stream relative to the last_sid if it was set.
h2_wake_some_streams() first looks up streams whose IDs are greater than
or equal to last+1, then checks if the id is lower than or equal to last,
which by definition will never match. Let's remove this confusing leftover
from ancient code.
These functions may fail to emit an RST or an empty DATA frame because
the mux is full or busy. Then they subscribe the h2s and try again.
However when doing so, they will already have marked the error state on
the stream and will not pass anymore through the sequence resulting in
the failed frame to be attempted to be sent again nor to the close to
be done, instead they will return a success.
It is important to only leave when the stream is already closed, but
to go through the whole sequence otherwise.
This patch should ideally be backported to 1.9 though it's possible that
the lack of the WANT_SHUT* flags makes this difficult or dangerous. The
severity is low enough to avoid this in case of trouble.
It was only set and not consumed after the previous change. The reason
is that the task's context always contains the relevant information,
so there is no need for a second pointer.
This one used to rely on the combined return statuses of the shutr/w
functions but now that we have the H2_SF_WANT_SHUT{R,W} flags we don't
need this anymore if we properly remove these flags after their operations
succeed. This is what this patch does.
Currently when a shutr/shutw fails due to lack of buffer space, we abuse
the wait_event's handle pointer to place up to two bits there in addition
to the original pointer. This pointer is not used for anything but this
and overall the intent becomes clearer with h2s flags than with these
two alien bits in the pointer, so let's use clean flags now.
Lots of places were using LIST_ISEMPTY() to detect if a stream belongs
to one of the send lists or to detect if a connection was already
waiting for a buffer or attached to an idle list. Since these ones are
not list heads but list elements, let's use LIST_ADDED() instead.
In this long thread, Maciej Zdeb reported that the H2 mux was still
going through endless loops from time to time :
https://www.mail-archive.com/haproxy@formilux.org/msg33709.html
What happens is the following :
- in h2s_frt_make_resp_data() we can set H2_SF_BLK_SFCTL and remove the
stream from the send_list
- then in h2_shutr() and h2_shutw(), we check if the list is empty before
subscribing the element, which is true after the case above
- then in h2c_update_all_ws() we still have H2_SF_BLK_SFCTL with the item
in the send_list, thus LIST_ADDQ() adds it a second time.
This patch adds a check of list emptiness before performing the LIST_ADDQ()
when the flow control window opens. Maciej reported that it reliably fixed
the problem for him.
As later discussed with Olivier, this fixes the consequence of the issue
rather than its cause. The root cause is that a stream should never be in
the send_list with a blocking flag set and the various places that can lead
to this situation must be revisited. Thus another fix is expected soon for
this issue, which will require some observation. In the mean time this one
is easy enough to validate and to backport.
Many thanks to Maciej for testing several versions of the patch, each
time providing detailed traces which allowed to nail the problem down.
This patch must be backported to 1.9.
When we have to stop sending due to the stream flow control, don't check
if send_wait is NULL to know if we're in the send_list, because at this
point it'll always be NULL, while we're probably in the list.
Use LIST_ISEMPTY(&h2s->list) instead.
Failing to do so mean we might be added in the send_list when flow control
allows us to emit again, while we're already in it.
While I'm here, replace LIST_DEL + LIST_INIT by LIST_DEL_INIT.
This should be backported to 1.9.
In h2_detach(), if we still have a send_wait pointer, because we woke the
tasklet up, but it hasn't ran yet, explicitely set send_wait to NULL after
we removed the tasklet from the task list.
Failure to do so may lead to crashes if the h2s isn't immediately destroyed,
because we considered there were still something to send.
This should be backported to 1.9.
A typo was introduced in the following commit : 927b88ba0 ("BUG/MAJOR:
mux-h2: fix race condition between close on both ends") making the test
on h2s->cs never being done and h2c->cs being dereferenced without being
tested. This also confirms that this condition does not happen on this
side but better fix it right now to be safe.
This must be backported to 1.9.
Since previous commit it's not needed anymore to test a task pointer
before calling task_destory() so let's just remove these tests from
the various callers before they become confusing. The function's
arguments were also documented. The same should probably be done
with tasklet_free() which involves a test in roughly half of the
call places.
If when receiving an H2 response we fail to add an EOM block after too
large a trailers block, we must not leave the trailers block alone as it
violates the internal assumptions by not being followed by an EOM, even
when an error is reported. We must then make sure the error will safely
be reported to upper layers and that no attempt will be made to forward
partial blocks.
This must be backported to 1.9.
In message https://www.mail-archive.com/haproxy@formilux.org/msg33541.html
Patrick Hemmer reported an interesting bug affecting H2 and trailers.
The problem is that in order to close the stream we have to see the EOM
block, but nothing guarantees it will atomically be delivered with the
trailers block(s). So the code currently waits for it by returning zero
when it was not found, resulting in the caller (h2_snd_buf()) to loop
forever calling it again.
The current internal connection/connstream API doesn't allow a send
actor to notify its caller that it cannot process the data until it
gets more, so even returning zero will only lead to calls in loops
without any guarantee that any progress will be made.
Some late amendments to HTX already guaranteed the atomicity of the
trailers block during snd_buf(), which is currently ensured by the
fact that producers create exactly one such trailers block for all
trailers. So in practice we can only loop between trailers and EOM.
This patch changes the behaviour by making h2s_htx_make_trailers()
become atomic by not consuming the EOM block. This way either it finds
the end of trailers marker (empty line) or it fails. Once it sends the
trailers block, ES is set so the stream turns HLOC or CLOSED. Thanks
to previous patch "MEDIUM: mux-h2: discard contents that are to be sent
after a shutdown" is is now safe to interrupt outgoing data processing,
and the late EOM block will silently be discarded when the caller
finally sends it.
This is a bit tricky but should remain solid by design, and seems like
the only option we have that is compatible with 1.9, where it must be
backported along with the aforementioned patch.
In h2_snd_buf() we discard any possible buffer contents requested to be
sent after a close or an error. But in practice we can extend this to
any case where the stream is locally half-closed since it means we will
never be able to send these data anymore.
For now it must not change anything, but it will be used by subsequent
patches to discard lone a HTX EOM block arriving after the trailers
block.
In case a header frame carrying trailers just fits into the HTX buffer
but leaves no room for the EOM block, we used to return the same code
as the one indicating we're missing data. This could would result in
such frames causing timeouts instead of immediate clean aborts. Now
they are properly reported as stream errors (since the frame was
decoded and the compression context is still synchronized).
This must be backported to 1.9.
When sending trailers, we may face an empty HTX trailers block or even
have to discard some of the headers there and be left with nothing to
send. RFC7540 forbids sending of empty HEADERS frames, so in this case
we turn to DATA frames (which is possible since after other DATA).
The code used to only check the input frame's contents to decide whether
or not to switch to a DATA frame, it didn't consider the possibility that
the frame only used to contain headers discarded later, thus it could still
emit an empty HEADERS frame in such a case. This patch makes sure that the
output frame size is checked instead to take the decision.
This patch must be backported to 1.9. In practice this situation is never
encountered since the discarded headers have really nothing to do in a
trailers block.
In h2c_decode_headers(), now that we support CONTINUATION frames, we
try to defragment all pending frames at once before processing them.
However if the first is exactly full and the second cannot be parsed,
we don't detect the problem and we wait for the next part forever due
to an incorrect check on exit; we must abort the processing as soon as
the current frame remains full after defragmentation as in this case
there is no way to make forward progress.
Thanks to Yves Lafon for providing traces exhibiting the problem.
This must be backported to 1.9.
For most of the xprt methods, provide a xprt_ctx. This will be useful later
when we'll want to be able to stack xprts.
The init() method now has to create and provide the said xprt_ctx if needed.
task_delete() was never used without calling task_free() just after, and
task_free() was only used on error pathes to destroy a just-created task,
so merge them into task_destroy(), that will remove the task from the
wait queue, and make sure the task is either destroyed immediately if it's
not in the run queue, or destroyed when it's supposed to run.
Instead of abusing the SUB_CALL_UNSUBSCRIBE flag, revamp the H2 code a bit
so that it just checks if h2s->sending_list is empty to know if the tasklet
of the stream_interface has been waken up or not.
send_wait is now set to NULL in h2_snd_buf() (ideally we'd set it to NULL
as soon as we're waking the tasklet, but it can't be done, because we still
need it in case we have to remove the tasklet from the task list).
In h2_subscribe(), don't add ourself to the send_list if we're already in it.
That may happen if we try to send and fail twice, as we're only removed
from the send_list if we managed to send data, to promote fairness.
Failing to do so can lead to either an infinite loop, or some random crashes,
as we'd get the same h2s in the send_list twice.
This should be backported to 1.9.
In the h1 and h2 muxes, make sure we unsubscribed before destroying the
mux context.
Failing to do so will lead in a segfault later, as the connection will
attempt to dereference its conn->send_wait or conn->recv_wait, which pointed
to the now-free'd mux context.
This was introduced by commit 39a96ee16e, so
should only be backported if that commit gets backported.
When a mux context is released, we must be sure it exists before dereferencing
it. The bug was introduced in the commit 39a96ee16 ("MEDIUM: muxes: Be prepared
to don't own connection during the release").
No need to backport this patch, expect if the commit 39a96ee16 is backported
too.
This happens during mux upgrades. In such case, when the destroy() callback is
called, the connection points to a different mux's context than the one passed
to the callback. It means the connection is owned by another mux. The old mux is
then released but the connection is not closed.
It is mandatory to handle mux upgrades, because during a mux upgrade, the
connection will be reassigned to another multiplexer. So when the old one is
destroyed, it does not own the connection anymore. Or in other words, conn->ctx
does not point to the old mux's context when its destroy() callback is
called. So we now rely on the multiplexer context do destroy it instead of the
connection.
In addition, h1_release() and h2_release() have also been updated in the same
way.
The mux's callback init() now take a pointer to a buffer as extra argument. It
must be used by the multiplexer as its input buffer. This buffer is always NULL
when a multiplexer is initialized with a fresh connection. But if a mux upgrade
is performed, it may be filled with existing data. Note that, for now, mux
upgrades are not supported. But this commit is mandatory to do so.
Instead of using the connection context to make the difference between a
frontend connection and a backend connection, we now rely on the function
conn_is_back().
A multiplexer must now set the flag MX_FL_HTX when it uses the HTX to structured
the data exchanged with channels. the muxes h1 and h2 set this flag. Of course,
for the mux h2, it is set on h2_htx_ops only.
Instead of using the same mux_ops structure for the legacy HTTP mode and the HTX
mode, a dedicated mux_ops is now used for the HTX mode. Same callbacks are used
for both. But the flags may be different depending on the mode used.
Modify h2c_restart_reading() to add a new parameter, to let it know if it
should consider if the buffer isn't empty when retrying to read or not, and
call h2c_restart_reading() using 0 as a parameter from h2_process_demux().
If we're leaving h2_process_demux() with a non-empty buffer, it means the
frame is incomplete, and we're waiting for more data, and if we already
subscribed, we'll be waken when more data are available.
Failing to do so means we'll be waken up in a loop until more data are
available.
This should be backported to 1.9.
Recent commit 63768a63d ("MEDIUM: mux-h2: Don't mix the end of the message
with the end of stream") introduced a race which may manifest itself with
small connection counts on large objects and large server timeouts in
legacy mode. Sometimes h2s_close() is called while the data layer is
subscribed to read events but nothing in the chain can cause this wake-up
to happen and some streams stall for a while at the end of a transfer
until the server timeout strikes and ends the stream completes.
We need to wake the stream up if it's subscribed to rx events there,
which is what this patch does. When the patch above is backported to
1.9, this patch will also have to be backported.
Previous commit 3ea351368 ("BUG/MEDIUM: h2: Remove the tasklet from the
task list if unsubscribing.") uncovered an issue which needs to be
addressed in the scheduler's API. The function task_remove_from_task_list()
was initially designed to remove a task from the running tasklet list from
within the scheduler, and had to be used in h2 to abort pending I/O events.
However this function was not designed to be idempotent, occasionally
causing a double removal from the tasklet list, with the second doing
nothing but affecting the apparent tasks count and making haproxy use
100% CPU on some tests consisting in stopping the client during some
transfers. The h2_unsubscribe() function can sometimes be called upon
stream exit after an error where the tasklet was possibly already
removed, so it.
This patch does 2 things :
- it renames task_remove_from_task_list() to
__task_remove_from_tasklet_list() to discourage users from calling
it. Also note the fix in the naming since it's a tasklet list and
not a task list. This function is still uesd from the scheduler.
- it adds a new, idempotent, task_remove_from_tasklet_list() function
which does nothing if the task is already not in the tasklet list.
This patch will need to be backported where the commit above is backported.
In h2_unsubscribe(), if we unsubscribe on SUB_CALL_UNSUBSCRIBE, then remove
ourself from the sending_list, and remove the tasklet from the task list.
We're probably about to destroy the stream anyway, so we don't want the
tasklet to run, or to stay in the sending_list, or it could lead to a crash.
This should be backpored to 1.9.
In h2_deferred_shut(), don't just set h2s->send_wait to NULL, instead, use
the same logic as in h2_snd_buf() and only do so if we successfully sent data
(or if we don't want to send them anymore). Setting it to NULL can lead to
crashes.
This should be backported to 1.9.
In h2s_notify_send(), use the new sending_list instead of using the old
way of setting hs->send_wait to NULL, failing to do so may lead to crashes.
This should be backported to 1.9.
In h2_deferred_shut(), only attempt to destroy the h2s if h2s->cs is NULL.
h2s->cs being non-NULL means it's still referenced by the stream interface,
so it may try to use it later, and that could lead to a crash.
This should be backported to 1.9.
The H2 multiplexer now sets CS_FL_EOI when it receives a frame with the ES
flag. And when the H2 streams is closed, it set the flag CS_FL_REOS.
This patch should be backported to 1.9.
On the send path, try to be fair, and make sure the first to attempt to
send data will actually be the first to send data when it's possible (ie
when the mux' buffer is not full anymore).
To do so, use a separate list element for the sending_list, and only remove
the h2s from the send_list/fctl_list if we successfully sent data. If we did
not, we'll keep our place in the list, and will be able to try again next time.
This should be backported to 1.9.
Some functions' roles and usage are far from being obvious, and diving
into this part each time requires deep concentration before starting to
understand who does what. Let's add a few comments which help figure
some of the useful pieces.
We tend to refrain from sending data a bit too much in the H2 mux :
whenever there are pending data in the buffer and we try to copy
something larger than 1/4 of the buffer we prefer to pause. This
is suboptimal for medium-sized objects which have to send their
headers and later their data.
This patch slightly changes this by allowing a copy of a large block
if it fits at once and if the realign cost is small, i.e. the pending
data are small or the block fits in the contiguous area. Depending on
the object size this measurably improves the download performance by
between 1 and 10%, and possibly lowers the transfer latency for medium
objects.
In h2_stop_senders(), when we're about to move the h2s about to send back
to the send_list, because we know the mux is full, instead of putting them
all in the send_list, put them back either in the fctl_list or the send_list
depending on if they are waiting for the flow control or not. This also makes
sure they're inserted in their arrival order and not reversed.
This should be backported to 1.9.
In h2_detach(), don't bother keeping the h2s even if it was waiting for
flow control if we no longer are subscribed for receiving or sending, as
nobody will do anything once we can write in the mux, anyway. Failing to do
so may lead to h2s being kept opened forever.
This should be backported to 1.9.
If we're waiting until we can send a shutr and/or a shutw, once we're done
and not considering sending anything, destroy the h2s, and eventually the
h2c if we're done with the whole connection, or it will never be done.
This should be backported to 1.9.
For conveniance, in HTTP muxes (h1 and h2), the end of the stream and the end of
the message are reported the same way to the stream, by setting the flag
CS_FL_EOS. In the stream-interface, when CS_FL_EOS is detected, a shutdown for
read is reported on the channel side. This is historical. With the legacy HTTP
layer, because the parsing is done by the stream in HTTP analyzers, the EOS
really means a shutdown for read.
Most of time, for muxes h1 and h2, it works pretty well, especially because the
keep-alive is handled by the muxes. The stream is only used for one
transaction. So mixing EOS and EOM is good enough. But not everytime. For now,
client aborts are only reported if it happens before the end of the request. It
is an error and it is properly handled. But because the EOS was already
reported, client aborts after the end of the request are silently
ignored. Eventually an error can be reported when the response is sent to the
client, if the sending fails. Otherwise, if the server does not reply fast
enough, an error is reported when the server timeout is reached. It is the
expected behaviour, excpect when the option abortonclose is set. In this case,
we must report an error when the client aborts. But as said before, this event
can be ignored. So to be short, for now, the abortonclose is broken.
In fact, it is a design problem and we have to rethink all channel's flags and
probably the conn-stream ones too. It is important to split EOS and EOM to not
loose information anymore. But it is not a small job and the refactoring will be
far from straightforward.
So for now, temporary flags are introduced. When the last read is received, the
flag CS_FL_READ_NULL is set on the conn-stream. This way, we can set the flag
SI_FL_READ_NULL on the stream interface. Both flags are persistant. And to be
sure to wake the stream, the event CF_READ_NULL is reported. So the stream will
always have the chance to handle the last read.
This patch must be backported to 1.9 because it will be used by another patch to
fix the option abortonclose.
According to the H2 spec (see #8.1.4), setting the REFUSED_STREAM error code
is a way to indicate that the stream is being closed prior to any processing
having occurred, such as when a server-side H1 keepalive connection is closed
without sending anything (which differs from the regular error case since
haproxy doesn't even generate an error message). Any request that was sent on
the reset stream can be safely retried. So, when a stream is closed, if no
data was ever sent back (ie. the flag H2_SF_HEADERS_SENT is not set), we can
set the REFUSED_STREAM error code on the RST_STREAM frame.
This patch may be backported to 1.9.
This only happens for server streams because their id is assigned when the first
message is sent. If these streams are not woken up, some events can be lost
leading to frozen streams. For instance, it happens when a server closes its
connection before sending its preface.
This patch must be backported to 1.9.
In order to allow the H2 parser to report parsing errors, we must make
sure to always pass the HTX_FL_PARSING_ERROR flag from the h2s htx to
the conn_stream's htx.
A crash in H2 was reported in issue #52. It turns out that there is a
small but existing race by which a conn_stream could detach itself
using h2_detach(), not being able to destroy the h2s due to pending
output data blocked by flow control, then upon next h2s activity
(transfer_data or trailers parsing), an ES flag may need to be turned
into a CS_FL_REOS bit, causing a dereference of a NULL stream. This is
a side effect of the fact that we still have a few places which
incorrectly depend on the CS flags, while these flags should only be
set by h2_rcv_buf() and h2_snd_buf().
All candidate locations along this path have been secured against this
risk, but the code should really evolve to stop depending on CS anymore.
This fix must be backported to 1.9 and possibly partially to 1.8.
The h2c_send_settings() function was initially made to serve on the
frontend. Here we don't need to advertise that we don't support PUSH
since we don't do that ourselves. But on the backend side it's
different because PUSH is enabled by default so we must announce that
we don't want the server to use it.
This must be backported to 1.9.
When chunked-encoding is used in HTX mode, a trailers HTX block is always
made due to the way trailers are currently implemented (verbatim copy of
the H1 representation). Because of this it's not possible to know when
processing data that we've reached the end of the stream, and it's up
to the function encoding the trailers (h2s_htx_make_trailers) to put the
end of stream. But when there are no trailers and only an empty HTX block,
this one cannot produce a HEADERS frame, thus it cannot send the END_STREAM
flag either, leaving the other end with an incomplete message, waiting for
either more data or some trailers. This is particularly visible with POST
requests where the server continues to wait.
What this patch does is transform the HEADERS frame into an empty DATA
frame when meeting an empty trailers block. It is possible to do this
because we've not sent any trailers so the other end is still waiting
for DATA frames. The check is made after attempting to encode the list
of headers, so as to minimize the specific code paths.
Thanks to Dragan Dosen for reporting the issue with a reproducer.
This fix must be backported to 1.9.
This creates a new tunable "tune.h2.max-frame-size" to adjust the
advertised max frame size. When not set it still defaults to the buffer
size. It is convenient to advertise sizes lower than the buffer size,
for example when using very large buffers.
For now, this can be only done when a content-length is specified. In fact, it
is the same value than h2s->body_len, the remaining body length according to
content-length. Setting this field allows the fast forwarding at the channel
layer, improving significantly data transfer for big objects.
This patch may be backported to 1.9.
1xx responses does not work in HTTP2 when the HTX is enabled. First of all, when
a response is parsed, only one HEADERS frame is expected. So when an interim
response is received, the flag H2_SF_HEADERS_RCVD is set and the next HEADERS
frame (for another interim repsonse or the final one) is parsed as a trailers
one. Then when the response is sent, because an EOM block is found at the end of
the interim HTX response, the ES flag is added on the frame, closing too early
the stream. Here, it is a design problem of the HTX. Iterim responses are
considered as full messages, leading to some ambiguities when HTX messages are
processed. This will not be fixed now, but we need to keep it in mind for future
improvements.
To fix the parsing bug, the flag H2_MSGF_RSP_1XX is added when the response
headers are decoded. When this flag is set, an EOM block is added into the HTX
message, despite the fact that there is no ES flag on the frame. And we don't
set the flag H2_SF_HEADERS_RCVD on the corresponding H2S. So the next HEADERS
frame will not be parsed as a trailers one.
To fix the sending bug, the ES flag is not set on the frame when an interim
response is processed and the flag H2_SF_HEADERS_SENT is not set on the
corresponding H2S.
This patch must be backported to 1.9.
It is especially important when some data are blocked in the RX buf and the
channel buffer is already full. In such case, instead of exiting the function
directly, we need to set right flags on the conn_stream. CS_FL_RCV_MORE and
CS_FL_WANT_ROOM must be set, otherwise, the stream-interface will subscribe to
receive events, thinking it is not blocked.
This bug leads to connection freeze when everything was received with some data
blocked in the RX buf and a channel full.
This patch must be backported to 1.9.
PiBa-NL reported that some servers don't fall back to the Host header when
:authority is absent. After studying all the combinations of Host and
:authority, it appears that we always have to send the latter, hence we
never need the former. In case of CONNECT method, the authority is retrieved
from the URI part, otherwise it's extracted from the Host field.
The tricky part is that we have to scan all headers for the Host header
before dumping other headers. This is due to the fact that we must emit
pseudo headers before other ones. One improvement could possibly be made
later in the request parser to search and emit the Host header immediately
if authority was not found. This would cost nothing on the vast marjority
of requests and make the lookup faster on output since Host would appear
first.
This fix must be backported to 1.9.
Till now we used to only rely on tune.h2.max-concurrent-streams but if
a peer advertises a lower limit this can cause streams to be reset or
even the conection to be killed. Let's respect the peer's value for
outgoing streams.
This patch should be backported to 1.9, though it depends on the following
ones :
BUG/MINOR: server: fix logic flaw in idle connection list management
MINOR: mux-h2: max-concurrent-streams should be unsigned
MINOR: mux-h2: make sure to only check concurrency limit on the frontend
MINOR: mux-h2: learn and store the peer's advertised MAX_CONCURRENT_STREAMS setting
We used not to take it into account because we only used the configured
parameter everywhere. This patch makes sure we can actually learn the
value advertised by the peer. We still enforce our own limit on top of
it however, to make sure we can actually limit resources or stream
concurrency in case of suboptimal server settings.
h2_has_too_many_cs() was renamed to h2_frt_has_too_many_cs() to make it
clear it's only used to throttle the frontend connection, and the call
places were adjusted to only call this code on a front connection. In
practice it was already the case since the H2_CF_DEM_TOOMANY flag is
only set there. But now the ambiguity is removed.
With variable connection limits, it's not possible to accurately determine
whether the mux is still in use by comparing usage and max to be equal due
to the fact that one determines the capacity and the other one takes care
of the context. This can cause some connections to be dropped before they
reach their stream ID limit.
It seems it could also cause some connections to be terminated with
streams still alive if the limit was reduced to match the newly computed
avail_streams() value, though this cannot yet happen with existing muxes.
Instead let's switch to usage reports and simply check whether connections
are both unused and available before adding them to the idle list.
This should be backported to 1.9.
We used to rely on a hint that a shutw() or shutr() without data is an
indication that the upper layer had performed a tcp-request content reject
and really wanted to kill the connection, but sadly there is another
situation where this happens, which is failed keep-alive request to a
server. In this case the upper layer stream silently closes to let the
client retry. In our case this had the side effect of killing all the
connection.
Instead of relying on such hints, let's address the problem differently
and rely on information passed by the upper layers about the intent to
kill the connection. During shutr/shutw, this is detected because the
flag CS_FL_KILL_CONN is set on the connstream. Then only in this case
we send a GOAWAY(ENHANCE_YOUR_CALM), otherwise we only send the reset.
This makes sure that failed backend requests only fail frontend requests
and not the whole connections anymore.
This fix relies on the two previous patches adding SI_FL_KILL_CONN and
CS_FL_KILL_CONN as well as the fix for the connection close, and it must
be backported to 1.9 and 1.8, though the code in 1.8 could slightly differ
(cs is always valid) :
BUG/MEDIUM: mux-h2: wait for the mux buffer to be empty before closing the connection
MINOR: stream-int: add a new flag to mention that we want the connection to be killed
MINOR: connstream: have a new flag CS_FL_KILL_CONN to kill a connection
When finishing to respond on a stream, a shutw() is called (resulting
in either an end of stream or RST), then h2_detach() is called, and may
decide to kill the connection is a number of conditions are satisfied.
Actually one of these conditions is that a GOAWAY frame was already sent
or attempted to be sent. This one is wrong, because it can happen in at
least these two situations :
- a shutw() sends a GOAWAY to obey tcp-request content reject
- a graceful shutdown is pending
In both cases, the connection will be aborted with the mux buffer holding
some data. In case of a strong abort the client will not see the GOAWAY or
RST and might want to try again, which is counter-productive. In case of
the graceful shutdown, it could result in truncated data. It looks like a
valid candidate for the issue reported here :
https://www.mail-archive.com/haproxy@formilux.org/msg32433.html
A backport to 1.9 and 1.8 is necessary.
In h2_frt_transfer_data(), we support both HTX and legacy modes. The
HTX mode is detected from the proxy option and sets a valid pointer
into the htx variable. Better rely on this variable in all the function
rather than testing the option again. This way the code is clearer and
even the compiler knows this pointer is valid when it's dereferenced.
This should be backported to 1.9 if the b_is_null() patch is backported.
We used to respond a connection error in case we received a trailers
frame on a closed stream, but it's a problem to do this if the error
was caused by a reset because the sender has not yet received it and
is just a victim of the timing. Thus we must not close the connection
in this case.
This patch may be backported to 1.9 but then it requires the following
previous ones :
MINOR: h2: add a generic frame checker
MEDIUM: mux-h2: check the frame validity before considering the stream state
CLEANUP: mux-h2: remove stream ID and frame length checks from the frame parsers
It's not convenient to have such structural checks mixed with the ones
related to the stream state. Let's remove all these basic tests that are
already covered once for all when reading the frame header.
There are some uneasy situation where it's difficult to validate a frame's
format without being in an appropriate state. This patch makes sure that
each frame passes through h2_frame_check() before being checked in the
context of the stream's state. This makes sure we can always return a GOAWAY
for protocol violations even if we can't process the frame.
RFC7540#5.1 states that these are the only states allowing any frame
type. For response HEADERS frames, we cannot accept that they are
delivered on idle streams of course, so we're left with these two
states only. It is important to test this so that we can remove the
generic CLOSE_STREAM test for such frames in the main loop.
This must be backported to 1.9 (1.8 doesn't have response HEADERS).
If a response HEADERS frame arrives on a closed connection (due to a
client abort sending an RST_STREAM), it's currently immediately rejected
with an RST_STREAM, like any other frame. This is incorrect, as HEADERS
frames must first be decoded to keep the HPACK decoder synchronized,
possibly breaking subsequent responses.
This patch excludes HEADERS/CONTINUATION/PUSH_PROMISE frames from the
central closed state test and leaves to the respective frame parsers
the responsibility to decode the frame then send RST_STREAM.
This fix must be backported to 1.9. 1.8 is not directly impacted since
it doesn't have response HEADERS nor trailers thus cannot recover from
such situations anyway.
The H2 spec requires to send GOAWAY when the client sends a frame after
it has already closed using END_STREAM. Here the corresponding case was
the fallback of a series of tests on the stream state, but it unfortunately
also catches old closed streams which we don't know anymore. Thus any late
packet after we've sent an RST_STREAM will trigger this GOAWAY and break
other streams on the connection.
This can happen when launching two tabs in a browser targetting the same
slow page through an H2-to-H2 proxy, and pressing Escape to stop one of
them. The other one gets an error when the page finally responds (and it
generally retries), and the logs in the middle indicate SD-- flags since
the late response was cancelled.
This patch takes care to only send GOAWAY on streams we still know.
It must be backported to 1.9 and 1.8.
When receiving a HEADERS or DATA frame with END_STREAM set, we would
inconditionally switch to half-closed(remote). This is wrong because we
could already have been in half-closed(local) and need to switch to closed.
This happens in the following situations :
- receipt of the end of a client upload after we've already responded
(e.g. redirects to POST requests)
- receipt of a response on the backend side after we've already finished
sending the request (most common case).
This may possibly have caused some streams to stay longer than needed
at the end of a transfer, though this is not apparent in tests.
This must be backported to 1.9 and 1.8.
When a settings frame updates the initial window, all affected streams's
window is updated as well. However the streams are not put back into the
send list if they were already blocked on flow control. The effect is that
such a stream will only be woken up by a WINDOW_UPDATE message but not by
a SETTINGS changing the initial window size. This can be verified with
h2spec's test http2/6.9.2/1 which occasionally fails without this patch.
It is unclear whether this situation is really met in field, but the
fix is trivial, it consists in adding each unblocked streams to the
wait list as is done for the window updates.
This fix must be backported to 1.9. For 1.8 the patch needs quite
a few adaptations. It's better to copy-paste the code block from
h2c_handle_window_update() adding the stream to the send_list when
its mws is > 0.
The WINDOW_UPDATE and DATA frame handlers used to still have a check on
h2s to return either h2s_error() or h2c_error(). This is a leftover from
the early code. The h2s cannot be null there anymore as it has already
been dereferenced before reaching these locations.
A subtle bug was introduced with H2 on the backend. RFC7540 states that
an attempt to create a stream on an ID not higher than the max known is
a connection error. This was translated into rejecting HEADERS frames
for closed streams. But with H2 on the backend, if the client aborts
and causes an RST_STREAM to be emitted, the stream is effectively closed,
and if/once the server responds, it starts by emitting a HEADERS frame
with this ID thus it is interpreted as a connection error.
This test must of course consider the side the mux is installed on and
not take this for a connection error on responses.
The effect is that an aborted stream on an outgoing H2 connection, for
example due to a client stopping a transfer with option abortonclose
set, would lead to an abort of all other streams. In the logs, this
appears as one or several CD-- line(s) followed by one or several SD--
lines which are victims.
Thanks to Luke Seelenbinder for reporting this problem and providing
enough elements to help understanding how to reproduce it.
This fix must be backported to 1.9.
The calculation of available outgoing H2 streams was improved by commit
d64a3ebe6 ("BUG/MINOR: mux-h2: always check the stream ID limit in
h2_avail_streams()"), but it still is incorrect because RFC7540#6.8
specifically forbids the creation of new streams after a GOAWAY frame
was received. Thus we must not mark the connection as available anymore
in order to be able to handle a graceful shutdown.
This needs to be backported to 1.9.
This is mandated by RFC7541#8.1.2.6. Till now we didn't have a copy of
the content-length header field. But now that it's already parsed, it's
easy to add the check.
The reg-test was updated to match the new behaviour as the previous one
expected unadvertised data to be silently discarded.
This should be backported to 1.9 along with previous patch (MEDIUM: h2:
always parse and deduplicate the content-length header) after it has got
a bit more exposure.
The header used to be parsed only in HTX but not in legacy. And even in
HTX mode, the value was dropped. Let's always parse it and report the
parsed value back so that we'll be able to store it in the streams.
This parameter allows to limit the number of successive requests sent
on a connection. Let's compare it to the number of streams already sent
on the connection to decide if the connection may still appear in the
idle list or not. This may be used to help certain servers work around
resource leaks, and also helps dealing with the issue of the GOAWAY in
flight which requires to set a usage limit on the client to be reliable.
This must be backported to 1.9.
One of the reasons for the excessive number of aborted requests when a
server sets a limit on the highest stream ID is that we don't check
this limit while allocating a new stream.
This patch does this at two locations :
- when a backend stream is allocated, we verify that there are still
IDs left ;
- when the ID is assigned, we verify that it's not higher than the
advertised limit.
This should be backported to 1.9.
This function is used to decide whether to put an idle connection back
into the idle pool. While it considers the limit in number of concurrent
requests, it does not consider the limit in number of streams, so if a
server announces a low limit in a GOAWAY frame, it will be ignored.
However there is a caveat : since we assign the stream IDs when sending
them, we have a number of allocated streams which max_id doesn't take
care of. This can be addressed by adding a new nb_reserved count on each
connection to keep track of the ID-less streams.
This patch makes sure we take care of the remaining number of streams
if such a limit was announced, or of the number of streams before the
highest ID. Now it is possible to accurately know how many streams
can be allocated, and the number of failed outgoing streams has dropped
in half.
This must be backported to 1.9.
When sending RST_STREAM in response to a frame delivered on an already
closed stream, we used not to be able to update the error code and
deliver an RST_STREAM with a wrong code (e.g. H2_ERR_CANCEL). Let's
always allow to update the code so that RST_STREAM is always sent
with the appropriate error code (most often H2_ERR_STREAM_CLOSED).
This should be backported to 1.9 and possibly to 1.8.
There are incompatible MUST statements in the HTTP/2 specification. Some
require a stream error and others a connection error for the same situation.
As discussed in the thread below, let's always apply the connection error
when relevant (headers-like frame in half-closed(remote)) :
https://mailarchive.ietf.org/arch/msg/httpbisa/pOIWRBRBdQrw5TDHODZXp8iblcE
This must be backported to 1.9, possibly to 1.8 as well.
Since we now support CONTINUATION frames, we must take care of properly
aborting the connection when they are sent on a closed stream. By default
we'd get a stream error which is not sufficient since the compression
context is modified and unrecoverable.
More info in this discussion :
https://mailarchive.ietf.org/arch/msg/httpbisa/azZ1jiOkvM3xrpH4jX-Q72KoH00
This needs to be backported to 1.9 and possibly to 1.8 (less important there).
There was an incomplete test in h2c_frt_handle_headers() resulting
in negative return values from h2c_decode_headers() not being taken
as errors. The effect is that the stream is then aborted on timeout
only.
This fix must be backported to 1.9.
In case we cannot allocate a stream ID for an outgoing stream, the stream
will be aborted. The problem is that we also release it and it will be
destroyed again by the application detecting the error, leading to a NULL
dereference in h2_shutr() and h2_shutw(). Let's only mark the error on the
CS and let the rest of the code handle the close.
This should be backported to 1.9.
When data are pushed in the channel's buffer, in h2_rcv_buf(), the mux-h2 must
respect the reserve if the flag CO_RFL_KEEP_RSV is set. In HTX, because the
stream-interface always sees the buffer as full, there is no other way to know
the reserve must be respected.
This patch must be backported to 1.9.
Tim Dsterhus reported a possible crash in the H2 HEADERS frame decoder
when the PRIORITY flag is present. A check is missing to ensure the 5
extra bytes needed with this flag are actually part of the frame. As per
RFC7540#4.2, let's return a connection error with code FRAME_SIZE_ERROR.
Many thanks to Tim for responsibly reporting this issue with a working
config and reproducer. This issue was assigned CVE-2018-20615.
This fix must be backported to 1.9 and 1.8.
Now the H2 mux will parse and encode the HTX trailers blocks and send
the corresponding HEADERS frame. Since these blocks contain pure H1
trailers which may be fragmented on line boundaries, if first needs
to collect all of them, parse them using the H1 parser, build a list
and finally encode all of them at once once the EOM is met. Note that
this HEADERS frame always carries the end-of-headers and end-of-stream
flags.
This was tested using the helloworld examples from the grpc project,
as well as with the h2c tools. It doesn't seem possible at the moment
to test tailers using varnishtest though.
We want to make sure we won't emit another empty DATA frame if we meet
HTX_BLK_EOM after and end of stream was already sent. For now it cannot
happen as far as HTX is respected, but with trailers it may become
ambiguous.
When forwarding an H2 request to an H1 server, if the request doesn't
have a content-length header field, it is chunked. In this case it is
possible to send trailers to the server, which is what this patch does.
If the transfer is performed without chunking, then the trailers are
silently discarded.
This is not exactly a bug but a long-time design limitation. We used not
to decode trailers in H2, resulting in broken connections each time a
trailer was sent, since it was impossible to keep the HPACK decompressor
synchronized. Now that the sequencing of operations permits it, we must
make sure to at least properly decode them.
What we try to do is to identify if a HEADERS frame was already seen and
use this indication to know if it's a headers or a trailers. For this,
h2c_decode_headers() checks if the stream indicates that a HEADERS frame
was already received. If so, it decodes it and emits the trailing
0 CRLF CRLF in case of H1, or the HTX_EOD + HTX_EOM blocks in case of HTX,
to terminate the data stream.
The trailers contents are still deleted for now but the request works, and
the connection remains synchronized and usable for subsequent streams.
The correctness may be tested using a simple config and h2spec :
h2spec -o 1000 -v -t -S -k -h 127.0.0.1 -p 4443 generic/4/4
This should definitely be backported to 1.9 given the low impact for the
benefit. However it cannot be backported to 1.8 since the operations cannot
be resumed. The following patches are also needed with this one :
MINOR: mux-h2: make h2c_decode_headers() return a status, not a count
MINOR: mux-h2: add a new dummy stream : h2_error_stream
MEDIUM: mux-h2: make h2c_decode_headers() support recoverable errors
BUG/MINOR: mux-h2: detect when the HTX EOM block cannot be added after headers
MINOR: mux-h2: check for too many streams only for idle streams
MINOR: mux-h2: set H2_SF_HEADERS_RCVD when a HEADERS frame was decoded
The HEADERS frame parser checks if we still have too many streams, but
this should only be done for idle streams, otherwise it would prevent
us from processing trailer frames.
In h2c_frt_handle_headers() and h2c_bck_handle_headers() we have an unused
error path made of the strm_err label, while send_rst is used to emit an
RST upon stream error after forcing the stream to h2_refused_stream. Let's
remove this unused strm_err block now.
In h2c_frt_handle_headers(), we test the stream for SS_ERROR just after
setting it to SS_OPEN, this makes no sense and creates confusion in the
error path. Remove this misleading test.
In case we receive a very large HEADERS frame which doesn't leave enough
room to place the EOM block after the decoded headers, we must fail the
stream. This test was missing, resulting in the loss of the EOM, possibly
leaving the stream waiting for a time-out.
Note that we also clear h2c->dfl here so that we don't attempt to clear
it twice when going back to the demux.
If this is backported to 1.9, it also requires that the following patches
are backported as well :
MINOR: mux-h2: make h2c_decode_headers() return a status, not a count
MINOR: mux-h2: add a new dummy stream : h2_error_stream
MEDIUM: mux-h2: make h2c_decode_headers() support recoverable errors
When a decoding error is recoverable, we should emit a stream error and
not a connection error. This patch does this by carefully checking the
connection state before deciding to send a connection error. If only the
stream is in error, an RST_STREAM is sent.
This function used to return a byte count for the output produced, or
zero on failure. Not only this value is not used differently than a
boolean, but it prevents us from returning stream errors when a frame
cannot be extracted because it's too large, or from parsing a frame
and producing nothing on output.
This patch modifies its API to return <0 on errors, 0 on inability to
proceed, or >0 on success, irrelevant to the amount of output data.
In h2c_decode_headers() we update the buffer's length according to the
amount of data produced (outlen). But in case of HTX this outlen value
is not a quantity, just an indicator of success, resulting in the buffer
being added one extra byte and temporarily showing .data > .size, which
is wrong. Fortunately this is overridden when leaving the function by
htx_to_buf() so the impact only exists in step-by-step debugging, but
it definitely needs to be fixed.
This must be backported to 1.9.
When dealing with a server's H2 response, we used to set the
end-of-stream flag on the conn_stream and the stream before parsing
the response, which is incorrect since we can fail to process this
response by lack of room, buffer or anything. The extend of this problem
is still limited to a few rare cases, but with trailers it will cause a
systematic failure.
This fix must be backported to 1.9.
This function handles response HEADERS frames, it is not responsible
for creating new streams thus it must not check if we've reached the
stream count limit, otherwise it could lead to some undesired pauses
which bring no benefit.
This must be backported to 1.9.
If we exit this function because some data are pending in the rxbuf, we
currently don't indicate any blocking flag, which will prevent the operation
from being attempted again. Let's set H2_CF_DEM_SFULL in this case to indicate
there's not enough room in the stream buffer so that the operation may be
attempted again once we make room. It seems that this issue cannot be
triggered right now but it definitely will with trailers.
This fix should be backported to 1.9 for completeness.
h2c_restart_reading() is used at various place to resume processing of
demux data, but this one refrains from doing so if the mux is already
subscribed for receiving. It just happens that even if some incoming
frame processing is interrupted, the mux is always subscribed for
receiving, so this condition alone is not enough, it must be combined
with the fact that the demux buffer is empty, otherwise some resume
events are lost. This typically happens when we refrain from processing
some incoming data due to missing room in the stream's rxbuf, and want
to resume in h2c_rcv_buf(). It will become even more visible with trailers
since these ones want to have an empty rxbuf before proceeding.
This must be backported to 1.9.
In h2c_decode_headers() we mistakenly check for H2_F_DATA_END_STREAM
while we should check for H2_F_HEADERS_END_STREAM. Both have the same
value (1) but better stick to the correct flag.
When a session adds a connection to its connection list, we used to remove
connections for an another server if there were not enough room for our
server. This can't work, because those lists are now the list of connections
we're responsible for, not just the idle connections.
To fix this, allow for an unlimited number of servers, instead of using
an array, we're now using a linked list.
In h2_detach(), don't add the connection to the idle list if nb_streams
is at the max. This can happen if we already closed that stream before, so
its slot became available and was used by another stream.
This should be backported to 1.9.
Now that the HEADERS frame decoding is retryable, we can safely try to
fold CONTINUATION frames into a HEADERS frame when the END_OF_HEADERS
flag is missing. In order to do this, h2c_decode_headers() moves the
frames payloads in-situ and leaves a hole that is plugged when leaving
the function. There is no limit to the number of CONTINUATION frames
handled this way provided that all of them fit into the buffer. The
error reported when meeting isolated CONTINUATION frames has now changed
from INTERNAL_ERROR to PROTOCOL_ERROR.
Now there is only one (unrelated) remaining failure in h2spec.
The H2 demux only checks for too many streams in h2c_frt_stream_new(),
then refuses to create a new stream and causes the connection to be
aborted by sending a GOAWAY frame. This will also happen if any error
happens during the stream creation (e.g. memory allocation).
RFC7540#5.1.2 says that attempts to create streams in excess should
instead be dealt with using an RST_STREAM frame conveying either the
PROTOCOL_ERROR or REFUSED_STREAM reason (the latter being usable only
if it is guaranteed that the stream was not processed). In theory it
should not happen for well behaving clients, though it may if we
configure a low enough h2.max_concurrent_streams limit. This error
however may definitely happen on memory shortage.
Previously it was not possible to use RST_STREAM due to the fact that
the HPACK decompressor would be desynchronized. But now we first decode
and only then try to allocate the stream, so the decompressor remains
synchronized regardless of policy or resources issues.
With this patch we enforce stream termination with RST_STREAM and
REFUSED_STREAM if this protocol violation happens, as well as if there
is a temporary condition like a memory allocation issue. It will allow
a client to recover cleanly.
This could possibly be backported to 1.9. Note that this requires that
these five previous patches are merged as well :
MINOR: h2: add a bit-based frame type representation
MEDIUM: mux-h2: remove padlen during headers phase
MEDIUM: mux-h2: decode HEADERS frames before allocating the stream
MINOR: mux-h2: make h2c_send_rst_stream() use the dummy stream's error code
MINOR: mux-h2: add a new dummy stream for the REFUSED_STREAM error code
This patch introduces a new dummy stream, h2_refused_stream, in CLOSED
status with the aforementioned error code. It will be usable to reject
unexpected extraneous streams.
We currently have 2 dummy streams allowing us to send an RST_STREAM
message with an error code matching this one. However h2c_send_rst_stream()
still enforces the STREAM_CLOSED error code for these dummy streams,
ignoring their respective errcode fields which however are properly
set.
Let's make the function always use the stream's error code. This will
allow to create other dummy streams for different codes.
It's hard to recover from a HEADERS frame decoding error after having
already created the stream, and it's not possible to recover from a
stream allocation error without dropping the connection since we can't
maintain the HPACK context, so let's decode it before allocating the
stream, into a temporary buffer that will then be offered to the newly
created stream.
Three types of frames may be padded : DATA, HEADERS and PUSH_PROMISE.
Currently, each of these independently deals with padding and needs to
wait for and skip the initial padlen byte. Not only this complicates
frame processing, but it makes it very hard to process CONTINUATION
frames after a padded HEADERS frame, and makes it complicated to perform
atomic calls to h2s_decode_headers(), which are needed if we want to be
able to maintain the HPACK decompressor's context even when dropping
streams.
This patch takes a different approach : the padding is checked when
parsing the frame header, the padlen byte is waited for and parsed,
and the dpl value is updated with this padlen value. This will allow
the frame parsers to decide to overwrite the padding if needed when
merging adjacent frames.
Since commit f210191 ("BUG/MEDIUM: h2: don't accept new streams if
conn_streams are still in excess") we're refraining from reading input
frames if we've reached the limit of number of CS. The problem is that
it prevents such situations from working fine. The initial purpose was
in fact to prevent from reading new HEADERS frames when this happens,
and causes some occasional transfer hiccups and pauses with large
concurrencies.
Given that we now properly reject extraneous streams before checking
this value, we can be sure never to have too many streams, and that
any higher value is only caused by a scheduling reason and will go
down after the scheduler calls the code.
This fix must be backported to 1.9 and possibly to 1.8. It may be
tested using h2spec this way with an h2spec config :
while :; do
h2spec -o 5 -v -t -S -k -h 127.0.0.1 -p 4443 http2/5.1.2
done
We were returning a stream error of type PROTOCOL_ERROR on empty HEADERS
frames, but RFC7540#4.2 stipulates that we should instead return a
connection error of type FRAME_SIZE_ERROR.
This may be backported to 1.9 and 1.8 though it's unlikely to have any
real life effect.
Commit dc57236 ("BUG/MINOR: mux-h2: advertise a larger connection window
size") caused a WINDOW_UPDATE message to be sent early with the connection
to increase the connection's window size. It turns out that it causes some
minor trouble that need to be worked around :
- varnishtest cannot transparently cope with the WU frames during the
handshake, forcing all tests to explicitly declare the handshake
sequence ;
- some vtc scripts randomly fail if the WU frame is sent after another
expected response frame, adding uncertainty to some tests ;
- h2spec doesn't correctly identify these WU at the connection level
that it believes are the responses to some purposely erroneous frames
it sends, resulting in some errors being reported
None of these are a problem with real clients but they add some confusion
during troubleshooting.
Since the fix above was intended to increase the upload bandwidth, we
have another option which is to increase the window size with the first
WU frame sent for the connection. This way, no WU frame is sent until
one is really needed, and this first frame will adjust the window to
the maximum value. It will make the window increase slightly later, so
the client will experience the first round trip when uploading data,
but this should not be perceptible, and is not worth the extra hassle
needed to maintain our debugging abilities. As an extra bonus, a few
extra bytes are saved for each connection until the first attempt to
upload data.
This should possibly be backported to 1.9 and 1.8.
In some situations, if too short a frame header is received, we may leave
h2_process_demux() waking up the task again without checking that we were
already subscribed.
In order to avoid this once for all, let's introduce an h2_restart_reading()
function which performs the control and calls the task up. This way we won't
needlessly wake the task up if it's already waiting for I/O.
Must be backported to 1.9.
In mux_h2_unsubscribe, don't forget to leave the sending_list if
SUB_CALL_UNSUBSCRIBE was set. SUB_CALL_UNSUBSCRIBE means we were about
to be woken up for writing, unless the mux was too full to get more data.
If there's an unsubscribe call in the meanwhile, we should leave the list,
or we may be put back in the send_list.
This should be backported to 1.9.
In h2_snd_buf(), if we couldn't send the data because of flow control, and
the connection got a shutr, then add CS_FL_ERROR (or CS_FL_ERR_PENDING). We
will never get any window update, so we will never be unlocked, anyway.
No backport is needed.
When dealing with early data we scan the list of stream to notify them.
We're not supposed to have h2s->cs == NULL here but it doesn't cost much
to make the scan more robust and verify it before notifying.
No backport is needed.
If we had no pending read, it could be complicated to report an
RST_STREAM to a sender since we used to only report it via the
rx side if subscribed. Similarly in h2_wake_some_streams() we
now try all methods, hoping to catch all possible events.
No backport is needed.
In order to report an error to the data layer, we have different ways
depending on the situation. At a lot of places it's open-coded and not
always correct. Let's create a new function h2s_alert() to handle this
task. It tries to wake on recv() first, then on send(), then using
wake().
In the mux h2, make sure we set CS_FL_ERR_PENDING and wake the recv task,
instead of setting CS_FL_ERROR, if CS_FL_EOS is not set, so if there's
potentially still some data to be sent.
Commiy 8519357c ("BUG/MEDIUM: mux-h2: report asynchronous errors in
h2_wake_some_streams()") addressed an issue with synchronous errors
but forgot to fix the call places to also pass CS_FL_ERR_PENDING
instead of CS_FL_ERROR.
No backport is needed.
We most often store the mux context there but it can also be something
else while setting up the connection. Better call it "ctx" and know
that it's the owner's context than misleadingly call it mux_ctx and
get caught doing suspicious tricks.
The SUB_CAN_SEND/SUB_CAN_RECV enum values have been confusing a few
times, especially when checking them on reading. After some discussion,
it appears that calling them SUB_RETRY_SEND/SUB_RETRY_RECV more
accurately reflects their purpose since these events may only appear
after a first attempt to perform the I/O operation has failed or was
not completed.
In addition the wait_reason field in struct wait_event which carries
them makes one think that a single reason may happen at once while
it is in fact a set of events. Since the struct is called wait_event
it makes sense that this field is called "events" to indicate it's the
list of events we're subscribed to.
Last, the values for SUB_RETRY_RECV/SEND were swapped so that value
1 corresponds to recv and 2 to send, as is done almost everywhere else
in the code an in the shutdown() call.
Today the demux only wakes a stream up after receiving some contents, but
not necessarily on close or error. Let's do it based on both error flags
and both EOS flags. With a bit of refinement we should be able to only do
it when the pending bits are there but not the static ones.
No backport is needed.
This function is called when dealing with a connection error or a GOAWAY
frame. It used to report a synchronous error instead of an asycnhronous
error, which can lead to data truncation since whatever is still available
in the rxbuf will be ignored. Let's correctly use CS_FL_ERR_PENDING instead
and only fall back to CS_FL_ERROR if CS_FL_EOS was already delivered.
No backport is needed.
If EOS has already been reported on the conn_stream, there won't be
any read anymore to turn ERR_PENDING into ERROR, so we have to do
report it directly.
No backport is needed.
The h2s pointer was used to scan fctl lists prior to being used to scan
the send list by ID, so it could appear non-null eventhough the list is
empty, resulting in misleading information on empty connections.
No backport is needed.
Most of the time when we issue "show fd" to dump a mux's state, it's
to figure why a transfer is frozen. Connection, stream and conn_stream
states are critical there. And most of the time when this happens there
is a single stream left in the H2 mux, so let's always dump the last
known stream on show fd, as most of the time it will be the one of
interest.
Commit 7505f94f9 ("MEDIUM: h2: Don't use a wake() method anymore.")
changed the conditions to restart demuxing so that this happens as soon
as something is read. But similar to previous fix, at an end of stream
we may be woken up with nothing to read but data still available in the
demux buffer, so we must also use this as a valid condition for demuxing.
No backport is needed, this is purely 1.9.
Commit 082f559d3 ("BUG/MEDIUM: h2: restart demuxing after releasing
buffer space") tried to address a situation where transfers could stall
after a read, but the condition was not completely covered : some stalls
may still happen at end of stream because there's nothing anymore to
receive and the last data lie in the demux buffer. Thus we must also
consider this state as a valid condition to restart demuxing.
No backport is needed.
Add a new flag to conn_streams, CS_FL_ERR_PENDING. This is to be set instead
of CS_FL_ERR in case there's still more data to be read, so that we read all
the data before closing.
In h2_deferred_shut, if we're done sending the shutr/shutw, don't destroy
the h2s if it still has a conn_stream attached, or the conn_stream may try
to access it again.
When creating new conn_streams, always set the CS_FL_NOT_FIRST flag. We
don't really care about being the first request for HTTP/2, this only
really makes sense for HTTP/1, and that way we can reuse connections.
In session, don't keep an infinite number of connection that can idle.
Add a new frontend parameter, "max-session-srv-conns" to set a max number,
with a default value of 5.
Instead of trying to get the session from the connection, which is not
always there, and of course there could be multiple sessions per connection,
provide it with the init() and attach() methods, so that we know the
session for each outgoing stream.
In the mux_h1 and mux_h2, move the test to see if we should add the
connection in the idle list until after we destroyed the h1s/h2s, that way
later we'll be able to check if the connection has no stream at all, and if
it should be added to the server idling list.
When using zerocopy, start from the beginning of the data, not from the
beginning of the buffer, it may have contained headers, and so the data
won't start at the beginning of the buffer.
The transpory layer now respects buffer alignment, so we don't need to
cheat anymore pretending we have some data at the head, adjusting the
buffer's head is enough.
For a long time we've been realigning empty buffers in the transport
layers, where the I/Os were performed based on callbacks. Doing so is
optimal for higher data throughput but makes it trickier to optimize
unaligned data, where mux_h1/h2 have to claim some data are present
in the buffer to force unaligned accesses to skip the frame's header
or the chunk header.
We don't need to do this anymore since the I/O calls are now always
performed from top to bottom, so it's only the mux's responsibility
to realign an empty buffer if it wants to.
In practice it doesn't change anything, it's just a convention, and
it will allow the code to be simplified in a next patch.
In the mux detach function, when using HTX, take the connection over if
it no longer has an owner (ie because the session that was the owner left).
It is done for legacy code in proto_http.c, but not for HTX.
Also when using HTX, in H2, try to add the connection back to idle_conns if
it was not already (ie we used to use all the available streams, and we're
freeing one). That too was done in proto_http.c.
H2 has a 9-byte frame header, and HTX has a 40-byte frame header.
By artificially advancing the Rx header and limiting the amount of
bytes read to protect the end of the buffer, we can make the data
payload perfectly aligned with HTX blocks and optimize the copy.
This is similar to what was done for the H1 mux : when the mux's buffer
is empty and the htx area contains exactly one data block of the same
size as the requested count, and all window and frame size conditions are
satisfied, then it's possible to simply swap the caller's buffer with the
mux's output buffer and adjust offsets and length to match the entire
DATA HTX block in the middle. An H2 frame header has to be prepended
before the block but this always fits in an HTX frame header.
In this case we perform a true zero-copy operation from end-to-end. This
is the situation that happens all the time with large files. When using
HTX over H2 over TLS, this brings a 3% extra performance gain. TLS remains
a limiting factor here but the copy definitely has a cost. Also since
haproxy can now use H2 in clear, the savings can be higher.
Due to blocking factor being different on H1 and H2, we regularly end
up with tails of data blocks that leave room in the mux buffer, making
it tempting to copy the pending frame into the remaining room left, and
possibly realigning the output buffer.
Here we check if the output buffer contains data, and prefer to wait
if either the current frame doesn't fit or if it's larger than 1/4 of
the buffer. This way upon next call, either a zero copy, or a larger
and aligned copy will be performed, taking the whole chunk at once.
Doing so increases the H2 bandwidth by slightly more than 1% on large
objects.
By default H2 uses a 65535 bytes window for the connection, and changing
it requires sending a WINDOW_UPDATE message. We only used to update the
window when receiving data, thus never increasing it further.
As reported by user klzgrad on the mailing list, this seriously limits
the upload bitrate, and will have an even higher impact on the backend
H2 connections to origin servers.
There is no technical reason for keeping this window so low, so let's
increase it to the maximum possible value (2G-1). We do this by
pretending we've already received that many data minus the maximum
data the client might already send (65535), so that an early
WINDOW_UPDATE message is sent right after the SETTINGS frame.
This should be backported to 1.8. This patch depends on previous
patch "BUG/MINOR: mux-h2: refrain from muxing during the preface".
The condition to refrain from processing the mux was insufficient as it
would only handle the outgoing connections. In essence it is not that much
of a problem since we don't have streams yet on an incoming connetion. But
it prevents waiting for the end of the preface before sending an early
WINDOW_UPDATE message, thus causing the connections to fail in this case.
This must be backported to 1.8 with a few minor adaptations.
All the HTX definition is self-contained and doesn't really depend on
anything external since it's a mostly protocol. In addition, some
external similar files (like h2) also placed in common used to rely
on it, making it a bit awkward.
This patch moves the two htx.h files into a single self-contained one.
The historical dependency on sample.h could be also removed since it
used to be there only for http_meth_t which is now in http.h.
This way we don't open-code the HPACK status codes anymore in the H2
code. Special care was taken not to cause any slowdown as this code is
very sensitive.
Jerome reported that outgoing H2 failed for methods different from GET
or POST. It turns out that the HPACK encoding is performed by hand in
the outgoing headers encoding function and that the data length was not
incremented to cover the literal method value, resulting in a corrupted
HEADERS frame.
Admittedly this code should move to the generic HPACK code.
No backport is needed.
Gcc 7 warns about a potential null pointer deref that cannot happen
since the start line block is guaranteed to be present in the functions
where it's dereferenced. Let's mark it as already checked.
When we're using HTX, we don't have to generate chunk header/trailers, and
that ultimately leads to a crash when we try to access a buffer that
contains just chunk trailers.
This should not be backported.
Since HTX stores header names in lower case already, we don't need to
do it again anymore. This increased H2 performance by 2.7% on quick
tests, now making H2 overr HTX about 5.5% faster than H2 over H1.
CS_FL_RCV_MORE is used in two cases, to let the conn_stream
know there may be more data available, and to let it know that
it needs more room. We can't easily differentiate between the
two, and that may leads to hangs, so split it into two flags,
CS_FL_RCV_MORE, that means there may be more data, and
CS_FL_WANT_ROOM, that means we need more room.
This should not be backported.
Due to a thinko, I used sl_off as the start line index number but it's
not it, it's its offset. The first index is obtained using htx_get_head(),
and the start line is obtained using htx_get_sline(). This caused crashes
to happen when forwarding HTX traffic via the H2 mux once the HTX buffer
started to wrap.
No backport is needed.
Now, the function htx_from_buf() will set the buffer's length to its size
automatically. In return, the caller should call htx_to_buf() at the end to be
sure to leave the buffer hosting the HTX message in the right state. When the
caller can use the function htxbuf() to get the HTX message without any update
on the underlying buffer.
It's incorrect to send more bytes than requested, because some filters
(e.g. compression) might intentionally hold on some blocks, so DATA
blocks must not be processed past the advertised byte count. It is not
the case for headers however.
No backport is needed.
If we're blocking on mux full, mux busy or whatever, we must get out of
the loop. In legacy mode this problem doesn't exist as we can normally
return 0 but here it's not a sufficient condition to stop sending, so
we must inspect the blocking flags as well.
No backport is needed.
The way htx_xfer_blks() was used is wrong, if we receive data, we must
report everything we found, not just the headers blocks. This ways causing
the EOM to be postponed and some fast responses (or errors) to be incorrectly
delayed.
No backport is needed.
In h2_snd_buf(), when running with htx, make sure we return the amount of
data the caller specified, if we emptied the buffer, as it is what the
caller expects, and will lead to him properly consider the buffer to be
empty.
When reaching h2_shutr/h2_shutw, as we may have generated an empty frame,
a goaway or a rst, make sure we wake the I/O tasklet, or we may not send
what we just generated.
Also in h2_shutw(), don't forget to return if all went well, we don't want
to subscribe the h2s to wait events.
Add a new method to muxes, "max_streams", that returns the max number of
streams the mux can handle. This will be used to know if a mux is in use
or not.
When creating a new stream, don't bother flagging a connection with
H2_CF_DEM_TOOMANY if we created the last available stream. We won't create
any other anyway, because h2_avail_streams() would return 0 available streams,
and has it is a blocking flag, it prevents us from reading data after.
The function now calls h2c_bck_handle_headers() or h2c_frt_handle_headers()
depending on the connection's side. The former doesn't create a new stream
but feeds an existing one. At this point it's possible to forward an H2
request to a backend server and retrieve the response headers.
This function does not really depend on the request, all it does is
also valid for H2 responses found on the backend side, so this patch
renames it and makes it call the appropriate decoder based on the
direction.
This creates an H2 HEADERS frame from an HTX request. The code is
very similar to the response encoding, so probably that in the future
we'll have to factor these functions differently. The HTX's start line
type is used to decide on the direction. We also purposely error out
when trying to encode an H2 request from an H1 message since it's not
implemented.
For now it reports an immediate error when trying to encode the request
since it doesn't parse as a response. We take care of sending the preface
and settings frame with the outgoing connection, and not to wait for a
preface during the H2_CS_PREFACE phase for outgoing connections.
For the backend we'll need to allocate streams as well. Let's do this
with h2c_bck_stream_new(). The stream ID allocator was split from it
so that the caller can decide whether or not to stay on the same
connection or create a new one. It possibly isn't the best way to do
this as once we're on the mux it's too late to give up creation of a
new stream. Another approach would possibly consist in detaching muxes
that reached their connection count limit before they can be reused.
Instead of choosing the stream id as soon as the stream is created, wait
until data is about to be sent. If we don't do that, the stream may send
data out of order, and so the stream 3 may send data before the stream 1,
and then when the stream 1 will try to send data, the other end will
consider that an error, as stream ids should always be increased.
Cc: Olivier Houchard <ohouchard@haproxy.com>
We declare two configurations for the H2 mux. One supporting only
the frontend in HTTP mode and one supporting both sides in HTX mode.
This is only to ease development at this point. Trying to assign an h2
mux on the server side will still fail during h2_init() anyway instead
of at config parsing time.
If we decided to emit the end of stream flag on the H2 response headers
frame, we must remove the EOM block from the HTX stream, otherwise it
will lead to an extra DATA frame being sent with the ES flag and will
violate the protocol.
This is used for uploads, we can now convert H2 DATA frames to HTX
DATA blocks. It's uncertain whether it's better to reuse the same
function or to split it in two at this point. For now the same
function was added with some paths specific to HTX. In this mode
we loop back to the same or next frame in order to try to complete
DATA blocks.
At the moment the way it's done is not optimal. We should aggregate multiple
blocks into a single DATA frame, and we should merge the ES flag with the
last one when we already know we've reached the end. For now and for an
easier tracking of the HTX stream, an individual empty DATA frame is sent
with the ES bit when EOM is met.
The DATA function is called for DATA, EOD and EOM since these stats indicate
that a previous frame was already produced without the ES flag (typically a
headers frame or another DATA frame). Thus it makes sense to handle all these
blocks there.
There's still an uncertainty on the way the EOD and EOM HTX blocks must be
accounted for, as they're counted as one byte in the HTX stream, but if we
count that byte off when parsing these blocks, we end up sending too much
and desynchronizing the HTX stream. Maybe it hides an issue somewhere else.
At least it's possible to reliably retrieve payloads up to 1 GB over H2/HTX
now. It's still unclear why larger ones are interrupted at 1 GB.
When using HTX, we need a separate function to emit a headers frame.
The code is significantly different from the H1 to H2 conversion, though
it borrows some parts there. It looks like the part building the H2 frame
from the headers list could be factored out, however some of the logic
around dealing with end of stream or block sizes remains different.
With this patch it becomes possible to retrieve bodyless HTTP responses
using H2 over HTX.
When the proxy is configured to use HTX mode, the headers frames
will be converted to HTX header blocks instead of HTTP/1 messages.
This requires very little modifications to the existing function
so it appeared better to do it this way than to duplicate it.
Only the request headers are handled, responses are not processed
yet and data frames are not processed yet either. The return value
is inaccurate but this is not an issue since we're using it as a
boolean : data received or not.
Now h2_snd_buf() will check the proxy's mode to decide whether to use
HTX-specific send functions or legacy functions. In HTX mode, the HTX
blocks of the output buffer will be parsed and the related functions
will be called accordingly based on the block type, and unimplemented
blocks will be skipped. For now all blocks are skipped, this is only
helpful for debugging.
The function needs to be slightly adapted to transfer HTX blocks, since
it may face a full buffer on the receive path, thus it needs to transfer
HTX blocks between the two sides ignoring the <count> argument in this
mode.
The H2 mux will now be called for both HTTP and HTX modes. For now the
data transferr functions are not HTX-aware so this will lead to problems
if used as-is but it's convenient for development and debugging.
In h2_recv(), return 1 if there's an error on the connection, not just if
there's a read0 pending, so that h2_process() can be called and act as a
janitor.
In h2_process_demux(), if we're demuxing multiple frames, and the previous
frame led to a stream getting closed, don't bogusly consider that an error,
and destroy the next stream, as there are valid cases where the stream could
be closed.
Instead of exporting a number of pools and having to manually delete
them in deinit() or to have dedicated destructors to remove them, let's
simply kill all pools on deinit().
For this a new function pool_destroy_all() was introduced. As its name
implies, it destroys and frees all pools (provided they don't have any
user anymore of course).
This allowed to remove 4 implicit destructors, 2 explicit ones, and 11
individual calls to pool_destroy(). In addition it properly removes
the mux_pt_ctx pool which was not cleared on exit (no backport needed
here since it's 1.9 only). The sig_handler pool doesn't need to be
exported anymore and became static now.
This commit replaces the explicit pool creation that are made in
constructors with a pool registration. Not only this simplifies the
pools declaration (it can be done on a single line after the head is
declared), but it also removes references to pools from within
constructors. The only remaining create_pool() calls are those
performed in init functions after the config is parsed, so there
is no more user of potentially uninitialized pool now.
It has been the opportunity to remove no less than 12 constructors
and 6 init functions.
Most calls to hap_register_post_check(), hap_register_post_deinit(),
hap_register_per_thread_init(), hap_register_per_thread_deinit() can
be done using initcalls and will not require a constructor anymore.
Let's create a set of simplified macros for this, called respectively
REGISTER_POST_CHECK, REGISTER_POST_DEINIT, REGISTER_PER_THREAD_INIT,
and REGISTER_PER_THREAD_DEINIT.
Some files were not modified because they wouldn't benefit from this
or because they conditionally register (e.g. the pollers).
This switches explicit calls to various trivial registration methods for
keywords, muxes or protocols from constructors to INITCALL1 at stage
STG_REGISTER. All these calls have in common to consume a single pointer
and return void. Doing this removes 26 constructors. The following calls
were addressed :
- acl_register_keywords
- bind_register_keywords
- cfg_register_keywords
- cli_register_kw
- flt_register_keywords
- http_req_keywords_register
- http_res_keywords_register
- protocol_register
- register_mux_proto
- sample_register_convs
- sample_register_fetches
- srv_register_keywords
- tcp_req_conn_keywords_register
- tcp_req_cont_keywords_register
- tcp_req_sess_keywords_register
- tcp_res_cont_keywords_register
- flt_register_keywords
Since the connection changes in 1.9, some breakage happened to the H2 mux
whose initial design was heavily relying on the fact that connection-level
functions were woken up after data were transferred to the stream layer.
We need to wake the demux up after receiving such data if the demux is
blocked. This at least allows to receive POSTs again. One issue remains,
it looks like the end of the uploaded data is silently discarded if the
server responds before the end of the transfer (H2 in half-closed(local)
state), which doesn't happen with 1.8.14 and nghttp as the client.
No backport is needed.
After the changes to the connection layer in 1.9, some wake up calls
need to be introduced to re-activate reading from the connection. One
such place is at the end of h2_process_demux(), otherwise processing
of input data stops after a few frames.
No backport is needed.
Do not destroy the connection when we're about to destroy a stream. This
prevents us from doing keepalive on server connections when the client is
using HTTP/2, as a new stream is created for each request.
Instead, the session is now responsible for destroying connections.
When reusing connections, the attach() mux method is now used to create a new
conn_stream.
Add a new method for mux, avail_streams, that returns the number of streams
still available for a mux.
For the mux_pt, it'll return 1 if the connection is in idle, or 0. For
the H2 mux, it'll return the max number of streams allowed, minus the number
of streams currently in use.
This method is used to retrieve the first known good conn_stream from
the mux. It will be used to find the other end of a connection when
dealing with the proxy protocol for example.
Commit d4dd22d ("MINOR: h2: Let user of h2_recv() and h2_send() know xfer
has been done") changed the API without documenting the expected returned
values which appear to come out of nowhere in the code :-( Please don't
do that anymore! The description was recovered from the commit message.
We wake up all the streams waiting to send data when we have space available
in the mux buffer. Doing so means we probably wake way too many streams,
because after a few the buffer will probably be full instead. So keep a
list of all the streams that are about to send data, and if we detect that
the buffer is full, unschedule the tasks and put the streams back to the
send_list.
Avoid using conn_xprt_want_send/recv, and totally nuke cs_want_send/recv,
from the upper layers. The polling is now directly handled by the connection
layer, it is activated on subscribe(), and unactivated once we got the event
and we woke the related task.
In h2_recv(), return 1 if we have data available, or if h2_recv_allowed()
failed, to be sure h2_process() is called.
Also don't subscribe if our buffer is full.
When we're closing a stream, is there's no stream left and a goaway was sent,
close the connection, there's no reason to keep it open.
[wt: it's likely that this is needed in 1.8 as well, though it's unclear
how to trigger this issue, some tests are needed]
We will need to know if a mux was created for a front or a back
connection and once it's established it's much harder, so let's
introduce H2_CF_IS_BACK for this.
For backend connections we'll have to initialize streams but not allocate
conn_streams since they'll already be there. Thus this patch splits the
h2c_stream_new() function into one dedicated to allocation of a new stream
and another one supposed to attach this stream to an existing frontend
connection.
Till now in order to figure the timeouts, we used to retrieve the proxy
from the session's owner, but the new API provides it so it's better to
simply take it from the caller at init time. We take this opportunity to
store the pointer to the proxy into the h2 connection so that we can
reuse it later when needed.
The init function was split into the mux init and the front init, but it
appears that most of the code will be common between the two sides when
implementing the backend init. Thus let's simply make this a unique
h2_init() function.
h2_snd_buf() must not accept to send data if the preface was not yet
received nor sent. At the moment it doesn't happen but it can with
server-side H2.
At a few places we check these states to detect if a stream has valid
data/errcode or is one of the two dummy streams (idle or closed). It
will become problematic for outgoing streams as it will not be possible
to report errors for example since the stream will switch from IDLE
state only after sending a HEADERS frame.
There is a safer solution consisting in checking the stream ID, which
may only be zero in the dummy streams. This patch changes the test to
only rely on the stream ID.
If we can't send data for a stream because of its flow control, make sure
not to put it in the send_list, until the flow control lets it send again.
This is specific to 1.9, and should not be backported.
When subscribing, we don't need to provide a list element, only the h2 mux
needs it. So instead, Add a list element to struct h2s, and use it when a
list is needed.
This forces us to use the unsubscribe method, since we can't just unsubscribe
by using LIST_DEL anymore.
This patch is larger than it should be because it includes some renaming.
As we don't know how subscriptions are handled, we can't just assume we can
use LIST_DEL() to unsubscribe, so introduce a new method to mux and connections
to do so.
Commit 8ae735da0 ("MEDIUM: mux_h2: Revamp the send path when blocking.")
added a tasklet allocation in h2_stream_new(), however the error exit path
fails to reset h2s in case the tasklet cannot be allocated, resulting in
the h2s pointer to be returned as valid to the caller. Let's readjust the
exit path to always return NULL on error and to always log as well (since
there is no reason for not logging on such important errors).
No backport is needed, this is strictly 1.9-dev.
Since commit 7505f94f9 ("MEDIUM: h2: Don't use a wake() method anymore."),
the H2 mux's init() calls h2_process(). But this last one may detect an
early error and call h2_release(), destroying the connection, and return
-1. At this point we're screwed because the caller will still dereference
the connection for various things ranging from the configuration of the
proxy protocol header to the retries. We could simply return -1 here upon
failure but that's not enough since the stream layer really needs to keep
its connection structure allocated (to clean it up in session_kill_embryonic
or for example because it holds the destination address to reconnect to
when the connection goes to the backend). Thus the correct solution here is
to only schedule a wakeup of the I/O callback so that the init succeeds,
and that the connection is only handled later.
No backport is needed, this is 1.9-specific.
While it was possible to consider the status before parsing response
headers, it's wrong to do it for request headers and could lead to
random behaviours due to this status matching other fields instead.
Additionnally there is little to no value in doing this for each and
every new header field. It's much better to reset the content-length
at once in the callerwhen seeing such statuses (which currently is only
the H2 mux).
No backport is needed, this is purely 1.9.
The h2 parser has this specificity that if it cannot send the headers
frame resulting from the headers it just parsed, it needs to drop it
and parse it again later. Since commit 8852850 ("MEDIUM: h1: let the
caller pass the initial parser's state"), when this happens the parser
remains in the data state and the headers are not parsed again next
time, resulting in a parse error. Let's reset the parser on exit there.
No backport is needed.
If we're detaching the conn_stream, and it was subscribed to be waken up
when more data was available to receive, unsubscribe it.
No backport is needed.
Empty both send_list and fctl_list when destroying the h2 context, so that
if we're freeing the stream after, it doesn't try to remove itself from the
now-deleted list.
No backport is needed.
Till now it was very difficult for a mux to know what proxy it was
working for. Let's pass the proxy when the mux is instanciated at
init() time. It's not yet used but the H1 mux will definitely need
it, just like the H2 mux when dealing with backend connections.
The h1 parser used to systematically turn header field names to lower
case because it was designed for H2. Let's add a flag which is off by
default to condition this behaviour so that when using it from an H1
parser it will not affect the message.
The HTTP status is not relevant to the H1 message but to the H2 stream
itself. It used to be placed there by pure convenience but better move
it before it's too hard to remove.
This state was only a delimiter between headers and body but it now
causes more harm than good because it requires someone to change it.
Since the H1 parser knows if we're in DATA or CHUNK_SIZE, simply let
it set the right next state so that h1m->state constantly matches
what is expected afterwards.
This will allow the parser to fill some extra fields like the method or
status without having to store them permanently in the HTTP message. At
this point however the parser cannot restart from an interrupted read.
There's no reason to have the two sides in H1 format since we only use
one at a time (the response at the moment). While completely removing
the request declaration, let's rename the response to "h1m" to clarify
that it's the unique h1 message there.
This is the *parsing* state of an HTTP/1 message. Currently the h1_state
is composite as it's made both of parsing and control (100SENT, BODY,
DONE, TUNNEL, ENDING etc). The purpose here is to have a purely H1 state
that can be used by H1 parsers. For now it's equivalent to h1_state.
Instead of waiting for the connection layer to let us know we can read,
attempt to receive as soon as process_stream() is called, and subscribe
to receive events if we can't receive yet.
Now, except for idle connections, the recv(), send() and wake() methods are
no more, all the lower layers do is waking tasklet for anybody waiting
for I/O events.
Change fctl_list and send_list to be lists of struct wait_list, and nuke
send_wait_list, as it's now redundant.
Make the code responsible for shutr/shutw subscribe to those lists.
Instead of having our wake() method called each time a fd event happens,
just subscribe to recv/send events, and get our tasklet called when that
happens. If any recv/send was possible, the equivalent of what h2_wake_cb()
will be done.
Let the connection layer know we're always interested in getting more data,
so that we get scheduled as soon as data is available, instead of relying
on the wake() method.
Make h2_recv() and h2_send() return 1 if data has been sent/received, or 0
if it did not. That way the caller will be able to know if more work may
have to be done.
Remove the recv() method from mux and conn_stream.
The goal is to always receive from the upper layers, instead of waiting
for the connection later. For now, recv() is still called from the wake()
method, but that should change soon.
For struct connection, struct conn_stream, and for the h2 mux, add 2 new
lists, one that handles waiters for recv, and one that handles waiters for
recv and send. That way we can ask to subscribe for either recv or send.
Christopher noticed that the CS_FL_EOS to CS_FL_REOS conversion was
incomplete : when the connectionis closed, we mark the streams with EOS
instead of REOS, causing the loss of any possibly pending data. At the
moment it's not an issue since H2 is used only with a client, but with
servers it could be a real problem if servers close the connection right
after sending their response.
This patch should be backported to 1.8.
The h2 mux currently lacks some basic transparency. Some errors cause the
connection to be aborted but they couldn't be reported. With this patch,
almost all situations where an error will cause a stream or connection to
be aborted without the ability for an existing stream to report it will be
reported in the logs. This at least provides a solution to monitor the
activity and abnormal traffic.
While parsing a headers frame, if the frame is wrapped in the buffer
and needs to be unwrapped, it will be duplicated before being processed.
But if it contains certain combinations of invalid flags, the parser
returns without releasing the temporary buffer leading to a memory
leak.
This fix needs to be backported to 1.8.
The handshake processing time used to be stored per stream, which was
valid when there was exactly one stream per session. With H2 and
multiplexing it's not the case anymore and the reported handshake times
are wrong in the logs as it's computed between the TCP accept() and the
stream creation. Let's first move the handshake where it belongs, which
is the session.
However, this is not enough because we don't want to report an excessive
idle time either for H2 (since many requests use the connection).
So the solution used here is to have the stream retrieve sess->tv_accept
and the handshake duration when the stream is created, and let the mux
immediately reset them. This way, the handshake time becomes zero for the
second and subsequent requests in H2 (which was already the case in H1),
and the idle time exactly counts how long the connection remained unused
while it could be used, so in H1 it runs from the end of the previous
response and in H2 it runs from the end of the previous request since the
channel is already available.
This patch will need to be backported to 1.8.
Multiplexers are not necessarily associated to an ALPN. ALPN is a TLS extension,
so it is not always defined or used. Instead, we now rather speak of
multiplexer's protocols. So in this patch, there are no significative changes,
some structures and functions are just renamed.
Now, a multiplexer can specify if it can be install on incoming connections
(ALPN_SIDE_FE), on outgoing connections (ALPN_SIDE_BE) or both
(ALPN_SIDE_BOTH). These flags are compatible with proxies' ones.
This is a partial revert of the commit deccd1116 ("MEDIUM: mux: make
mux->snd_buf() take the byte count in argument"). It is a requirement to do
zero-copy transfers. This will be mandatory when the TX buffer of the
conn_stream will be used.
So, now, data are consumed by mux->snd_buf() and not only sent. So it needs to
update the buffer state. On its side, the caller must be aware the buffer can be
replaced y an empty or unallocated one.
As a side effet of this change, the function co_set_data() is now only responsible
to update the channel set, by update ->output field.
Some h2 connections remaining in CLOSE_WAIT state forever have been
reported for a while. Thanks to detailed captures provided by Milan
Petruzelka, the sequence where this happens became clearer :
1) multiple streams compete for the mux and are queued in the send_list
2) at this point the mux has to emit a GOAWAY for any reason (for
example because it received a bad message)
3) the streams are woken up, notified about the error
4) h2_detach() is called for each of them
5) the CS they are detached from the H2S
6) since the streams are marked as blocked for some room, they are
orphaned and nothing more is done on them.
7) at this point, any activity on the connection goes through h2_wake()
which sees the conneciton in ERROR2 state, tries again to release
the streams, cannot, and stops polling (thus even connection errors
cannot be detected anymore).
=> from this point, no more events can be received on the connection, and
the streams remain orphaned forever.
This patch makes sure that we never return without doing anything once
an error was met. It has to act both on the h2_detach() side (for h2
streams being detached after the error was emitted) and on the h2_wake()
side (for errors reported after h2s have already been orphaned).
Many thanks to Milan Petruzelka and Janusz Dziemidowicz for their
awesome work on this issue, collecting traces and testing patches,
and to Olivier Doucet for extra testing and confirming the fix.
This fix must be backported to 1.8.
Instead of calling the data layer from each individual frame processing
function, we now call it from demux. This requires to know the h2s that
was created inside h2c_frt_handle_headers(), which is why the pointer is
now returned. This results in a small performance boost from 58k to 60k
POST requests/s compared to -master, thanks to half the number of
si_cs_recv_cb() calls and 66% calls to si_cs_wake_cb().
It's interesting to note that all calls to data_cb->recv() are now always
immediately followed by a call to data_cb->wake(). The next step should
be to let the ->wake handler perform the recv() call itself. For this it
will be useful to have some info on the CS to indicate whether or not it
is ready to be read (ie: contains a non-empty input buffer).
We still call the parser but it should soon not be needed anymore. The
decode functions don't need the buffer nor the max size anymore. They
must also not touch the CS_FL_EOS or CS_FL_RCV_MORE flags either, so
this is done within h2_rcv_buf() after transmission.
The "flags" argument to h2_frt_decode_headers() and h2_frt_transfer_data()
has been removed since it's not used anymore.
The purpose here is also to ensure we can split the lower from the top
layers. The way the CS_FL_MSG_MORE flag is set was updated so that it's
set or cleared upon exit depending on the buffer's remaining contents.
The purpose is to decode to a temporary buffer and then to copy this buffer
to the caller. This double-copy definitely has an impact on performance, the
test code goes down from 220k to 140k req/s, but this memcpy() will disappear
soon.
The test on CO_RFL_BUF_WET has become irrelevant now since we only use
the cs' rxbuf, so we cannot be blocked by "output" data that has to be
forwarded first. Thus instead we don't start until the rxbuf is empty
(it will be drained from any input data once the stream processes it).
The purpose is to decode to a temporary buffer and then to copy this buffer
to the caller upon request to avoid having to process frames on the fly
when called from the higher level. For now the buffer is only initialized
on stream creation via cs_new() and allocated if the buffer_wait's callback
is called.
Totally nuke the "send" method, instead, the upper layer decides when it's
time to send data, and if it's not possible, uses the new subscribe() method
to be called when it can send data again.
Add a new "subscribe" method for connection, conn_stream and mux, so that
upper layer can subscribe to them, to be called when the event happens.
Right now, the only event implemented is "SUB_CAN_SEND", where the upper
layer can register to be called back when it is possible to send data.
The connection and conn_stream got a new "send_wait_list" entry, which
required to move a few struct members around to maintain an efficient
cache alignment (and actually this slightly improved performance).
Now all the code used to manipulate chunks uses a struct buffer instead.
The functions are still called "chunk*", and some of them will progressively
move to the generic buffer handling code as they are cleaned up.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
The new file istbuf.h links the indirect strings (ist) with the buffers.
The purpose is to encourage addition of more standard buffer manipulation
functions that rely on this in order to improve the overall ease of use
along all the code. Just like ist.h and buf.h, this new file is not
expected to depend on anything beyond these two files.
A few functions were added and/or converted from buffer.h :
- b_isteq() : indicates if a buffer and a string match
- b_isteat() : consumes a string from the buffer if it matches
- b_istput() : appends a small string to a buffer (all or none)
- b_putist() : appends part of a large string to a buffer
The equivalent functions were removed from buffer.h and changed at the
various call places.
The two variants now do exactly the same (appending at the tail of the
buffer) so let's not keep the distinction between these classes of
functions and have generic ones for this. It's also worth noting that
b{i,o}_putchk() wasn't used at all and was removed.
There is no more distinction between ->i and ->o for the mux's buffers,
we always use b_data() to know the buffer's length since only one side
is used for each direction.
With this flag we introduce the notion of "dry" vs "wet" buffers : some
demultiplexers like the H2 mux require as much room as possible for some
operations that are not retryable like decoding a headers frame. For this
they need to know if the buffer is congested with data scheduled for
leaving soon or not. Since the new API will not provide this information
in the buffer itself, the caller must indicate it. We never need to know
the amount of such data, just the fact that the buffer is not in its
optimal condition to be used for receipt. This "CO_RFL_BUF_WET" flag is
used to mention that such outgoing data are still pending in the buffer
and that a sensitive receiver should better let it "dry" before using it.
The mux and transport rcv_buf() now takes a "flags" argument, just like
the snd_buf() one or like the equivalent syscall lower part. The upper
layers will use this to pass some information such as indicating whether
the buffer is free from outgoing data or if the lower layer may allocate
the buffer itself.
It also returns a size_t. This is in order to clean the API. Note
that the H2 mux still uses some ints in the functions called from
h2_rcv_buf(), though it's not really a problem given that H2 frames
are smaller. It may deserve a general cleanup later though.
This way the mux doesn't need to modify the buffer's metadata anymore
nor to know the output's size. The mux->snd_buf() function now takes a
const buffer and it's up to the caller to update the buffer's state.
The return type was updated to return a size_t to comply with the count
argument.
This way the senders don't need to modify the buffer's metadata anymore
nor to know about the output's split point. This way the functions can
take a const buffer and it's clearer who's in charge of updating the
buffer after a send. That's why the buffer realignment is now performed
by the caller of the transport's snd_buf() functions.
The return type was updated to return a size_t to comply with the count
argument.
Now that there are no more users requiring to modify the buffer anymore,
switch these ones to const char and const buffer. This will make it more
obvious next time send functions are tempted to modify the buffer's output
count. Minor adaptations were necessary at a few call places which were
using char due to the function's previous prototype.
The few places where they were still used were replaced with b_peek() and
b_wrap() respectively. The parts making use of ->i and ->o should now be
convertible to the new API.
Functions h2s_frt_make_resp_headers() and h2s_frt_make_resp_data() used
to modify the buffer's output data count. This is problematic for the
buffer's rework as we don't want to rely on this anymore. This commit
modifies these functions to take an offset (relative to the buffer's
head) and a maximum byte count. Thus h2_snd_buf() now calls them with
buf->o and takes care of removing deleted data itself. The send functions
now almost support being passed const buffers (except for the data part
which is still embedded).
There's no more error return combined with the send output, though
the comments were misleading. Let's fix this as well as the functions'
prototypes. h2_snd_buf()'s return value wasn't changed yet since it
has to match the ->snd_buf prototype.
Till now the callers had to know which one to call for specific use cases.
Let's fuse them now since a single one will remain after the API migration.
Given that bi_del() may only be used where o==0, just combine the two tests
by first removing output data then only input.
This will be important so that we can parse a buffer without touching it.
Now we indicate where from the buffer's head we plan to start to copy, and
for how many bytes. This will be used by send functions to loop at the end
of the buffer without having to update the buffer's output byte count.
These ones were merged into a single b_contig_space() that covers both
(the bo_ case was a simplified version of the other one). The function
doesn't use ->i nor ->o anymore.
This function was sometimes used from a channel and sometimes from a buffer.
In both cases it requires knowledge of the size of the output data (to skip
them). Here the split ensures the channel can deal with this point, and that
other places not having output data can continue to work.
Where relevant, the channel version is used instead. The buffer version
was ported to be more generic and now takes a swap buffer and the output
byte count to know where to set the alignment point. The H2 mux still
uses buffer_slow_realign() with buf->o but it will change later.
Passing unsigned ints everywhere is painful, and will cause some headache
later when we'll want to integrate better with struct ist which already
uses size_t. Let's switch buffers to use size_t instead.
If a timeout strikes on the connection side with some active streams,
there is a corner case which can sometimes cause the following sequence
to happen :
- There are active streams but there are data in the mux buffer
(eg: a client suddenly disconnected during a download with pending
requests). The timeout is active.
- The timeout strikes, h2_timeout_task() is called, kills the task and
doesn't close the connection since there are streams left ; The
connection is marked in H2_CS_ERROR ;
- the streams are woken up and closed ;
- when the last stream closes, calling h2_detach(), it sees the
tree list is empty, but there is no condition allowing the
connection to be closed (mbuf->o > 0), thus it does nothing ;
- since the task is dead, there's no more hope to clear this
situation later
For now we can take care of this by adding a test for the presence of
H2_CS_ERROR and !task, implying the timeout task triggered already
and will not be able to handle this again.
Over the long term it seems like a more reliable test on should be
made, so that it is possible to know whether or not someone is still
able to close this connection.
A big thanks to Janusz Dziemidowicz and Milan Petruzelka for providing
many details helping in figuring this bug.
We currently don't process trailers on H2, but this has an impact : on
chunked HTTP/1 responses, we decide to emit the ES bit once we see the
0CRLF. From this point the stream switches to the CLOSED state, which
aborts processing of the remaining bytes. Thus the extra CRLF which ends
trailers is not processed and remains in the buffer. This prevents the
stream from being notified about end of transmission, which in turn keeps
the mux busy and prevents the connection from quitting.
The case of the trailers is not the root cause of this issue, though it
is what triggers it. The root cause is that upon error and/or close, once
we know we're not going to process any more data, we must absolutely flush
any remaining bytes from the output buffer, otherwise there is no way the
stream can quit. This is what this patch does.
It looks very likely related to the issues reported and debugged by
Janusz Dziemidowicz and Milan Petruzelka.
One way to reproduce it is to chain two proxies with the last one emitting
chunked data (typically using the stats page) :
global
stats socket /tmp/sock1 mode 666 level admin
stats timeout 1h
tune.ssl.default-dh-param 1024
tune.bufsize 16384
defaults
mode http
timeout connect 4s
timeout client 10s
timeout server 20s
listen px1
bind :4443 ssl crt rsa+dh2048.pem npn h2 alpn h2
server s1 127.0.0.1:4445
listen px2
bind :4444 ssl crt rsa+dh2048.pem npn h2 alpn h2
bind :4445
stats uri /
Then use curl to fetch the stats through px1 :
curl --http2 -k "https://127.0.0.1:4443/"
When curl is sent to the first one, "show sess" issued to the CLI will
show a remaining session during the client timeout. When curl is aimed at
port 4444 (px2), there is no such remaining session.
This fix needs to be backported to 1.8.
The streams bookkeeping made in H2 is used for protocol compliance only
but it doesn't consider the number of conn_streams still attached to the
mux. It causes an issue when http-request set-nice rules are applied on
H2 requests processed on a saturated machine. Indeed, in this case, the
requests are accepted and assigned a default nice value of zero. When
they are processed, their nice value changes to a higher one (say 1024).
The response is sent through the H2 mux, which detects the end of stream
and decrements the protocol-level stream count (h2c->nb_streams). The
client may then send a new request. But the conn_stream is still attached
and will require a new call to process_stream() to finish, which is made
through the scheduler. Given that the machine is saturated, it is assumed
that many tasks are present in the scheduler. Thus the closing tasks holding
a higher nice value will pass after the new stream creations. If the client
is fast enough with a low latency link, it may add a lot of new stream
creations before the stream terminations have a chance to disappear due
to their high nice value, resulting in a huge amount of memory being used.
The solution consists in letting a mux always monitor its conn_streams and
refrain from creating new ones when it is full. Here the H2 mux checks the
nb_cs counter and sets a new blocked flag (H2_CF_DEM_TOOMANY) if the limit
was reached, so that the frame parser requests a pause in the new stream
creation, leaving some time for the pending conn_streams to vanish.
Several experiments were made using varying thresholds to see if
overbooking would provide any benefit here but it turned out not to be
the case, so the conn_stream limit remains set to the exact streams
limit. Interestingly various performance measurements showed that the
code tends to be slightly faster now than without the limit, probably
due to the smoother memory usage.
This commit requires previous patch ("MINOR: h2: keep a count of the number
of conn_streams attached to the mux"). It needs to be backported to 1.8.
The h2 mux only knows about the number of H2 streams which are not in a
CLOSED state. This is used for protocol compliance. But it doesn't hold
the number of really attached streams. It is a problem because depending
on scheduling, it is possible that more streams are attached to the mux
than the ones seen at the protocol level, due to some streams taking some
time to be detached. Let's add this count based on the conn_streams.
Note: this patch is part of a series of fixes which will have to be
backported to 1.8.
There's no real reason to have a specific scheduler for applets anymore, so
nuke it and just use tasks. This comes with some benefits, the first one
being that applets cannot induce high latencies anymore since they share
nice values with other tasks. Later it will be possible to configure the
applets' nice value. The second benefit is that the applet scheduler was
not very thread-friendly, having a big lock around it in prevision of this
change. Thus applet-intensive workloads should now scale much better with
threads.
Some more improvement is possible now : some applets also use a task to
handle timers and timeouts. These ones could now be simplified to use only
one task.
In preparation for thread-specific runqueues, change the task API so that
the callback takes 3 arguments, the task itself, the context, and the state,
those were retrieved from the task before. This will allow these elements to
change atomically in the scheduler while the application uses the copied
value, and even to have NULL tasks later.
Upload requests not carrying a content-length nor tunnelling data must
be sent chunked-encoded over HTTP/1. The code was planned but for some
reason forgotten during the implementation, leading to such payloads to
be sent as tunnelled data.
Browsers always emit a content length in uploads so this problem doesn't
happen for most sites. However some applications may send data frames
after a request without indicating it earlier.
The only way to detect that a client will need to send data is that the
HEADERS frame doesn't hold the ES bit. In this case it's wise to look
for the content-length header. If it's not there, either we're in tunnel
(CONNECT method) or chunked-encoding (other methods).
This patch implements this.
The following request is sent using content-length :
curl --http2 -sk https://127.0.0.1:4443/s2 -XPOST -T /large/file
and these ones using chunked-encoding :
curl --http2 -sk https://127.0.0.1:4443/s2 -XPUT -T /large/file
curl --http2 -sk https://127.0.0.1:4443/s2 -XPUT -T - < /dev/urandom
Thanks to Robert Samuel Newson for raising this issue with details.
This fix must be backported to 1.8.
We'll need this in order to support uploading chunks. The h2 to h1
converter checks for the presence of the content-length header field
as well as the CONNECT method and returns these information to the
caller. The caller indicates whether or not a body is detected for
the message (presence of END_STREAM or not). No transfer-encoding
header is emitted yet.
The incoming H2 frame length was checked against the max_frame_size
setting instead of being checked against the bufsize. The max_frame_size
only applies to outgoing traffic and not to incoming one, so if a large
enough frame size is advertised in the SETTINGS frame, a wrapped frame
will be defragmented into a temporary allocated buffer where the second
fragment my overflow the heap by up to 16 kB.
It is very unlikely that this can be exploited for code execution given
that buffers are very short lived and their address not realistically
predictable in production, but the likeliness of an immediate crash is
absolutely certain.
This fix must be backported to 1.8.
Many thanks to Jordan Zebor from F5 Networks for reporting this issue
in a responsible way.
When a stream blocks on a mux buffer full/unallocated or on connection
flow control, a flag among H2_SF_MUX_M* is set, but the stream is not
always added to the connection's list. It's properly done when the
operations are performed from the connection handler but not always when
done from the stream handler. For instance, a simple shutr or shutw may
fail by lack of room. If it's immediately followed by a call to h2_detach(),
the stream remains lying around in no list at all, and prevents the
connection from ending. This problem is actually quite difficult to
trigger and seems to require some large objects and low server-side
timeouts.
This patch covers all identified paths. Some are redundant but since the
code will change and will be simplified in 1.9, it's better to stay on
the safe side here for now. It must be backported to 1.8.
Commit e3f36cd ("MINOR: h2: implement a basic "show_fd" function")
accidently brought one surrounding debugging part that was in the same
context. No backport needed.
The purpose here is to dump some information regarding an H2 connection,
and a few statistics about its streams. The output looks like this :
35 : st=0x55(R:PrA W:PrA) ev=0x00(heopi) [lc] cache=0 owner=0x7ff49ee15e80 iocb=0x588a61(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00201366 fe=decrypt mux=H2 mux_ctx=0x7ff49ee16f30 st0=2 flg=0x00000002 fctl_cnt=0 send_cnt=33 tree_cnt=33 orph_cnt=0
- st0 is the connection's state (FRAME_H here)
- flg is the connection's flags (MUX_MFULL here)
- fctl_cnt is the number of streams in the fctl_list
- send_cnt is the number of streams in the send_list
- tree_cnt is the number of streams in the streams_by_id tree
- orph_cnt is the number of orphaned streams (cs==0) in the tree
Interrupting an h2load test shows that some connections remain active till
the client timeout. This is due to the fact that h2_detach() immediately
returns if the h2s flags indicate that the h2s is still waiting for some
buffer room in the output mux (possibly to emit a response or to send some
window updates). If the connection is broken, these data will never leave
and must not prevent the stream from being terminated nor the connection
from being released.
This fix must be backported to 1.8.
Currently, h2_release() will release all resources assigned to the h2
connection, including the timeout task if any. But since the multi-threaded
scheduler, the timeout task could very well be queued in the thread-local
list of running tasks without any way to remove it, so task_delete() will
have no effect and task_free() will cause this undefined object to be
dereferenced.
In order to prevent this from happening, we never release the task in
h2_release(), instead we wake it up after marking its context NULL so that
the task handler can release the task.
Future improvements could consist in modifying the scheduler so that a
task_wakeup() has to be done on any task having to be killed, letting
the scheduler take care of it.
This fix must be backported to 1.8. This bug was apparently not reported
so far.
Since these two functions are always used together, let's simplify
the code by having a single one for both operations. It also ensures
we don't leave wandering elements that risk to leak later.
The code is safer and more robust this way, it avoids multiple paths.
This is possible due to the idempotence of LIST_DEL() and eb32_delete()
that are called in h2s_detach().
Several people reported very strange occasional crashes when using H2.
Every time it appeared that either an h2s or a task was corrupted. The
outcome is that a missing LIST_DEL() when removing an orphaned stream
from the list in h2_wake_some_streams() can cause this stream to
remain present in the send list after it was freed. This may happen
when receiving a GOAWAY frame for example. In the mean time the send
list may be processed due to pending streams, and the just released
stream is still found. If due to a buffer full condition we left the
h2_process_demux() loop before being able to process the pending
stream, the pool entry may be reassigned somewhere else. Either another
h2 connection will get it, or a task, since they are the same size and
are shared. Then upon next pass in h2_process_mux(), the stream is
processed again. Either it crashes here due to modifications, or the
contents are harmless to it and its last changes affect the other object
reasigned to this area (typically a struct task). In the case of a
collision with struct task, the LIST_DEL operation performed on h2s
corrupts the task's wait queue's leaf_p pointer, thus all the wait
queue's structure.
The fix consists in always performing the LIST_DEL in h2s_detach().
It will also make h2s_stream_new() more robust against a possible
future situation where stream_create_from_cs() could have sent data
before failing.
Many thanks to all the reporters who provided extremely valuable
information, traces and/or cores, namely Thierry Fournier, Yves Lafon,
Holger Amann, Peter Lindegaard Hansen, and discourse user "slawekc".
This fix must be backported to 1.8. It is probably better to also
backport the following code cleanups with it as well to limit the
divergence between master and 1.8-stable :
00dd078 CLEANUP: h2: rename misleading h2c_stream_close() to h2s_close()
0a10de6 MINOR: h2: provide and use h2s_detach() and h2s_free()
There are some corner cases where this could happen by accident. Since
the spec explicitly forbids this (RFC7540#5.4.2), let's add a test in
the two only functions which make the RST to avoid this. Thanks to user
klzgrad for reporting this problem. Usually it is expected to be harmless
but may result in browsers issuing a warning.
This fix must be backported to 1.8.
Recent fixes made to process partial frames broke the flow control on
DATA frames, as the padding is not considered anymore, only the actual
data is. Let's simply take account of the padding once the transfer
ends. The probability to meet this bug is low because, when used, padding
is small and it can require a large number of padded transfers before the
window is completely depleted.
Thanks to user klzgrad for reporting this bug and confirming the fix.
This fix must be backported to 1.8.
Right now the h2 idle timeout is only set when there is no stream. If we
fail to send because the socket buffers are full (generally indicating
the client has left), we also need to arm it so that we can properly
expire such connections, otherwise some failed transfers might leave
H2 connections pending forever.
Thanks to Thierry Fournier for the diag and the traces.
This patch needs to be backported to 1.8.
We used to have one buffer allocator per direction while we can never
block on two buffers at once. Let's have a single one and rely on the
connection's flags to know which one we're waitinf for.
This function takes an h2c and an h2s but it never uses the h2c, which
is a bit confusing at some places in the code. Let's make it clear that
it only operates on the h2s instead by renaming it and removing the
unused h2c argument.
In case a stream tries to emit more data than advertised by the chunks
or content-length headers, the extra data remains in the channel's output
buffer until the channel's timeout expires. It can easily happen when
sending malformed error files making use of a wrong content-length or
having extra CRLFs after the empty chunk. It may also be possible to
forge such a bad response using Lua.
The H1 to H2 encoder must protect itself against this by marking the data
presented to it as consumed if it decides to discard them, so that the
sending stream doesn't wait for the timeout to trigger.
The visible effect of this problem is a huge memory usage and a high
concurrent connection count during benchmarks when using such bad data
(a typical place where this easily happens).
This fix must be backported to 1.8.
In h2_get_dbuf, when the buffer allocation was failing, dbuf_wait.target was
errornously set to the connection (h2c->conn) instead of the h2 connection
descriptor (h2c).
This patch must be backported to 1.8.
This removes the unused next_header_block and try_again labels
from mux_h2.c.
try_again is unused as of a76e4c2183,
which first appeared in haproxy 1.8.0.
next_header_block is unused as of 872855998b,
which was backported to haproxy 1.8.0 as
59fcb216085a7aa9744cffe39567c80de4ebd6bf.
Instead of looking for CO_FL_EARLY_DATA to know if we have to try to wake
up a stream, because it is waiting for a SSL handshake, instead add a new
conn_stream flag, CS_FL_WAIT_FOR_HS. This way we don't have to rely on
CO_FL_EARLY_DATA, and we will only wake streams that are actually waiting.
Peter Lindegaard Hansen reported a problem affecting some POST requests
sent by MSIE on 1.8.3. Lukas found that we incorrectly dealt with the
END_STREAM flag on empty DATA frames.
What happens in fact is that while we correctly report that we've read a
zero-byte frame, since commit 8fc016d ("BUG/MEDIUM: h2: support uploading
partial DATA frames") backported into 1.8.2, we've been able to return
without updating the parser's state nor checking the frame flags in this
case.
The fix is trival, we just need not to return too early.
This fix must be backported to 1.8.
During a reload operation, instead of keeping the H2 connections opened
forever causing confusion during configuration changes, let's send a
graceful shutdown so that the client knows that it would better open a
new connection for future requests. We can't really catch the signal
from H2, but we can advertise this graceful shutdown upon the next I/O
event (eg: a WINDOW_UPDATE from the client or a new request). One of
the visible effect is that the old process quits much faster.
This patch should be backported to 1.8 since it is affected by this
problem.
The recent patch introducing the H2_CS_FRAME_E state to emit stream
resets was not totally correct in that in the rare case where there is
no room left to emit the reset, the next call to process it later could
use an uninitialized stream. This only affects responses to frames that
are sent on closed streams though.
This fix must be backported to 1.8.
The h2spec utility found certain situations where we're returning an
RST_STREAM while a GOAWAY is expected. While we can't always reliably
decide which one to use (eg: after a stream has been closed for a long
time), in practice we often still have the stream available until it's
destroyed at the application level. This provides the flags we need to
verify the conditions that led to its closure, namely if RST was sent
or received, or if it was regularly closed using a double ES.
The first step consists in marking all closed streams as having already
sent an RST_STREAM frame. This will ensure that we can send an RST_STREAM
for a late transmission on a stream we have forgotten about instead of
risking to break the connection. The next steps consist in re-arranging
the H2_SS_CLOSED checks so that we can deliver a GOAWAY frame for the
few cases where an unexpected frame was received after a double ES.
By carefully taking care of these specificities, we can reduce by 4 the
number of remaining compliance issues.
Note: some tests start to become a bit long and to be repeated at various
places. Probably that adding a bitmask of allowed/forbidden frame types
per state and/or per situation could significantly help. It's likely
that some deeper tests in the frame handlers could also be removed now
as they can't be triggered anymore.
This fix should be backported to 1.8.
Some stream errors applied to half-closed and closed streams are not
properly reported, especially after the stream transistions to the
closed state. The reason is that the code checks for this "error"
stream state in order to send an RST frame. But if the stream was
just closed or was already closed, there's no way to validate this
condition, and the error is never reported to the peer.
In order to address this situation, we'll add a new FRAME_E demux state
which indicates that the previously parsed frame triggered a stream error
of type STREAM CLOSED that needs to be reported. Proceeding like this
will ensure that we don't lose that information even if we can't
immediately send the message. It also removes the confusion where FRAME_A
could be used either for ACKs or for RST.
The state transition has been added after every h2s_error() on the demux
path. It seems that we might need to have two distinct h2s_error()
functions, one for the mux and another one for the demux, though it
would provide little benefit. It also becomes more apparent that the
H2_SS_ERROR state is only used to detect the need to report an error
on the mux direction. Maybe this will have to be revisited later.
This simple change managed to eliminate 5 bugs reported by h2spec.
This fix must be backported to 1.8.
This new field will be used to describe certain properties of some
muxes. For now we only add MX_FL_CLEAN_ABRT to indicate that a mux
is able to unambiguously report aborts using CS_FL_ERROR contrary
to others who may only report it via a read0. This will be used to
improve handling of the abortonclose option with H2. Other flags
may come later to report multiplexing capabilities or not, support
of client/server sides etc.
Commit 4974561 ("BUG/MEDIUM: h2: enforce the per-connection stream limit")
implemented a stream limit enforcement on the connection but it was not
correctly done as it would count streams still known by the connection,
which includes the lingering ones that are already marked close. We need
to count only the non-closed ones, which this patch does. The effect is
that some streams are rejected a bit before the limit.
This fix needs to be backported to 1.8.
Tunnelled responses are those without a content-length nor a chunked
encoding. They are specially dealt with in the current code but the
behaviour is not correct. The fact that the chunk size is left to zero
with a state artificially set to CHUNK_SIZE validates the test on
whether or not to set the end of stream flag. Thus the first DATA
frame always carries the ES flag and subsequent ones remain blocked.
This patch fixes it in two ways :
- update h1m->curr_len to the size of the current buffer so that it
is properly subtracted later to find the real end ;
- don't set the state to CHUNK_SIZE when there's no content-length
and instead set it to CHUNK_SIZE only when there's chunking.
This fix needs to be backported to 1.8.
We used to switch the stream's state to HREM when seeing and ES bit on
the DATA frame before actually being able to process that frame, possibly
resulting in the DATA frame being processed after the stream was seen as
half-closed and possibly being rejected. The state must not change before
the frame is really processed.
Also fixes a harmless typo in the flag name which should have DATA and
not HEADERS in its name (but all values are equal).
Must be backported to 1.8.
Since last commit it's not required that the DATA frames are complete anymore
so better start with what we have. Only the HEADERS frame requires this. This
may be backported as part of the upload fixes.