The H2 demux only checks for too many streams in h2c_frt_stream_new(),
then refuses to create a new stream and causes the connection to be
aborted by sending a GOAWAY frame. This will also happen if any error
happens during the stream creation (e.g. memory allocation).
RFC7540#5.1.2 says that attempts to create streams in excess should
instead be dealt with using an RST_STREAM frame conveying either the
PROTOCOL_ERROR or REFUSED_STREAM reason (the latter being usable only
if it is guaranteed that the stream was not processed). In theory it
should not happen for well behaving clients, though it may if we
configure a low enough h2.max_concurrent_streams limit. This error
however may definitely happen on memory shortage.
Previously it was not possible to use RST_STREAM due to the fact that
the HPACK decompressor would be desynchronized. But now we first decode
and only then try to allocate the stream, so the decompressor remains
synchronized regardless of policy or resources issues.
With this patch we enforce stream termination with RST_STREAM and
REFUSED_STREAM if this protocol violation happens, as well as if there
is a temporary condition like a memory allocation issue. It will allow
a client to recover cleanly.
This could possibly be backported to 1.9. Note that this requires that
these five previous patches are merged as well :
MINOR: h2: add a bit-based frame type representation
MEDIUM: mux-h2: remove padlen during headers phase
MEDIUM: mux-h2: decode HEADERS frames before allocating the stream
MINOR: mux-h2: make h2c_send_rst_stream() use the dummy stream's error code
MINOR: mux-h2: add a new dummy stream for the REFUSED_STREAM error code
This patch introduces a new dummy stream, h2_refused_stream, in CLOSED
status with the aforementioned error code. It will be usable to reject
unexpected extraneous streams.
We currently have 2 dummy streams allowing us to send an RST_STREAM
message with an error code matching this one. However h2c_send_rst_stream()
still enforces the STREAM_CLOSED error code for these dummy streams,
ignoring their respective errcode fields which however are properly
set.
Let's make the function always use the stream's error code. This will
allow to create other dummy streams for different codes.
It's hard to recover from a HEADERS frame decoding error after having
already created the stream, and it's not possible to recover from a
stream allocation error without dropping the connection since we can't
maintain the HPACK context, so let's decode it before allocating the
stream, into a temporary buffer that will then be offered to the newly
created stream.
Three types of frames may be padded : DATA, HEADERS and PUSH_PROMISE.
Currently, each of these independently deals with padding and needs to
wait for and skip the initial padlen byte. Not only this complicates
frame processing, but it makes it very hard to process CONTINUATION
frames after a padded HEADERS frame, and makes it complicated to perform
atomic calls to h2s_decode_headers(), which are needed if we want to be
able to maintain the HPACK decompressor's context even when dropping
streams.
This patch takes a different approach : the padding is checked when
parsing the frame header, the padlen byte is waited for and parsed,
and the dpl value is updated with this padlen value. This will allow
the frame parsers to decide to overwrite the padding if needed when
merging adjacent frames.
Since commit f210191 ("BUG/MEDIUM: h2: don't accept new streams if
conn_streams are still in excess") we're refraining from reading input
frames if we've reached the limit of number of CS. The problem is that
it prevents such situations from working fine. The initial purpose was
in fact to prevent from reading new HEADERS frames when this happens,
and causes some occasional transfer hiccups and pauses with large
concurrencies.
Given that we now properly reject extraneous streams before checking
this value, we can be sure never to have too many streams, and that
any higher value is only caused by a scheduling reason and will go
down after the scheduler calls the code.
This fix must be backported to 1.9 and possibly to 1.8. It may be
tested using h2spec this way with an h2spec config :
while :; do
h2spec -o 5 -v -t -S -k -h 127.0.0.1 -p 4443 http2/5.1.2
done
We were returning a stream error of type PROTOCOL_ERROR on empty HEADERS
frames, but RFC7540#4.2 stipulates that we should instead return a
connection error of type FRAME_SIZE_ERROR.
This may be backported to 1.9 and 1.8 though it's unlikely to have any
real life effect.
Commit dc57236 ("BUG/MINOR: mux-h2: advertise a larger connection window
size") caused a WINDOW_UPDATE message to be sent early with the connection
to increase the connection's window size. It turns out that it causes some
minor trouble that need to be worked around :
- varnishtest cannot transparently cope with the WU frames during the
handshake, forcing all tests to explicitly declare the handshake
sequence ;
- some vtc scripts randomly fail if the WU frame is sent after another
expected response frame, adding uncertainty to some tests ;
- h2spec doesn't correctly identify these WU at the connection level
that it believes are the responses to some purposely erroneous frames
it sends, resulting in some errors being reported
None of these are a problem with real clients but they add some confusion
during troubleshooting.
Since the fix above was intended to increase the upload bandwidth, we
have another option which is to increase the window size with the first
WU frame sent for the connection. This way, no WU frame is sent until
one is really needed, and this first frame will adjust the window to
the maximum value. It will make the window increase slightly later, so
the client will experience the first round trip when uploading data,
but this should not be perceptible, and is not worth the extra hassle
needed to maintain our debugging abilities. As an extra bonus, a few
extra bytes are saved for each connection until the first attempt to
upload data.
This should possibly be backported to 1.9 and 1.8.
In some situations, if too short a frame header is received, we may leave
h2_process_demux() waking up the task again without checking that we were
already subscribed.
In order to avoid this once for all, let's introduce an h2_restart_reading()
function which performs the control and calls the task up. This way we won't
needlessly wake the task up if it's already waiting for I/O.
Must be backported to 1.9.
In mux_h2_unsubscribe, don't forget to leave the sending_list if
SUB_CALL_UNSUBSCRIBE was set. SUB_CALL_UNSUBSCRIBE means we were about
to be woken up for writing, unless the mux was too full to get more data.
If there's an unsubscribe call in the meanwhile, we should leave the list,
or we may be put back in the send_list.
This should be backported to 1.9.
In h2_snd_buf(), if we couldn't send the data because of flow control, and
the connection got a shutr, then add CS_FL_ERROR (or CS_FL_ERR_PENDING). We
will never get any window update, so we will never be unlocked, anyway.
No backport is needed.
When dealing with early data we scan the list of stream to notify them.
We're not supposed to have h2s->cs == NULL here but it doesn't cost much
to make the scan more robust and verify it before notifying.
No backport is needed.
If we had no pending read, it could be complicated to report an
RST_STREAM to a sender since we used to only report it via the
rx side if subscribed. Similarly in h2_wake_some_streams() we
now try all methods, hoping to catch all possible events.
No backport is needed.
In order to report an error to the data layer, we have different ways
depending on the situation. At a lot of places it's open-coded and not
always correct. Let's create a new function h2s_alert() to handle this
task. It tries to wake on recv() first, then on send(), then using
wake().
In the mux h2, make sure we set CS_FL_ERR_PENDING and wake the recv task,
instead of setting CS_FL_ERROR, if CS_FL_EOS is not set, so if there's
potentially still some data to be sent.
Commiy 8519357c ("BUG/MEDIUM: mux-h2: report asynchronous errors in
h2_wake_some_streams()") addressed an issue with synchronous errors
but forgot to fix the call places to also pass CS_FL_ERR_PENDING
instead of CS_FL_ERROR.
No backport is needed.
We most often store the mux context there but it can also be something
else while setting up the connection. Better call it "ctx" and know
that it's the owner's context than misleadingly call it mux_ctx and
get caught doing suspicious tricks.
The SUB_CAN_SEND/SUB_CAN_RECV enum values have been confusing a few
times, especially when checking them on reading. After some discussion,
it appears that calling them SUB_RETRY_SEND/SUB_RETRY_RECV more
accurately reflects their purpose since these events may only appear
after a first attempt to perform the I/O operation has failed or was
not completed.
In addition the wait_reason field in struct wait_event which carries
them makes one think that a single reason may happen at once while
it is in fact a set of events. Since the struct is called wait_event
it makes sense that this field is called "events" to indicate it's the
list of events we're subscribed to.
Last, the values for SUB_RETRY_RECV/SEND were swapped so that value
1 corresponds to recv and 2 to send, as is done almost everywhere else
in the code an in the shutdown() call.
Today the demux only wakes a stream up after receiving some contents, but
not necessarily on close or error. Let's do it based on both error flags
and both EOS flags. With a bit of refinement we should be able to only do
it when the pending bits are there but not the static ones.
No backport is needed.
This function is called when dealing with a connection error or a GOAWAY
frame. It used to report a synchronous error instead of an asycnhronous
error, which can lead to data truncation since whatever is still available
in the rxbuf will be ignored. Let's correctly use CS_FL_ERR_PENDING instead
and only fall back to CS_FL_ERROR if CS_FL_EOS was already delivered.
No backport is needed.
If EOS has already been reported on the conn_stream, there won't be
any read anymore to turn ERR_PENDING into ERROR, so we have to do
report it directly.
No backport is needed.
The h2s pointer was used to scan fctl lists prior to being used to scan
the send list by ID, so it could appear non-null eventhough the list is
empty, resulting in misleading information on empty connections.
No backport is needed.
Most of the time when we issue "show fd" to dump a mux's state, it's
to figure why a transfer is frozen. Connection, stream and conn_stream
states are critical there. And most of the time when this happens there
is a single stream left in the H2 mux, so let's always dump the last
known stream on show fd, as most of the time it will be the one of
interest.
Commit 7505f94f9 ("MEDIUM: h2: Don't use a wake() method anymore.")
changed the conditions to restart demuxing so that this happens as soon
as something is read. But similar to previous fix, at an end of stream
we may be woken up with nothing to read but data still available in the
demux buffer, so we must also use this as a valid condition for demuxing.
No backport is needed, this is purely 1.9.
Commit 082f559d3 ("BUG/MEDIUM: h2: restart demuxing after releasing
buffer space") tried to address a situation where transfers could stall
after a read, but the condition was not completely covered : some stalls
may still happen at end of stream because there's nothing anymore to
receive and the last data lie in the demux buffer. Thus we must also
consider this state as a valid condition to restart demuxing.
No backport is needed.
Add a new flag to conn_streams, CS_FL_ERR_PENDING. This is to be set instead
of CS_FL_ERR in case there's still more data to be read, so that we read all
the data before closing.
In h2_deferred_shut, if we're done sending the shutr/shutw, don't destroy
the h2s if it still has a conn_stream attached, or the conn_stream may try
to access it again.
When creating new conn_streams, always set the CS_FL_NOT_FIRST flag. We
don't really care about being the first request for HTTP/2, this only
really makes sense for HTTP/1, and that way we can reuse connections.
In session, don't keep an infinite number of connection that can idle.
Add a new frontend parameter, "max-session-srv-conns" to set a max number,
with a default value of 5.
Instead of trying to get the session from the connection, which is not
always there, and of course there could be multiple sessions per connection,
provide it with the init() and attach() methods, so that we know the
session for each outgoing stream.
In the mux_h1 and mux_h2, move the test to see if we should add the
connection in the idle list until after we destroyed the h1s/h2s, that way
later we'll be able to check if the connection has no stream at all, and if
it should be added to the server idling list.
When using zerocopy, start from the beginning of the data, not from the
beginning of the buffer, it may have contained headers, and so the data
won't start at the beginning of the buffer.
The transpory layer now respects buffer alignment, so we don't need to
cheat anymore pretending we have some data at the head, adjusting the
buffer's head is enough.
For a long time we've been realigning empty buffers in the transport
layers, where the I/Os were performed based on callbacks. Doing so is
optimal for higher data throughput but makes it trickier to optimize
unaligned data, where mux_h1/h2 have to claim some data are present
in the buffer to force unaligned accesses to skip the frame's header
or the chunk header.
We don't need to do this anymore since the I/O calls are now always
performed from top to bottom, so it's only the mux's responsibility
to realign an empty buffer if it wants to.
In practice it doesn't change anything, it's just a convention, and
it will allow the code to be simplified in a next patch.
In the mux detach function, when using HTX, take the connection over if
it no longer has an owner (ie because the session that was the owner left).
It is done for legacy code in proto_http.c, but not for HTX.
Also when using HTX, in H2, try to add the connection back to idle_conns if
it was not already (ie we used to use all the available streams, and we're
freeing one). That too was done in proto_http.c.
H2 has a 9-byte frame header, and HTX has a 40-byte frame header.
By artificially advancing the Rx header and limiting the amount of
bytes read to protect the end of the buffer, we can make the data
payload perfectly aligned with HTX blocks and optimize the copy.
This is similar to what was done for the H1 mux : when the mux's buffer
is empty and the htx area contains exactly one data block of the same
size as the requested count, and all window and frame size conditions are
satisfied, then it's possible to simply swap the caller's buffer with the
mux's output buffer and adjust offsets and length to match the entire
DATA HTX block in the middle. An H2 frame header has to be prepended
before the block but this always fits in an HTX frame header.
In this case we perform a true zero-copy operation from end-to-end. This
is the situation that happens all the time with large files. When using
HTX over H2 over TLS, this brings a 3% extra performance gain. TLS remains
a limiting factor here but the copy definitely has a cost. Also since
haproxy can now use H2 in clear, the savings can be higher.
Due to blocking factor being different on H1 and H2, we regularly end
up with tails of data blocks that leave room in the mux buffer, making
it tempting to copy the pending frame into the remaining room left, and
possibly realigning the output buffer.
Here we check if the output buffer contains data, and prefer to wait
if either the current frame doesn't fit or if it's larger than 1/4 of
the buffer. This way upon next call, either a zero copy, or a larger
and aligned copy will be performed, taking the whole chunk at once.
Doing so increases the H2 bandwidth by slightly more than 1% on large
objects.
By default H2 uses a 65535 bytes window for the connection, and changing
it requires sending a WINDOW_UPDATE message. We only used to update the
window when receiving data, thus never increasing it further.
As reported by user klzgrad on the mailing list, this seriously limits
the upload bitrate, and will have an even higher impact on the backend
H2 connections to origin servers.
There is no technical reason for keeping this window so low, so let's
increase it to the maximum possible value (2G-1). We do this by
pretending we've already received that many data minus the maximum
data the client might already send (65535), so that an early
WINDOW_UPDATE message is sent right after the SETTINGS frame.
This should be backported to 1.8. This patch depends on previous
patch "BUG/MINOR: mux-h2: refrain from muxing during the preface".
The condition to refrain from processing the mux was insufficient as it
would only handle the outgoing connections. In essence it is not that much
of a problem since we don't have streams yet on an incoming connetion. But
it prevents waiting for the end of the preface before sending an early
WINDOW_UPDATE message, thus causing the connections to fail in this case.
This must be backported to 1.8 with a few minor adaptations.
All the HTX definition is self-contained and doesn't really depend on
anything external since it's a mostly protocol. In addition, some
external similar files (like h2) also placed in common used to rely
on it, making it a bit awkward.
This patch moves the two htx.h files into a single self-contained one.
The historical dependency on sample.h could be also removed since it
used to be there only for http_meth_t which is now in http.h.