For the same consistency reasons, let's use b_empty() at the few places
where an empty buffer is expected, or c_empty() if it's done on a channel.
Some of these places were there to realign the buffer so
{b,c}_realign_if_empty() was used instead.
We used to have variations around buffer_total_space() and
size-buffer_len() or size-b_data(). Let's simplify all this. buffer_len()
was also removed as not used anymore.
With this flag we introduce the notion of "dry" vs "wet" buffers : some
demultiplexers like the H2 mux require as much room as possible for some
operations that are not retryable like decoding a headers frame. For this
they need to know if the buffer is congested with data scheduled for
leaving soon or not. Since the new API will not provide this information
in the buffer itself, the caller must indicate it. We never need to know
the amount of such data, just the fact that the buffer is not in its
optimal condition to be used for receipt. This "CO_RFL_BUF_WET" flag is
used to mention that such outgoing data are still pending in the buffer
and that a sensitive receiver should better let it "dry" before using it.
The mux and transport rcv_buf() now takes a "flags" argument, just like
the snd_buf() one or like the equivalent syscall lower part. The upper
layers will use this to pass some information such as indicating whether
the buffer is free from outgoing data or if the lower layer may allocate
the buffer itself.
It also returns a size_t. This is in order to clean the API. Note
that the H2 mux still uses some ints in the functions called from
h2_rcv_buf(), though it's not really a problem given that H2 frames
are smaller. It may deserve a general cleanup later though.
Just like we have a size_t for xprt->snd_buf(), we adjust to use size_t
for rcv_buf()'s count argument and return value. It also removes the
ambiguity related to the possibility to see a negative value there.
This way the mux doesn't need to modify the buffer's metadata anymore
nor to know the output's size. The mux->snd_buf() function now takes a
const buffer and it's up to the caller to update the buffer's state.
The return type was updated to return a size_t to comply with the count
argument.
This way the senders don't need to modify the buffer's metadata anymore
nor to know about the output's split point. This way the functions can
take a const buffer and it's clearer who's in charge of updating the
buffer after a send. That's why the buffer realignment is now performed
by the caller of the transport's snd_buf() functions.
The return type was updated to return a size_t to comply with the count
argument.
Now that there are no more users requiring to modify the buffer anymore,
switch these ones to const char and const buffer. This will make it more
obvious next time send functions are tempted to modify the buffer's output
count. Minor adaptations were necessary at a few call places which were
using char due to the function's previous prototype.
The few places where they were still used were replaced with b_peek() and
b_wrap() respectively. The parts making use of ->i and ->o should now be
convertible to the new API.
Functions h2s_frt_make_resp_headers() and h2s_frt_make_resp_data() used
to modify the buffer's output data count. This is problematic for the
buffer's rework as we don't want to rely on this anymore. This commit
modifies these functions to take an offset (relative to the buffer's
head) and a maximum byte count. Thus h2_snd_buf() now calls them with
buf->o and takes care of removing deleted data itself. The send functions
now almost support being passed const buffers (except for the data part
which is still embedded).
There's no more error return combined with the send output, though
the comments were misleading. Let's fix this as well as the functions'
prototypes. h2_snd_buf()'s return value wasn't changed yet since it
has to match the ->snd_buf prototype.
Till now the callers had to know which one to call for specific use cases.
Let's fuse them now since a single one will remain after the API migration.
Given that bi_del() may only be used where o==0, just combine the two tests
by first removing output data then only input.
This will be important so that we can parse a buffer without touching it.
Now we indicate where from the buffer's head we plan to start to copy, and
for how many bytes. This will be used by send functions to loop at the end
of the buffer without having to update the buffer's output byte count.
This new functoin limits itself to the amount of data available in the
buffer and doesn't care about the direction anymore. It's only called
from co_getblk() which already checks that no more than the available
output bytes is requested.
These ones were merged into a single b_contig_space() that covers both
(the bo_ case was a simplified version of the other one). The function
doesn't use ->i nor ->o anymore.
This function was sometimes used from a channel and sometimes from a buffer.
In both cases it requires knowledge of the size of the output data (to skip
them). Here the split ensures the channel can deal with this point, and that
other places not having output data can continue to work.
These ones manipulate the output data count which will be specific to
the channel soon, so prepare the call points to use the channel only.
The b_* functions are now unused and were removed.
The few call places where it's used can use the trash as a swap buffer,
which is made for this exact purpose. This way we can rely on the
generic b_slow_realign() call.
Where relevant, the channel version is used instead. The buffer version
was ported to be more generic and now takes a swap buffer and the output
byte count to know where to set the alignment point. The H2 mux still
uses buffer_slow_realign() with buf->o but it will change later.
Passing unsigned ints everywhere is painful, and will cause some headache
later when we'll want to integrate better with struct ist which already
uses size_t. Let's switch buffers to use size_t instead.
If a timeout strikes on the connection side with some active streams,
there is a corner case which can sometimes cause the following sequence
to happen :
- There are active streams but there are data in the mux buffer
(eg: a client suddenly disconnected during a download with pending
requests). The timeout is active.
- The timeout strikes, h2_timeout_task() is called, kills the task and
doesn't close the connection since there are streams left ; The
connection is marked in H2_CS_ERROR ;
- the streams are woken up and closed ;
- when the last stream closes, calling h2_detach(), it sees the
tree list is empty, but there is no condition allowing the
connection to be closed (mbuf->o > 0), thus it does nothing ;
- since the task is dead, there's no more hope to clear this
situation later
For now we can take care of this by adding a test for the presence of
H2_CS_ERROR and !task, implying the timeout task triggered already
and will not be able to handle this again.
Over the long term it seems like a more reliable test on should be
made, so that it is possible to know whether or not someone is still
able to close this connection.
A big thanks to Janusz Dziemidowicz and Milan Petruzelka for providing
many details helping in figuring this bug.
We currently don't process trailers on H2, but this has an impact : on
chunked HTTP/1 responses, we decide to emit the ES bit once we see the
0CRLF. From this point the stream switches to the CLOSED state, which
aborts processing of the remaining bytes. Thus the extra CRLF which ends
trailers is not processed and remains in the buffer. This prevents the
stream from being notified about end of transmission, which in turn keeps
the mux busy and prevents the connection from quitting.
The case of the trailers is not the root cause of this issue, though it
is what triggers it. The root cause is that upon error and/or close, once
we know we're not going to process any more data, we must absolutely flush
any remaining bytes from the output buffer, otherwise there is no way the
stream can quit. This is what this patch does.
It looks very likely related to the issues reported and debugged by
Janusz Dziemidowicz and Milan Petruzelka.
One way to reproduce it is to chain two proxies with the last one emitting
chunked data (typically using the stats page) :
global
stats socket /tmp/sock1 mode 666 level admin
stats timeout 1h
tune.ssl.default-dh-param 1024
tune.bufsize 16384
defaults
mode http
timeout connect 4s
timeout client 10s
timeout server 20s
listen px1
bind :4443 ssl crt rsa+dh2048.pem npn h2 alpn h2
server s1 127.0.0.1:4445
listen px2
bind :4444 ssl crt rsa+dh2048.pem npn h2 alpn h2
bind :4445
stats uri /
Then use curl to fetch the stats through px1 :
curl --http2 -k "https://127.0.0.1:4443/"
When curl is sent to the first one, "show sess" issued to the CLI will
show a remaining session during the client timeout. When curl is aimed at
port 4444 (px2), there is no such remaining session.
This fix needs to be backported to 1.8.
The streams bookkeeping made in H2 is used for protocol compliance only
but it doesn't consider the number of conn_streams still attached to the
mux. It causes an issue when http-request set-nice rules are applied on
H2 requests processed on a saturated machine. Indeed, in this case, the
requests are accepted and assigned a default nice value of zero. When
they are processed, their nice value changes to a higher one (say 1024).
The response is sent through the H2 mux, which detects the end of stream
and decrements the protocol-level stream count (h2c->nb_streams). The
client may then send a new request. But the conn_stream is still attached
and will require a new call to process_stream() to finish, which is made
through the scheduler. Given that the machine is saturated, it is assumed
that many tasks are present in the scheduler. Thus the closing tasks holding
a higher nice value will pass after the new stream creations. If the client
is fast enough with a low latency link, it may add a lot of new stream
creations before the stream terminations have a chance to disappear due
to their high nice value, resulting in a huge amount of memory being used.
The solution consists in letting a mux always monitor its conn_streams and
refrain from creating new ones when it is full. Here the H2 mux checks the
nb_cs counter and sets a new blocked flag (H2_CF_DEM_TOOMANY) if the limit
was reached, so that the frame parser requests a pause in the new stream
creation, leaving some time for the pending conn_streams to vanish.
Several experiments were made using varying thresholds to see if
overbooking would provide any benefit here but it turned out not to be
the case, so the conn_stream limit remains set to the exact streams
limit. Interestingly various performance measurements showed that the
code tends to be slightly faster now than without the limit, probably
due to the smoother memory usage.
This commit requires previous patch ("MINOR: h2: keep a count of the number
of conn_streams attached to the mux"). It needs to be backported to 1.8.
The h2 mux only knows about the number of H2 streams which are not in a
CLOSED state. This is used for protocol compliance. But it doesn't hold
the number of really attached streams. It is a problem because depending
on scheduling, it is possible that more streams are attached to the mux
than the ones seen at the protocol level, due to some streams taking some
time to be detached. Let's add this count based on the conn_streams.
Note: this patch is part of a series of fixes which will have to be
backported to 1.8.
Commit 200b0fa ("MEDIUM: Add support for updating TLS ticket keys via
socket") introduced support for updating TLS ticket keys from the CLI,
but missed a small corner case : if multiple bind lines reference the
same tls_keys file, the same reference is used (as expected), but during
the clean shutdown, it will lead to a double free when destroying the
bind_conf contexts since none of the lines knows if others still use
it. The impact is very low however, mostly a core and/or a message in
the system's log upon old process termination.
Let's introduce some basic refcounting to prevent this from happening,
so that only the last bind_conf frees it.
Thanks to Janusz Dziemidowicz and Thierry Fournier for both reporting
the same issue with an easy reproducer.
This fix needs to be backported from 1.6 to 1.8.
By default, HAProxy's DNS resolution at runtime ensure that there is no
IP address duplication in a backend (for servers being resolved by the
same hostname).
There are a few cases where people want, on purpose, to disable this
feature.
This patch introduces a couple of new server side options for this purpose:
"resolve-opts allow-dup-ip" or "resolve-opts prevent-dup-ip".
dns_get_ip_from_response() is used to compare the caller current IP to
the IP available in the records returned by the DNS server.
A scoring system is in place to get the best IP address available.
That said, in the current implementation, there are a couple of issues:
1. a comment does not match what the code does
2. the code does not match what the commet says (score value is not
incremented with '2')
This patch fixes both issues.
Backport status: 1.8
The comment was about "prefered network ip version" while it's actually
"prefered ip version" in the code.
Fixed
Backport status: 1.7 and 1.8
Be careful, this patch may not apply on 1.7, since the score was '4'
for this item at that time.
Change the way the process groups are set. Indeed setsid() was called
for every processes which caused the worker to have a different process
group than the master.
This patch behave in a better way:
- In daemon mode only, each child do a setsid()
- In master worker + daemon mode, the setsid() is done in the master before
forking the children
- In any foreground mode, we don't do a setsid()
Could be backported in 1.8 but the master-worker mode is mostly used
with systemd which rely on cgroups so that won't affect much people.