This function must be called when new incoming data are pushed in the channel's
buffer. It updates the channel state and take care of the fast forwarding by
consuming right amount of data and decrementing "->to_forward" accordingly when
necessary. In fact, this patch just moves a part of ci_putblk in a dedicated
function.
This patch must be backported to 1.9.
In ci_insert_line2() and b_rep_blk(), we can't afford to wrap, so don't use
b_tail() to check if we do, use __b_tail() instead.
This should be backported to previous versions.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
There was no point keeping that function in the buffer part since it's
exclusively used by HTTP at the channel level, since it also automatically
appends the CRLF. This further cleans up the buffer code.
For the same consistency reasons, let's use b_empty() at the few places
where an empty buffer is expected, or c_empty() if it's done on a channel.
Some of these places were there to realign the buffer so
{b,c}_realign_if_empty() was used instead.
We used to have variations around buffer_total_space() and
size-buffer_len() or size-b_data(). Let's simplify all this. buffer_len()
was also removed as not used anymore.
Now that there are no more users requiring to modify the buffer anymore,
switch these ones to const char and const buffer. This will make it more
obvious next time send functions are tempted to modify the buffer's output
count. Minor adaptations were necessary at a few call places which were
using char due to the function's previous prototype.
This will be important so that we can parse a buffer without touching it.
Now we indicate where from the buffer's head we plan to start to copy, and
for how many bytes. This will be used by send functions to loop at the end
of the buffer without having to update the buffer's output byte count.
This new functoin limits itself to the amount of data available in the
buffer and doesn't care about the direction anymore. It's only called
from co_getblk() which already checks that no more than the available
output bytes is requested.
These ones were merged into a single b_contig_space() that covers both
(the bo_ case was a simplified version of the other one). The function
doesn't use ->i nor ->o anymore.
These ones manipulate the output data count which will be specific to
the channel soon, so prepare the call points to use the channel only.
The b_* functions are now unused and were removed.
When haproxy is compiled using GCC <= 3.x or >= 5.x the `unlikely`
macro performs a comparison with zero: `(x) != 0`, thus returning
either 0 or 1.
In `int co_getline_nc()` this macro was accidentally applied to
the variable `retcode` itself, instead of the result of the
comparison `retcode <= 0`. As a result any negative `retcode`
is converted to `1` for purposes of the comparison.
Thus never taking the branch (and exiting the function) for
negative values.
This in turn leads to reads of uninitialized memory in the for-loop
below:
==12141== Conditional jump or move depends on uninitialised value(s)
==12141== at 0x4EB6B4: co_getline_nc (channel.c:346)
==12141== by 0x421CA4: hlua_socket_receive_yield (hlua.c:1713)
==12141== by 0x421F6F: hlua_socket_receive (hlua.c:1896)
==12141== by 0x529B08F: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52A7EFC: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B497: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529711A: lua_pcallk (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52ABDF0: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B08F: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52A7EFC: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529A9F1: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B523: lua_resume (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141==
==12141== Use of uninitialised value of size 8
==12141== at 0x4EB6B9: co_getline_nc (channel.c:346)
==12141== by 0x421CA4: hlua_socket_receive_yield (hlua.c:1713)
==12141== by 0x421F6F: hlua_socket_receive (hlua.c:1896)
==12141== by 0x529B08F: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52A7EFC: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B497: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529711A: lua_pcallk (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52ABDF0: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B08F: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52A7EFC: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529A9F1: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B523: lua_resume (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141==
==12141== Invalid read of size 1
==12141== at 0x4EB6B9: co_getline_nc (channel.c:346)
==12141== by 0x421CA4: hlua_socket_receive_yield (hlua.c:1713)
==12141== by 0x421F6F: hlua_socket_receive (hlua.c:1896)
==12141== by 0x529B08F: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52A7EFC: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B497: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529711A: lua_pcallk (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52ABDF0: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B08F: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x52A7EFC: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529A9F1: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== by 0x529B523: lua_resume (in /usr/lib/x86_64-linux-gnu/liblua5.3.so.0.0.0)
==12141== Address 0x8637171e928bb500 is not stack'd, malloc'd or (recently) free'd
Fix this bug by correctly applying the `unlikely` macro to the result of the comparison.
This bug exists as of commit ca16b03813
which is the first commit adding this function.
v1.6-dev1 is the first tag containing this commit, the fix should
be backported to haproxy 1.6 and newer.
There's no point having the channel marked writable as these functions
only extract data from the channel. The code was retrieved from their
ci/co ancestors.
For HTTP/2 we'll need some buffer-only equivalent functions to some of
the ones applying to channels and still squatting the bi_* / bo_*
namespace. Since these names have kept being misleading for quite some
time now and are really getting annoying, it's time to rename them. This
commit will use "ci/co" as the prefix (for "channel in", "channel out")
instead of "bi/bo". The following ones were renamed :
bi_getblk_nc, bi_getline_nc, bi_putblk, bi_putchr,
bo_getblk, bo_getblk_nc, bo_getline, bo_getline_nc, bo_inject,
bi_putchk, bi_putstr, bo_getchr, bo_skip, bi_swpbuf
The function buffer_contig_space is buggy and could lead to pernicious bugs
(never hitted until now, AFAIK). This function should return the number of bytes
that can be written into the buffer at once (without wrapping).
First, this function is used to inject input data (bi_putblk) and to inject
output data (bo_putblk and bo_inject). But there is no context. So it cannot
decide where contiguous space should placed. For input data, it should be after
bi_end(buf) (ie, buf->p + buf->i modulo wrapping calculation). For output data,
it should be after bo_end(buf) (ie, buf->p) and input data are assumed to not
exist (else there is no space at all).
Then, considering we need to inject input data, this function does not always
returns the right value. And when we need to inject output data, we must be sure
to have no input data at all (buf->i == 0), else the result can also be wrong
(but this is the caller responsibility, so everything should be fine here).
The buffer can be in 3 different states:
1) no wrapping
<---- o ----><----- i ----->
+------------+------------+-------------+------------+
| |oooooooooooo|iiiiiiiiiiiii|xxxxxxxxxxxx|
+------------+------------+-------------+------------+
^ <contig_space>
p ^ ^
l r
2) input wrapping
...---> <---- o ----><-------- i -------...
+-----+------------+------------+--------------------+
|iiiii|xxxxxxxxxxxx|oooooooooooo|iiiiiiiiiiiiiiiiiiii|
+-----+------------+------------+--------------------+
<contig_space> ^
^ ^ p
l r
3) output wrapping
...------ o ------><----- i -----> <----...
+------------------+-------------+------------+------+
|oooooooooooooooooo|iiiiiiiiiiiii|xxxxxxxxxxxx|oooooo|
+------------------+-------------+------------+------+
^ <contig_space>
p ^ ^
l r
buffer_contig_space returns (l - r). The cases 1 and 3 are correctly
handled. But for the second case, r is wrong. It points on the buffer's end
(buf->data + buf->size). It should be bo_end(buf) (ie, buf->p - buf->o).
To fix the bug, the function has been splitted. Now, bi_contig_space and
bo_contig_space should be used to know the contiguous space available to insert,
respectively, input data and output data. For bo_contig_space, input data are
assumed to not exist. And the right version is used, depending what we want to
do.
In addition, to clarify the buffer's API, buffer_realign does not return value
anymore. So it has the same API than buffer_slow_realign.
This patch can be backported in 1.7, 1.6 and 1.5.
The unlikely macro doesn't take in acount the condition, but only one
variable.
Must be backported in 1.6
[wt: with gcc 3.x, unlikely(x) is defined as __builtin_expect((x) != 0, 0)
so the condition is wrong for negative numbers, which correspond to the
case where bi_getblk_nc() has reached the end of the buffer and the
channel is already closed. With gcc 4.x, the output is cast to unsigned long
so the <=0 will not match negative values either. This is only used in Lua
for now so that may explain why it hasn't hit yet]
In 1.4-dev3, commit 31971e5 ("[MEDIUM] add support for infinite forwarding")
made it possible to configure the lower layer to forward data indefinitely
by setting the forward size to CHN_INFINITE_FORWARD (4GB-1). By then larger
chunk sizes were not supported so there was no confusion in the usage of the
function.
Since 1.5 we support 64-bit content-lengths and chunk sizes and the function
has grown to support 64-bit arguments, though it still limits a single pass
to 32-bit quantities (what fit in the channel's to_forward field). The issue
now becomes that a 4GB-1 content-length can be confused with infinite
forwarding (in fact it's 4GB-1+what was already in the buffer). It causes a
visible effect when transferring this exact size because the transfer rate
is lower than with other sizes due in part to the disabling of the Nagle
algorithm on the sendto() call.
In theory with keep-alive it should prevent a second request from being
processed after such a transfer, but since the analysers are still present,
the forwarding analyser properly counts down the remaining size to transfer
and ultimately the transaction gets correctly reset so there is no visible
effect.
Since the root cause of the issue is an API problem (lack of distinction
between a real valid length and a magic value), this patch modifies the API
to have a new dedicated function called channel_forward_forever() to program
a permanent forwarding. The existing function __channel_forward() was modified
to properly take care of the requested sizes and ensure it 1) never overflows
and 2) never reaches CHN_INFINITE_FORWARD by accident.
It is worth noting that the function used to have a bug causing a 2GB
forward to be scheduled if it was called with less data than what is present
in buf->i. Fortunately this bug couldn't be triggered with existing code.
This fix should be backported to 1.6 and 1.5. While it also theorically
affects 1.4, it's better not to backport it there, as the risk of breaking
large object transfers due to significant API differences is high, compared
to the fact that the largest supported objects (4GB-1) are just slower to
transfer.
The recent addition of "show env" on the CLI has revealed an interesting
design bug. Chunks are supposed to support a negative length to indicate
that they carry no data. chunk_printf() sets this size to -1 if the string
is too large for the buffer. At a few places in the http engine we may end
up with trash.len = -1. But bi_putchk(), chunk_appendf() and a few other
chunks consumers don't consider this case as possible and will use such a
chunk, possibly restoring an invalid string or trying to copy -1 bytes.
This fix takes care of clarifying the situation in a backportable way
where such sizes are used, so that a negative length indicating an error
remains present until the chunk is reinitialized or overwritten. But a
cleaner design adjustment needs to be done so that there's a clear contract
on how to use these chunks. At first glance it doesn't seem *that* useful
to support negative sizes, so probably this is what should change.
This fix must be backported to 1.6 and 1.5.
It was inappropriate to put this flag on every failed write into an
input buffer because it depends where it happens. When it's in the
context of an analyser (eg: hlua) it makes sense. When it's in the
context of an applet (eg: dumpstats), it does not make sense, and
it only happens to work because currently applets are scheduled by
the sessions. The proper solution for applets would be to add the
flag SI_FL_WAIT_ROOM on the stream interface.
Thus, we now don't set any flag anymore in bi_put* and it's up to the
caller to either set CF_WAKE_WRITE on the channel or SI_FL_WAIT_ROOM
on the stream interface. Changes were applied to hlua, peers and
dumpstats.
We now have functions to retrieve one block and one line from
either the input or the output part of a buffer. They return
up to two (pointer,length) values in case the buffer wraps.
This function's name was poorly chosen and is confusing to the point of
being suspiciously used at some places. The operations it does always
consider the ability to forward pending input data before receiving new
data. This is not obvious at all, especially at some places where it was
used when consuming outgoing data to know if the buffer has any chance
to ever get the missing data. The code needs to be re-audited with that
in mind. Care must be taken with existing code since the polarity of the
function was switched with the renaming.
bi_swpbuf() swaps the buffer passed in argument with the one attached to
the channel, but only if this last one is empty. The idea is to avoid a
copy when buffers can simply be swapped.
The errors reported by Igor Chan on the stats interface in chunked mode
were caused by data wrapping at the wrong place in the buffer. It could
be reliably reproduced by picking random buffer sizes until the issue
appeared (for a given conf, 5300 with 1024 maxrewrite was OK).
The issue is that the stats interface uses bi_putchk() to emit data,
which relies on bi_putblk(). This code checks the largest part that can
be emitted while preserving the rewrite reserve, but uses that result to
compute the wrapping offset, which is wrong. If some data remain present
in the buffer, the wrapping may be allowed and will happen before the
end of the buffer, leaving some old data in the buffer.
The reason it did not happen before keep-alive is simply that the buffer
was much less likely to contain older data. It also used to happen only
for certain configs after a certain amount of time because the size of
the counters used to inflate the output till the point wrapping started
to happen.
The fix is trivial, buffer_contig_space_with_res() simply needs to be
replaced by buffer_contig_space().
Note that peers were using the same function so it is possible that they
were affected as well.
This issue was introduced in 1.5-dev8. No backport to stable is needed.
Since commit 6b66f3e ([MAJOR] implement autonomous inter-socket forwarding)
introduced in 1.3.16-rc1, we've been relying on a stupid mechanism to wake
up the task after a write, which was an exact copy-paste of the reader side.
The principle was that if we empty a buffer and there's no forwarding
scheduled or if the *producer* is not in a connected state, then we wake
the task up.
That does not make any sense. It happens to wake up too late sometimes (eg,
when the request analyser waits for some room in the buffer to start to
work), and leads to unneeded wakeups in client-side keep-alive, because
the task is woken up when the response is sent, while the analysers are
simply waiting for a new request.
In order to fix this, we introduce a new channel flag : CF_WAKE_WRITE. It
is designed so that an analyser can explicitly request being notified when
some data were written. It is used only when the HTTP request or response
analysers need to wait for more room in the buffers. It is automatically
cleared upon wake up.
The flag is also automatically set by the functions which try to write into
a buffer from an applet when they fail (bi_putblk() etc...).
That allows us to remove the stupid condition above and avoid some wakeups.
In http-server-close and in http-keep-alive modes, this reduces from 4 to 3
the average number of wakeups per request, and increases the overall
performance by about 1.5%.
We must not report incomplete data if the buffer is not full, otherwise
we can abort some processing on the stats socket when dealing with massive
amounts of commands.
Most calls to channel_forward() are performed with short byte counts and
are already optimized in channel_forward() taking just a few instructions.
Thus it's a waste of CPU cycles to call a function for this, let's just
inline the short byte count case and fall back to the common one for
remaining situations.
Doing so has increased the chunked encoding parser's performance by 12% !
With this commit, we now separate the channel from the buffer. This will
allow us to replace buffers on the fly without touching the channel. Since
nobody is supposed to keep a reference to a buffer anymore, doing so is not
a problem and will also permit some copy-less data manipulation.
Interestingly, these changes have shown a 2% performance increase on some
workloads, probably due to a better cache placement of data.
This is a massive rename of most functions which should make use of the
word "channel" instead of the word "buffer" in their names.
In concerns the following ones (new names) :
unsigned long long channel_forward(struct channel *buf, unsigned long long bytes);
static inline void channel_init(struct channel *buf)
static inline int channel_input_closed(struct channel *buf)
static inline int channel_output_closed(struct channel *buf)
static inline void channel_check_timeouts(struct channel *b)
static inline void channel_erase(struct channel *buf)
static inline void channel_shutr_now(struct channel *buf)
static inline void channel_shutw_now(struct channel *buf)
static inline void channel_abort(struct channel *buf)
static inline void channel_stop_hijacker(struct channel *buf)
static inline void channel_auto_connect(struct channel *buf)
static inline void channel_dont_connect(struct channel *buf)
static inline void channel_auto_close(struct channel *buf)
static inline void channel_dont_close(struct channel *buf)
static inline void channel_auto_read(struct channel *buf)
static inline void channel_dont_read(struct channel *buf)
unsigned long long channel_forward(struct channel *buf, unsigned long long bytes)
Some functions provided by channel.[ch] have kept their "buffer" name because
they are really designed to act on the buffer according to some information
gathered from the channel. They have been moved together to the same place in
the file for better readability but they were not changed at all.
The "buffer" memory pool was also renamed "channel".
Get rid of these confusing BF_* flags. Now channel naming should clearly
be used everywhere appropriate.
No code was changed, only a renaming was performed. The comments about
channel operations was updated.
These functions do not depend on the channel flags anymore thus they're
much better suited to be used on plain buffers. Move them from channel
to buffer.
This is similar to the recent removal of BF_OUT_EMPTY. This flag was very
problematic because it relies on permanently changing information such as the
to_forward value, so it had to be updated upon every change to the buffers.
Previous patch already got rid of its users.
One part of the change is sensible : the flag was also part of BF_MASK_STATIC,
which is used by process_session() to rescan all analysers in case the flag's
status changes. At first glance, none of the analysers seems to change its
mind base on this flag when it is subject to change, so it seems fine not to
add variation checks here. Otherwise it's possible that checking the buffer's
input and output is more reliable than checking the flag's replacement.
This flag is quite complex to get right and updating it everywhere is a
major pain, especially since the buffer/channel split. This is the first
step of getting rid of it. Instead now it's dynamically computed whenever
needed.
This flag was very problematic because it was composite in that both changes
to the pipe or to the buffer had to cause this flag to be updated, which is
not always simple (eg: there may not even be a channel attached to a buffer
at all).
There were not that many users of this flags, mostly setters. So the flag got
replaced with a macro which reports whether the channel is empty or not, by
checking both the pipe and the buffer.
One part of the change is sensible : the flag was also part of BF_MASK_STATIC,
which is used by process_session() to rescan all analysers in case the flag's
status changes. At first glance, none of the analysers seems to change its
mind base on this flag when it is subject to change, so it seems fine not to
add variation checks here. Otherwise it's possible that checking the buffer's
output size is more useful than checking the flag's replacement.