PiBa-NL reported an issue affecting logs when stdout is enabled at the
same time as an IP address. It does not affect FD and UNIX, but does
still affect multiple FDs. What happens is that the condition to detect
that the initialization was not made relies on the FD being -1, and in
this case the FD points to the *unique* FD used for AF_INET sockets, so
the configured socket used for outgoing logs over UDP gets overwritten
by the last configured FD. This is not appropriate, so instead we rely
on the sin_port part of the IPv4-mapped address to store the
initialization state for each FD.
This part deserves being significantly revamped, as IPv6 is still not
possible due to the way the FDs are managed, and inherited FDs are a
bit hackish.
Note that this patch relies on "MINOR: tools: preset the port of
fd-based "sockets" to zero" in order to operate properly.
No backport is needed.
Addresses made of a file descriptor store the file descriptor into the
address part of a sin_addr. Contrary to other address classes, there's
no way to figure later based on the FD if an initialization was done
(which is how logs initialize their FDs). The port part is currently
left with random data, so let's instead specifically set the port part
to zero when creating an FD, and let the code using it set whatever
info it needs there, typically an initialization state.
PiBa-NL reported that since this commit f157384 ("MINOR: backend: count
the number of connect and reuse per server and per backend"), reg-test
connection/h00001 fails. Indeed it does, the server is not checked for
existing prior to updating its counter. It should also fail with
transparent mode.
Commit 84cca66 ("BUG/MEDIUM: htx: When performing zero-copy, start from
the right offset.") uncovered another issue which is that the send function
may occasionally be called without any block. It's important to check for
this case when computing the zero-copy offsets.
No backport is needed.
In sess_build_logline(), don't attempt to call sample_fetch_as_type()
if we don't have a stream.
It used never to happen in the past before commit 09bb27c ("MEDIUM: log:
make sess_build_logline() support being called with no stream"). But now
it can happen when sess_log() is called from the lower layers (i.e. the
H2 mux got garbage when it was expecting a preface frame), and it reveals
that some sample fetch functions and some converter fnuctions which rely
on the stream don't test for its existence. For the sample fetch functions,
a durable solution is easy and would consist in adapting sample_process()
to verify the SMP_USE_* bits when the stream is not set. But for the
converters we don't have this option as they don't declare whether or not
they use a stream (which possibly is the real issue).
Thus for now let's disable sample_fetch_as_type() if a stream does not
exist, until it can be more accurately refined later.
If a reload was issued to the master process and failed, it is critical
that the admin sees it because it means that the saved configuration
does not work anymore and might not be usable after a full restart. For
this reason in this case we modify the "master" prompt to explicitly
indicate that a reload failed.
If the reload fail after the parsing of the configuration, the
mworker_proc structures are created for the processes it tried to
create.
The mworker_proc_list_to_env() function was exporting these unitialized
structures in the "HAPROXY_PROCESSES" environment variable which was
leading to this kind of output in "show proc":
4294967295 worker [was: 1] 1 17879d 16h26m28s
When using zerocopy, start from the beginning of the data, not from the
beginning of the buffer, it may have contained headers, and so the data
won't start at the beginning of the buffer.
This patch is a bit huge but nothing special here. Some functions have been
duplicated to support the HTX, some others have a switch inside to do so. So,
now, it is possible to create HTTP applets from HTX proxies. However, TCP
applets remains unsupported.
Functions from then Channel class are now forbidden for LUA scripts called from
HTTP proxies. These functions totally hijacked the HTTP parser, leaving it in an
undefined state. This patch is tagged as MAJOR because it could be see as a
compatibility breakage. But a LUA script using one of these functions has a very
low probablity to work correctly except by chance.
So, concretely, following functions are concerned: Channel.get, Channel.dup,
Channel.getline, Channel.set, Channel.append, Channel.send,
Channel.forward. Others remain available.
When deciding whether to scan the global run queue or not, we currently
check the configured threads number, and if it's 1 we skip the queue
since it's not supposed to be used. However when running with a master
process and multiple threads in the workers, the master will turn this
number back to 1 while some task wakeups might possibly have set bits
in the global tasks mask, thus causing active_tasks_mask to have one
bit permanently set, preventing the process from sleeping.
Instead of checking global.nbthread, let's check for the current
thread's bit in global_tasks_mask. First it will make this part of the
code more consistent, working like a test and set operation, it will
solve the issue with master+nbthread and as a bonus it will save a
lock/unlock for each scheduler call when the thread doesn't have a
task in the global run queue.
The tooltip in the HTML stats page was damaged by commit 1b0f85e47 ("MINOR:
stats: also report the failed header rewrites warnings on the stats page"),
due to the header rewrites counter being inserted at the wrong place and
taking the place of the other statuses.
This is only for 1.9, no backport is needed.
Sadly we didn't have the cumulated number of connections established to
servers till now, so let's now update it per backend and per-server and
report it in the stats. On the stats page it appears in the tooltip
when hovering over the total sessions count field.
The transpory layer now respects buffer alignment, so we don't need to
cheat anymore pretending we have some data at the head, adjusting the
buffer's head is enough.
For a long time we've been realigning empty buffers in the transport
layers, where the I/Os were performed based on callbacks. Doing so is
optimal for higher data throughput but makes it trickier to optimize
unaligned data, where mux_h1/h2 have to claim some data are present
in the buffer to force unaligned accesses to skip the frame's header
or the chunk header.
We don't need to do this anymore since the I/O calls are now always
performed from top to bottom, so it's only the mux's responsibility
to realign an empty buffer if it wants to.
In practice it doesn't change anything, it's just a convention, and
it will allow the code to be simplified in a next patch.
The cconf variable was not initialized before the two first possible
error exits before being freed, resulting in random crashes instead
of displaying an error message if the cache ID was missing from the
filter declaration.
No backport is needed, this is exclusively 1.9.
The leastconn algorithm queues available servers based on their weighted
current load. But this results in an inaccurate load balancing when weights
differ and the load is very low, because what matters is not the load before
picking the server but the load resulting from picking the server. At the
very least, it must be granted that servers with the highest weight are
always picked first when no server has any connection.
This patch addresses this by simply adding one to the current connections
count when queuing the server, since this is the load the server will have
once picked. This finally allows to bridge the gap that existed between
the "leastconn" and the "first" algorithms.
In the mux detach function, when using HTX, take the connection over if
it no longer has an owner (ie because the session that was the owner left).
It is done for legacy code in proto_http.c, but not for HTX.
Also when using HTX, in H2, try to add the connection back to idle_conns if
it was not already (ie we used to use all the available streams, and we're
freeing one). That too was done in proto_http.c.
Before trying to add a connection to the idle list, make sure it doesn't
have the error, the shutr or the shutw flag. If any of them is present,
don't bother trying to recycle the connection, it's going to be destroyed
anyway.
A first patch was pushed to fix this bug if it happens during the headers
parsing. But it is also possible to hit the bug during the parsing of
chunks. For instance, if the server sends only part of the trailers, some data
remains unparsed. So it the server closes its connection without sending the end
of the response, we fall back again into an infinite loop.
The fix contains in 2 parts. First, we block the receive if a read0 or an error
is detected on the connection, independently if the input buffer is empty or
not. Then, the flags CS_FL_RCV_MORE and CL_FL_WANT_ROOM are always reset when
input data are processed. We set them again only when necessary.
Add a new method to mux, "reset", that is used to let the mux know the
connection attempt failed, and we're about to retry, so it just have to
reinit itself. Currently only the H1 mux needs it.
When the connection failed, we don't really want to close the conn_stream,
as we're probably about to retry, so just make sure the file descriptor is
closed.
In si_cs_recv(), report that arrive at the end of stream only if we were
indeed connected, we don't want that if the connection failed and we're about
to retry.
CS_FL_EOS | CS_FL_REOS can be set by the mux if the connection failed, so make
sure we remove them before retrying to connect, or it may lead to a premature
close of the connection.
In the master CLI, the commands and the prefix were still parsed and
trimmed after the pattern payload. Don't parse anything but the end of a
line till we are in payload mode.
Put the search of the pattern after the trim so we can use correctly a
payload with a command which is prefixed by @.
Handle the CLI level in the master CLI. In order to do this, the master
CLI stores the level in the stream. Each command are prefixed by a
"user" or "operator" command before they are forwarded to the target
CLI.
The level can be configured in the haproxy program arguments with the
level keyword: -S /tmp/sock,level,admin -S /tmp/sock2,level,user.
Implement "show cli level" which show the level of the current CLI
session.
Implement "operator" and "user" which lower the permissions of the
current CLI session.
We need to make sure that the record length is not making us read
past the end of the data we received.
Before this patch we could for example read the 16 bytes
corresponding to an AAAA record from the non-initialized part of
the buffer, possibly accessing anything that was left on the stack,
or even past the end of the 8193-byte buffer, depending on the
value of accepted_payload_size.
To be backported to 1.8, probably also 1.7.
Some callers of dns_read_name() do not make sure that we can read
the first byte, holding the length of the next label, without going
past our buffer, so we need to make sure of that.
In addition, if the label is a compressed one we need to make sure
that we can read the following byte to compute the target offset.
To be backported to 1.8, probably also 1.7.
When a compressed pointer is encountered, dns_read_name() will call
itself with the pointed-to offset in the packet.
With a specially crafted packet, it was possible to trigger an
infinite-loop recursion by making the pointer points to itself.
While it would be possible to handle that particular case differently
by making sure that the target is different from the current offset,
it would still be possible to craft a packet with a very long chain
of valid pointers, always pointing backwards. To prevent a stack
exhaustion in that case, this patch restricts the number of recursive
calls to 100, which should be more than enough.
To be backported to 1.8, probably also 1.7.
The commit 3815b227f ("MEDIUM: mux-h1: implement true zero-copy of DATA blocks")
broke the output of chunked messages. When the zero-copy was performed on such
messages, no chunk size was emitted nor ending CRLF.
Now, the chunked envelope is added when necessary. We have at least the size of
the struct htx to emit it. So 40 bytes for now. It should be enough.
Change the output of the relative pid for the old processes, displays
"[was: X]" instead of just "X" which was confusing if you want to
connect to the CLI of an old PID.
H2 has a 9-byte frame header, and HTX has a 40-byte frame header.
By artificially advancing the Rx header and limiting the amount of
bytes read to protect the end of the buffer, we can make the data
payload perfectly aligned with HTX blocks and optimize the copy.
This is similar to what was done for the H1 mux : when the mux's buffer
is empty and the htx area contains exactly one data block of the same
size as the requested count, and all window and frame size conditions are
satisfied, then it's possible to simply swap the caller's buffer with the
mux's output buffer and adjust offsets and length to match the entire
DATA HTX block in the middle. An H2 frame header has to be prepended
before the block but this always fits in an HTX frame header.
In this case we perform a true zero-copy operation from end-to-end. This
is the situation that happens all the time with large files. When using
HTX over H2 over TLS, this brings a 3% extra performance gain. TLS remains
a limiting factor here but the copy definitely has a cost. Also since
haproxy can now use H2 in clear, the savings can be higher.
Due to blocking factor being different on H1 and H2, we regularly end
up with tails of data blocks that leave room in the mux buffer, making
it tempting to copy the pending frame into the remaining room left, and
possibly realigning the output buffer.
Here we check if the output buffer contains data, and prefer to wait
if either the current frame doesn't fit or if it's larger than 1/4 of
the buffer. This way upon next call, either a zero copy, or a larger
and aligned copy will be performed, taking the whole chunk at once.
Doing so increases the H2 bandwidth by slightly more than 1% on large
objects.
By default H2 uses a 65535 bytes window for the connection, and changing
it requires sending a WINDOW_UPDATE message. We only used to update the
window when receiving data, thus never increasing it further.
As reported by user klzgrad on the mailing list, this seriously limits
the upload bitrate, and will have an even higher impact on the backend
H2 connections to origin servers.
There is no technical reason for keeping this window so low, so let's
increase it to the maximum possible value (2G-1). We do this by
pretending we've already received that many data minus the maximum
data the client might already send (65535), so that an early
WINDOW_UPDATE message is sent right after the SETTINGS frame.
This should be backported to 1.8. This patch depends on previous
patch "BUG/MINOR: mux-h2: refrain from muxing during the preface".