The stktable_touch_remote considers the expire field stored in the stksess
struct.
The expire field was updated on the a newly created stksess to store.
But if the stksess with a same key is still present the expire was not updated.
This patch postpones the update of the expire field of the stksess just before
processing the "touch".
These bug was introduced in commit:
MEDIUM: threads/stick-tables: handle multithreads on stick tables.
And the fix should be backported on 1.8.
Since peers were ported to an applet in 1.5, an issue appeared which
is that certain attempts to close an outgoing connection are a bit
"too nice". Specifically, protocol errors and stream timeouts result
in a clean shutdown to be sent, waiting for the other side to confirm.
This is particularly problematic in the case of timeouts since by
definition the other side will not confirm as it has disappeared.
As found by Fred, this issue was further emphasized in 1.8 by commit
f9ce57e ("MEDIUM: connection: make conn_sock_shutw() aware of
lingering") which causes clean shutdowns not to be sent if the fd is
marked as linger_risk, because now even a clean timeout will not be
sent on an idle peers session, and the other one will have nothing
to respond to.
The solution here is to set NOLINGER on the outgoing stream interface
to ensure we always close whenever we attempt a simple shutdown.
However it is important to keep in mind that this also underlines
some weaknesses of the shutr/shutw processing inside process_stream()
and that all this part needs to be reworked to clearly consider the
abort case, and to stop the confusion between linger_risk and NOLINGER.
This fix needs to be backported as far as 1.5 (all versions are affected).
However, during testing of the backport it was found that 1.5 never tries
to close the peers connection on timeout, so it suffers for another issue.
Commit 8d8aa0d ("MEDIUM: threads/listeners: Make listeners thread-safe")
mistakenly placed HA_ATOMIC_ADD(job, 1) to replace a job--, so it maintains
the job count too high preventing the process from cleanly exiting on
reload.
This needs to be backported to 1.8.
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
Using peers or stick table we could update an freq_ctr
using a tick value with the first bit set but this
bit is reserved for lock since multithreading support.
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
For HTTP/2 and QUIC, we'll need to deal with multiplexed streams inside
a connection. After quite a long brainstorming, it appears that the
connection interface to the existing streams is appropriate just like
the connection interface to the lower layers. In fact we need to have
the mux layer in the middle of the connection, between the transport
and the data layer.
A mux can exist on two directions/sides. On the inbound direction, it
instanciates new streams from incoming connections, while on the outbound
direction it muxes streams into outgoing connections. The difference is
visible on the mux->init() call : in one case, an upper context is already
known (outgoing connection), and in the other case, the upper context is
not yet known (incoming connection) and will have to be allocated by the
mux. The session doesn't have to create the new streams anymore, as this
is performed by the mux itself.
This patch introduces this and creates a pass-through mux called
"mux_pt" which is used for all new connections and which only
calls the data layer's recv,send,wake() calls. One incoming stream
is immediately created when init() is called on the inbound direction.
There should not be any visible impact.
Note that the connection's mux is purposely not set until the session
is completed so that we don't accidently run with the wrong mux. This
must not cause any issue as the xprt_done_cb function is always called
prior to using mux's recv/send functions.
A lock is used to protect accesses to a peer structure.
A the lock is taken in the applet handler when the peer is identified
and released living the applet handler.
In the scheduling task for peers section, the lock is taken for every
listed peer and released at the end of the process task function.
The peer 'force shutdown' function was also re-worked.
A global lock has been added to protect accesses to the list of active
applets. A process mask has also been added on each applet. Like for FDs and
tasks, it is used to know which threads are allowed to process an
applet. Because applets are, most of time, linked to a session, it should be
sticky on the same thread. But in all cases, it is the responsibility of the
applet handler to lock what have to be protected in the applet context.
The stick table API was slightly reworked:
A global spin lock on stick table was added to perform lookup and
insert in a thread safe way. The handling of refcount on entries
is now handled directly by stick tables functions under protection
of this lock and was removed from the code of callers.
The "stktable_store" function is no more externalized and users should
now use "stktable_set_entry" in any case of insertion. This last one performs
a lookup followed by a store if not found. So the code using "stktable_store"
was re-worked.
Lookup, and set_entry functions automatically increase the refcount
of the returned/stored entry.
The function "sticktable_touch" was renamed "sticktable_touch_local"
and is now able to decrease the refcount if last arg is set to true. It
is allowing to release the entry without taking the lock twice.
A new function "sticktable_touch_remote" is now used to insert
entries coming from remote peers at the right place in the update tree.
The code of peer update was re-worked to use this new function.
This function is also able to decrease the refcount if wanted.
The function "stksess_kill" also handle a parameter to decrease
the refcount on the entry.
A read/write lock is added on each entry to protect the data content
updates of the entry.
First, we use atomic operations to update jobs/totalconn/actconn variables,
listener's nbconn variable and listener's counters. Then we add a lock on
listeners to protect access to their information. And finally, listener queues
(global and per proxy) are also protected by a lock. Here, because access to
these queues are unusal, we use the same lock for all queues instead of a global
one for the global queue and a lock per proxy for others.
2 global locks have been added to protect, respectively, the run queue and the
wait queue. And a process mask has been added on each task. Like for FDs, this
mask is used to know which threads are allowed to process a task.
For many tasks, all threads are granted. And this must be your first intension
when you create a new task, else you have a good reason to make a task sticky on
some threads. This is then the responsibility to the process callback to lock
what have to be locked in the task context.
Nevertheless, all tasks linked to a session must be sticky on the thread
creating the session. It is important that I/O handlers processing session FDs
and these tasks run on the same thread to avoid conflicts.
For HTTP/2 we'll need some buffer-only equivalent functions to some of
the ones applying to channels and still squatting the bi_* / bo_*
namespace. Since these names have kept being misleading for quite some
time now and are really getting annoying, it's time to rename them. This
commit will use "ci/co" as the prefix (for "channel in", "channel out")
instead of "bi/bo". The following ones were renamed :
bi_getblk_nc, bi_getline_nc, bi_putblk, bi_putchr,
bo_getblk, bo_getblk_nc, bo_getline, bo_getline_nc, bo_inject,
bi_putchk, bi_putstr, bo_getchr, bo_skip, bi_swpbuf
Commit bcb86ab ("MINOR: session: add a streams field to the session
struct") added this list of streams that is not needed anymore. Let's
get rid of it now.
There are several places where we see feconn++, feconn--, totalconn++ and
an increment on the frontend's number of connections and connection rate.
This is done exactly once per session in each direction, so better take
care of this counter in the session and simplify the callers. At least it
ensures a better symmetry. It also ensures consistency as till now the
lua/spoe/peers frontend didn't have these counters properly set, which can
be useful at least for troubleshooting.
Each user of a session increments/decrements the jobs variable at its
own place, resulting in a real mess and inconsistencies between them.
Let's have session_new() increment jobs and session_free() decrement
it.
Since v1.7 it's pointless to reference a listener when greating a session
for an outgoing connection, it only complicates the code. SPOE and Lua were
cleaned up in 1.8-dev1 but the peers code was forgotten. This patch fixes
this by not assigning such a listener for outgoing connections. It also has
the extra benefit of not discounting the outgoing connections from the number
of allowed incoming connections (the code currently adds a safety marging of
3 extra connections to take care of this).
Currently a task is allocated in session_new() and serves two purposes :
- either the handshake is complete and it is offered to the stream via
the second arg of stream_new()
- or the handshake is not complete and it's diverted to be used as a
timeout handler for the embryonic session and repurposed once we land
into conn_complete_session()
Furthermore, the task's process() function was taken from the listener's
handler in conn_complete_session() prior to being replaced by a call to
stream_new(). This will become a serious mess with the mux.
Since it's impossible to have a stream without a task, this patch removes
the second arg from stream_new() and make this function allocate its own
task. In session_accept_fd(), we now only allocate the task if needed for
the embryonic session and delete it later.
Now each stream is added to the session's list of streams, so that it
will be possible to know all the streams belonging to a session, and
to know if any stream is still attached to a sessoin.
When several stick-tables were configured with several peers sections,
only a part of them could be synchronized: the ones attached to the last
parsed 'peers' section. This was due to the fact that, at least, the peer I/O handler
refered to the wrong peer section list, in fact always the same: the last one parsed.
The fact that the global peer section list was named "struct peers *peers"
lead to this issue. This variable name is dangerous ;).
So this patch renames global 'peers' variable to 'cfg_peers' to ensure that
no such wrong references are still in use, then all the functions wich used
old 'peers' variable have been modified to refer to the correct peer list.
Must be backported to 1.6 and 1.7.
With this patch additional information are added to stick-table definition
messages so that to make external application capable of learning peer
stick-table configurations. First stick-table entries duration is added
followed by the frequency counters type IDs and values.
May be backported to 1.7 and 1.6.
The task_wakeup was called on stream_new, but the task/stream
wasn't fully initialized yet. The task_wakeup must be called
explicitly by the caller once the task/stream is initialized.
When a peer task has sent a synchronization request to remote peers
its next expiration date was updated based on a resynchronization timeout
value which itself may have already expired leading the underlying
poller to wait for 0ms during a fraction of second (consuming high CPU
resources).
With this patch we update such peer task expiration dates only if
the resynchronization timeout is not already expired.
Thanks to Patrick Hemmer who reported an issue with nice traces
which helped in finding this one.
This patch may be backported to 1.7 and 1.6.
A peer session which has just been created upon reconnect timeout expirations,
could be right after shutdown (at peer session level) because the remote
side peer could also righ after have connected. In such a case the underlying
TCP session was still running (connect()/accept()) and finally left in CLOSE_WAIT
state after the remote side stopped writting (shutdown(SHUT_WR)).
Now on, with this patch we never shutdown such peer sessions wich have just
been created. We leave them connect to the remote peer which is already
connected and must shutdown its own peer session.
Thanks to Patric Hemmer and Yves Lafon at w3.org for reporting this issue,
and for having tested this patch on the field.
Thanks also to Willy and Yelp blogs which helped me a lot in fixing it
(see https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/ and
https://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.htmll).
A buffer overflow could happen if an integer is badly encoded in
the data part of a msg received from a peer. It should not happen
with authenticated peers (the handshake do not use this function).
This patch makes the code of the 'intdecode' function more robust.
It also adds some comments about the intencode function.
This bug affects versions >= 1.6.
Historically, all listeners have a pointer to the frontend. But since
the introduction of SSL, we now have an intermediary layer called
bind_conf corresponding to a "bind" line. It makes no sense to have
the frontend on each listener given that it's the same for all
listeners belonging to a same bind_conf. Also certain parts like
SSL can only operate on bind_conf and need the frontend.
This patch fixes this by moving the frontend pointer from the listener
to the bind_conf. The extra indirection is quite cheap given and the
places were this is used are very scarce.
When an entity tries to get a buffer, if it cannot be allocted, for example
because the number of buffers which may be allocated per process is limited,
this entity is added in a list (called <buffer_wq>) and wait for an available
buffer.
Historically, the <buffer_wq> list was logically attached to streams because it
were the only entities likely to be added in it. Now, applets can also be
waiting for a free buffer. And with filters, we could imagine to have more other
entities waiting for a buffer. So it make sense to have a generic list.
Anyway, with the current design there is a bug. When an applet failed to get a
buffer, it will wait. But we add the stream attached to the applet in
<buffer_wq>, instead of the applet itself. So when a buffer is available, we
wake up the stream and not the waiting applet. So, it is possible to have
waiting applets and never awakened.
So, now, <buffer_wq> is independant from streams. And we really add the waiting
entity in <buffer_wq>. To be generic, the entity is responsible to define the
callback used to awaken it.
In addition, applets will still request an input buffer when they become
active. But they will not be sleeped anymore if no buffer are available. So this
is the responsibility to the applet I/O handler to check if this buffer is
allocated or not. This way, an applet can decide if this buffer is required or
not and can do additional processing if not.
[wt: backport to 1.7 and 1.6]
There's no reason to use the stream anymore, only the appctx should be
used by a peer. This was a leftover from the migration to appctx and it
caused some confusion, so let's totally drop it now. Note that half of
the patch are just comment updates.
It was inherited from initial code but we must only manipulate the appctx
and never the stream, otherwise we always risk shooting ourselves in the
foot.
In case of resource allocation error, peer_session_create() frees
everything allocated and returns a pointer to the stream/session that
was put back into the free pool. This stream/session is then assigned
to ps->{stream,session} with no error control. This means that it is
perfectly possible to have a new stream or session being both used for
a regular communication and for a peer at the same time.
In fact it is the only way (for now) to explain a CLOSE_WAIT on peers
connections that was caught in this dump with the stream interface in
SI_ST_CON state while the error field proves the state ought to have
been SI_ST_DIS, very likely indicating two concurrent accesses on the
same area :
0x7dbd50: [31/Oct/2016:17:53:41.267510] id=0 proto=tcpv4
flags=0x23006, conn_retries=0, srv_conn=(nil), pend_pos=(nil)
frontend=myhost2 (id=4294967295 mode=tcp), listener=? (id=0)
backend=<NONE> (id=-1 mode=-) addr=127.0.0.1:41432
server=<NONE> (id=-1) addr=127.0.0.1:8521
task=0x7dbcd8 (state=0x08 nice=0 calls=2 exp=<NEVER> age=1m5s)
si[0]=0x7dbf48 (state=CLO flags=0x4040 endp0=APPCTX:0x7d99c8 exp=<NEVER>, et=0x000)
si[1]=0x7dbf68 (state=CON flags=0x50 endp1=CONN:0x7dc0b8 exp=<NEVER>, et=0x020)
app0=0x7d99c8 st0=11 st1=0 st2=0 applet=<PEER>
co1=0x7dc0b8 ctrl=tcpv4 xprt=RAW data=STRM target=PROXY:0x7fe62028a010
flags=0x0020b310 fd=7 fd.state=22 fd.cache=0 updt=0
req=0x7dbd60 (f=0x80a020 an=0x0 pipe=0 tofwd=0 total=0)
an_exp=<NEVER> rex=<NEVER> wex=<NEVER>
buf=0x78a3c0 data=0x78a3d4 o=0 p=0 req.next=0 i=0 size=0
res=0x7dbda0 (f=0x80402020 an=0x0 pipe=0 tofwd=0 total=0)
an_exp=<NEVER> rex=<NEVER> wex=<NEVER>
buf=0x78a3c0 data=0x78a3d4 o=0 p=0 rsp.next=0 i=0 size=0
Special thanks to Arnaud Gavara who provided lots of valuable input and
ran some validation testing on this patch.
This fix must be backported to 1.6 and 1.5. Note that in 1.5 the
session is not assigned from within the function so some extra checks
may be needed in the callers.
This part was missed when peers were ported to the new applet
infrastructure in 1.6, the main stream is woken up instead of the
appctx. This creates a race condition by which it is possible to
wake the stream at the wrong moment and miss an event. This bug
might be at least partially responsible for some of the CLOSE_WAIT
that were reported on peers session upon reload in version 1.6.
This fix must be backported to 1.6.
During the stick-table teaching process which occurs at reloading/restart time,
expiration dates of stick-tables entries were not synchronized between peers.
This patch adds two new stick-table messages to provide such a synchronization feature.
As these new messages are not supported by older haproxy peers protocol versions,
this patch increments peers protol version, from 2.0 to 2.1, to help in detecting/supporting
such older peers protocol implementations so that new versions might still be able
to transparently communicate with a newer one.
[wt: technically speaking it would be nice to have this backported into 1.6
as some people who reload often are affected by this design limitation, but
it's not a totally transparent change that may make certain users feel
reluctant to upgrade older versions. Let's let it cook in 1.7 first and
decide later]
After pushing the last update related to a resync process, the teacher resets
the re-connection's origin to the saved one (pointer position when he receive
the resync request). But the last acknowledgement overwrites this pointer to
an inconsistent value. In peersv2, it results with empty table chunks
regularly pushed.
The fix consist to move the confirm code to assure that the confirm message is
always sent after the last acknowledgement related to the resync. And to reset
the re-connection's origin to the saved one when a confirm message is received.
This bug affects versions 1.6 and superior.
For older versions (peersv1), this inconsistent state should not generate side
effects because chunks are not used and a next acknowlegement message resets
the pointer in a valid state.
This bug is due to a copy/paste error and was introduced
with peers-protocol v2:
The last_pushed pointer was not correctly reset to the teaching_origin at
the end of the teaching state but to the first update present in the tree.
The result: some updates were re-pushed after leaving the teaching state.
This fix needs to be backported to 1.6.
Instead of repeating the type of the LHS argument (sizeof(struct ...))
in calls to malloc/calloc, we directly use the pointer
name (sizeof(*...)). The following Coccinelle patch was used:
@@
type T;
T *x;
@@
x = malloc(
- sizeof(T)
+ sizeof(*x)
)
@@
type T;
T *x;
@@
x = calloc(1,
- sizeof(T)
+ sizeof(*x)
)
When the LHS is not just a variable name, no change is made. Moreover,
the following patch was used to ensure that "1" is consistently used as
a first argument of calloc, not the last one:
@@
@@
calloc(
+ 1,
...
- ,1
)
In C89, "void *" is automatically promoted to any pointer type. Casting
the result of malloc/calloc to the type of the LHS variable is therefore
unneeded.
Most of this patch was built using this Coccinelle patch:
@@
type T;
@@
- (T *)
(\(lua_touserdata\|malloc\|calloc\|SSL_get_app_data\|hlua_checkudata\|lua_newuserdata\)(...))
@@
type T;
T *x;
void *data;
@@
x =
- (T *)
data
@@
type T;
T *x;
T *data;
@@
x =
- (T *)
data
Unfortunately, either Coccinelle or I is too limited to detect situation
where a complex RHS expression is of type "void *" and therefore casting
is not needed. Those cases were manually examined and corrected.
The frequency counters's window start is sent as "now - freq.date",
which is a positive age compared to the current date. But on receipt,
this age was added to the current date instead of subtracted. So
since the date was always in the future, they were always expired if
the activity changed side in less than the counter's measuring period
(eg: 10s).
This bug was reported by Christian Ruppert who also provided an easy
reproducer.
It needs to be backported to 1.6.
New sticktable entries learned from a remote peer can be pushed to others after
a random delay because they are not inserted at the right position in the updates
tree.
function 'peer_prepare_ackmsg' is designed to use the argument 'msg'
instead of 'trash.str'.
There is currently no bug because the caller passes 'trash.str' in
the 'msg' argument.
Some updates are pushed using an incremental update message after a
re-connection whereas the origin is forgotten by the peer.
These updates are never correctly acknowledged. So they are regularly
re-pushed after an idle timeout and a re-connect.
The fix consists to use an absolute update message in some cases.
Pradeep Jindal reported and troubleshooted a bug causing haproxy to die
during startup on all processes not making use of a peers section. It
only happens with nbproc > 1 when peers are declared. Technically it's
when the peers task is stopped on processes that don't use it that the
crash occurred (a task_free() called on a NULL task pointer).
This only affects peers v2 in the dev branch, no backport is needed.
The table definition message id was used instead of the update acknowledgement id.
This bug causes a malformated message and a protocol error and breaks the
connection.
After that, the updates remain unacknowledged.
This patch removes the special stick tables types names and
use the standard sample type names. This avoid the maintainance
of two types and remove the switch/case for matching a sample
type for each stick table type.
It is possible to propagate entries of any data-types in stick-tables between
several haproxy instances over TCP connections in a multi-master fashion. Each
instance pushes its local updates and insertions to remote peers. The pushed
values overwrite remote ones without aggregation. Interrupted exchanges are
automatically detected and recovered from the last known point.