mirror of
				https://git.haproxy.org/git/haproxy.git/
				synced 2025-10-26 14:10:59 +01:00 
			
		
		
		
	
		
			
				
	
	
		
			139 lines
		
	
	
		
			7.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			139 lines
		
	
	
		
			7.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 2021-07-30 - File descriptor migration between threads
 | |
| 
 | |
| An FD migration may happen on any idle connection that experiences a takeover()
 | |
| operation by another thread. In this case the acting thread becomes the owner
 | |
| of the connection (and FD) while previous one(s) need to forget about it.
 | |
| 
 | |
| File descriptor migration between threads is a fairly complex operation because
 | |
| it is required to maintain a durable consistency between the pollers states and
 | |
| the haproxy's desired state. Indeed, very often the FD is registered within one
 | |
| thread's poller and that thread might be waiting in the system, so there is no
 | |
| way to synchronously update it. This is where thread_mask, polled_mask and per
 | |
| thread updates are used:
 | |
| 
 | |
|   - a thread knows if it's allowed to manipulate an FD by looking at its bit in
 | |
|     the FD's thread_mask ;
 | |
| 
 | |
|   - each thread knows if it was polling an FD by looking at its bit in the
 | |
|     polled_mask field ; a recent migration is usually indicated by a bit being
 | |
|     present in polled_mask and absent from thread_mask.
 | |
| 
 | |
|   - other threads know whether it's safe to take over an FD by looking at the
 | |
|     running mask: if it contains any other thread's bit, then other threads are
 | |
|     using it and it's not safe to take it over.
 | |
| 
 | |
|   - sleeping threads are notified about the need to update their polling via
 | |
|     local or global updates to the FD. Each thread has its own local update
 | |
|     list and its own bit in the update_mask to know whether there are pending
 | |
|     updates for it. This allows to reconverge polling with the desired state
 | |
|     at the last instant before polling.
 | |
| 
 | |
| While the description above could be seen as "progressive" (it technically is)
 | |
| in that there is always a transition and convergence period in a migrated FD's
 | |
| life, functionally speaking it's perfectly atomic thanks to the running bit and
 | |
| to the per-thread idle connections lock: no takeover is permitted without
 | |
| holding the idle_conns lock, and takeover may only happen by atomically picking
 | |
| a connection from the list that is also protected by this lock. In practice, an
 | |
| FD is never taken over by itself, but always in the context of a connection,
 | |
| and by atomically removing a connection from an idle list, it is possible to
 | |
| guarantee that a connection will not be picked, hence that its FD will not be
 | |
| taken over.
 | |
| 
 | |
| same thread as list!
 | |
| 
 | |
| The possible entry points to a race to use a file descriptor are the following
 | |
| ones, with their respective sequences:
 | |
| 
 | |
|  1) takeover: requested by conn_backend_get() on behalf of connect_server()
 | |
|     - take the idle_conns_lock, protecting against a parallel access from the
 | |
|       I/O tasklet or timeout task
 | |
|     - pick the first connection from the list
 | |
|     - attempt an fd_takeover() on this connection's fd. Usually it works,
 | |
|       unless a late wakeup of the owning thread shows up in the FD's running
 | |
|       mask. The operation is performed in fd_takeover() using a DWCAS which
 | |
|       tries to switch both running and thread_mask to the caller's tid_bit. A
 | |
|       concurrent bit in running is enough to make it fail. This guarantees
 | |
|       another thread does not wakeup from I/O in the middle of the takeover.
 | |
|       In case of conflict, this FD is skipped and the attempt is tried again
 | |
|       with the next connection.
 | |
|     - resets the task/tasklet contexts to NULL, as a signal that they are not
 | |
|       allowed to run anymore. The tasks retrieve their execution context from
 | |
|       the scheduler in the arguments, but will check the tasks' context from
 | |
|       the structure under the lock to detect this possible change, and abort.
 | |
|     - at this point the takeover succeeded, the idle_conns_lock is released and
 | |
|       the connection and its FD are now owned by the caller
 | |
| 
 | |
|   2) poll report: happens on late rx, shutdown or error on idle conns
 | |
|     - fd_set_running() is called to atomically set the running_mask and check
 | |
|       that the caller's tid_bit is still present in the thread_mask. Upon
 | |
|       failure the caller arranges itself to stop reporting that FD (e.g. by
 | |
|       immediate removal or by an asynchronous update). Upon success, it's
 | |
|       guaranteed that any concurrent fd_takeover() will fail the DWCAS and that
 | |
|       another connection will need to be picked instead.
 | |
|     - FD's state is possibly updated
 | |
|     - the iocb is called if needed (almost always)
 | |
|     - if the iocb didn't kill the connection, release the bit from running_mask
 | |
|       making the connection possibly available to a subsequent fd_takeover().
 | |
| 
 | |
|   3) I/O tasklet, timeout task: timeout or subscribed wakeup
 | |
|     - start by taking the idle_conns_lock, ensuring no takeover() will pick the
 | |
|       same connection from this point.
 | |
|     - check the task/tasklet's context to verify that no recently completed
 | |
|       takeover() stole the connection. If it's NULL, the connection was lost,
 | |
|       the lock is released and the task/tasklet killed. Otherwise it is
 | |
|       guaranteed that no other thread may use that connection (current takeover
 | |
|       candidates are waiting on the lock, previous owners waking from poll()
 | |
|       lost their bit in the thread_mask and will not touch the FD).
 | |
|     - the connection is removed from the idle conns list. From this point on,
 | |
|       no other thread will even find it there nor even try fd_takeover() on it.
 | |
|     - the idle_conns_lock is now released, the connection is protected and its
 | |
|       FD is not reachable by other threads anymore.
 | |
|     - the task does what it has to do
 | |
|     - if the connection is still usable (i.e. not upon timeout), it's inserted
 | |
|       again into the idle conns list, meaning it may instantly be taken over
 | |
|       by a competing thread.
 | |
| 
 | |
|   4) wake() callback: happens on last user after xfers (may free() the conn)
 | |
|     - the connection is still owned by the caller, it's still subscribed to
 | |
|       polling but the connection is idle thus inactive. Errors or shutdowns
 | |
|       may be reported late, via sock_conn_iocb() and conn_notify_mux(), thus
 | |
|       the running bit is set (i.e. a concurrent fd_takeover() will fail).
 | |
|     - if the connection is in the list, the idle_conns_lock is grabbed, the
 | |
|       connection is removed from the list, and the lock is released.
 | |
|     - mux->wake() is called
 | |
|     - if the connection previously was in the list, it's reinserted under the
 | |
|       idle_conns_lock.
 | |
| 
 | |
| 
 | |
| With the DWCAS removal between running_mask & thread_mask:
 | |
| 
 | |
| fd_takeover:
 | |
|      1  if (!CAS(&running_mask, 0, tid_bit))
 | |
|      2      return fail;
 | |
|      3  atomic_store(&thread_mask, tid_bit);
 | |
|      4  atomic_and(&running_mask, ~tid_bit);
 | |
| 
 | |
| poller:
 | |
|      1  do {
 | |
|      2      /* read consistent running_mask & thread_mask */
 | |
|      3      do {
 | |
|      4          run = atomic_load(&running_mask);
 | |
|      5          thr = atomic_load(&thread_mask);
 | |
|      6      } while (run & ~thr);
 | |
|      7
 | |
|      8      if (!(thr & tid_bit)) {
 | |
|      9          /* takeover has started */
 | |
|     10          goto disable_fd;
 | |
|     11      }
 | |
|     12  } while (!CAS(&running_mask, run, run | tid_bit));
 | |
| 
 | |
| fd_delete:
 | |
|      1  atomic_or(&running_mask, tid_bit);
 | |
|      2  atomic_store(&thread_mask, 0);
 | |
|      3  atomic_and(&running_mask, ~tid_bit);
 | |
| 
 | |
| The loop in poller:3-6 is used to make sure the thread_mask we read matches
 | |
| the last updated running_mask. If nobody can give up on fd_takeover(), it
 | |
| might even be possible to spin on thread_mask only. Late pollers will not
 | |
| set running anymore with this.
 |