Merge branch 'sfx-xdp-fallback-tx-queues'

Íñigo Huguet says: ==================== sfc: fallback for lack of xdp tx queues If there are not enough hardware resources to allocate one tx queue per CPU for XDP, XDP_TX and XDP_REDIRECT actions were unavailable, and using them resulted each time with the packet being drop and this message in the logs: XDP TX failed (-22) These patches implement 2 fallback solutions for 2 different situations that might happen: 1. There are not enough free resources for all the tx queues, but there are some free resources available 2. There are not enough free resources at all for tx queues. Both solutions are based in sharing tx queues, using __netif_tx_lock for synchronization. In the second case, as there are not XDP TX queues to share, network stack queues are used instead, but since we're taking __netif_tx_lock, concurrent access to the queues is correctly protected. The solution for this second case might affect performance both of XDP traffic and normal traffice due to lock contention if both are used intensively. That's why I call it a "last resort" fallback: it's not a desirable situation, but at least we have XDP TX working. Some tests has shown good results and indicate that the non-fallback case is not being damaged by this changes. They are also promising for the fallback cases. This is the test: 1. From another machine, send high amount of packets with pktgen, script samples/pktgen/pktgen_sample04_many_flows.sh 2. In the tested machine, run samples/bpf/xdp_rxq_info with arguments "-a XDP_TX --swapmac" and see the results 3. In the tested machine, run also pktgen_sample04 to create high TX normal traffic, and see how xdp_rxq_info results vary Note that this test doesn't check the worst situations for the fallback solutions because XDP_TX will only be executed from the same CPUs that are processed by sfc, and not from every CPU in the system, so the performance drop due to the highest locking contention doesn't happen. I'd like to test that, as well, but I don't have access right now to a proper environment. Test results: Without doing TX: Before changes: ~2,900,000 pps After changes, 1 queues/core: ~2,900,000 pps After changes, 2 queues/core: ~2,900,000 pps After changes, 8 queues/core: ~2,900,000 pps After changes, borrowing from network stack: ~2,900,000 pps With multiflow TX at the same time: Before changes: ~1,700,000 - 2,900,000 pps After changes, 1 queues/core: ~1,700,000 - 2,900,000 pps After changes, 2 queues/core: ~1,700,000 pps After changes, 8 queues/core: ~1,700,000 pps After changes, borrowing from network stack: 1,150,000 pps Sporadic "XDP TX failed (-5)" warnings are shown when running xdp program and pktgen simultaneously. This was expected because XDP doesn't have any buffering system if the NIC is under very high pressure. Thousands of these warnings are shown in the case of borrowing net stack queues. As I said before, this was also expected. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sfx-xdp-fallback-tx-queues'
Íñigo Huguet says: ==================== sfc: fallback for lack of xdp tx queues If there are not enough hardware resources to allocate one tx queue per CPU for XDP, XDP_TX and XDP_REDIRECT actions were unavailable, and using them resulted each time with the packet being drop and this message in the logs: XDP TX failed (-22) These patches implement 2 fallback solutions for 2 different situations that might happen: 1. There are not enough free resources for all the tx queues, but there are some free resources available 2. There are not enough free resources at all for tx queues. Both solutions are based in sharing tx queues, using __netif_tx_lock for synchronization. In the second case, as there are not XDP TX queues to share, network stack queues are used instead, but since we're taking __netif_tx_lock, concurrent access to the queues is correctly protected. The solution for this second case might affect performance both of XDP traffic and normal traffice due to lock contention if both are used intensively. That's why I call it a "last resort" fallback: it's not a desirable situation, but at least we have XDP TX working. Some tests has shown good results and indicate that the non-fallback case is not being damaged by this changes. They are also promising for the fallback cases. This is the test: 1. From another machine, send high amount of packets with pktgen, script samples/pktgen/pktgen_sample04_many_flows.sh 2. In the tested machine, run samples/bpf/xdp_rxq_info with arguments "-a XDP_TX --swapmac" and see the results 3. In the tested machine, run also pktgen_sample04 to create high TX normal traffic, and see how xdp_rxq_info results vary Note that this test doesn't check the worst situations for the fallback solutions because XDP_TX will only be executed from the same CPUs that are processed by sfc, and not from every CPU in the system, so the performance drop due to the highest locking contention doesn't happen. I'd like to test that, as well, but I don't have access right now to a proper environment. Test results: Without doing TX: Before changes: ~2,900,000 pps After changes, 1 queues/core: ~2,900,000 pps After changes, 2 queues/core: ~2,900,000 pps After changes, 8 queues/core: ~2,900,000 pps After changes, borrowing from network stack: ~2,900,000 pps With multiflow TX at the same time: Before changes: ~1,700,000 - 2,900,000 pps After changes, 1 queues/core: ~1,700,000 - 2,900,000 pps After changes, 2 queues/core: ~1,700,000 pps After changes, 8 queues/core: ~1,700,000 pps After changes, borrowing from network stack: 1,150,000 pps Sporadic "XDP TX failed (-5)" warnings are shown when running xdp program and pktgen simultaneously. This was expected because XDP doesn't have any buffering system if the NIC is under very high pressure. Thousands of these warnings are shown in the case of borrowing net stack queues. As I said before, this was also expected. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
e3a843f9 · David S. Miller · 2a48d96f · 6215b608 · e3a843f9 · e3a843f9
Commit e3a843f9 authored Sep 09, 2021 by David S. Miller
3 changed files
--- a/drivers/net/ethernet/sfc/efx_channels.c
+++ b/drivers/net/ethernet/sfc/efx_channels.c
@@ -166,32 +166,46 @@ static int efx_allocate_msix_channels(struct efx_nic *efx,
 	 * We need a channel per event queue, plus a VI per tx queue.
 	 * This may be more pessimistic than it needs to be.
 	 */
-	if (n_channels + n_xdp_ev > max_channels) {
-		netif_err(efx, drv, efx->net_dev,
-			  "Insufficient resources for %d XDP event queues (%d other channels, max %d)\n",
-			  n_xdp_ev, n_channels, max_channels);
-		netif_err(efx, drv, efx->net_dev,
-			  "XDP_TX and XDP_REDIRECT will not work on this interface");
-		efx->n_xdp_channels = 0;
-		efx->xdp_tx_per_channel = 0;
-		efx->xdp_tx_queue_count = 0;
+	if (n_channels >= max_channels) {
+		efx->xdp_txq_queues_mode = EFX_XDP_TX_QUEUES_BORROWED;
+		netif_warn(efx, drv, efx->net_dev,
+			   "Insufficient resources for %d XDP event queues (%d other channels, max %d)\n",
+			   n_xdp_ev, n_channels, max_channels);
+		netif_warn(efx, drv, efx->net_dev,
+			   "XDP_TX and XDP_REDIRECT might decrease device's performance\n");
 	} else if (n_channels + n_xdp_tx > efx->max_vis) {
-		netif_err(efx, drv, efx->net_dev,
-			  "Insufficient resources for %d XDP TX queues (%d other channels, max VIs %d)\n",
-			  n_xdp_tx, n_channels, efx->max_vis);
-		netif_err(efx, drv, efx->net_dev,
-			  "XDP_TX and XDP_REDIRECT will not work on this interface");
-		efx->n_xdp_channels = 0;
-		efx->xdp_tx_per_channel = 0;
-		efx->xdp_tx_queue_count = 0;
+		efx->xdp_txq_queues_mode = EFX_XDP_TX_QUEUES_BORROWED;
+		netif_warn(efx, drv, efx->net_dev,
+			   "Insufficient resources for %d XDP TX queues (%d other channels, max VIs %d)\n",
+			   n_xdp_tx, n_channels, efx->max_vis);
+		netif_warn(efx, drv, efx->net_dev,
+			   "XDP_TX and XDP_REDIRECT might decrease device's performance\n");
+	} else if (n_channels + n_xdp_ev > max_channels) {
+		efx->xdp_txq_queues_mode = EFX_XDP_TX_QUEUES_SHARED;
+		netif_warn(efx, drv, efx->net_dev,
+			   "Insufficient resources for %d XDP event queues (%d other channels, max %d)\n",
+			   n_xdp_ev, n_channels, max_channels);
+
+		n_xdp_ev = max_channels - n_channels;
+		netif_warn(efx, drv, efx->net_dev,
+			   "XDP_TX and XDP_REDIRECT will work with reduced performance (%d cpus/tx_queue)\n",
+			   DIV_ROUND_UP(n_xdp_tx, tx_per_ev * n_xdp_ev));
 	} else {
+		efx->xdp_txq_queues_mode = EFX_XDP_TX_QUEUES_DEDICATED;
+	}
+
+	if (efx->xdp_txq_queues_mode != EFX_XDP_TX_QUEUES_BORROWED) {
 		efx->n_xdp_channels = n_xdp_ev;
 		efx->xdp_tx_per_channel = tx_per_ev;
 		efx->xdp_tx_queue_count = n_xdp_tx;
 		n_channels += n_xdp_ev;
 		netif_dbg(efx, drv, efx->net_dev,
 			  "Allocating %d TX and %d event queues for XDP\n",
-			  n_xdp_tx, n_xdp_ev);
+			  n_xdp_ev * tx_per_ev, n_xdp_ev);
+	} else {
+		efx->n_xdp_channels = 0;
+		efx->xdp_tx_per_channel = 0;
+		efx->xdp_tx_queue_count = n_xdp_tx;
 	}

 	if (vec_count < n_channels) {
@@ -858,6 +872,20 @@ int efx_realloc_channels(struct efx_nic *efx, u32 rxq_entries, u32 txq_entries)
 	goto out;
 }

+static inline int
+efx_set_xdp_tx_queue(struct efx_nic *efx, int xdp_queue_number,
+		     struct efx_tx_queue *tx_queue)
+{
+	if (xdp_queue_number >= efx->xdp_tx_queue_count)
+		return -EINVAL;
+
+	netif_dbg(efx, drv, efx->net_dev, "Channel %u TXQ %u is XDP %u, HW %u\n",
+		  tx_queue->channel->channel, tx_queue->label,
+		  xdp_queue_number, tx_queue->queue);
+	efx->xdp_tx_queues[xdp_queue_number] = tx_queue;
+	return 0;
+}
+
 int efx_set_channels(struct efx_nic *efx)
 {
 	struct efx_tx_queue *tx_queue;
@@ -896,20 +924,9 @@ int efx_set_channels(struct efx_nic *efx)
 			if (efx_channel_is_xdp_tx(channel)) {
 				efx_for_each_channel_tx_queue(tx_queue, channel) {
 					tx_queue->queue = next_queue++;
-
-					/* We may have a few left-over XDP TX
-					 * queues owing to xdp_tx_queue_count
-					 * not dividing evenly by EFX_MAX_TXQ_PER_CHANNEL.
-					 * We still allocate and probe those
-					 * TXQs, but never use them.
-					 */
-					if (xdp_queue_number < efx->xdp_tx_queue_count) {
-						netif_dbg(efx, drv, efx->net_dev, "Channel %u TXQ %u is XDP %u, HW %u\n",
-							  channel->channel, tx_queue->label,
-							  xdp_queue_number, tx_queue->queue);
-						efx->xdp_tx_queues[xdp_queue_number] = tx_queue;
+					rc = efx_set_xdp_tx_queue(efx, xdp_queue_number, tx_queue);
+					if (rc == 0)
 						xdp_queue_number++;
-					}
 				}
 			} else {
 				efx_for_each_channel_tx_queue(tx_queue, channel) {
@@ -918,10 +935,35 @@ int efx_set_channels(struct efx_nic *efx)
 						  channel->channel, tx_queue->label,
 						  tx_queue->queue);
 				}
+
+				/* If XDP is borrowing queues from net stack, it must use the queue
+				 * with no csum offload, which is the first one of the channel
+				 * (note: channel->tx_queue_by_type is not initialized yet)
+				 */
+				if (efx->xdp_txq_queues_mode == EFX_XDP_TX_QUEUES_BORROWED) {
+					tx_queue = &channel->tx_queue[0];
+					rc = efx_set_xdp_tx_queue(efx, xdp_queue_number, tx_queue);
+					if (rc == 0)
+						xdp_queue_number++;
+				}
 			}
 		}
 	}
-	WARN_ON(xdp_queue_number != efx->xdp_tx_queue_count);
+	WARN_ON(efx->xdp_txq_queues_mode == EFX_XDP_TX_QUEUES_DEDICATED &&
+		xdp_queue_number != efx->xdp_tx_queue_count);
+	WARN_ON(efx->xdp_txq_queues_mode != EFX_XDP_TX_QUEUES_DEDICATED &&
+		xdp_queue_number > efx->xdp_tx_queue_count);
+
+	/* If we have more CPUs than assigned XDP TX queues, assign the already
+	 * existing queues to the exceeding CPUs
+	 */
+	next_queue = 0;
+	while (xdp_queue_number < efx->xdp_tx_queue_count) {
+		tx_queue = efx->xdp_tx_queues[next_queue++];
+		rc = efx_set_xdp_tx_queue(efx, xdp_queue_number, tx_queue);
+		if (rc == 0)
+			xdp_queue_number++;
+	}

 	rc = netif_set_real_num_tx_queues(efx->net_dev, efx->n_tx_channels);
 	if (rc)

--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -782,6 +782,12 @@ struct efx_async_filter_insertion {
 #define EFX_RPS_MAX_IN_FLIGHT	8
 #endif /* CONFIG_RFS_ACCEL */

+enum efx_xdp_tx_queues_mode {
+	EFX_XDP_TX_QUEUES_DEDICATED,	/* one queue per core, locking not needed */
+	EFX_XDP_TX_QUEUES_SHARED,	/* each queue used by more than 1 core */
+	EFX_XDP_TX_QUEUES_BORROWED	/* queues borrowed from net stack */
+};
+
 /**
 * struct efx_nic - an Efx NIC
 * @name: Device name (net device name or bus id before net device registered)
@@ -820,6 +826,7 @@ struct efx_async_filter_insertion {
 *	should be allocated for this NIC
 * @xdp_tx_queue_count: Number of entries in %xdp_tx_queues.
 * @xdp_tx_queues: Array of pointers to tx queues used for XDP transmit.
+ * @xdp_txq_queues_mode: XDP TX queues sharing strategy.
 * @rxq_entries: Size of receive queues requested by user.
 * @txq_entries: Size of transmit queues requested by user.
 * @txq_stop_thresh: TX queue fill level at or above which we stop it.
@@ -979,6 +986,7 @@ struct efx_nic {

 	unsigned int xdp_tx_queue_count;
 	struct efx_tx_queue **xdp_tx_queues;
+	enum efx_xdp_tx_queues_mode xdp_txq_queues_mode;

 	unsigned rxq_entries;
 	unsigned txq_entries;

--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -428,23 +428,32 @@ int efx_xdp_tx_buffers(struct efx_nic *efx, int n, struct xdp_frame **xdpfs,
 	unsigned int len;
 	int space;
 	int cpu;
-	int i;
+	int i = 0;

-	cpu = raw_smp_processor_id();
+	if (unlikely(n && !xdpfs))
+		return -EINVAL;
+	if (unlikely(!n))
+		return 0;

-	if (!efx->xdp_tx_queue_count ||
-	    unlikely(cpu >= efx->xdp_tx_queue_count))
+	cpu = raw_smp_processor_id();
+	if (unlikely(cpu >= efx->xdp_tx_queue_count))
 		return -EINVAL;

 	tx_queue = efx->xdp_tx_queues[cpu];
 	if (unlikely(!tx_queue))
 		return -EINVAL;

-	if (unlikely(n && !xdpfs))
-		return -EINVAL;
+	if (efx->xdp_txq_queues_mode != EFX_XDP_TX_QUEUES_DEDICATED)
+		HARD_TX_LOCK(efx->net_dev, tx_queue->core_txq, cpu);

-	if (!n)
-		return 0;
+	/* If we're borrowing net stack queues we have to handle stop-restart
+	 * or we might block the queue and it will be considered as frozen
+	 */
+	if (efx->xdp_txq_queues_mode == EFX_XDP_TX_QUEUES_BORROWED) {
+		if (netif_tx_queue_stopped(tx_queue->core_txq))
+			goto unlock;
+		efx_tx_maybe_stop_queue(tx_queue);
+	}

 	/* Check for available space. We should never need multiple
 	 * descriptors per frame.
@@ -484,6 +493,10 @@ int efx_xdp_tx_buffers(struct efx_nic *efx, int n, struct xdp_frame **xdpfs,
 	if (flush && i > 0)
 		efx_nic_push_buffers(tx_queue);

+unlock:
+	if (efx->xdp_txq_queues_mode != EFX_XDP_TX_QUEUES_DEDICATED)
+		HARD_TX_UNLOCK(efx->net_dev, tx_queue->core_txq);
+
 	return i == 0 ? -EIO : i;
 }