Commit 3d9a0d2f authored by Eric Dumazet's avatar Eric Dumazet Committed by David S. Miller

dql: dql_queued() should write first to reduce bus transactions

While doing high throughput test on a BQL enabled NIC,
I found a very high cost in ndo_start_xmit() when accessing BQL data.

It turned out the problem was caused by compiler trying to be
smart, but involving a bad MESI transaction :

  0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
  0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
 58.23 │  add    %edx,%edi
  0.58 │  cmp    %edi,0xc4(%rax)
  0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
  0.72 │  js     bd8

I got an incredible 10 % gain [1] by making sure cpu do not attempt
to get the cache line in Shared mode, but directly requests for
ownership.

New code :
	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
	cmp    %edx,%ecx

The TX completion was running from another cpu, with high interrupts
rate.

Note that I am using barrier() as a soft hint, as mb() here could be
too heavy cost.

[1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parent 68f6a7c6
...@@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql, unsigned int count) ...@@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
{ {
BUG_ON(count > DQL_MAX_OBJECT); BUG_ON(count > DQL_MAX_OBJECT);
dql->num_queued += count;
dql->last_obj_cnt = count; dql->last_obj_cnt = count;
/* We want to force a write first, so that cpu do not attempt
* to get cache line containing last_obj_cnt, num_queued, adj_limit
* in Shared state, but directly does a Request For Ownership
* It is only a hint, we use barrier() only.
*/
barrier();
dql->num_queued += count;
} }
/* Returns how many objects can be queued, < 0 indicates over limit. */ /* Returns how many objects can be queued, < 0 indicates over limit. */
static inline int dql_avail(const struct dql *dql) static inline int dql_avail(const struct dql *dql)
{ {
return dql->adj_limit - dql->num_queued; return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued);
} }
/* Record number of completed objects and recalculate the limit. */ /* Record number of completed objects and recalculate the limit. */
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment