• Chuck Lever's avatar
    SUNRPC: Replace the use of the xprtiod WQ in rpcrdma · 6b1eb3b2
    Chuck Lever authored
    While setting up a new lab, I accidentally misconfigured the
    Ethernet port for a system that tried an NFS mount using RoCE.
    This made the NFS server unreachable. The following WARNING
    popped on the NFS client while waiting for the mount attempt to
    time out:
    
    kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
    kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
    kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
    kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
    kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
    kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
    kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
    kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
    kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
    kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
    kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
    kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
    kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
    kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
    kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
    kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
    kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    kernel: Call Trace:
    kernel:  <TASK>
    kernel:  __flush_work.isra.0+0xaf/0x188
    kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
    kernel:  ? lock_timer_base+0x38/0x5f
    kernel:  __cancel_work_timer+0xea/0x13d
    kernel:  ? preempt_latency_start+0x2b/0x46
    kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
    kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
    kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
    kernel:  ? _raw_spin_unlock+0x14/0x29
    kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
    kernel:  ? finish_task_switch.isra.0+0x171/0x249
    kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
    kernel:  process_one_work+0x1d8/0x2d4
    kernel:  worker_thread+0x18b/0x24f
    kernel:  ? rescuer_thread+0x280/0x280
    kernel:  kthread+0xf4/0xfc
    kernel:  ? kthread_complete_and_exit+0x1b/0x1b
    kernel:  ret_from_fork+0x22/0x30
    kernel:  </TASK>
    
    SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
    one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
    prevent a priority inversion. The internal workqueues in the
    RDMA/core are currently non-MEM_RECLAIM.
    
    Jason Gunthorpe says this about the current state of RDMA/core:
    > If you attempt to do a reconnection/etc from within a RECLAIM
    > context it will deadlock on one of the many allocations that are
    > made to support opening the connection.
    >
    > The general idea of reclaim is that the entire task context
    > working under the reclaim is marked with an override of the gfp
    > flags to make all allocations under that call chain reclaim safe.
    >
    > But rdmacm does allocations outside this, eg in the WQs processing
    > the CM packets. So this doesn't work and we will deadlock.
    >
    > Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
    > here and there.
    
    So we will change the ULP in this case to avoid the use of
    WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
    are not fixed, but at least we no longer have a false sense of
    confidence that the stack won't allocate memory during memory
    reclaim.
    Suggested-by: default avatarLeon Romanovsky <leon@kernel.org>
    Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
    Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
    6b1eb3b2
transport.c 21.8 KB