• Ivan Vecera's avatar
    ice: Fix race during aux device (un)plugging · 486b9eee
    Ivan Vecera authored
    Function ice_plug_aux_dev() assigns pf->adev field too early prior
    aux device initialization and on other side ice_unplug_aux_dev()
    starts aux device deinit and at the end assigns NULL to pf->adev.
    This is wrong because pf->adev should always be non-NULL only when
    aux device is fully initialized and ready. This wrong order causes
    a crash when ice_send_event_to_aux() call occurs because that function
    depends on non-NULL value of pf->adev and does not assume that
    aux device is half-initialized or half-destroyed.
    After order correction the race window is tiny but it is still there,
    as Leon mentioned and manipulation with pf->adev needs to be protected
    by mutex.
    
    Fix (un-)plugging functions so pf->adev field is set after aux device
    init and prior aux device destroy and protect pf->adev assignment by
    new mutex. This mutex is also held during ice_send_event_to_aux()
    call to ensure that aux device is valid during that call.
    Note that device lock used ice_send_event_to_aux() needs to be kept
    to avoid race with aux drv unload.
    
    Reproducer:
    cycle=1
    while :;do
            echo "#### Cycle: $cycle"
    
            ip link set ens7f0 mtu 9000
            ip link add bond0 type bond mode 1 miimon 100
            ip link set bond0 up
            ifenslave bond0 ens7f0
            ip link set bond0 mtu 9000
            ethtool -L ens7f0 combined 1
            ip link del bond0
            ip link set ens7f0 mtu 1500
            sleep 1
    
            let cycle++
    done
    
    In short when the device is added/removed to/from bond the aux device
    is unplugged/plugged. When MTU of the device is changed an event is
    sent to aux device asynchronously. This can race with (un)plugging
    operation and because pf->adev is set too early (plug) or too late
    (unplug) the function ice_send_event_to_aux() can touch uninitialized
    or destroyed fields. In the case of crash below pf->adev->dev.mutex.
    
    Crash:
    [   53.372066] bond0: (slave ens7f0): making interface the new active one
    [   53.378622] bond0: (slave ens7f0): Enslaving as an active interface with an u
    p link
    [   53.386294] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
    [   53.549104] bond0: (slave ens7f1): Enslaving as a backup interface with an up
     link
    [   54.118906] ice 0000:ca:00.0 ens7f0: Number of in use tx queues changed inval
    idating tc mappings. Priority traffic classification disabled!
    [   54.233374] ice 0000:ca:00.1 ens7f1: Number of in use tx queues changed inval
    idating tc mappings. Priority traffic classification disabled!
    [   54.248204] bond0: (slave ens7f0): Releasing backup interface
    [   54.253955] bond0: (slave ens7f1): making interface the new active one
    [   54.274875] bond0: (slave ens7f1): Releasing backup interface
    [   54.289153] bond0 (unregistering): Released all slaves
    [   55.383179] MII link monitoring set to 100 ms
    [   55.398696] bond0: (slave ens7f0): making interface the new active one
    [   55.405241] BUG: kernel NULL pointer dereference, address: 0000000000000080
    [   55.405289] bond0: (slave ens7f0): Enslaving as an active interface with an u
    p link
    [   55.412198] #PF: supervisor write access in kernel mode
    [   55.412200] #PF: error_code(0x0002) - not-present page
    [   55.412201] PGD 25d2ad067 P4D 0
    [   55.412204] Oops: 0002 [#1] PREEMPT SMP NOPTI
    [   55.412207] CPU: 0 PID: 403 Comm: kworker/0:2 Kdump: loaded Tainted: G S
               5.17.0-13579-g57f2d6540f03 #1
    [   55.429094] bond0: (slave ens7f1): Enslaving as a backup interface with an up
     link
    [   55.430224] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/
    2021
    [   55.430226] Workqueue: ice ice_service_task [ice]
    [   55.468169] RIP: 0010:mutex_unlock+0x10/0x20
    [   55.472439] Code: 0f b1 13 74 96 eb e0 4c 89 ee eb d8 e8 79 54 ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 48 8b 04 25 40 ef 01 00 31 d2 <f0> 48 0f b1 17 75 01 c3 e9 e3 fe ff ff 0f 1f 00 0f 1f 44 00 00 48
    [   55.491186] RSP: 0018:ff4454230d7d7e28 EFLAGS: 00010246
    [   55.496413] RAX: ff1a79b208b08000 RBX: ff1a79b2182e8880 RCX: 0000000000000001
    [   55.503545] RDX: 0000000000000000 RSI: ff4454230d7d7db0 RDI: 0000000000000080
    [   55.510678] RBP: ff1a79d1c7e48b68 R08: ff4454230d7d7db0 R09: 0000000000000041
    [   55.517812] R10: 00000000000000a5 R11: 00000000000006e6 R12: ff1a79d1c7e48bc0
    [   55.524945] R13: 0000000000000000 R14: ff1a79d0ffc305c0 R15: 0000000000000000
    [   55.532076] FS:  0000000000000000(0000) GS:ff1a79d0ffc00000(0000) knlGS:0000000000000000
    [   55.540163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   55.545908] CR2: 0000000000000080 CR3: 00000003487ae003 CR4: 0000000000771ef0
    [   55.553041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [   55.560173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [   55.567305] PKRU: 55555554
    [   55.570018] Call Trace:
    [   55.572474]  <TASK>
    [   55.574579]  ice_service_task+0xaab/0xef0 [ice]
    [   55.579130]  process_one_work+0x1c5/0x390
    [   55.583141]  ? process_one_work+0x390/0x390
    [   55.587326]  worker_thread+0x30/0x360
    [   55.590994]  ? process_one_work+0x390/0x390
    [   55.595180]  kthread+0xe6/0x110
    [   55.598325]  ? kthread_complete_and_exit+0x20/0x20
    [   55.603116]  ret_from_fork+0x1f/0x30
    [   55.606698]  </TASK>
    
    Fixes: f9f5301e ("ice: Register auxiliary device to provide RDMA")
    Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
    Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
    Reviewed-by: default avatarDave Ertman <david.m.ertman@intel.com>
    Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
    Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
    486b9eee
ice_main.c 239 KB