1. 23 Feb, 2018 2 commits
    • Doug Ledford's avatar
      Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next · 4bb46608
      Doug Ledford authored
      There is a 14 patch series waiting to come into for-next that has a
      dependecy on code submitted into this kernel's for-rc series.  So, merge
      the for-rc branch into the current for-next in order to make the patch
      series apply cleanly.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      4bb46608
    • Doug Ledford's avatar
      Merge tag 'mlx5-updates-2018-02-21' of... · f76a5c75
      Doug Ledford authored
      Merge tag 'mlx5-updates-2018-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
      
      mlx5-updates-2018-02-21
      
      This series includes shared code updates for mlx5 core driver for both
      netdev and rdma subsystems.
      
      By Saeed,
      First six patches of the series are meant to address a performance issue
      and should provide a performance boost for multi core IRQ interrupt hungry
      workloads.  The issue is fixed in the first patch, all other patches are
      meant to refactor the code in light of this fix.
      
      The problem it comes to fix, is a shared spinlock accessed across all HCA
      IRQs which protects the CQ database.  To solve this we simply move the CQ
      database and its spinlock to be per EQ (IRQ), thus per core.
      
      By Yonatan,
      Fragmented completion queue (CQ) for RDMA,
      core driver implementation to create fragmented CQ buffers rather than
      one large contiguous memory buffer, the implementation scheme already
      exist and used by the netdev CQs, the patch shares that code with the
      rdma CQ creation flow and makes use of the new API in mlx5_ib driver.
      
      Thanks,
      Saeed.
      
      Merged into rdma-next tree as well as net-next tree to prevent conflicts
      in future patches between the two trees.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f76a5c75
  2. 21 Feb, 2018 1 commit
    • Leon Romanovsky's avatar
      RDMA/uverbs: Fix kernel panic while using XRC_TGT QP type · f4576587
      Leon Romanovsky authored
      Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
      invocation) will trigger the following kernel panic. It is caused by the
      fact that such QPs missed uobject initialization.
      
      [   17.408845] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
      [   17.412645] IP: rdma_lookup_put_uobject+0x9/0x50
      [   17.416567] PGD 0 P4D 0
      [   17.419262] Oops: 0000 [#1] SMP PTI
      [   17.422915] CPU: 0 PID: 455 Comm: ibv_xsrq_pingpo Not tainted 4.16.0-rc1+ #86
      [   17.424765] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [   17.427399] RIP: 0010:rdma_lookup_put_uobject+0x9/0x50
      [   17.428445] RSP: 0018:ffffb8c7401e7c90 EFLAGS: 00010246
      [   17.429543] RAX: 0000000000000000 RBX: ffffb8c7401e7cf8 RCX: 0000000000000000
      [   17.432426] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
      [   17.437448] RBP: 0000000000000000 R08: 00000000000218f0 R09: ffffffff8ebc4cac
      [   17.440223] R10: fffff6038052cd80 R11: ffff967694b36400 R12: ffff96769391f800
      [   17.442184] R13: ffffb8c7401e7cd8 R14: 0000000000000000 R15: ffff967699f60000
      [   17.443971] FS:  00007fc29207d700(0000) GS:ffff96769fc00000(0000) knlGS:0000000000000000
      [   17.446623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   17.448059] CR2: 0000000000000048 CR3: 000000001397a000 CR4: 00000000000006b0
      [   17.449677] Call Trace:
      [   17.450247]  modify_qp.isra.20+0x219/0x2f0
      [   17.451151]  ib_uverbs_modify_qp+0x90/0xe0
      [   17.452126]  ib_uverbs_write+0x1d2/0x3c0
      [   17.453897]  ? __handle_mm_fault+0x93c/0xe40
      [   17.454938]  __vfs_write+0x36/0x180
      [   17.455875]  vfs_write+0xad/0x1e0
      [   17.456766]  SyS_write+0x52/0xc0
      [   17.457632]  do_syscall_64+0x75/0x180
      [   17.458631]  entry_SYSCALL_64_after_hwframe+0x21/0x86
      [   17.460004] RIP: 0033:0x7fc29198f5a0
      [   17.460982] RSP: 002b:00007ffccc71f018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [   17.463043] RAX: ffffffffffffffda RBX: 0000000000000078 RCX: 00007fc29198f5a0
      [   17.464581] RDX: 0000000000000078 RSI: 00007ffccc71f050 RDI: 0000000000000003
      [   17.466148] RBP: 0000000000000000 R08: 0000000000000078 R09: 00007ffccc71f050
      [   17.467750] R10: 000055b6cf87c248 R11: 0000000000000246 R12: 00007ffccc71f300
      [   17.469541] R13: 000055b6cf8733a0 R14: 0000000000000000 R15: 0000000000000000
      [   17.471151] Code: 00 00 0f 1f 44 00 00 48 8b 47 48 48 8b 00 48 8b 40 10 e9 0b 8b 68 00 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53 89 f5 <48> 8b 47 48 48 89 fb 40 0f b6 f6 48 8b 00 48 8b 40 20 e8 e0 8a
      [   17.475185] RIP: rdma_lookup_put_uobject+0x9/0x50 RSP: ffffb8c7401e7c90
      [   17.476841] CR2: 0000000000000048
      [   17.477764] ---[ end trace 1dbcc5354071a712 ]---
      [   17.478880] Kernel panic - not syncing: Fatal exception
      [   17.480277] Kernel Offset: 0xd000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      
      Fixes: 2f08ee36 ("RDMA/restrack: don't use uaccess_kernel()")
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f4576587
  3. 20 Feb, 2018 5 commits
  4. 16 Feb, 2018 2 commits
  5. 15 Feb, 2018 27 commits
    • Bart Van Assche's avatar
      IB/srp: Fix completion vector assignment algorithm · 3a148896
      Bart Van Assche authored
      Ensure that cv_end is equal to ibdev->num_comp_vectors for the
      NUMA node with the highest index. This patch improves spreading
      of RDMA channels over completion vectors and thereby improves
      performance, especially on systems with only a single NUMA node.
      This patch drops support for the comp_vector login parameter by
      ignoring the value of that parameter since I have not found a
      good way to combine support for that parameter and automatic
      spreading of RDMA channels over completion vectors.
      
      Fixes: d92c0da7 ("IB/srp: Add multichannel support")
      Reported-by: default avatarAlexander Schmid <alex@modula-shop-systems.de>
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Alexander Schmid <alex@modula-shop-systems.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3a148896
    • Adit Ranadive's avatar
      RDMA/vmw_pvrdma: Fix usage of user response structures in ABI file · 1f5a6c47
      Adit Ranadive authored
      This ensures that we return the right structures back to userspace.
      Otherwise, it looks like the reserved fields in the response structures
      in userspace might have uninitialized data in them.
      
      Fixes: 8b10ba78 ("RDMA/vmw_pvrdma: Add shared receive queue support")
      Fixes: 29c8d9eb ("IB: Add vmw_pvrdma driver")
      Suggested-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarBryan Tan <bryantan@vmware.com>
      Reviewed-by: default avatarAditya Sarwade <asarwade@vmware.com>
      Reviewed-by: default avatarJorgen Hansen <jhansen@vmware.com>
      Signed-off-by: default avatarAdit Ranadive <aditr@vmware.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      1f5a6c47
    • Leon Romanovsky's avatar
      RDMA/uverbs: Sanitize user entered port numbers prior to access it · 5d4c05c3
      Leon Romanovsky authored
      ==================================================================
      BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs+0x6f2/0x8c0
      Read of size 4 at addr ffff88006476a198 by task syzkaller697701/265
      
      CPU: 0 PID: 265 Comm: syzkaller697701 Not tainted 4.15.0+ #90
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      Call Trace:
       dump_stack+0xde/0x164
       ? dma_virt_map_sg+0x22c/0x22c
       ? show_regs_print_info+0x17/0x17
       ? lock_contended+0x11a0/0x11a0
       print_address_description+0x83/0x3e0
       kasan_report+0x18c/0x4b0
       ? copy_ah_attr_from_uverbs+0x6f2/0x8c0
       ? copy_ah_attr_from_uverbs+0x6f2/0x8c0
       ? lookup_get_idr_uobject+0x120/0x200
       ? copy_ah_attr_from_uverbs+0x6f2/0x8c0
       copy_ah_attr_from_uverbs+0x6f2/0x8c0
       ? modify_qp+0xd0e/0x1350
       modify_qp+0xd0e/0x1350
       ib_uverbs_modify_qp+0xf9/0x170
       ? ib_uverbs_query_qp+0xa70/0xa70
       ib_uverbs_write+0x7f9/0xef0
       ? attach_entity_load_avg+0x8b0/0x8b0
       ? ib_uverbs_query_qp+0xa70/0xa70
       ? uverbs_devnode+0x110/0x110
       ? cyc2ns_read_end+0x10/0x10
       ? print_irqtrace_events+0x280/0x280
       ? sched_clock_cpu+0x18/0x200
       ? _raw_spin_unlock_irq+0x29/0x40
       ? _raw_spin_unlock_irq+0x29/0x40
       ? _raw_spin_unlock_irq+0x29/0x40
       ? time_hardirqs_on+0x27/0x670
       __vfs_write+0x10d/0x700
       ? uverbs_devnode+0x110/0x110
       ? kernel_read+0x170/0x170
       ? _raw_spin_unlock_irq+0x29/0x40
       ? finish_task_switch+0x1bd/0x7a0
       ? finish_task_switch+0x194/0x7a0
       ? prandom_u32_state+0xe/0x180
       ? rcu_read_unlock+0x80/0x80
       ? security_file_permission+0x93/0x260
       vfs_write+0x1b0/0x550
       SyS_write+0xc7/0x1a0
       ? SyS_read+0x1a0/0x1a0
       ? trace_hardirqs_on_thunk+0x1a/0x1c
       entry_SYSCALL_64_fastpath+0x1e/0x8b
      RIP: 0033:0x433c29
      RSP: 002b:00007ffcf2be82a8 EFLAGS: 00000217
      
      Allocated by task 62:
       kasan_kmalloc+0xa0/0xd0
       kmem_cache_alloc+0x141/0x480
       dup_fd+0x101/0xcc0
       copy_process.part.62+0x166f/0x4390
       _do_fork+0x1cb/0xe90
       kernel_thread+0x34/0x40
       call_usermodehelper_exec_work+0x112/0x260
       process_one_work+0x929/0x1aa0
       worker_thread+0x5c6/0x12a0
       kthread+0x346/0x510
       ret_from_fork+0x3a/0x50
      
      Freed by task 259:
       kasan_slab_free+0x71/0xc0
       kmem_cache_free+0xf3/0x4c0
       put_files_struct+0x225/0x2c0
       exit_files+0x88/0xc0
       do_exit+0x67c/0x1520
       do_group_exit+0xe8/0x380
       SyS_exit_group+0x1e/0x20
       entry_SYSCALL_64_fastpath+0x1e/0x8b
      
      The buggy address belongs to the object at ffff88006476a000
       which belongs to the cache files_cache of size 832
      The buggy address is located 408 bytes inside of
       832-byte region [ffff88006476a000, ffff88006476a340)
      The buggy address belongs to the page:
      page:ffffea000191da80 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
      flags: 0x4000000000008100(slab|head)
      raw: 4000000000008100 0000000000000000 0000000000000000 0000000100080008
      raw: 0000000000000000 0000000100000001 ffff88006bcf7a80 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88006476a080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88006476a100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88006476a180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                  ^
       ffff88006476a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88006476a280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ==================================================================
      
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org> # 4.11
      Fixes: 44c58487 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
      Reported-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5d4c05c3
    • Leon Romanovsky's avatar
      RDMA/uverbs: Fix circular locking dependency · 1ff5325c
      Leon Romanovsky authored
      Avoid circular locking dependency by calling
      to uobj_alloc_commit() outside of xrcd_tree_mutex lock.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.15.0+ #87 Not tainted
      ------------------------------------------------------
      syzkaller401056/269 is trying to acquire lock:
       (&uverbs_dev->xrcd_tree_mutex){+.+.}, at: [<000000006c12d2cd>] uverbs_free_xrcd+0xd2/0x360
      
      but task is already holding lock:
       (&ucontext->uobjects_lock){+.+.}, at: [<00000000da010f09>] uverbs_cleanup_ucontext+0x168/0x730
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&ucontext->uobjects_lock){+.+.}:
             __mutex_lock+0x111/0x1720
             rdma_alloc_commit_uobject+0x22c/0x600
             ib_uverbs_open_xrcd+0x61a/0xdd0
             ib_uverbs_write+0x7f9/0xef0
             __vfs_write+0x10d/0x700
             vfs_write+0x1b0/0x550
             SyS_write+0xc7/0x1a0
             entry_SYSCALL_64_fastpath+0x1e/0x8b
      
      -> #0 (&uverbs_dev->xrcd_tree_mutex){+.+.}:
             lock_acquire+0x19d/0x440
             __mutex_lock+0x111/0x1720
             uverbs_free_xrcd+0xd2/0x360
             remove_commit_idr_uobject+0x6d/0x110
             uverbs_cleanup_ucontext+0x2f0/0x730
             ib_uverbs_cleanup_ucontext.constprop.3+0x52/0x120
             ib_uverbs_close+0xf2/0x570
             __fput+0x2cd/0x8d0
             task_work_run+0xec/0x1d0
             do_exit+0x6a1/0x1520
             do_group_exit+0xe8/0x380
             SyS_exit_group+0x1e/0x20
             entry_SYSCALL_64_fastpath+0x1e/0x8b
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ucontext->uobjects_lock);
                                     lock(&uverbs_dev->xrcd_tree_mutex);
                                     lock(&ucontext->uobjects_lock);
        lock(&uverbs_dev->xrcd_tree_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by syzkaller401056/269:
       #0:  (&file->cleanup_mutex){+.+.}, at: [<00000000c9f0c252>] ib_uverbs_close+0xac/0x570
       #1:  (&ucontext->cleanup_rwsem){++++}, at: [<00000000b6994d49>] uverbs_cleanup_ucontext+0xf6/0x730
       #2:  (&ucontext->uobjects_lock){+.+.}, at: [<00000000da010f09>] uverbs_cleanup_ucontext+0x168/0x730
      
      stack backtrace:
      CPU: 0 PID: 269 Comm: syzkaller401056 Not tainted 4.15.0+ #87
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      Call Trace:
       dump_stack+0xde/0x164
       ? dma_virt_map_sg+0x22c/0x22c
       ? uverbs_cleanup_ucontext+0x168/0x730
       ? console_unlock+0x502/0xbd0
       print_circular_bug.isra.24+0x35e/0x396
       ? print_circular_bug_header+0x12e/0x12e
       ? find_usage_backwards+0x30/0x30
       ? entry_SYSCALL_64_fastpath+0x1e/0x8b
       validate_chain.isra.28+0x25d1/0x40c0
       ? check_usage+0xb70/0xb70
       ? graph_lock+0x160/0x160
       ? find_usage_backwards+0x30/0x30
       ? cyc2ns_read_end+0x10/0x10
       ? print_irqtrace_events+0x280/0x280
       ? __lock_acquire+0x93d/0x1630
       __lock_acquire+0x93d/0x1630
       lock_acquire+0x19d/0x440
       ? uverbs_free_xrcd+0xd2/0x360
       __mutex_lock+0x111/0x1720
       ? uverbs_free_xrcd+0xd2/0x360
       ? uverbs_free_xrcd+0xd2/0x360
       ? __mutex_lock+0x828/0x1720
       ? mutex_lock_io_nested+0x1550/0x1550
       ? uverbs_cleanup_ucontext+0x168/0x730
       ? __lock_acquire+0x9a9/0x1630
       ? mutex_lock_io_nested+0x1550/0x1550
       ? uverbs_cleanup_ucontext+0xf6/0x730
       ? lock_contended+0x11a0/0x11a0
       ? uverbs_free_xrcd+0xd2/0x360
       uverbs_free_xrcd+0xd2/0x360
       remove_commit_idr_uobject+0x6d/0x110
       uverbs_cleanup_ucontext+0x2f0/0x730
       ? sched_clock_cpu+0x18/0x200
       ? uverbs_close_fd+0x1c0/0x1c0
       ib_uverbs_cleanup_ucontext.constprop.3+0x52/0x120
       ib_uverbs_close+0xf2/0x570
       ? ib_uverbs_remove_one+0xb50/0xb50
       ? ib_uverbs_remove_one+0xb50/0xb50
       __fput+0x2cd/0x8d0
       task_work_run+0xec/0x1d0
       do_exit+0x6a1/0x1520
       ? fsnotify_first_mark+0x220/0x220
       ? exit_notify+0x9f0/0x9f0
       ? entry_SYSCALL_64_fastpath+0x5/0x8b
       ? entry_SYSCALL_64_fastpath+0x5/0x8b
       ? trace_hardirqs_on_thunk+0x1a/0x1c
       ? time_hardirqs_on+0x27/0x670
       ? time_hardirqs_off+0x27/0x490
       ? syscall_return_slowpath+0x6c/0x460
       ? entry_SYSCALL_64_fastpath+0x5/0x8b
       do_group_exit+0xe8/0x380
       SyS_exit_group+0x1e/0x20
       entry_SYSCALL_64_fastpath+0x1e/0x8b
      RIP: 0033:0x431ce9
      
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org> # 4.11
      Fixes: fd3c7904 ("IB/core: Change idr objects to use the new schema")
      Reported-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      1ff5325c
    • Leon Romanovsky's avatar
      RDMA/uverbs: Fix bad unlock balance in ib_uverbs_close_xrcd · 5c2e1c4f
      Leon Romanovsky authored
      There is no matching lock for this mutex. Git history suggests this is
      just a missed remnant from an earlier version of the function before
      this locking was moved into uverbs_free_xrcd.
      
      Originally this lock was protecting the xrcd_table_delete()
      
      =====================================
      WARNING: bad unlock balance detected!
      4.15.0+ #87 Not tainted
      -------------------------------------
      syzkaller223405/269 is trying to release lock (&uverbs_dev->xrcd_tree_mutex) at:
      [<00000000b8703372>] ib_uverbs_close_xrcd+0x195/0x1f0
      but there are no more locks to release!
      
      other info that might help us debug this:
      1 lock held by syzkaller223405/269:
       #0:  (&uverbs_dev->disassociate_srcu){....}, at: [<000000005af3b960>] ib_uverbs_write+0x265/0xef0
      
      stack backtrace:
      CPU: 0 PID: 269 Comm: syzkaller223405 Not tainted 4.15.0+ #87
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      Call Trace:
       dump_stack+0xde/0x164
       ? dma_virt_map_sg+0x22c/0x22c
       ? ib_uverbs_write+0x265/0xef0
       ? console_unlock+0x502/0xbd0
       ? ib_uverbs_close_xrcd+0x195/0x1f0
       print_unlock_imbalance_bug+0x131/0x160
       lock_release+0x59d/0x1100
       ? ib_uverbs_close_xrcd+0x195/0x1f0
       ? lock_acquire+0x440/0x440
       ? lock_acquire+0x440/0x440
       __mutex_unlock_slowpath+0x88/0x670
       ? wait_for_completion+0x4c0/0x4c0
       ? rdma_lookup_get_uobject+0x145/0x2f0
       ib_uverbs_close_xrcd+0x195/0x1f0
       ? ib_uverbs_open_xrcd+0xdd0/0xdd0
       ib_uverbs_write+0x7f9/0xef0
       ? cyc2ns_read_end+0x10/0x10
       ? ib_uverbs_open_xrcd+0xdd0/0xdd0
       ? uverbs_devnode+0x110/0x110
       ? cyc2ns_read_end+0x10/0x10
       ? cyc2ns_read_end+0x10/0x10
       ? sched_clock_cpu+0x18/0x200
       __vfs_write+0x10d/0x700
       ? uverbs_devnode+0x110/0x110
       ? kernel_read+0x170/0x170
       ? __fget+0x358/0x5d0
       ? security_file_permission+0x93/0x260
       vfs_write+0x1b0/0x550
       SyS_write+0xc7/0x1a0
       ? SyS_read+0x1a0/0x1a0
       ? trace_hardirqs_on_thunk+0x1a/0x1c
       entry_SYSCALL_64_fastpath+0x1e/0x8b
      RIP: 0033:0x4335c9
      
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org> # 4.11
      Fixes: fd3c7904 ("IB/core: Change idr objects to use the new schema")
      Reported-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5c2e1c4f
    • Leon Romanovsky's avatar
      RDMA/restrack: Increment CQ restrack object before committing · 0cba0efc
      Leon Romanovsky authored
      Once the uobj is committed it is immediately possible another thread
      could destroy it, which worst case, can result in a use-after-free
      of the restrack objects.
      
      Cc: syzkaller <syzkaller@googlegroups.com>
      Fixes: 08f294a1 ("RDMA/core: Add resource tracking for create and destroy CQs")
      Reported-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      0cba0efc
    • Leon Romanovsky's avatar
      RDMA/uverbs: Protect from command mask overflow · 3f802b16
      Leon Romanovsky authored
      The command number is not bounds checked against the command mask before it
      is shifted, resulting in an ubsan hit. This does not cause malfunction since
      the command number is eventually bounds checked, but we can make this ubsan
      clean by moving the bounds check to before the mask check.
      
      ================================================================================
      UBSAN: Undefined behaviour in
      drivers/infiniband/core/uverbs_main.c:647:21
      shift exponent 207 is too large for 64-bit type 'long long unsigned int'
      CPU: 0 PID: 446 Comm: syz-executor3 Not tainted 4.15.0-rc2+ #61
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      Call Trace:
      dump_stack+0xde/0x164
      ? dma_virt_map_sg+0x22c/0x22c
      ubsan_epilogue+0xe/0x81
      __ubsan_handle_shift_out_of_bounds+0x293/0x2f7
      ? debug_check_no_locks_freed+0x340/0x340
      ? __ubsan_handle_load_invalid_value+0x19b/0x19b
      ? lock_acquire+0x440/0x440
      ? lock_acquire+0x19d/0x440
      ? __might_fault+0xf4/0x240
      ? ib_uverbs_write+0x68d/0xe20
      ib_uverbs_write+0x68d/0xe20
      ? __lock_acquire+0xcf7/0x3940
      ? uverbs_devnode+0x110/0x110
      ? cyc2ns_read_end+0x10/0x10
      ? sched_clock_cpu+0x18/0x200
      ? sched_clock_cpu+0x18/0x200
      __vfs_write+0x10d/0x700
      ? uverbs_devnode+0x110/0x110
      ? kernel_read+0x170/0x170
      ? __fget+0x35b/0x5d0
      ? security_file_permission+0x93/0x260
      vfs_write+0x1b0/0x550
      SyS_write+0xc7/0x1a0
      ? SyS_read+0x1a0/0x1a0
      ? trace_hardirqs_on_thunk+0x1a/0x1c
      entry_SYSCALL_64_fastpath+0x18/0x85
      RIP: 0033:0x448e29
      RSP: 002b:00007f033f567c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007f033f5686bc RCX: 0000000000448e29
      RDX: 0000000000000060 RSI: 0000000020001000 RDI: 0000000000000012
      RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000056a0 R14: 00000000006e8740 R15: 0000000000000000
      ================================================================================
      
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org> # 4.5
      Fixes: 2dbd5186 ("IB/core: IB/core: Allow legacy verbs through extended interfaces")
      Reported-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Reviewed-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3f802b16
    • Jason Gunthorpe's avatar
      IB/uverbs: Fix unbalanced unlock on error path for rdma_explicit_destroy · ec6f8401
      Jason Gunthorpe authored
      If remove_commit fails then the lock is left locked while the uobj still
      exists. Eventually the kernel will deadlock.
      
      lockdep detects this and says:
      
       test/4221 is leaving the kernel with locks still held!
       1 lock held by test/4221:
        #0:  (&ucontext->cleanup_rwsem){.+.+}, at: [<000000001e5c7523>] rdma_explicit_destroy+0x37/0x120 [ib_uverbs]
      
      Fixes: 4da70da2 ("IB/core: Explicitly destroy an object while keeping uobject")
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      ec6f8401
    • Jason Gunthorpe's avatar
      IB/uverbs: Improve lockdep_check · 104f268d
      Jason Gunthorpe authored
      This is really being used as an assert that the expected usecnt
      is being held and implicitly that the usecnt is valid. Rename it to
      assert_uverbs_usecnt and tighten the checks to only accept valid
      values of usecnt (eg 0 and < -1 are invalid).
      
      The tigher checkes make the assertion cover more cases and is more
      likely to find bugs via syzkaller/etc.
      
      Fixes: 38321256 ("IB/core: Add support for idr types")
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      104f268d
    • Leon Romanovsky's avatar
      RDMA/uverbs: Protect from races between lookup and destroy of uobjects · 6623e3e3
      Leon Romanovsky authored
      The race is between lookup_get_idr_uobject and
      uverbs_idr_remove_uobj -> uverbs_uobject_put.
      
      We deliberately do not call sychronize_rcu after the idr_remove in
      uverbs_idr_remove_uobj for performance reasons, instead we call
      kfree_rcu() during uverbs_uobject_put.
      
      However, this means we can obtain pointers to uobj's that have
      already been released and must protect against krefing them
      using kref_get_unless_zero.
      
      ==================================================================
      BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
      Read of size 4 at addr ffff88005fda1ac8 by task syz-executor2/441
      
      CPU: 1 PID: 441 Comm: syz-executor2 Not tainted 4.15.0-rc2+ #56
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      Call Trace:
      dump_stack+0x8d/0xd4
      print_address_description+0x73/0x290
      kasan_report+0x25c/0x370
      ? copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
      copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
      ? uverbs_try_lock_object+0x68/0xc0
      ? modify_qp.isra.7+0xdc4/0x10e0
      modify_qp.isra.7+0xdc4/0x10e0
      ib_uverbs_modify_qp+0xfe/0x170
      ? ib_uverbs_query_qp+0x970/0x970
      ? __lock_acquire+0xa11/0x1da0
      ib_uverbs_write+0x55a/0xad0
      ? ib_uverbs_query_qp+0x970/0x970
      ? ib_uverbs_query_qp+0x970/0x970
      ? ib_uverbs_open+0x760/0x760
      ? futex_wake+0x147/0x410
      ? sched_clock_cpu+0x18/0x180
      ? check_prev_add+0x1680/0x1680
      ? do_futex+0x3b6/0xa30
      ? sched_clock_cpu+0x18/0x180
      __vfs_write+0xf7/0x5c0
      ? ib_uverbs_open+0x760/0x760
      ? kernel_read+0x110/0x110
      ? lock_acquire+0x370/0x370
      ? __fget+0x264/0x3b0
      vfs_write+0x18a/0x460
      SyS_write+0xc7/0x1a0
      ? SyS_read+0x1a0/0x1a0
      ? trace_hardirqs_on_thunk+0x1a/0x1c
      entry_SYSCALL_64_fastpath+0x18/0x85
      RIP: 0033:0x448e29
      RSP: 002b:00007f443fee0c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007f443fee16bc RCX: 0000000000448e29
      RDX: 0000000000000078 RSI: 00000000209f8000 RDI: 0000000000000012
      RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 0000000000008e98 R14: 00000000006ebf38 R15: 0000000000000000
      
      Allocated by task 1:
      kmem_cache_alloc_trace+0x16c/0x2f0
      mlx5_alloc_cmd_msg+0x12e/0x670
      cmd_exec+0x419/0x1810
      mlx5_cmd_exec+0x40/0x70
      mlx5_core_mad_ifc+0x187/0x220
      mlx5_MAD_IFC+0xd7/0x1b0
      mlx5_query_mad_ifc_gids+0x1f3/0x650
      mlx5_ib_query_gid+0xa4/0xc0
      ib_query_gid+0x152/0x1a0
      ib_query_port+0x21e/0x290
      mlx5_port_immutable+0x30f/0x490
      ib_register_device+0x5dd/0x1130
      mlx5_ib_add+0x3e7/0x700
      mlx5_add_device+0x124/0x510
      mlx5_register_interface+0x11f/0x1c0
      mlx5_ib_init+0x56/0x61
      do_one_initcall+0xa3/0x250
      kernel_init_freeable+0x309/0x3b8
      kernel_init+0x14/0x180
      ret_from_fork+0x24/0x30
      
      Freed by task 1:
      kfree+0xeb/0x2f0
      mlx5_free_cmd_msg+0xcd/0x140
      cmd_exec+0xeba/0x1810
      mlx5_cmd_exec+0x40/0x70
      mlx5_core_mad_ifc+0x187/0x220
      mlx5_MAD_IFC+0xd7/0x1b0
      mlx5_query_mad_ifc_gids+0x1f3/0x650
      mlx5_ib_query_gid+0xa4/0xc0
      ib_query_gid+0x152/0x1a0
      ib_query_port+0x21e/0x290
      mlx5_port_immutable+0x30f/0x490
      ib_register_device+0x5dd/0x1130
      mlx5_ib_add+0x3e7/0x700
      mlx5_add_device+0x124/0x510
      mlx5_register_interface+0x11f/0x1c0
      mlx5_ib_init+0x56/0x61
      do_one_initcall+0xa3/0x250
      kernel_init_freeable+0x309/0x3b8
      kernel_init+0x14/0x180
      ret_from_fork+0x24/0x30
      
      The buggy address belongs to the object at ffff88005fda1ab0
      which belongs to the cache kmalloc-32 of size 32
      The buggy address is located 24 bytes inside of
      32-byte region [ffff88005fda1ab0, ffff88005fda1ad0)
      The buggy address belongs to the page:
      page:00000000d5655c19 count:1 mapcount:0 mapping: (null)
      index:0xffff88005fda1fc0
      flags: 0x4000000000000100(slab)
      raw: 4000000000000100 0000000000000000 ffff88005fda1fc0 0000000180550008
      raw: ffffea00017f6780 0000000400000004 ffff88006c803980 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
      ffff88005fda1980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb
      ffff88005fda1a00: fb fb fc fc fb fb fb fb fc fc 00 00 00 00 fc fc
      ffff88005fda1a80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb
      ffff88005fda1b00: fc fc 00 00 00 00 fc fc fb fb fb fb fc fc fb fb
      ffff88005fda1b80: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc
      ==================================================================@
      
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org> # 4.11
      Fixes: 38321256 ("IB/core: Add support for idr types")
      Reported-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      6623e3e3
    • Jason Gunthorpe's avatar
      IB/uverbs: Hold the uobj write lock after allocate · d9dc7a35
      Jason Gunthorpe authored
      This clarifies the design intention that time between allocate and
      commit has the uobj exclusive to the caller. We already guarantee
      this by delaying publishing the uobj pointer via idr_insert,
      fd_install, list_add, etc.
      
      Additionally holding the usecnt lock during this period provides
      extra clarity and more protection against future mistakes.
      
      Fixes: 38321256 ("IB/core: Add support for idr types")
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d9dc7a35
    • Matan Barak's avatar
      IB/uverbs: Fix possible oops with duplicate ioctl attributes · 4d39a959
      Matan Barak authored
      If the same attribute is listed twice by the user in the ioctl attribute
      list then error unwind can cause the kernel to deref garbage.
      
      This happens when an object with WRITE access is sent twice. The second
      parse properly fails but corrupts the state required for the error unwind
      it triggers.
      
      Fixing this by making duplicates in the attribute list invalid. This is
      not something we need to support.
      
      The ioctl interface is currently recommended to be disabled in kConfig.
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      4d39a959
    • Matan Barak's avatar
      IB/uverbs: Add ioctl support for 32bit processes · 9dfb2ff4
      Matan Barak authored
      32 bit processes running on a 64 bit kernel call compat_ioctl so that
      implementations can revise any structure layout issues. Point compat_ioctl
      at our normal ioctl because:
      
      - All our structures are designed to be the same on 32 and 64 bit, ie we
        use __aligned_u64 when required and are careful to manage padding.
      
      - Any pointers are stored in u64's and userspace is expected
        to prepare them properly.
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      9dfb2ff4
    • Jason Gunthorpe's avatar
      IB/uverbs: Use __aligned_u64 for uapi headers · 5d2beb57
      Jason Gunthorpe authored
      This has no impact on the structure layout since these structs already
      have their u64s already properly aligned, but it does document that we
      have this requirement for 32 bit compatibility.
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5d2beb57
    • Matan Barak's avatar
      IB/uverbs: Fix method merging in uverbs_ioctl_merge · 3d89459e
      Matan Barak authored
      Fix a bug in uverbs_ioctl_merge that looked at the object's iterator
      number instead of the method's iterator number when merging methods.
      
      While we're at it, make the uverbs_ioctl_merge code a bit more clear
      and faster.
      
      Fixes: 118620d3 ('IB/core: Add uverbs merge trees functionality')
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3d89459e
    • Jason Gunthorpe's avatar
      IB/uverbs: Use u64_to_user_ptr() not a union · 2f36028c
      Jason Gunthorpe authored
      The union approach will get the endianness wrong sometimes if the kernel's
      pointer size is 32 bits resulting in EFAULTs when trying to copy to/from
      user.
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      2f36028c
    • Jason Gunthorpe's avatar
      IB/uverbs: Use inline data transfer for UHW_IN · 6c976c30
      Jason Gunthorpe authored
      The rule for the API is pointers less than 8 bytes are inlined into
      the .data field of the attribute. Fix the creation of the driver udata
      struct to follow this rule and point to the .data itself when the size
      is less than 8 bytes.
      
      Otherwise if the UHW struct is less than 8 bytes the driver will get
      EFAULT during copy_from_user.
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      6c976c30
    • Matan Barak's avatar
      IB/uverbs: Always use the attribute size provided by the user · 89d9e8d3
      Matan Barak authored
      This fixes several bugs around the copy_to/from user path:
       - copy_to used the user provided size of the attribute
         and could copy data beyond the end of the kernel buffer into
         userspace.
       - copy_from didn't know the size of the kernel buffer and
         could have left kernel memory unexpectedly un-initialized.
       - copy_from did not use the user length to determine if the
         attribute data is inlined or not.
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      89d9e8d3
    • Leon Romanovsky's avatar
      RDMA/restrack: Remove unimplemented XRCD object · 415bb699
      Leon Romanovsky authored
      Resource tracking of XRCD objects is not implemented in current
      version of restrack and hence can be removed.
      
      Fixes: 02d8883f ("RDMA/restrack: Add general infrastructure to track RDMA resources")
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      415bb699
    • Alaa Hleihel's avatar
      IB/ipoib: Do not warn if IPoIB debugfs doesn't exist · 14fa91e0
      Alaa Hleihel authored
      netdev_wait_allrefs() could rebroadcast NETDEV_UNREGISTER event
      multiple times until all refs are gone, which will result in calling
      ipoib_delete_debug_files multiple times and printing a warning.
      
      Remove the WARN_ONCE since checks of NULL pointers before calling
      debugfs_remove are not needed.
      
      Fixes: 771a5258 ("IB/IPoIB: ibX: failed to create mcg debug file")
      Signed-off-by: default avatarAlaa Hleihel <alaa@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      14fa91e0
    • Yonatan Cohen's avatar
      IB/mlx5: Implement fragmented completion queue (CQ) · 388ca8be
      Yonatan Cohen authored
      The current implementation of create CQ requires contiguous
      memory, such requirement is problematic once the memory is
      fragmented or the system is low in memory, it causes for
      failures in dma_zalloc_coherent().
      
      This patch implements new scheme of fragmented CQ to overcome
      this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
      to allocate fragmented buffers, rather than contiguous ones.
      
      Base the Completion Queues (CQs) on this new fragmented buffer.
      
      It fixes following crashes:
      kworker/29:0: page allocation failure: order:6, mode:0x80d0
      CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
      Workqueue: ib_cm cm_work_handler [ib_cm]
      Call Trace:
      [<>] dump_stack+0x19/0x1b
      [<>] warn_alloc_failed+0x110/0x180
      [<>] __alloc_pages_slowpath+0x6b7/0x725
      [<>] __alloc_pages_nodemask+0x405/0x420
      [<>] dma_generic_alloc_coherent+0x8f/0x140
      [<>] x86_swiotlb_alloc_coherent+0x21/0x50
      [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
      [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
      [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
      [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
      [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
      [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]
      Signed-off-by: default avatarYonatan Cohen <yonatanc@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      388ca8be
    • Saeed Mahameed's avatar
      net/mlx5: Remove redundant EQ API exports · 3ec5693b
      Saeed Mahameed authored
      EQ structure and API is private to mlx5_core driver only, external
      drivers should not have access or the means to manipulate EQ objects.
      
      Remove redundant exports and move API functions out of the linux/mlx5
      include directory into the driver's mlx5_core.h private include file.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarGal Pressman <galp@mellanox.com>
      3ec5693b
    • Saeed Mahameed's avatar
      net/mlx5: Move CQ completion and event forwarding logic to eq.c · 3ac7afdb
      Saeed Mahameed authored
      Since CQ tree is now per EQ, CQ completion and event forwarding became
      specific implementation of EQ logic, this patch moves that logic to eq.c
      and makes those functions static.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarGal Pressman <galp@mellanox.com>
      3ac7afdb
    • Saeed Mahameed's avatar
      net/mlx5: CQ hold/put API · f105b45b
      Saeed Mahameed authored
      Now as the CQ table is per EQ, add an API to hold/put CQ to be used from
      eq.c in downstream patch.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarGal Pressman <galp@mellanox.com>
      f105b45b
    • Saeed Mahameed's avatar
      net/mlx5: EQ add/del CQ API · d5c07157
      Saeed Mahameed authored
      Add API to add/del CQ to/from EQs CQ table to be used in cq.c upon CQ
      creation/destruction, as CQ table is now private to eq.c.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarGal Pressman <galp@mellanox.com>
      d5c07157
    • Saeed Mahameed's avatar
      net/mlx5: Add missing likely/unlikely hints to cq events · d2ff4fa5
      Saeed Mahameed authored
      If a hardware event is targeting a CQ, that CQ should exist.
      Add unlikely to error handling flows.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarGal Pressman <galp@mellanox.com>
      d2ff4fa5
    • Saeed Mahameed's avatar
      net/mlx5: CQ Database per EQ · 02d92f79
      Saeed Mahameed authored
      Before this patch the driver had one CQ database protected via one
      spinlock, this spinlock is meant to synchronize between CQ
      adding/removing and CQ IRQ interrupt handling.
      
      On a system with large number of CPUs and on a work load that requires
      lots of interrupts, this global spinlock becomes a very nasty hotspot
      and introduces a contention between the active cores, which will
      significantly hurt performance and becomes a bottleneck that prevents
      seamless cpu scaling.
      
      To solve this we simply move the CQ database and its spinlock to be per
      EQ (IRQ), thus per core.
      
      Tested with:
      system: 2 sockets, 14 cores per socket, hyperthreading, 2x14x2=56 cores
      netperf command: ./super_netperf 200 -P 0 -t TCP_RR  -H <server> -l 30 -- -r 300,300 -o -s 1M,1M -S 1M,1M
      
      WITHOUT THIS PATCH:
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal  %guest  %gnice   %idle
      Average:     all    4.32    0.00   36.15    0.09    0.00   34.02   0.00    0.00    0.00   25.41
      
      Samples: 2M of event 'cycles:pp', Event count (approx.): 1554616897271
      Overhead  Command          Shared Object                 Symbol
      +   14.28%  swapper          [kernel.vmlinux]              [k] intel_idle
      +   12.25%  swapper          [kernel.vmlinux]              [k] queued_spin_lock_slowpath
      +   10.29%  netserver        [kernel.vmlinux]              [k] queued_spin_lock_slowpath
      +    1.32%  netserver        [kernel.vmlinux]              [k] mlx5e_xmit
      
      WITH THIS PATCH:
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
      Average:     all    4.27    0.00   34.31    0.01    0.00   18.71    0.00    0.00    0.00   42.69
      
      Samples: 2M of event 'cycles:pp', Event count (approx.): 1498132937483
      Overhead  Command          Shared Object             Symbol
      +   23.33%  swapper          [kernel.vmlinux]          [k] intel_idle
      +    1.69%  netserver        [kernel.vmlinux]          [k] mlx5e_xmit
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarGal Pressman <galp@mellanox.com>
      02d92f79
  6. 14 Feb, 2018 3 commits