1. 03 Jul, 2019 4 commits
  2. 02 Jul, 2019 12 commits
  3. 29 Jun, 2019 13 commits
  4. 25 Jun, 2019 6 commits
  5. 24 Jun, 2019 5 commits
    • Max Gurtovoy's avatar
      RDMA/mlx5: Refactor MR descriptors allocation · 7796d2a3
      Max Gurtovoy authored
      Improve code readability using static helpers for each memory region
      type. Re-use the common logic to get smaller functions that are easy
      to maintain and reduce code duplication.
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      7796d2a3
    • Max Gurtovoy's avatar
      RDMA/mlx5: Use PA mapping for PI handover · 2563e2f3
      Max Gurtovoy authored
      If possibe, avoid doing a UMR operation to register data and protection
      buffers (via MTT/KLM mkeys). Instead, use the local DMA key and map the
      SG lists using PA access. This is safe, since the internal key for data
      and protection never exposed to the remote server (only signature key
      might be exposed). If PA mappings are not possible, perform mapping
      using MTT/KLM descriptors.
      
      The setup of the tested benchmark (using iSER ULP):
       - 2 servers with 24 cores (1 initiator and 1 target)
       - ConnectX-4/ConnectX-5 adapters
       - 24 target sessions with 1 LUN each
       - ramdisk backstore
       - PI active
      
      Performance results running fio (24 jobs, 128 iodepth) using
      write_generate=1 and read_verify=1 (w/w.o patch):
      
      bs      IOPS(read)        IOPS(write)
      ----    ----------        ----------
      512   1266.4K/1262.4K    1720.1K/1732.1K
      4k    793139/570902      1129.6K/773982
      32k   72660/72086        97229/96164
      
      Using write_generate=0 and read_verify=0 (w/w.o patch):
      bs      IOPS(read)        IOPS(write)
      ----    ----------        ----------
      512   1590.2K/1600.1K    1828.2K/1830.3K
      4k    1078.1K/937272     1142.1K/815304
      32k   77012/77369        98125/97435
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarIsrael Rukshin <israelr@mellanox.com>
      Suggested-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      2563e2f3
    • Israel Rukshin's avatar
      RDMA/mlx5: Improve PI handover performance · de0ae958
      Israel Rukshin authored
      In some loads, there is performance degradation when using KLM mkey
      instead of MTT mkey. This is because KLM descriptor access is via
      indirection that might require more HW resources and cycles.
      Using KLM descriptor is not necessary when there are no gaps at the
      data/metadata sg lists. As an optimization, use MTT mkey whenever it
      is possible. For that matter, allocate internal MTT mkey and choose the
      effective pi_mr for in transaction according to the required mapping
      scheme.
      
      The setup of the tested benchmark (using iSER ULP):
       - 2 servers with 24 cores (1 initiator and 1 target)
       - ConnectX-4/ConnectX-5 adapters
       - 24 target sessions with 1 LUN each
       - ramdisk backstore
       - PI active
      
      Performance results running fio (24 jobs, 128 iodepth) using
      write_generate=1 and read_verify=1 (w/w.o/baseline):
      
      bs      IOPS(read)                IOPS(write)
      ----    ----------                ----------
      512   1262.4K/1243.3K/1147.1K    1732.1K/1725.1K/1423.8K
      4k    570902/571233/457874       773982/743293/642080
      32k   72086/72388/71933          96164/71789/93249
      
      Using write_generate=0 and read_verify=0 (w/w.o patch):
      bs      IOPS(read)                IOPS(write)
      ----    ----------                ----------
      512   1600.1K/1572.1K/1393.3K    1830.3K/1823.5K/1557.2K
      4k    937272/921992/762934       815304/753772/646071
      32k   77369/75052/72058          97435/73180/94612
      Signed-off-by: default avatarIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Suggested-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Suggested-by: default avatarIdan Burstein <idanb@mellanox.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      de0ae958
    • Israel Rukshin's avatar
      RDMA/mlx5: Remove unused IB_WR_REG_SIG_MR code · 5c171cbe
      Israel Rukshin authored
      IB_WR_REG_SIG_MR is not needed after IB_WR_REG_MR_INTEGRITY
      was used.
      Signed-off-by: default avatarIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5c171cbe
    • Israel Rukshin's avatar
      RDMA/rw: Use IB_WR_REG_MR_INTEGRITY for PI handover · e9a53e73
      Israel Rukshin authored
      Replace the old signature handover API with the new one. The new API
      simplifes PI handover code complexity for ULPs and improve performance.
      For RW API it will reduce the maximum number of work requests per task
      and the need of dealing with multiple MRs (and their registrations and
      invalidations) per task. All the mappings and registration of the data
      and the protection buffers is done by the LLD using a single WR and a
      special MR type (IB_MR_TYPE_INTEGRITY) for the PI handover operation.
      
      The setup of the tested benchmark (using iSER ULP):
       - 2 servers with 24 cores (1 initiator and 1 target)
       - ConnectX-4/ConnectX-5 adapters
       - 24 target sessions with 1 LUN each
       - ramdisk backstore
       - PI active
      
      Performance results running fio (24 jobs, 128 iodepth) using
      write_generate=1 and read_verify=1 (w/w.o patch):
      
      bs      IOPS(read)        IOPS(write)
      ----    ----------        ----------
      512   1243.3K/1182.3K    1725.1K/1680.2K
      4k    571233/528835      743293/748259
      32k   72388/71086        71789/93573
      
      Using write_generate=0 and read_verify=0 (w/w.o patch):
      bs      IOPS(read)        IOPS(write)
      ----    ----------        ----------
      512   1572.1K/1427.2K    1823.5K/1724.3K
      4k    921992/916194      753772/768267
      32k   75052/73960        73180/95484
      
      There is a performance degradation when writing big block sizes.
      Degradation is caused by the complexity of combining multiple
      indirections and perform RDMA READ operation from it. This will be
      fixed in the following patches by reducing the indirections if
      possible.
      Signed-off-by: default avatarIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      e9a53e73