1. 30 Apr, 2019 3 commits
    • Martin Wilck's avatar
      dm mpath: always free attached_handler_name in parse_path() · 940bc471
      Martin Wilck authored
      Commit b592211c ("dm mpath: fix attached_handler_name leak and
      dangling hw_handler_name pointer") fixed a memory leak for the case
      where setup_scsi_dh() returns failure. But setup_scsi_dh may return
      success and not "use" attached_handler_name if the
      retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
      properly "steals" the pointer by nullifying it, freeing it
      unconditionally in parse_path() is safe.
      
      Fixes: b592211c ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarMartin Wilck <mwilck@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      940bc471
    • Helen Koike's avatar
      dm init: fix max devices/targets checks · 8e890c1a
      Helen Koike authored
      dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
      and not DM_MAX_{DEVICES,TARGETS} - 1.
      
      Fix the checks and also fix the error message when the number of devices
      is surpassed.
      
      Fixes: 6bbc923d ("dm: add support to directly boot to a mapped device")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHelen Koike <helen.koike@collabora.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      8e890c1a
    • Bryan Gurney's avatar
      dm: add dust target · e4f3fabd
      Bryan Gurney authored
      Add the dm-dust target, which simulates the behavior of bad sectors
      at arbitrary locations, and the ability to enable the emulation of
      the read failures at an arbitrary time.
      
      This target behaves similarly to a linear target.  At a given time,
      the user can send a message to the target to start failing read
      requests on specific blocks.  When the failure behavior is enabled,
      reads of blocks configured "bad" will fail with EIO.
      
      Writes of blocks configured "bad" will result in the following:
      
      1. Remove the block from the "bad block list".
      2. Successfully complete the write.
      
      After this point, the block will successfully contain the written
      data, and will service reads and writes normally.  This emulates the
      behavior of a "remapped sector" on a hard disk drive.
      
      dm-dust provides logging of which blocks have been added or removed
      to the "bad block list", as well as logging when a block has been
      removed from the bad block list.  These messages can be used
      alongside the messages from the driver using a dm-dust device to
      analyze the driver's behavior when a read fails at a given time.
      
      (This logging can be reduced via a "quiet" mode, if desired.)
      
      NOTE: If the block size is larger than 512 bytes, only the first sector
      of each "dust block" is detected.  Placing a limiting layer above a dust
      target, to limit the minimum I/O size to the dust block size, will
      ensure proper emulation of the given large block size.
      Signed-off-by: default avatarBryan Gurney <bgurney@redhat.com>
      Co-developed-by: default avatarJoe Shimkus <jshimkus@redhat.com>
      Co-developed-by: default avatarJohn Dorminy <jdorminy@redhat.com>
      Co-developed-by: default avatarJohn Pittman <jpittman@redhat.com>
      Co-developed-by: default avatarThomas Jaskiewicz <tjaskiew@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e4f3fabd
  2. 26 Apr, 2019 4 commits
  3. 25 Apr, 2019 1 commit
    • Yufen Yu's avatar
      dm mpath: fix missing call of path selector type->end_io · 5de719e3
      Yufen Yu authored
      After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via
      blk_insert_cloned_request feedback"), map_request() will requeue the tio
      when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
      
      Thus, if device driver status is error, a tio may be requeued multiple
      times until the return value is not DM_MAPIO_REQUEUE.  That means
      type->start_io may be called multiple times, while type->end_io is only
      called when IO complete.
      
      In fact, even without commit 396eaf21, setup_clone() failure can
      also cause tio requeue and associated missed call to type->end_io.
      
      The service-time path selector selects path based on in_flight_size,
      which is increased by st_start_io() and decreased by st_end_io().
      Missed calls to st_end_io() can lead to in_flight_size count error and
      will cause the selector to make the wrong choice.  In addition,
      queue-length path selector will also be affected.
      
      To fix the problem, call type->end_io in ->release_clone_rq before tio
      requeue.  map_info is passed to ->release_clone_rq() for map_request()
      error path that result in requeue.
      
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Cc: stable@vger.kernl.org
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      5de719e3
  4. 18 Apr, 2019 16 commits
    • Mike Snitzer's avatar
      dm thin metadata: do not write metadata if no changes occurred · 873f258b
      Mike Snitzer authored
      Otherwise, just activating a thin-pool and thin device and then
      deactivating them will cause the thin-pool metadata to be changed
      (e.g. superblock written) -- even without any metadata being changed.
      
      Add 'in_service' flag to struct dm_pool_metadata and set it in
      pmd_write_lock() because all on-disk metadata changes must take a write
      lock of pmd->root_lock.  Once 'in_service' is set it is never cleared.
      __commit_transaction() will return 0 if 'in_service' is not set.
      dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that
      it isn't the sole reason for putting a thin-pool in service.
      
      Also fix dm_pool_commit_metadata() to open the next transaction if the
      return from __commit_transaction() is 0.  Not seeing why the early
      return ever made since for a return of 0 given that dm-io's async_io(),
      as used by bufio, always returns 0.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      873f258b
    • Mike Snitzer's avatar
      dm thin metadata: add wrappers for managing write locking of metadata · 6a1b1ddc
      Mike Snitzer authored
      No functional change, but this prepares to hook off of pmd_write_lock()
      with additional functionality (as provided in next commit).
      Suggested-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      6a1b1ddc
    • Mike Snitzer's avatar
      dm thin metadata: check __commit_transaction()'s return · a1ed4d9e
      Mike Snitzer authored
      Fix __reserve_metadata_snap() to return early if __commit_transaction()
      fails.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      a1ed4d9e
    • Mike Snitzer's avatar
      dm space map common: zero entire ll_disk · c6e086e0
      Mike Snitzer authored
      Otherwise, memory that is allocated (and potentially not previously
      zeroed) will get written to disk as part of the space maps.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c6e086e0
    • Huaisheng Ye's avatar
      dm writecache: add unlikely for returned value of rb_next/prev · 84420b1e
      Huaisheng Ye authored
      In functions writecache_discard() and writecache_find_entry() there is a
      high probablity that the pointer of structure rb_node won't equal NULL.
      Add unlikely for the pointer node NULL.
      Signed-off-by: default avatarHuaisheng Ye <yehs1@lenovo.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      84420b1e
    • Huaisheng Ye's avatar
      dm writecache: remove needless dereferences in __writecache_writeback_pmem() · 09f2d656
      Huaisheng Ye authored
      bio is already available so there is no need to access it in terms of
      the wb pointer.
      Signed-off-by: default avatarHuaisheng Ye <yehs1@lenovo.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      09f2d656
    • Nikos Tsironis's avatar
      dm snapshot: Use fine-grained locking scheme · 3f1637f2
      Nikos Tsironis authored
      Substitute the global locking scheme with a fine grained one, employing
      the read-write semaphore and the scalable exception tables with
      per-bucket locks introduced by the previous two commits.
      
      Summarizing, we now use a read-write semaphore to protect the mostly
      read fields of the snapshot structure, e.g., valid, active, etc., and
      per-bucket bit spinlocks to protect accesses to the complete and pending
      exception tables.
      
      Finally, we use an extra spinlock (pe_allocation_lock) to serialize the
      allocation of new exceptions by the exception store. This allocation is
      really fast, so the extra spinlock doesn't hurt the performance.
      
      This scheme allows dm-snapshot to scale better, resulting in increased
      IOPS and reduced latency.
      
      Following are some benchmark results using the null_blk device:
      
        modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
         queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1
      
      * Benchmark fio_origin_randwrite_throughput_N, from the device mapper
        test suite [1] (direct IO, random 4K writes to origin device, IO
        engine libaio):
      
        +--------------+-------------+------------+
        | # of workers | IOPS Before | IOPS After |
        +--------------+-------------+------------+
        |      1       |    57708    |   66421    |
        |      2       |    63415    |   77589    |
        |      4       |    67276    |   98839    |
        |      8       |    60564    |   109258   |
        +--------------+-------------+------------+
      
      * Benchmark fio_origin_randwrite_latency_N, from the device mapper test
        suite [1] (direct IO, random 4K writes to origin device, IO engine
        psync):
      
        +--------------+-----------------------+----------------------+
        | # of workers | Latency (usec) Before | Latency (usec) After |
        +--------------+-----------------------+----------------------+
        |      1       |         16.25         |        13.27         |
        |      2       |         31.65         |        25.08         |
        |      4       |         55.28         |        41.08         |
        |      8       |         121.47        |        74.44         |
        +--------------+-----------------------+----------------------+
      
      * Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
        test suite [1] (direct IO, random 4K writes to snapshot device, IO
        engine libaio):
      
        +--------------+-------------+------------+
        | # of workers | IOPS Before | IOPS After |
        +--------------+-------------+------------+
        |      1       |    72593    |   84938    |
        |      2       |    97379    |   134973   |
        |      4       |    90610    |   143077   |
        |      8       |    90537    |   180085   |
        +--------------+-------------+------------+
      
      * Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
        test suite [1] (direct IO, random 4K writes to snapshot device, IO
        engine psync):
      
        +--------------+-----------------------+----------------------+
        | # of workers | Latency (usec) Before | Latency (usec) After |
        +--------------+-----------------------+----------------------+
        |      1       |         12.53         |         10.6         |
        |      2       |         19.78         |        14.89         |
        |      4       |         40.37         |        23.47         |
        |      8       |         89.32         |        48.48         |
        +--------------+-----------------------+----------------------+
      
      [1] https://github.com/jthornber/device-mapper-test-suiteCo-developed-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      3f1637f2
    • Nikos Tsironis's avatar
      dm snapshot: Make exception tables scalable · f79ae415
      Nikos Tsironis authored
      Use list_bl to implement the exception hash tables' buckets. This change
      permits concurrent access, to distinct buckets, by multiple threads.
      
      Also, implement helper functions to lock and unlock the exception tables
      based on the chunk number of the exception at hand.
      
      We retain the global locking, by means of down_write(), which is
      replaced by the next commit.
      
      Still, we must acquire the per-bucket spinlocks when accessing the hash
      tables, since list_bl does not allow modification on unlocked lists.
      Co-developed-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      f79ae415
    • Nikos Tsironis's avatar
      dm snapshot: Replace mutex with rw semaphore · 4ad8d880
      Nikos Tsironis authored
      dm-snapshot uses a single mutex to serialize every access to the
      snapshot state. This includes all accesses to the complete and pending
      exception tables, which occur at every origin write, every snapshot
      read/write and every exception completion.
      
      The lock statistics indicate that this mutex is a bottleneck (average
      wait time ~480 usecs for 8 processes doing random 4K writes to the
      origin device) preventing dm-snapshot to scale as the number of threads
      doing IO increases.
      
      The major contention points are __origin_write()/snapshot_map() and
      pending_complete(), i.e., the submission and completion of pending
      exceptions.
      
      Replace this mutex with a rw semaphore.
      
      We essentially revert commit ae1093be ("dm snapshot: use mutex
      instead of rw_semaphore") and together with the next two patches we
      substitute the single mutex with a fine-grained locking scheme, where we
      use a read-write semaphore to protect the mostly read fields of the
      snapshot structure, e.g., valid, active, etc., and per-bucket bit
      spinlocks to protect accesses to the complete and pending exception
      tables.
      Co-developed-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      4ad8d880
    • Nikos Tsironis's avatar
      dm snapshot: Don't sleep holding the snapshot lock · 65fc7c37
      Nikos Tsironis authored
      When completing a pending exception, pending_complete() waits for all
      conflicting reads to drain, before inserting the final, completed
      exception. Conflicting reads are snapshot reads redirected to the
      origin, because the relevant chunk is not remapped to the COW device the
      moment we receive the read.
      
      The completed exception must be inserted into the exception table after
      all conflicting reads drain to ensure snapshot reads don't return
      corrupted data. This is required because inserting the completed
      exception into the exception table signals that the relevant chunk is
      remapped and both origin writes and snapshot merging will now overwrite
      the chunk in origin.
      
      This wait is done holding the snapshot lock to ensure that
      pending_complete() doesn't starve if new snapshot reads keep coming for
      this chunk.
      
      In preparation for the next commit, where we use a spinlock instead of a
      mutex to protect the exception tables, we remove the need for holding
      the lock while waiting for conflicting reads to drain.
      
      We achieve this in two steps:
      
      1. pending_complete() inserts the completed exception before waiting for
         conflicting reads to drain and removes the pending exception after
         all conflicting reads drain.
      
         This ensures that new snapshot reads will be redirected to the COW
         device, instead of the origin, and thus pending_complete() will not
         starve. Moreover, we use the existence of both a completed and
         a pending exception to signify that the COW is done but there are
         conflicting reads in flight.
      
      2. In __origin_write() we check first if there is a pending exception
         and then if there is a completed exception. If there is a pending
         exception any submitted BIO is delayed on the pe->origin_bios list and
         DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
         origin nor snapshot merging can overwrite the origin chunk, until all
         conflicting reads drain, and thus snapshot reads will not return
         corrupted data.
      
      Summarizing, we now have the following possible combinations of pending
      and completed exceptions for a chunk, along with their meaning:
      
      A. No exceptions exist: The chunk has not been remapped yet.
      B. Only a pending exception exists: The chunk is currently being copied
         to the COW device.
      C. Both a pending and a completed exception exist: COW for this chunk
         has completed but there are snapshot reads in flight which had been
         redirected to the origin before the chunk was remapped.
      D. Only the completed exception exists: COW has been completed and there
         are no conflicting reads in flight.
      Co-developed-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      65fc7c37
    • Nikos Tsironis's avatar
      list_bl: Add hlist_bl_add_before/behind helpers · 34191ae8
      Nikos Tsironis authored
      Add hlist_bl_add_before/behind helpers to add an element before/after an
      existing element in a bl_list.
      Co-developed-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      34191ae8
    • Nikos Tsironis's avatar
      list: Don't use WRITE_ONCE() in hlist_add_behind() · ae325dcd
      Nikos Tsironis authored
      Commit 1c97be67 ("list: Use WRITE_ONCE() when adding to lists and
      hlists") introduced the use of WRITE_ONCE() to atomically write the list
      head's ->next pointer.
      
      hlist_add_behind() doesn't touch the hlist head's ->first pointer so
      there is no reason to use WRITE_ONCE() in this case.
      Co-developed-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ae325dcd
    • Nikos Tsironis's avatar
      dm cache metadata: Fix loading discard bitset · e28adc3b
      Nikos Tsironis authored
      Add missing dm_bitset_cursor_next() to properly advance the bitset
      cursor.
      
      Otherwise, the discarded state of all blocks is set according to the
      discarded state of the first block.
      
      Fixes: ae4a46a1 ("dm cache metadata: use bitset cursor api to load discard bitset")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e28adc3b
    • Damien Le Moal's avatar
      dm zoned: Fix zone report handling · 7aedf75f
      Damien Le Moal authored
      The function blkdev_report_zones() returns success even if no zone
      information is reported (empty report). Empty zone reports can only
      happen if the report start sector passed exceeds the device capacity.
      The conditions for this to happen are either a bug in the caller code,
      or, a change in the device that forced the low level driver to change
      the device capacity to a value that is lower than the report start
      sector. This situation includes a failed disk revalidation resulting in
      the disk capacity being changed to 0.
      
      If this change happens while dm-zoned is in its initialization phase
      executing dmz_init_zones(), this function may enter an infinite loop
      and hang the system. To avoid this, add a check to disallow empty zone
      reports and bail out early. Also fix the function dmz_update_zone() to
      make sure that the report for the requested zone was correctly obtained.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarShaun Tancheff <shaun@tancheff.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7aedf75f
    • Dan Carpenter's avatar
      dm zoned: Silence a static checker warning · a3839bc6
      Dan Carpenter authored
      My static checker complains about this line from dmz_get_zoned_device()
      
      	aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1);
      
      The problem is that "aligned_capacity" and "dev->capacity" are sector_t
      type (which is a u64 under most configs) but blk_queue_zone_sectors(q)
      returns a u32 so the higher 32 bits in aligned_capacity are cleared to
      zero.  This patch adds a cast to address the issue.
      
      Fixes: 114e0259 ("dm zoned: ignore last smaller runt zone")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      a3839bc6
    • Christoph Hellwig's avatar
      dm crypt: fix endianness annotations around org_sector_of_dmreq · c13b5487
      Christoph Hellwig authored
      The sector used here is a little endian value, so use the right
      type for it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c13b5487
  5. 14 Apr, 2019 6 commits
    • Linus Torvalds's avatar
      Linux 5.1-rc5 · dc4060a5
      Linus Torvalds authored
      dc4060a5
    • Linus Torvalds's avatar
      Merge branch 'page-refs' (page ref overflow) · 6b3a7077
      Linus Torvalds authored
      Merge page ref overflow branch.
      
      Jann Horn reported that he can overflow the page ref count with
      sufficient memory (and a filesystem that is intentionally extremely
      slow).
      
      Admittedly it's not exactly easy.  To have more than four billion
      references to a page requires a minimum of 32GB of kernel memory just
      for the pointers to the pages, much less any metadata to keep track of
      those pointers.  Jann needed a total of 140GB of memory and a specially
      crafted filesystem that leaves all reads pending (in order to not ever
      free the page references and just keep adding more).
      
      Still, we have a fairly straightforward way to limit the two obvious
      user-controllable sources of page references: direct-IO like page
      references gotten through get_user_pages(), and the splice pipe page
      duplication.  So let's just do that.
      
      * branch page-refs:
        fs: prevent page refcount overflow in pipe_buf_get
        mm: prevent get_user_pages() from overflowing page refcount
        mm: add 'try_get_page()' helper function
        mm: make page ref count overflow check tighter and more explicit
      6b3a7077
    • Matthew Wilcox's avatar
      fs: prevent page refcount overflow in pipe_buf_get · 15fab63e
      Matthew Wilcox authored
      Change pipe_buf_get() to return a bool indicating whether it succeeded
      in raising the refcount of the page (if the thing in the pipe is a page).
      This removes another mechanism for overflowing the page refcount.  All
      callers converted to handle a failure.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15fab63e
    • Linus Torvalds's avatar
      mm: prevent get_user_pages() from overflowing page refcount · 8fde12ca
      Linus Torvalds authored
      If the page refcount wraps around past zero, it will be freed while
      there are still four billion references to it.  One of the possible
      avenues for an attacker to try to make this happen is by doing direct IO
      on a page multiple times.  This patch makes get_user_pages() refuse to
      take a new page reference if there are already more than two billion
      references to the page.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fde12ca
    • Linus Torvalds's avatar
      mm: add 'try_get_page()' helper function · 88b1a17d
      Linus Torvalds authored
      This is the same as the traditional 'get_page()' function, but instead
      of unconditionally incrementing the reference count of the page, it only
      does so if the count was "safe".  It returns whether the reference count
      was incremented (and is marked __must_check, since the caller obviously
      has to be aware of it).
      
      Also like 'get_page()', you can't use this function unless you already
      had a reference to the page.  The intent is that you can use this
      exactly like get_page(), but in situations where you want to limit the
      maximum reference count.
      
      The code currently does an unconditional WARN_ON_ONCE() if we ever hit
      the reference count issues (either zero or negative), as a notification
      that the conditional non-increment actually happened.
      
      NOTE! The count access for the "safety" check is inherently racy, but
      that doesn't matter since the buffer we use is basically half the range
      of the reference count (ie we look at the sign of the count).
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88b1a17d
    • Linus Torvalds's avatar
      mm: make page ref count overflow check tighter and more explicit · f958d7b5
      Linus Torvalds authored
      We have a VM_BUG_ON() to check that the page reference count doesn't
      underflow (or get close to overflow) by checking the sign of the count.
      
      That's all fine, but we actually want to allow people to use a "get page
      ref unless it's already very high" helper function, and we want that one
      to use the sign of the page ref (without triggering this VM_BUG_ON).
      
      Change the VM_BUG_ON to only check for small underflows (or _very_ close
      to overflowing), and ignore overflows which have strayed into negative
      territory.
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f958d7b5
  6. 13 Apr, 2019 10 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block · 4443f8e6
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Set of fixes that should go into this round. This pull is larger than
        I'd like at this time, but there's really no specific reason for that.
        Some are fixes for issues that went into this merge window, others are
        not. Anyway, this contains:
      
         - Hardware queue limiting for virtio-blk/scsi (Dongli)
      
         - Multi-page bvec fixes for lightnvm pblk
      
         - Multi-bio dio error fix (Jason)
      
         - Remove the cache hint from the io_uring tool side, since we didn't
           move forward with that (me)
      
         - Make io_uring SETUP_SQPOLL root restricted (me)
      
         - Fix leak of page in error handling for pc requests (Jérôme)
      
         - Fix BFQ regression introduced in this merge window (Paolo)
      
         - Fix break logic for bio segment iteration (Ming)
      
         - Fix NVMe cancel request error handling (Ming)
      
         - NVMe pull request with two fixes (Christoph):
             - fix the initial CSN for nvme-fc (James)
             - handle log page offsets properly in the target (Keith)"
      
      * tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
        block: fix the return errno for direct IO
        nvmet: fix discover log page when offsets are used
        nvme-fc: correct csn initialization and increments on error
        block: do not leak memory in bio_copy_user_iov()
        lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
        nvme: cancel request synchronously
        blk-mq: introduce blk_mq_complete_request_sync()
        scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
        virtio-blk: limit number of hw queues by nr_cpu_ids
        block, bfq: fix use after free in bfq_bfqq_expire
        io_uring: restrict IORING_SETUP_SQPOLL to root
        tools/io_uring: remove IOCQE_FLAG_CACHEHIT
        block: don't use for-inside-for in bio_for_each_segment_all
      4443f8e6
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · b60bc066
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Highlights include:
      
        Stable fix:
      
         - Fix a deadlock in close() due to incorrect draining of RDMA queues
      
        Bugfixes:
      
         - Revert "SUNRPC: Micro-optimise when the task is known not to be
           sleeping" as it is causing stack overflows
      
         - Fix a regression where NFSv4 getacl and fs_locations stopped
           working
      
         - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
      
         - Fix xfstests failures due to incorrect copy_file_range() return
           values"
      
      * tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
        NFSv4.1 fix incorrect return value in copy_file_range
        xprtrdma: Fix helper that drains the transport
        NFS: Fix handling of reply page vector
        NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
      b60bc066
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 87af0c38
      Linus Torvalds authored
      Pull SCSI fix from James Bottomley:
       "One obvious fix for a ciostor data corruption on error bug"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: csiostor: fix missing data copy in csio_scsi_err_handler()
      87af0c38
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 09bad0df
      Linus Torvalds authored
      Pull clk fixes from Stephen Boyd:
       "Here's more than a handful of clk driver fixes for changes that came
        in during the merge window:
      
         - Fix the AT91 sama5d2 programmable clk prescaler formula
      
         - A bunch of Amlogic meson clk driver fixes for the VPU clks
      
         - A DMI quirk for Intel's Bay Trail SoC's driver to properly mark pmc
           clks as critical only when really needed
      
         - Stop overwriting CLK_SET_RATE_PARENT flag in mediatek's clk gate
           implementation
      
         - Use the right structure to test for a frequency table in i.MX's
           PLL_1416x driver"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: imx: Fix PLL_1416X not rounding rates
        clk: mediatek: fix clk-gate flag setting
        platform/x86: pmc_atom: Drop __initconst on dmi table
        clk: x86: Add system specific quirk to mark clocks as critical
        clk: meson: vid-pll-div: remove warning and return 0 on invalid config
        clk: meson: pll: fix rounding and setting a rate that matches precisely
        clk: meson-g12a: fix VPU clock parents
        clk: meson: g12a: fix VPU clock muxes mask
        clk: meson-gxbb: round the vdec dividers to closest
        clk: at91: fix programmable clock for sama5d2
      09bad0df
    • Linus Torvalds's avatar
      Merge tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · a3b84248
      Linus Torvalds authored
      Pull PCI fixes from Bjorn Helgaas:
      
       - Add a DMA alias quirk for another Marvell SATA device (Andre
         Przywara)
      
       - Fix a pciehp regression that broke safe removal of devices (Sergey
         Miroshnichenko)
      
      * tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: pciehp: Ignore Link State Changes after powering off a slot
        PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller
      a3b84248
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · cf60528f
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "A minor build fix for 64-bit FLATMEM configs.
      
        A fix for a boot failure on 32-bit powermacs.
      
        My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on
        64-bit kernels, ie. compat mode, which is only used on big endian.
      
        The rewrite of the SLB code we merged in 4.20 missed the fact that the
        0x380 exception is also used with the Radix MMU to report out of range
        accesses. This could lead to an oops if userspace tried to read from
        addresses outside the user or kernel range.
      
        Thanks to: Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas
        Piggin"
      
      * tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs
        powerpc/64s/radix: Fix radix segment exception handling
        powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
        powerpc/32: Fix early boot failure with RTAS built-in
      cf60528f
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 5ded8871
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "The main thing is a fix to our FUTEX_WAKE_OP implementation which was
        unbelievably broken, but did actually work for the one scenario that
        GLIBC used to use.
      
        Summary:
      
         - Fix stack unwinding so we ignore user stacks
      
         - Fix ftrace module PLT trampoline initialisation checks
      
         - Fix terminally broken implementation of FUTEX_WAKE_OP atomics"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
        arm64: backtrace: Don't bother trying to unwind the userspace stack
        arm64/ftrace: fix inadvertent BUG() in trampoline check
      5ded8871
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6d0a5984
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "Fix typos in user-visible resctrl parameters, and also fix assembly
        constraint bugs that might result in miscompilation"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/asm: Use stricter assembly constraints in bitops
        x86/resctrl: Fix typos in the mba_sc mount option
      6d0a5984
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 122c215b
      Linus Torvalds authored
      Pull timer fix from Ingo Molnar:
       "Fix the alarm_timer_remaining() return value"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        alarmtimer: Return correct remaining time
      122c215b
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5e6f1fee
      Linus Torvalds authored
      Pull scheduler fix from Ingo Molnar:
       "Fix a NULL pointer dereference crash in certain environments"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
      5e6f1fee