1. 11 Sep, 2017 6 commits
    • Mikulas Patocka's avatar
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka authored
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c3ca015f
    • Arnd Bergmann's avatar
      dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK · b5e8ad92
      Arnd Bergmann authored
      The new lockdep support for completions causeed the stack usage
      in dm-integrity to explode, in case of write_journal from 504 bytes
      to 1120 (using arm gcc-7.1.1):
      
      drivers/md/dm-integrity.c: In function 'write_journal':
      drivers/md/dm-integrity.c:827:1: error: the frame size of 1120 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      
      The problem is that not only the size of 'struct completion' grows
      significantly, but we end up having multiple copies of it on the stack
      when we assign it from a local variable after the initial declaration.
      
      COMPLETION_INITIALIZER_ONSTACK() is the right thing to use when we
      want to declare and initialize a completion on the stack. However,
      this driver doesn't do that and instead initializes the completion
      just before it is used.
      
      In this case, init_completion() does the same thing more efficiently,
      and drops the stack usage for the function above down to 496 bytes.
      While the other functions in this file are not bad enough to cause
      a warning, they benefit equally from the change, so I do the change
      across the entire file. In the one place where we reuse a completion,
      I picked the cheaper reinit_completion() over init_completion().
      
      Fixes: cd8084f9 ("locking/lockdep: Apply crossrelease to completions")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      b5e8ad92
    • Bhumika Goyal's avatar
      dm integrity: make blk_integrity_profile structure const · 7c373d66
      Bhumika Goyal authored
      Make this structure const as it is only stored in the profile field of a
      blk_integrity structure. This field is of type const, so make structure
      as const.
      Signed-off-by: default avatarBhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7c373d66
    • Hyunchul Lee's avatar
      dm integrity: do not check integrity for failed read operations · b7e326f7
      Hyunchul Lee authored
      Even though read operations fail, dm_integrity_map_continue() calls
      integrity_metadata() to check integrity.  In this case, just complete
      these.
      
      This also makes it so read I/O errors do not generate integrity warnings
      in the kernel log.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHyunchul Lee <cheol.lee@lge.com>
      Acked-by: default avatarMilan Broz <gmazyland@gmail.com>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      b7e326f7
    • Josef Bacik's avatar
      dm log writes: fix >512b sectorsize support · 228bb5b2
      Josef Bacik authored
      512b sectors vs device's physical sectorsize was not maintained
      consistently and as such the support for >512b sector devices has bugs.
      The log metadata expects native sectorsize but 512b sectors were being
      stored.  Also, device's sectorsize was assumed when assigning the
      bi_sector for blocks that were being logged.
      
      Fix this up by adding two helpers to convert between bio and dev
      sectors, and use these in the appropriate places to fix the problem and
      make it clear which units go where.  Doing so allows dm-log-writes use
      with 4k devices.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      228bb5b2
    • Josef Bacik's avatar
      dm log writes: don't use all the cpu while waiting to log blocks · 0c79c620
      Josef Bacik authored
      The check to see if the logging kthread needs to go to sleep is wrong,
      it checks lc->pending_blocks, which will be non-0 if there are any
      blocks that are pending, whether they are ready to be logged or not.
      What we really want is to go to sleep until it's time to log blocks, so
      change this check so we do actually go to sleep in between flushes.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      0c79c620
  2. 28 Aug, 2017 17 commits
  3. 27 Aug, 2017 3 commits
    • Linus Torvalds's avatar
      Avoid page waitqueue race leaving possible page locker waiting · a8b169af
      Linus Torvalds authored
      The "lock_page_killable()" function waits for exclusive access to the
      page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
      set.
      
      That means that if it gets woken up, other waiters may have been
      skipped.
      
      That, in turn, means that if it sees the page being unlocked, it *must*
      take that lock and return success, even if a lethal signal is also
      pending.
      
      So instead of checking for lethal signals first, we need to check for
      them after we've checked the actual bit that we were waiting for.  Even
      if that might then delay the killing of the process.
      
      This matches the order of the old "wait_on_bit_lock()" infrastructure
      that the page locking used to use (and is still used in a few other
      areas).
      
      Note that if we still return an error after having unsuccessfully tried
      to acquire the page lock, that is ok: that means that some other thread
      was able to get ahead of us and lock the page, and when that other
      thread then unlocks the page, the wakeup event will be repeated.  So any
      other pending waiters will now get properly woken up.
      
      Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8b169af
    • Linus Torvalds's avatar
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds authored
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
    • Linus Torvalds's avatar
      Clarify (and fix) MAX_LFS_FILESIZE macros · 0cc3b0ec
      Linus Torvalds authored
      We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
      filesystems (and other IO targets) that know they are 64-bit clean and
      don't have any 32-bit limits in their IO path.
      
      It turns out that our 32-bit value for that limit was bogus.  On 32-bit,
      the VM layer is limited by the page cache to only 32-bit index values,
      but our logic for that was confusing and actually wrong.  We used to
      define that value to
      
      	(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
      
      which is actually odd in several ways: it limits the index to 31 bits,
      and then it limits files so that they can't have data in that last byte
      of a page that has the highest 31-bit index (ie page index 0x7fffffff).
      
      Neither of those limitations make sense.  The index is actually the full
      32 bit unsigned value, and we can use that whole full page.  So the
      maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
      
      However, we do wan tto avoid the maximum index, because we have code
      that iterates over the page indexes, and we don't want that code to
      overflow.  So the maximum size of a file on a 32-bit host should
      actually be one page less than the full 32-bit index.
      
      So the actual limit is ULONG_MAX << PAGE_SHIFT.  That means that we will
      not actually be using the page of that last index (ULONG_MAX), but we
      can grow a file up to that limit.
      
      The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
      Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
      volume.  It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
      byte less), but the actual true VM limit is one page less than 16TiB.
      
      This was invisible until commit c2a9737f ("vfs,mm: fix a dead loop
      in truncate_inode_pages_range()"), which started applying that
      MAX_LFS_FILESIZE limit to block devices too.
      
      NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
      actually just the offset type itself (loff_t), which is signed.  But for
      clarity, on 64-bit, just use the maximum signed value, and don't make
      people have to count the number of 'f' characters in the hex constant.
      
      So just use LLONG_MAX for the 64-bit case.  That was what the value had
      been before too, just written out as a hex constant.
      
      Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
      Reported-and-tested-by: default avatarDoug Nazar <nazard@nazar.ca>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cc3b0ec
  4. 26 Aug, 2017 13 commits
  5. 25 Aug, 2017 1 commit
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 1f5de42d
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "I2C has some bugfixes for you: mainly Jarkko fixed up a few things in
        the designware driver regarding the new slave mode. But Ulf also fixed
        a long-standing and now agreed suspend problem. Plus, some simple
        stuff which nonetheless needs fixing"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: designware: Fix runtime PM for I2C slave mode
        i2c: designware: Remove needless pm_runtime_put_noidle() call
        i2c: aspeed: fixed potential null pointer dereference
        i2c: simtec: use release_mem_region instead of release_resource
        i2c: core: Make comment about I2C table requirement to reflect the code
        i2c: designware: Fix standard mode speed when configuring the slave mode
        i2c: designware: Fix oops from i2c_dw_irq_handler_slave
        i2c: designware: Fix system suspend
      1f5de42d