1. 07 Jun, 2017 40 commits
    • Brian Foster's avatar
      xfs: fix up quotacheck buffer list error handling · bf162426
      Brian Foster authored
      commit 20e8a063 upstream.
      
      The quotacheck error handling of the delwri buffer list assumes the
      resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
      on the buffers that are dequeued. This can lead to assert failures
      on buffer release and possibly other locking problems.
      
      Move this code to a delwri queue cancel helper function to
      encapsulate the logic required to properly release buffers from a
      delwri queue. Update the helper to clear the delwri queue flag and
      call it from quotacheck.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf162426
    • Brian Foster's avatar
      xfs: prevent multi-fsb dir readahead from reading random blocks · 89aab40e
      Brian Foster authored
      commit cb52ee33 upstream.
      
      Directory block readahead uses a complex iteration mechanism to map
      between high-level directory blocks and underlying physical extents.
      This mechanism attempts to traverse the higher-level dir blocks in a
      manner that handles multi-fsb directory blocks and simultaneously
      maintains a reference to the corresponding physical blocks.
      
      This logic doesn't handle certain (discontiguous) physical extent
      layouts correctly with multi-fsb directory blocks. For example,
      consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
      block size and a directory with the following extent layout:
      
       EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
         0: [0..7]:          88..95            0 (88..95)             8
         1: [8..15]:         80..87            0 (80..87)             8
         2: [16..39]:        168..191          0 (168..191)          24
         3: [40..63]:        5242952..5242975  1 (72..95)            24
      
      Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
      entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
      extent 2 is larger than the directory block size, the readahead code
      erroneously assumes the block is contiguous and issues a readahead
      based on the physical mapping of the first fsb of the dirblk. This
      results in read verifier failure and a spurious corruption or crc
      failure, depending on the filesystem format.
      
      Further, the subsequent readahead code responsible for walking
      through the physical table doesn't correctly advance the physical
      block reference for dirblk 2. Instead of advancing two physical
      filesystem blocks, the first iteration of the loop advances 1 block
      (correctly), but the subsequent iteration advances 2 more physical
      blocks because the next physical extent (extent 3, above) happens to
      cover more than dirblk 2. At this point, the higher-level directory
      block walking is completely off the rails of the actual physical
      layout of the directory for the respective mapping table.
      
      Update the contiguous dirblock logic to consider the current offset
      in the physical extent to avoid issuing directory readahead to
      unrelated blocks. Also, update the mapping table advancing code to
      consider the current offset within the current dirblock to avoid
      advancing the mapping reference too far beyond the dirblock.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89aab40e
    • Eric Sandeen's avatar
      xfs: handle array index overrun in xfs_dir2_leaf_readbuf() · 91bb4f7d
      Eric Sandeen authored
      commit 023cc840 upstream.
      
      Carlos had a case where "find" seemed to start spinning
      forever and never return.
      
      This was on a filesystem with non-default multi-fsb (8k)
      directory blocks, and a fragmented directory with extents
      like this:
      
      0:[0,133646,2,0]
      1:[2,195888,1,0]
      2:[3,195890,1,0]
      3:[4,195892,1,0]
      4:[5,195894,1,0]
      5:[6,195896,1,0]
      6:[7,195898,1,0]
      7:[8,195900,1,0]
      8:[9,195902,1,0]
      9:[10,195908,1,0]
      10:[11,195910,1,0]
      11:[12,195912,1,0]
      12:[13,195914,1,0]
      ...
      
      i.e. the first extent is a contiguous 2-fsb dir block, but
      after that it is fragmented into 1 block extents.
      
      At the top of the readdir path, we allocate a mapping array
      which (for this filesystem geometry) can hold 10 extents; see
      the assignment to map_info->map_size.  During readdir, we are
      therefore able to map extents 0 through 9 above into the array
      for readahead purposes.  If we count by 2, we see that the last
      mapped index (9) is the first block of a 2-fsb directory block.
      
      At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
      more readahead; the outer loop assumes one full dir block is
      processed each loop iteration, and an inner loop that ensures
      that this is so by advancing to the next extent until a full
      directory block is mapped.
      
      The problem is that this inner loop may step past the last
      extent in the mapping array as it tries to reach the end of
      the directory block.  This will read garbage for the extent
      length, and as a result the loop control variable 'j' may
      become corrupted and never fail the loop conditional.
      
      The number of valid mappings we have in our array is stored
      in map->map_valid, so stop this inner loop based on that limit.
      
      There is an ASSERT at the top of the outer loop for this
      same condition, but we never made it out of the inner loop,
      so the ASSERT never fired.
      
      Huge appreciation for Carlos for debugging and isolating
      the problem.
      Debugged-and-analyzed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Tested-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarBill O'Donnell <billodo@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91bb4f7d
    • Christoph Hellwig's avatar
      xfs: fix integer truncation in xfs_bmap_remap_alloc · 23da04dc
      Christoph Hellwig authored
      commit 52813fb1 upstream.
      
      bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
      xfs_aglock_t, which truncates the value to 32 bits.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23da04dc
    • Brian Foster's avatar
      xfs: drop iolock from reclaim context to appease lockdep · eff615a6
      Brian Foster authored
      commit 3b4683c2 upstream.
      
      Lockdep complains about use of the iolock in inode reclaim context
      because it doesn't understand that reclaim has the last reference to
      the inode, and thus an iolock->reclaim->iolock deadlock is not
      possible.
      
      The iolock is technically not necessary in xfs_inactive() and was
      only added to appease an assert in xfs_free_eofblocks(), which can
      be called from other non-reclaim contexts. Therefore, just kill the
      assert and drop the use of the iolock from reclaim context to quiet
      lockdep.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eff615a6
    • Darrick J. Wong's avatar
      xfs: actually report xattr extents via iomap · 8a1f7858
      Darrick J. Wong authored
      commit 84358536 upstream.
      
      Apparently FIEMAP for xattrs has been broken since we switched to
      the iomap backend because of an incorrect check for xattr presence.
      Also fix the broken locking.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a1f7858
    • Darrick J. Wong's avatar
      xfs: fix over-copying of getbmap parameters from userspace · 1c862f5a
      Darrick J. Wong authored
      commit be6324c0 upstream.
      
      In xfs_ioc_getbmap, we should only copy the fields of struct getbmap
      from userspace, or else we end up copying random stack contents into the
      kernel.  struct getbmap is a strict subset of getbmapx, so a partial
      structure copy should work fine.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c862f5a
    • Brian Foster's avatar
      xfs: use dedicated log worker wq to avoid deadlock with cil wq · 2cd6bd86
      Brian Foster authored
      commit 696a5620 upstream.
      
      The log covering background task used to be part of the xfssyncd
      workqueue. That workqueue was removed as of commit 5889608d ("xfs:
      syncd workqueue is no more") and the associated work item scheduled
      to the xfs-log wq. The latter is used for log buffer I/O completion.
      
      Since xfs_log_worker() can invoke a log flush, a deadlock is
      possible between the xfs-log and xfs-cil workqueues. Consider the
      following codepath from xfs_log_worker():
      
      xfs_log_worker()
        xfs_log_force()
          _xfs_log_force()
            xlog_cil_force()
              xlog_cil_force_lsn()
                xlog_cil_push_now()
                  flush_work()
      
      The above is in xfs-log wq context and blocked waiting on the
      completion of an xfs-cil work item. Concurrently, the cil push in
      progress can end up blocked here:
      
      xlog_cil_push_work()
        xlog_cil_push()
          xlog_write()
            xlog_state_get_iclog_space()
              xlog_wait(&log->l_flush_wait, ...)
      
      The above is in xfs-cil context waiting on log buffer I/O
      completion, which executes in xfs-log wq context. In this scenario
      both workqueues are deadlocked waiting on eachother.
      
      Add a new workqueue specifically for the high level log covering and
      ail pushing worker, as was the case prior to commit 5889608d.
      Diagnosed-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cd6bd86
    • Eryu Guan's avatar
      xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff() · a19348b7
      Eryu Guan authored
      commit 8affebe1 upstream.
      
      xfs_find_get_desired_pgoff() is used to search for offset of hole or
      data in page range [index, end] (both inclusive), and the max number
      of pages to search should be at least one, if end == index.
      Otherwise the only page is missed and no hole or data is found,
      which is not correct.
      
      When block size is smaller than page size, this can be demonstrated
      by preallocating a file with size smaller than page size and writing
      data to the last block. E.g. run this xfs_io command on a 1k block
      size XFS on x86_64 host.
      
        # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
        	    -c "seek -d 0" /mnt/xfs/testfile
        wrote 1024/1024 bytes at offset 2048
        1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec)
        Whence  Result
        DATA    EOF
      
      Data at offset 2k was missed, and lseek(2) returned ENXIO.
      
      This is uncovered by generic/285 subtest 07 and 08 on ppc64 host,
      where pagesize is 64k. Because a recent change to generic/285
      reduced the preallocated file size to smaller than 64k.
      Signed-off-by: default avatarEryu Guan <eguan@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a19348b7
    • Brian Foster's avatar
      xfs: use ->b_state to fix buffer I/O accounting release race · 0364c225
      Brian Foster authored
      commit 63db7c81 upstream.
      
      We've had user reports of unmount hangs in xfs_wait_buftarg() that
      analysis shows is due to btp->bt_io_count == -1. bt_io_count
      represents the count of in-flight asynchronous buffers and thus
      should always be >= 0. xfs_wait_buftarg() waits for this value to
      stabilize to zero in order to ensure that all untracked (with
      respect to the lru) buffers have completed I/O processing before
      unmount proceeds to tear down in-core data structures.
      
      The value of -1 implies an I/O accounting decrement race. Indeed,
      the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
      (where the buffer lock is no longer held) means that bp->b_flags can
      be updated from an unsafe context. While a user-level reproducer is
      currently not available, some intrusive hacks to run racing buffer
      lookups/ioacct/releases from multiple threads was used to
      successfully manufacture this problem.
      
      Existing callers do not expect to acquire the buffer lock from
      xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
      this context. It turns out that we already have separate buffer
      state bits and associated serialization for dealing with buffer LRU
      state in the form of ->b_state and ->b_lock. Therefore, replace the
      _XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
      accounting wrappers appropriately and make sure they are used with
      the correct locking. This ensures that buffer in-flight state can be
      modified at buffer release time without racing with modifications
      from a buffer lock holder.
      
      Fixes: 9c7504aa ("xfs: track and serialize in-flight async buffers against unmount")
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarLibor Pechacek <lpechacek@suse.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0364c225
    • Jan Kara's avatar
      xfs: Fix missed holes in SEEK_HOLE implementation · 34836549
      Jan Kara authored
      commit 5375023a upstream.
      
      XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as
      can be seen by the following command:
      
      xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
             -c "seek -h 0" file
      wrote 57344/57344 bytes at offset 0
      56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec)
      wrote 8192/8192 bytes at offset 131072
      8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec)
      Whence	Result
      HOLE	139264
      
      Where we can see that hole at offset 56k was just ignored by SEEK_HOLE
      implementation. The bug is in xfs_find_get_desired_pgoff() which does
      not properly detect the case when pages are not contiguous.
      
      Fix the problem by properly detecting when found page has larger offset
      than expected.
      
      Fixes: d126d43fSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34836549
    • Patrik Jakobsson's avatar
      drm/gma500/psb: Actually use VBT mode when it is found · 4ec50822
      Patrik Jakobsson authored
      commit 82bc9a42 upstream.
      
      With LVDS we were incorrectly picking the pre-programmed mode instead of
      the prefered mode provided by VBT. Make sure we pick the VBT mode if
      one is provided. It is likely that the mode read-out code is still wrong
      but this patch fixes the immediate problem on most machines.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=78562Signed-off-by: default avatarPatrik Jakobsson <patrik.r.jakobsson@gmail.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170418114332.12183-1-patrik.r.jakobsson@gmail.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ec50822
    • Thomas Gleixner's avatar
      slub/memcg: cure the brainless abuse of sysfs attributes · 74b4db84
      Thomas Gleixner authored
      commit 478fe303 upstream.
      
      memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
      to propagate settings from the root kmem_cache to a newly created
      kmem_cache.  It does that with:
      
           attr->show(root, buf);
           attr->store(new, buf, strlen(bug);
      
      Aside of being a lazy and absurd hackery this is broken because it does
      not check the return value of the show() function.
      
      Some of the show() functions return 0 w/o touching the buffer.  That
      means in such a case the store function is called with the stale content
      of the previous show().  That causes nonsense like invoking
      kmem_cache_shrink() on a newly created kmem_cache.  In the worst case it
      would cause handing in an uninitialized buffer.
      
      This should be rewritten proper by adding a propagate() callback to
      those slub_attributes which must be propagated and avoid that insane
      conversion to and from ASCII, but that's too large for a hot fix.
      
      Check at least the return value of the show() function, so calling
      store() with stale content is prevented.
      
      Steven said:
       "It can cause a deadlock with get_online_cpus() that has been uncovered
        by recent cpu hotplug and lockdep changes that Thomas and Peter have
        been doing.
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(cpu_hotplug.lock);
                                         lock(slab_mutex);
                                         lock(cpu_hotplug.lock);
            lock(slab_mutex);
      
           *** DEADLOCK ***"
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanosSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74b4db84
    • Andrea Arcangeli's avatar
      ksm: prevent crash after write_protect_page fails · 6caa2db3
      Andrea Arcangeli authored
      commit a7306c34 upstream.
      
      "err" needs to be left set to -EFAULT if split_huge_page succeeds.
      Otherwise if "err" gets clobbered with zero and write_protect_page
      fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
      and then try_to_merge_with_ksm_page() will continue thinking kpage is a
      PageKsm when in fact it's still an anonymous page.  Eventually it'll
      crash in page_add_anon_rmap.
      
      This has been reproduced on Fedora25 kernel but I can reproduce with
      upstream too.
      
      The bug was introduced in commit f765f540 ("ksm: prepare to new THP
      semantics") introduced in v4.5.
      
          page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
          flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
          page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
          page->mem_cgroup:ffffa09674bf0000
          ------------[ cut here ]------------
          kernel BUG at mm/rmap.c:1222!
          CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
          RIP: do_page_add_anon_rmap+0x1c4/0x240
          Call Trace:
            page_add_anon_rmap+0x18/0x20
            try_to_merge_with_ksm_page+0x50b/0x780
            ksm_scan_thread+0x1211/0x1410
            ? prepare_to_wait_event+0x100/0x100
            ? try_to_merge_with_ksm_page+0x780/0x780
            kthread+0xd9/0xf0
            ? kthread_park+0x60/0x60
            ret_from_fork+0x25/0x30
      
      Fixes: f765f540 ("ksm: prepare to new THP semantics")
      Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarFederico Simoncelli <fsimonce@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6caa2db3
    • Rob Landley's avatar
      x86/boot: Use CROSS_COMPILE prefix for readelf · d6eaf7a4
      Rob Landley authored
      commit 37805787 upstream.
      
      The boot code Makefile contains a straight 'readelf' invocation. This
      causes build warnings in cross compile environments, when there is no
      unprefixed readelf accessible via $PATH.
      
      Add the missing $(CROSS_COMPILE) prefix.
      
      [ tglx: Rewrote changelog ]
      
      Fixes: 98f78525 ("x86/boot: Refuse to build with data relocations")
      Signed-off-by: default avatarRob Landley <rob@landley.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Cc: "H.J. Lu" <hjl.tools@gmail.com>
      Link: http://lkml.kernel.org/r/ced18878-693a-9576-a024-113ef39a22c0@landley.netSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6eaf7a4
    • Mike Marciniszyn's avatar
      RDMA/qib,hfi1: Fix MR reference count leak on write with immediate · f7e82ab3
      Mike Marciniszyn authored
      commit 1feb4006 upstream.
      
      The handling of IB_RDMA_WRITE_ONLY_WITH_IMMEDIATE will leak a memory
      reference when a buffer cannot be allocated for returning the immediate
      data.
      
      The issue is that the rkey validation has already occurred and the RNR
      nak fails to release the reference that was fruitlessly gotten.  The
      the peer will send the identical single packet request when its RNR
      timer pops.
      
      The fix is to release the held reference prior to the rnr nak exit.
      This is the only sequence the requires both rkey validation and the
      buffer allocation on the same packet.
      Tested-by: default avatarTadeusz Struk <tadeusz.struk@intel.com>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7e82ab3
    • Israel Rukshin's avatar
      RDMA/srp: Fix NULL deref at srp_destroy_qp() · 68c98967
      Israel Rukshin authored
      commit 95c2ef50 upstream.
      
      If srp_init_qp() fails at srp_create_ch_ib() then ch->send_cq
      may be NULL.
      Calling directly to ib_destroy_qp() is sufficient because
      no work requests were posted on the created qp.
      
      Fixes: 9294000d ("IB/srp: Drain the send queue before destroying a QP")
      Signed-off-by: default avatarIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: Bart van Assche <bart.vanassche@sandisk.com>--
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68c98967
    • Michal Hocko's avatar
      mm: consider memblock reservations for deferred memory initialization sizing · dbab0232
      Michal Hocko authored
      commit 864b9a39 upstream.
      
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dbab0232
    • James Morse's avatar
      mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified · 1cc89263
      James Morse authored
      commit 9a291a7c upstream.
      
      KVM uses get_user_pages() to resolve its stage2 faults.  KVM sets the
      FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
      finds a VM_FAULT_HWPOISON.  KVM handles these hwpoison pages as a
      special case.  (check_user_page_hwpoison())
      
      When huge pages are involved, this doesn't work so well.
      get_user_pages() calls follow_hugetlb_page(), which stops early if it
      receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
      -EFAULT to the caller.  The step to map this to -EHWPOISON based on the
      FOLL_ flags is missing.  The hwpoison special case is skipped, and
      -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.
      
      Instead, move this VM_FAULT_ to errno mapping code into a header file
      and use it from faultin_page() and follow_hugetlb_page().
      
      With this, KVM works as expected.
      
      This isn't a problem for arm64 today as we haven't enabled
      MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
      too, so I think this should be a fix.  This doesn't apply earlier than
      stable's v4.11.1 due to all sorts of cleanup.
      
      [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
      suggested.
        Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.comSigned-off-by: default avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cc89263
    • Yisheng Xie's avatar
      mlock: fix mlock count can not decrease in race condition · f814bf46
      Yisheng Xie authored
      commit 70feee0e upstream.
      
      Kefeng reported that when running the follow test, the mlock count in
      meminfo will increase permanently:
      
       [1] testcase
       linux:~ # cat test_mlockal
       grep Mlocked /proc/meminfo
        for j in `seq 0 10`
        do
       	for i in `seq 4 15`
       	do
       		./p_mlockall >> log &
       	done
       	sleep 0.2
       done
       # wait some time to let mlock counter decrease and 5s may not enough
       sleep 5
       grep Mlocked /proc/meminfo
      
       linux:~ # cat p_mlockall.c
       #include <sys/mman.h>
       #include <stdlib.h>
       #include <stdio.h>
      
       #define SPACE_LEN	4096
      
       int main(int argc, char ** argv)
       {
      	 	int ret;
      	 	void *adr = malloc(SPACE_LEN);
      	 	if (!adr)
      	 		return -1;
      
      	 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
      	 	printf("mlcokall ret = %d\n", ret);
      
      	 	ret = munlockall();
      	 	printf("munlcokall ret = %d\n", ret);
      
      	 	free(adr);
      	 	return 0;
      	 }
      
      In __munlock_pagevec() we should decrement NR_MLOCK for each page where
      we clear the PageMlocked flag.  Commit 1ebb7cc6 ("mm: munlock: batch
      NR_MLOCK zone state updates") has introduced a bug where we don't
      decrement NR_MLOCK for pages where we clear the flag, but fail to
      isolate them from the lru list (e.g.  when the pages are on some other
      cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
      accounting gets permanently disrupted by this.
      
      Fix it by counting the number of page whose PageMlock flag is cleared.
      
      Fixes: 1ebb7cc6 (" mm: munlock: batch NR_MLOCK zone state updates")
      Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f814bf46
    • Punit Agrawal's avatar
      mm/migrate: fix refcount handling when !hugepage_migration_supported() · a0189db3
      Punit Agrawal authored
      commit 30809f55 upstream.
      
      On failing to migrate a page, soft_offline_huge_page() performs the
      necessary update to the hugepage ref-count.
      
      But when !hugepage_migration_supported() , unmap_and_move_hugepage()
      also decrements the page ref-count for the hugepage.  The combined
      behaviour leaves the ref-count in an inconsistent state.
      
      This leads to soft lockups when running the overcommitted hugepage test
      from mce-tests suite.
      
        Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
        soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
        INFO: rcu_preempt detected stalls on CPUs/tasks:
         Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
          (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
          thugetlb_overco R  running task        0  2715   2685 0x00000008
          Call trace:
            dump_backtrace+0x0/0x268
            show_stack+0x24/0x30
            sched_show_task+0x134/0x180
            rcu_print_detail_task_stall_rnp+0x54/0x7c
            rcu_check_callbacks+0xa74/0xb08
            update_process_times+0x34/0x60
            tick_sched_handle.isra.7+0x38/0x70
            tick_sched_timer+0x4c/0x98
            __hrtimer_run_queues+0xc0/0x300
            hrtimer_interrupt+0xac/0x228
            arch_timer_handler_phys+0x3c/0x50
            handle_percpu_devid_irq+0x8c/0x290
            generic_handle_irq+0x34/0x50
            __handle_domain_irq+0x68/0xc0
            gic_handle_irq+0x5c/0xb0
      
      Address this by changing the putback_active_hugepage() in
      soft_offline_huge_page() to putback_movable_pages().
      
      This only triggers on systems that enable memory failure handling
      (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
      (!ARCH_ENABLE_HUGEPAGE_MIGRATION).
      
      I imagine this wasn't triggered as there aren't many systems running
      this configuration.
      
      [akpm@linux-foundation.org: remove dead comment, per Naoya]
      Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.comReported-by: default avatarManoj Iyer <manoj.iyer@canonical.com>
      Tested-by: default avatarManoj Iyer <manoj.iyer@canonical.com>
      Suggested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0189db3
    • Ross Zwisler's avatar
      dax: fix race between colliding PMD & PTE entries · bdaac1fe
      Ross Zwisler authored
      commit e2093926 upstream.
      
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatarPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bdaac1fe
    • Ross Zwisler's avatar
      mm: avoid spurious 'bad pmd' warning messages · b16e6ab5
      Ross Zwisler authored
      commit d0f0931d upstream.
      
      When the pmd_devmap() checks were added by 5c7fb56e ("mm, dax:
      dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
      pages, they were all added to the end of if() statements after existing
      pmd_trans_huge() checks.  So, things like:
      
        -       if (pmd_trans_huge(*pmd))
        +       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
      
      When further checks were added after pmd_trans_unstable() checks by
      commit 7267ec00 ("mm: postpone page table allocation until we have
      page to map") they were also added at the end of the conditional:
      
        +       if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
      
      This ordering is fine for pmd_trans_huge(), but doesn't work for
      pmd_trans_unstable().  This is because DAX huge pages trip the bad_pmd()
      check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
      pmd_trans_unstable()), which prints out a warning and returns 1.  So, we
      do end up doing the right thing, but only after spamming dmesg with
      suspicious looking messages:
      
        mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)
      
      Reorder these checks in a helper so that pmd_devmap() is checked first,
      avoiding the error messages, and add a comment explaining why the
      ordering is important.
      
      Fixes: commit 7267ec00 ("mm: postpone page table allocation until we have page to map")
      Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b16e6ab5
    • Tetsuo Handa's avatar
      mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once · de12c73f
      Tetsuo Handa authored
      commit c288983d upstream.
      
      Roman Gushchin has reported that the OOM killer can trivially selects
      next OOM victim when a thread doing memory allocation from page fault
      path was selected as first OOM victim.
      
          allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           __alloc_pages_slowpath+0xc84/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
          Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
          allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           __alloc_pages_slowpath+0xd32/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
          ...
          allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           pagefault_out_of_memory+0x68/0x80
           mm_fault_error+0x8f/0x190
           ? handle_mm_fault+0xf3/0x210
           __do_page_fault+0x4b2/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
          Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
      
      There is a race window that the OOM reaper completes reclaiming the
      first victim's memory while nothing but mutex_trylock() prevents the
      first victim from calling out_of_memory() from pagefault_out_of_memory()
      after memory allocation for page fault path failed due to being selected
      as an OOM victim.
      
      This is a side effect of commit 9a67f648 ("mm: consolidate
      GFP_NOFAIL checks in the allocator slowpath") because that commit
      silently changed the behavior from
      
          /* Avoid allocations with no watermarks from looping endlessly */
      
      to
      
          /*
           * Give up allocations without trying memory reserves if selected
           * as an OOM victim
           */
      
      in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
      flag.  I have noticed this change but I didn't post a patch because I
      thought it is an acceptable change other than noise by warn_alloc()
      because !__GFP_NOFAIL allocations are allowed to fail.  But we
      overlooked that failing memory allocation from page fault path makes
      difference due to the race window explained above.
      
      While it might be possible to add a check to pagefault_out_of_memory()
      that prevents the first victim from calling out_of_memory() or remove
      out_of_memory() from pagefault_out_of_memory(), changing
      pagefault_out_of_memory() does not suppress noise by warn_alloc() when
      allocating thread was selected as an OOM victim.  There is little point
      with printing similar backtraces and memory information from both
      out_of_memory() and warn_alloc().
      
      Instead, if we guarantee that current thread can try allocations with no
      watermarks once when current thread looping inside
      __alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
      can use memory reserves" rules and suppress noise by warn_alloc() and
      prevent memory allocations from page fault path from calling
      pagefault_out_of_memory().
      
      If we take the comment literally, this patch would do
      
        -    if (test_thread_flag(TIF_MEMDIE))
        -        goto nopage;
        +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
        +        goto nopage;
      
      because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
      given.  But if I recall correctly (I couldn't find the message), the
      condition is meant to apply to only OOM victims despite the comment.
      Therefore, this patch preserves TIF_MEMDIE check.
      
      Fixes: 9a67f648 ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
      Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarRoman Gushchin <guro@fb.com>
      Tested-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de12c73f
    • Takashi Iwai's avatar
      ALSA: usb: Fix a typo in Tascam US-16x08 mixer element · af03bb0c
      Takashi Iwai authored
      commit 617163fc upstream.
      
      A mixer element created in a quirk for Tascam US-16x08 contains a
      typo: it should be "EQ MidLow Q" instead of "EQ MidQLow Q".
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195875
      Fixes: d2bb390a ("ALSA: usb-audio: Tascam US-16x08 DSP mixer quirk")
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af03bb0c
    • Takashi Iwai's avatar
      Revert "ALSA: usb-audio: purge needless variable length array" · 0be9a9a4
      Takashi Iwai authored
      commit 64188cfb upstream.
      
      This reverts commit 89b593c3 ("ALSA: usb-audio: purge needless
      variable length array").  The patch turned out to cause a severe
      regression, triggering an Oops at snd_usb_ctl_msg().  It was overseen
      that snd_usb_ctl_msg() writes back the response to the given buffer,
      while the patch changed it to a read-only const buffer.  (One should
      always double-check when an extra pointer cast is present...)
      
      As a simple fix, just revert the affected commit.  It was merely a
      cleanup.  Although it brings VLA again, it's clearer as a fix.  We'll
      address the VLA later in another patch.
      
      Fixes: 89b593c3 ("ALSA: usb-audio: purge needless variable length array")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195875Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0be9a9a4
    • Alexander Tsoy's avatar
      ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430 · ffb97b00
      Alexander Tsoy authored
      commit 1fc2e41f upstream.
      
      This model is actually called 92XXM2-8 in Windows driver. But since pin
      configs for M22 and M28 are identical, just reuse M22 quirk.
      
      Fixes external microphone (tested) and probably docking station ports
      (not tested).
      Signed-off-by: default avatarAlexander Tsoy <alexander@tsoy.me>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffb97b00
    • Takashi Iwai's avatar
      ALSA: hda - No loopback on ALC299 codec · 0c4afdc6
      Takashi Iwai authored
      commit fa16b69f upstream.
      
      ALC299 has no loopback mixer, but the driver still tries to add a beep
      control over the mixer NID which leads to the error at accessing it.
      This patch fixes it by properly declaring mixer_nid=0 for this codec.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195775
      Fixes: 28f1f9b2 ("ALSA: hda/realtek - Add new codec ID ALC299")
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c4afdc6
    • Nicolas Iooss's avatar
      pcmcia: remove left-over %Z format · d5bc54d0
      Nicolas Iooss authored
      commit ff5a2016 upstream.
      
      Commit 5b5e0928 ("lib/vsprintf.c: remove %Z support") removed some
      usages of format %Z but forgot "%.2Zx".  This makes clang 4.0 reports a
      -Wformat-extra-args warning because it does not know about %Z.
      
      Replace %Z with %z.
      
      Link: http://lkml.kernel.org/r/20170520090946.22562-1-nicolas.iooss_linux@m4x.orgSigned-off-by: default avatarNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Harald Welte <laforge@gnumonks.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5bc54d0
    • Lyude's avatar
      drm/radeon: Unbreak HPD handling for r600+ · 26097709
      Lyude authored
      commit 3d18e337 upstream.
      
      We end up reading the interrupt register for HPD5, and then writing it
      to HPD6 which on systems without anything using HPD5 results in
      permanently disabling hotplug on one of the display outputs after the
      first time we acknowledge a hotplug interrupt from the GPU.
      
      This code is really bad. But for now, let's just fix this. I will
      hopefully have a large patch series to refactor all of this soon.
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarLyude <lyude@redhat.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26097709
    • Alex Deucher's avatar
      drm/radeon/ci: disable mclk switching for high refresh rates (v2) · 6b693bbf
      Alex Deucher authored
      commit 58d7e3e4 upstream.
      
      Even if the vblank period would allow it, it still seems to
      be problematic on some cards.
      
      v2: fix logic inversion (Nils)
      
      bug: https://bugs.freedesktop.org/show_bug.cgi?id=96868Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b693bbf
    • Alex Deucher's avatar
      drm/amd/powerplay/smu7: disable mclk switching for high refresh rates · d6ba1a44
      Alex Deucher authored
      commit 2275a3a2 upstream.
      
      Even if the vblank period would allow it, it still seems to
      be problematic on some cards.
      
      bug: https://bugs.freedesktop.org/show_bug.cgi?id=96868Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6ba1a44
    • Alex Deucher's avatar
      drm/amd/powerplay/smu7: add vblank check for mclk switching (v2) · 662dbfcc
      Alex Deucher authored
      commit 09be4a52 upstream.
      
      Check to make sure the vblank period is long enough to support
      mclk switching.
      
      v2: drop needless initial assignment (Nils)
      
      bug: https://bugs.freedesktop.org/show_bug.cgi?id=96868Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Reviewed-by: default avatarRex Zhu <Rex.Zhu@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      662dbfcc
    • Ming Lei's avatar
      nvme: avoid to use blk_mq_abort_requeue_list() · a8aa8a0c
      Ming Lei authored
      commit 986f75c8 upstream.
      
      NVMe may add request into requeue list simply and not kick off the
      requeue if hw queues are stopped. Then blk_mq_abort_requeue_list()
      is called in both nvme_kill_queues() and nvme_ns_remove() for
      dealing with this issue.
      
      Unfortunately blk_mq_abort_requeue_list() is absolutely a
      race maker, for example, one request may be requeued during
      the aborting. So this patch just calls blk_mq_kick_requeue_list() in
      nvme_kill_queues() to handle this issue like what nvme_start_queues()
      does. Now all requests in requeue list when queues are stopped will be
      handled by blk_mq_kick_requeue_list() when queues are restarted, either
      in nvme_start_queues() or in nvme_kill_queues().
      Reported-by: default avatarZhang Yi <yizhan@redhat.com>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8aa8a0c
    • Ming Lei's avatar
      nvme: use blk_mq_start_hw_queues() in nvme_kill_queues() · 20c03f45
      Ming Lei authored
      commit 806f026f upstream.
      
      Inside nvme_kill_queues(), we have to start hw queues for
      draining requests in sw queues, .dispatch list and requeue list,
      so use blk_mq_start_hw_queues() instead of blk_mq_start_stopped_hw_queues()
      which only run queues if queues are stopped, but the queues may have
      been started already, for example nvme_start_queues() is called in reset work
      function.
      
      blk_mq_start_hw_queues() run hw queues in current context, instead
      of running asynchronously like before. Given nvme_kill_queues() is
      run from either remove context or reset worker context, both are fine
      to run hw queue directly. And the mutex of namespaces_mutex isn't a
      problem too becasue nvme_start_freeze() runs hw queue in this way
      already.
      Reported-by: default avatarZhang Yi <yizhan@redhat.com>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20c03f45
    • Marta Rybczynska's avatar
      nvme-rdma: support devices with queue size < 32 · 0fe9c551
      Marta Rybczynska authored
      commit 0544f549 upstream.
      
      In the case of small NVMe-oF queue size (<32) we may enter a deadlock
      caused by the fact that the IB completions aren't sent waiting for 32
      and the send queue will fill up.
      
      The error is seen as (using mlx5):
      [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
      [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
      
      This patch changes the way the signaling is done so that it depends on
      the queue depth now. The magic define has been removed completely.
      Signed-off-by: default avatarMarta Rybczynska <marta.rybczynska@kalray.eu>
      Signed-off-by: default avatarSamuel Jones <sjones@kalray.eu>
      Acked-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0fe9c551
    • Jason Gerecke's avatar
      HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference · f88d3d6e
      Jason Gerecke authored
      commit 2ac97f0f upstream.
      
      The following Smatch complaint was generated in response to commit
      2a6cdbdd ("HID: wacom: Introduce new 'touch_input' device"):
      
          drivers/hid/wacom_wac.c:1586 wacom_tpc_irq()
                   error: we previously assumed 'wacom->touch_input' could be null (see line 1577)
      
      The 'touch_input' and 'pen_input' variables point to the 'struct input_dev'
      used for relaying touch and pen events to userspace, respectively. If a
      device does not have a touch interface or pen interface, the associated
      input variable is NULL. The 'wacom_tpc_irq()' function is responsible for
      forwarding input reports to a more-specific IRQ handler function. An
      unknown report could theoretically be mistaken as e.g. a touch report
      on a device which does not have a touch interface. This can be prevented
      by only calling the pen/touch functions are called when the pen/touch
      pointers are valid.
      
      Fixes: 2a6cdbdd ("HID: wacom: Introduce new 'touch_input' device")
      Signed-off-by: default avatarJason Gerecke <jason.gerecke@wacom.com>
      Reviewed-by: default avatarPing Cheng <ping.cheng@wacom.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f88d3d6e
    • Bryant G. Ly's avatar
      ibmvscsis: Fix the incorrect req_lim_delta · 8d975ebd
      Bryant G. Ly authored
      commit 75dbf2d3 upstream.
      
      The current code is not correctly calculating the req_lim_delta.
      
      We want to make sure vscsi->credit is always incremented when
      we do not send a response for the scsi op. Thus for the case where
      there is a successfully aborted task we need to make sure the
      vscsi->credit is incremented.
      
      v2 - Moves the original location of the vscsi->credit increment
      to a better spot. Since if we increment credit, the next command
      we send back will have increased req_lim_delta. But we probably
      shouldn't be doing that until the aborted cmd is actually released.
      Otherwise the client will think that it can send a new command, and
      we could find ourselves short of command elements. Not likely, but could
      happen.
      
      This patch depends on both:
      commit 25e78531 ("ibmvscsis: Do not send aborted task response")
      commit 98883f1b ("ibmvscsis: Clear left-over abort_cmd pointers")
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d975ebd
    • Bryant G. Ly's avatar
      ibmvscsis: Clear left-over abort_cmd pointers · e920be83
      Bryant G. Ly authored
      commit 98883f1b upstream.
      
      With the addition of ibmvscsis->abort_cmd pointer within
      commit 25e78531 ("ibmvscsis: Do not send aborted task response"),
      make sure to explicitly NULL these pointers when clearing
      DELAY_SEND flag.
      
      Do this for two cases, when getting the new new ibmvscsis
      descriptor in ibmvscsis_get_free_cmd() and before posting
      the response completion in ibmvscsis_send_messages().
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e920be83
    • Artem Savkov's avatar
      scsi: scsi_dh_rdac: Use ctlr directly in rdac_failover_get() · 1fb66c6a
      Artem Savkov authored
      commit 0648a07c upstream.
      
      rdac_failover_get references struct rdac_controller as
      ctlr->ms_sdev->handler_data->ctlr for no apparent reason. Besides being
      inefficient this also introduces a null-pointer dereference as
      send_mode_select() sets ctlr->ms_sdev to NULL before calling
      rdac_failover_get():
      
      [   18.432550] device-mapper: multipath service-time: version 0.3.0 loaded
      [   18.436124] BUG: unable to handle kernel NULL pointer dereference at 0000000000000790
      [   18.436129] IP: send_mode_select+0xca/0x560
      [   18.436129] PGD 0
      [   18.436130] P4D 0
      [   18.436130]
      [   18.436132] Oops: 0000 [#1] SMP
      [   18.436133] Modules linked in: dm_service_time sd_mod dm_multipath amdkfd amd_iommu_v2 radeon(+) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm qla2xxx drm serio_raw scsi_transport_fc bnx2 i2c_core dm_mirror dm_region_hash dm_log dm_mod
      [   18.436143] CPU: 4 PID: 443 Comm: kworker/u16:2 Not tainted 4.12.0-rc1.1.el7.test.x86_64 #1
      [   18.436144] Hardware name: IBM BladeCenter LS22 -[79013SG]-/Server Blade, BIOS -[L8E164AUS-1.07]- 05/25/2011
      [   18.436145] Workqueue: kmpath_rdacd send_mode_select
      [   18.436146] task: ffff880225116a40 task.stack: ffffc90002bd8000
      [   18.436148] RIP: 0010:send_mode_select+0xca/0x560
      [   18.436148] RSP: 0018:ffffc90002bdbda8 EFLAGS: 00010246
      [   18.436149] RAX: 0000000000000000 RBX: ffffc90002bdbe08 RCX: ffff88017ef04a80
      [   18.436150] RDX: ffffc90002bdbe08 RSI: ffff88017ef04a80 RDI: ffff8802248e4388
      [   18.436151] RBP: ffffc90002bdbe48 R08: 0000000000000000 R09: ffffffff81c104c0
      [   18.436151] R10: 00000000000001ff R11: 000000000000035a R12: ffffc90002bdbdd8
      [   18.436152] R13: ffff8802248e4390 R14: ffff880225152800 R15: ffff8802248e4400
      [   18.436153] FS:  0000000000000000(0000) GS:ffff880227d00000(0000) knlGS:0000000000000000
      [   18.436154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   18.436154] CR2: 0000000000000790 CR3: 000000042535b000 CR4: 00000000000006e0
      [   18.436155] Call Trace:
      [   18.436159]  ? rdac_activate+0x14e/0x150
      [   18.436161]  ? refcount_dec_and_test+0x11/0x20
      [   18.436162]  ? kobject_put+0x1c/0x50
      [   18.436165]  ? scsi_dh_activate+0x6f/0xd0
      [   18.436168]  process_one_work+0x149/0x360
      [   18.436170]  worker_thread+0x4d/0x3c0
      [   18.436172]  kthread+0x109/0x140
      [   18.436173]  ? rescuer_thread+0x380/0x380
      [   18.436174]  ? kthread_park+0x60/0x60
      [   18.436176]  ret_from_fork+0x2c/0x40
      [   18.436177] Code: 49 c7 46 20 00 00 00 00 4c 89 ef c6 07 00 0f 1f 40 00 45 31 ed c7 45 b0 05 00 00 00 44 89 6d b4 4d 89 f5 4c 8b 75 a8 49 8b 45 20 <48> 8b b0 90 07 00 00 48 8b 56 10 8b 42 10 48 8d 7a 28 85 c0 0f
      [   18.436192] RIP: send_mode_select+0xca/0x560 RSP: ffffc90002bdbda8
      [   18.436192] CR2: 0000000000000790
      [   18.436198] ---[ end trace 40f3e4dca1ffabdd ]---
      [   18.436199] Kernel panic - not syncing: Fatal exception
      [   18.436222] Kernel Offset: disabled
      [-- MARK -- Thu May 18 11:45:00 2017]
      
      Fixes: 32782557 scsi_dh_rdac: switch to scsi_execute_req_flags()
      Signed-off-by: default avatarArtem Savkov <asavkov@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fb66c6a