1. 15 Nov, 2019 1 commit
    • Yang Tao's avatar
      futex: Prevent robust futex exit race · ca16d5be
      Yang Tao authored
      Robust futexes utilize the robust_list mechanism to allow the kernel to
      release futexes which are held when a task exits. The exit can be voluntary
      or caused by a signal or fault. This prevents that waiters block forever.
      
      The futex operations in user space store a pointer to the futex they are
      either locking or unlocking in the op_pending member of the per task robust
      list.
      
      After a lock operation has succeeded the futex is queued in the robust list
      linked list and the op_pending pointer is cleared.
      
      After an unlock operation has succeeded the futex is removed from the
      robust list linked list and the op_pending pointer is cleared.
      
      The robust list exit code checks for the pending operation and any futex
      which is queued in the linked list. It carefully checks whether the futex
      value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
      tries to wake up a potential waiter.
      
      This is race free for the lock operation but unlock has two race scenarios
      where waiters might not be woken up. These issues can be observed with
      regular robust pthread mutexes. PI aware pthread mutexes are not affected.
      
      (1) Unlocking task is killed after unlocking the futex value in user space
          before being able to wake a waiter.
      
              pthread_mutex_unlock()
                      |
                      V
              atomic_exchange_rel (&mutex->__data.__lock, 0)
                              <------------------------killed
                  lll_futex_wake ()                   |
                                                      |
                                                      |(__lock = 0)
                                                      |(enter kernel)
                                                      |
                                                      V
                                                  do_exit()
                                                  exit_mm()
                                                mm_release()
                                              exit_robust_list()
                                              handle_futex_death()
                                                      |
                                                      |(__lock = 0)
                                                      |(uval = 0)
                                                      |
                                                      V
              if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                      return 0;
      
          The sanity check which ensures that the user space futex is owned by
          the exiting task prevents the wakeup of waiters which in consequence
          block infinitely.
      
      (2) Waiting task is killed after a wakeup and before it can acquire the
          futex in user space.
      
              OWNER                         WAITER
      				futex_wait()      		
         pthread_mutex_unlock()               |
                      |                       |
                      |(__lock = 0)           |
                      |                       |
                      V                       |
               futex_wake() ------------>  wakeup()
                                              |
                                              |(return to userspace)
                                              |(__lock = 0)
                                              |
                                              V
                              oldval = mutex->__data.__lock
                                                <-----------------killed
          atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,  |
                              id | assume_other_futex_waiters, 0)      |
                                                                       |
                                                                       |
                                                         (enter kernel)|
                                                                       |
                                                                       V
                                                               do_exit()
                                                              |
                                                              |
                                                              V
                                              handle_futex_death()
                                              |
                                              |(__lock = 0)
                                              |(uval = 0)
                                              |
                                              V
              if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                      return 0;
      
          The sanity check which ensures that the user space futex is owned
          by the exiting task prevents the wakeup of waiters, which seems to
          be correct as the exiting task does not own the futex value, but
          the consequence is that other waiters wont be woken up and block
          infinitely.
      
      In both scenarios the following conditions are true:
      
         - task->robust_list->list_op_pending != NULL
         - user space futex value == 0
         - Regular futex (not PI)
      
      If these conditions are met then it is reasonably safe to wake up a
      potential waiter in order to prevent the above problems.
      
      As this might be a false positive it can cause spurious wakeups, but the
      waiter side has to handle other types of unrelated wakeups, e.g. signals
      gracefully anyway. So such a spurious wakeup will not affect the
      correctness of these operations.
      
      This workaround must not touch the user space futex value and cannot set
      the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
      OWNER_DIED in this case would result in inconsistent state and subsequently
      in malfunction of the owner died handling in user space.
      
      The rest of the user space state is still consistent as no other task can
      observe the list_op_pending entry in the exiting tasks robust list.
      
      The eventually woken up waiter will observe the uncontended lock value and
      take it over.
      
      [ tglx: Massaged changelog and comment. Made the return explicit and not
        	depend on the subsequent check and added constants to hand into
        	handle_futex_death() instead of plain numbers. Fixed a few coding
      	style issues. ]
      
      Fixes: 0771dfef ("[PATCH] lightweight robust futexes: core")
      Signed-off-by: default avatarYang Tao <yang.tao172@zte.com.cn>
      Signed-off-by: default avatarYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn
      Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
      ca16d5be
  2. 13 Nov, 2019 1 commit
  3. 29 Oct, 2019 2 commits
  4. 09 Oct, 2019 2 commits
  5. 08 Oct, 2019 6 commits
  6. 07 Oct, 2019 20 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · eda57a0e
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "The usual shower of hotfixes.
      
        Chris's memcg patches aren't actually fixes - they're mature but a few
        niggling review issues were late to arrive.
      
        The ocfs2 fixes are quite old - those took some time to get reviewer
        attention.
      
        Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
        mm/slab-generic"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
        mm, sl[ou]b: improve memory accounting
        mm, memcg: make scan aggression always exclude protection
        mm, memcg: make memory.emin the baseline for utilisation determination
        mm, memcg: proportional memory.{low,min} reclaim
        mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
        mm/page_alloc.c: fix a crash in free_pages_prepare()
        mm/z3fold.c: claim page in the beginning of free
        kernel/sysctl.c: do not override max_threads provided by userspace
        memcg: only record foreign writebacks with dirty pages when memcg is not disabled
        mm: fix -Wmissing-prototypes warnings
        writeback: fix use-after-free in finish_writeback_work()
        mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
        panic: ensure preemption is disabled during panic()
        fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
        fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
        fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
        ocfs2: clear zero in unaligned direct IO
      eda57a0e
    • Vlastimil Babka's avatar
      mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) · 59bb4798
      Vlastimil Babka authored
      In most configurations, kmalloc() happens to return naturally aligned
      (i.e.  aligned to the block size itself) blocks for power of two sizes.
      
      That means some kmalloc() users might unknowingly rely on that
      alignment, until stuff breaks when the kernel is built with e.g.
      CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
      developers have to devise workaround such as own kmem caches with
      specified alignment [1], which is not always practical, as recently
      evidenced in [2].
      
      The topic has been discussed at LSF/MM 2019 [3].  Adding a
      'kmalloc_aligned()' variant would not help with code unknowingly relying
      on the implicit alignment.  For slab implementations it would either
      require creating more kmalloc caches, or allocate a larger size and only
      give back part of it.  That would be wasteful, especially with a generic
      alignment parameter (in contrast with a fixed alignment to size).
      
      Ideally we should provide to mm users what they need without difficult
      workarounds or own reimplementations, so let's make the kmalloc()
      alignment to size explicitly guaranteed for power-of-two sizes under all
      configurations.  What this means for the three available allocators?
      
      * SLAB object layout happens to be mostly unchanged by the patch.  The
        implicitly provided alignment could be compromised with
        CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
        caches with alignment larger than unsigned long long.  Practically on at
        least x86 this includes kmalloc caches as they use cache line alignment,
        which is larger than that.  Still, this patch ensures alignment on all
        arches and cache sizes.
      
      * SLUB layout is also unchanged unless redzoning is enabled through
        CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
        With this patch, explicit alignment is guaranteed with redzoning as
        well.  This will result in more memory being wasted, but that should be
        acceptable in a debugging scenario.
      
      * SLOB has no implicit alignment so this patch adds it explicitly for
        kmalloc().  The potential downside is increased fragmentation.  While
        pathological allocation scenarios are certainly possible, in my testing,
        after booting a x86_64 kernel+userspace with virtme, around 16MB memory
        was consumed by slab pages both before and after the patch, with
        difference in the noise.
      
      [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
      [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
      [3] https://lwn.net/Articles/787740/
      
      [akpm@linux-foundation.org: documentation fixlet, per Matthew]
      Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59bb4798
    • Vlastimil Babka's avatar
      mm, sl[ou]b: improve memory accounting · 6a486c0a
      Vlastimil Babka authored
      Patch series "guarantee natural alignment for kmalloc()", v2.
      
      This patch (of 2):
      
      SLOB currently doesn't account its pages at all, so in /proc/meminfo the
      Slab field shows zero.  Modifying a counter on page allocation and
      freeing should be acceptable even for the small system scenarios SLOB is
      intended for.  Since reclaimable caches are not separated in SLOB,
      account everything as unreclaimable.
      
      SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
      larger than order-1 page, that are passed directly to the page
      allocator.  As they also don't appear in /proc/slabinfo, it might look
      like a memory leak.  For consistency, account them as well.  (SLAB
      doesn't actually use page allocator directly, so no change there).
      
      Ideally SLOB and SLUB would be handled in separate patches, but due to
      the shared kmalloc_order() function and different kfree()
      implementations, it's easier to patch both at once to prevent
      inconsistencies.
      
      Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a486c0a
    • Chris Down's avatar
      mm, memcg: make scan aggression always exclude protection · 1bc63fb1
      Chris Down authored
      This patch is an incremental improvement on the existing
      memory.{low,min} relative reclaim work to base its scan pressure
      calculations on how much protection is available compared to the current
      usage, rather than how much the current usage is over some protection
      threshold.
      
      This change doesn't change the experience for the user in the normal
      case too much.  One benefit is that it replaces the (somewhat arbitrary)
      100% cutoff with an indefinite slope, which makes it easier to ballpark
      a memory.low value.
      
      As well as this, the old methodology doesn't quite apply generically to
      machines with varying amounts of physical memory.  Let's say we have a
      top level cgroup, workload.slice, and another top level cgroup,
      system-management.slice.  We want to roughly give 12G to
      system-management.slice, so on a 32GB machine we set memory.low to 20GB
      in workload.slice, and on a 64GB machine we set memory.low to 52GB.
      However, because these are relative amounts to the total machine size,
      while the amount of memory we want to generally be willing to yield to
      system.slice is absolute (12G), we end up putting more pressure on
      system.slice just because we have a larger machine and a larger workload
      to fill it, which seems fairly unintuitive.  With this new behaviour, we
      don't end up with this unintended side effect.
      
      Previously the way that memory.low protection works is that if you are
      50% over a certain baseline, you get 50% of your normal scan pressure.
      This is certainly better than the previous cliff-edge behaviour, but it
      can be improved even further by always considering memory under the
      currently enforced protection threshold to be out of bounds.  This means
      that we can set relatively low memory.low thresholds for variable or
      bursty workloads while still getting a reasonable level of protection,
      whereas with the previous version we may still trivially hit the 100%
      clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
      one is more concretely based on the currently enforced protection
      threshold, which is likely easier to reason about.
      
      There is also a subtle issue with the way that proportional reclaim
      worked previously -- it promotes having no memory.low, since it makes
      pressure higher during low reclaim.  This happens because we base our
      scan pressure modulation on how far memory.current is between memory.min
      and memory.low, but if memory.low is unset, we only use the overage
      method.  In most cromulent configurations, this then means that we end
      up with *more* pressure than with no memory.low at all when we're in low
      reclaim, which is not really very usable or expected.
      
      With this patch, memory.low and memory.min affect reclaim pressure in a
      more understandable and composable way.  For example, from a user
      standpoint, "protected" memory now remains untouchable from a reclaim
      aggression standpoint, and users can also have more confidence that
      bursty workloads will still receive some amount of guaranteed
      protection.
      
      Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.nameSigned-off-by: default avatarChris Down <chris@chrisdown.name>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bc63fb1
    • Chris Down's avatar
      mm, memcg: make memory.emin the baseline for utilisation determination · 9de7ca46
      Chris Down authored
      Roman points out that when when we do the low reclaim pass, we scale the
      reclaim pressure relative to position between 0 and the maximum
      protection threshold.
      
      However, if the maximum protection is based on memory.elow, and
      memory.emin is above zero, this means we still may get binary behaviour
      on second-pass low reclaim.  This is because we scale starting at 0, not
      starting at memory.emin, and since we don't scan at all below emin, we
      end up with cliff behaviour.
      
      This should be a fairly uncommon case since usually we don't go into the
      second pass, but it makes sense to scale our low reclaim pressure
      starting at emin.
      
      You can test this by catting two large sparse files, one in a cgroup
      with emin set to some moderate size compared to physical RAM, and
      another cgroup without any emin.  In both cgroups, set an elow larger
      than 50% of physical RAM.  The one with emin will have less page
      scanning, as reclaim pressure is lower.
      
      Rebase on top of and apply the same idea as what was applied to handle
      cgroup_memory=disable properly for the original proportional patch
      http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
      memcg: Handle cgroup_disable=memory when getting memcg protection").
      
      Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.nameSigned-off-by: default avatarChris Down <chris@chrisdown.name>
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9de7ca46
    • Chris Down's avatar
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down authored
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
    • Dan Carpenter's avatar
      mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() · 518a8671
      Dan Carpenter authored
      The "mode" and "level" variables are enums and in this context GCC will
      treat them as unsigned ints so the error handling is never triggered.
      
      I also removed the bogus initializer because it isn't required any more
      and it's sort of confusing.
      
      [akpm@linux-foundation.org: reduce implicit and explicit typecasting]
      [akpm@linux-foundation.org: fix return value, add comment, per Matthew]
      Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda
      Fixes: 3cadfa2b ("mm/vmpressure.c: convert to use match_string() helper")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Enrico Weigelt <info@metux.net>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      518a8671
    • Qian Cai's avatar
      mm/page_alloc.c: fix a crash in free_pages_prepare() · 234fdce8
      Qian Cai authored
      On architectures like s390, arch_free_page() could mark the page unused
      (set_page_unused()) and any access later would trigger a kernel panic.
      Fix it by moving arch_free_page() after all possible accessing calls.
      
       Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
       Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
                  R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
       Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
                  000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
                  00000000275cca48 0000000000000100 0000000000000008 000003d080010000
                  00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
       Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
                  0000000026c2b962: e32014000036 pfd 2,1024(%r1)
                 #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
                 >0000000026c2b96e: 41101100  la %r1,256(%r1)
                  0000000026c2b972: a737fff8  brctg %r3,26c2b962
                  0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
                  0000000026c2b97c: e31003400004 lg %r1,832
                  0000000026c2b982: ebff1430016a asi 5168(%r1),-1
       Call Trace:
       __free_pages_ok+0x16a/0x5d8)
       memblock_free_all+0x206/0x290
       mem_init+0x58/0x120
       start_kernel+0x2b0/0x570
       startup_continue+0x6a/0xc0
       INFO: lockdep is turned off.
       Last Breaking-Event-Address:
       __free_pages_ok+0x372/0x5d8
       Kernel panic - not syncing: Fatal exception: panic_on_oops
       00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C
      
      In the past, only kernel_poison_pages() would trigger this but it needs
      "page_poison=on" kernel cmdline, and I suspect nobody tested that on
      s390.  Recently, kernel_init_free_pages() (commit 6471384a ("mm:
      security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
      was added and could trigger this as well.
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
      Fixes: 8823b1db ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
      Fixes: 6471384a ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      234fdce8
    • Vitaly Wool's avatar
      mm/z3fold.c: claim page in the beginning of free · 5b6807de
      Vitaly Wool authored
      There's a really hard to reproduce race in z3fold between z3fold_free()
      and z3fold_reclaim_page().  z3fold_reclaim_page() can claim the page
      after z3fold_free() has checked if the page was claimed and
      z3fold_free() will then schedule this page for compaction which may in
      turn lead to random page faults (since that page would have been
      reclaimed by then).
      
      Fix that by claiming page in the beginning of z3fold_free() and not
      forgetting to clear the claim in the end.
      
      [vitalywool@gmail.com: v2]
        Link: http://lkml.kernel.org/r/20190928113456.152742cf@bigdell
      Link: http://lkml.kernel.org/r/20190926104844.4f0c6efa1366b8f5741eaba9@gmail.comSigned-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Reported-by: default avatarMarkus Linnala <markus.linnala@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Markus Linnala <markus.linnala@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b6807de
    • Michal Hocko's avatar
      kernel/sysctl.c: do not override max_threads provided by userspace · b0f53dbc
      Michal Hocko authored
      Partially revert 16db3d3f ("kernel/sysctl.c: threads-max observe
      limits") because the patch is causing a regression to any workload which
      needs to override the auto-tuning of the limit provided by kernel.
      
      set_max_threads is implementing a boot time guesstimate to provide a
      sensible limit of the concurrently running threads so that runaways will
      not deplete all the memory.  This is a good thing in general but there
      are workloads which might need to increase this limit for an application
      to run (reportedly WebSpher MQ is affected) and that is simply not
      possible after the mentioned change.  It is also very dubious to
      override an admin decision by an estimation that doesn't have any direct
      relation to correctness of the kernel operation.
      
      Fix this by dropping set_max_threads from sysctl_max_threads so any
      value is accepted as long as it fits into MAX_THREADS which is important
      to check because allowing more threads could break internal robust futex
      restriction.  While at it, do not use MIN_THREADS as the lower boundary
      because it is also only a heuristic for automatic estimation and admin
      might have a good reason to stop new threads to be created even when
      below this limit.
      
      This became more severe when we switched x86 from 4k to 8k kernel
      stacks.  Starting since 6538b8ea ("x86_64: expand kernel stack to
      16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
      value.
      
      In the particular case
      
        3.12
        kernel.threads-max = 515561
      
        4.4
        kernel.threads-max = 200000
      
      Neither of the two values is really insane on 32GB machine.
      
      I am not sure we want/need to tune the max_thread value further.  If
      anything the tuning should be removed altogether if proven not useful in
      general.  But we definitely need a way to override this auto-tuning.
      
      Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
      Fixes: 16db3d3f ("kernel/sysctl.c: threads-max observe limits")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0f53dbc
    • Baoquan He's avatar
      memcg: only record foreign writebacks with dirty pages when memcg is not disabled · 08d1d0e6
      Baoquan He authored
      In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
      for saving memory.  Now kdump kernel will always panic when dump vmcore
      to local disk:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000ab8
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
        Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
        RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
        Call Trace:
         __set_page_dirty+0x52/0xc0
         iomap_set_page_dirty+0x50/0x90
         iomap_write_end+0x6e/0x270
         iomap_write_actor+0xce/0x170
         iomap_apply+0xba/0x11e
         iomap_file_buffered_write+0x62/0x90
         xfs_file_buffered_aio_write+0xca/0x320 [xfs]
         new_sync_write+0x12d/0x1d0
         vfs_write+0xa5/0x1a0
         ksys_write+0x59/0xd0
         do_syscall_64+0x59/0x1e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.
      
      Via the trace and with debugging, it is pointing to commit 97b27821
      ("writeback, memcg: Implement foreign dirty flushing") which introduced
      this regression.  Disabling memcg causes the null pointer dereference at
      uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().
      
      Fix it by returning directly if memcg is disabled, but not trying to
      record the foreign writebacks with dirty pages.
      
      Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
      Fixes: 97b27821 ("writeback, memcg: Implement foreign dirty flushing")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08d1d0e6
    • Yi Wang's avatar
      mm: fix -Wmissing-prototypes warnings · 758b8db4
      Yi Wang authored
      We get two warnings when build kernel W=1:
      
        mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
        mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]
      
      Make the functions static to fix this.
      
      Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: default avatarYi Wang <wang.yi59@zte.com.cn>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      758b8db4
    • Tejun Heo's avatar
      writeback: fix use-after-free in finish_writeback_work() · 8e00c4e9
      Tejun Heo authored
      finish_writeback_work() reads @done->waitq after decrementing
      @done->cnt.  However, once @done->cnt reaches zero, @done may be freed
      (from stack) at any moment and @done->waitq can contain something
      unrelated by the time finish_writeback_work() tries to read it.  This
      led to the following crash.
      
        "BUG: kernel NULL pointer dereference, address: 0000000000000002"
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 0 P4D 0
        Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
        CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted
        ...
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
        Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90
        RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002
        RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0
        Call Trace:
         __wake_up_common_lock+0x63/0xc0
         wb_workfn+0xd2/0x3e0
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         kthread+0x111/0x130
         ret_from_fork+0x1f/0x30
      
      Fix it by reading and caching @done->waitq before decrementing
      @done->cnt.
      
      Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com
      Fixes: 5b9cce4c ("writeback: Generalize and expose wb_completion")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Debugged-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e00c4e9
    • Anshuman Khandual's avatar
      mm/memremap: drop unused SECTION_SIZE and SECTION_MASK · 6d0e9849
      Anshuman Khandual authored
      SECTION_SIZE and SECTION_MASK macros are not getting used anymore.  But
      they do conflict with existing definitions on arm64 platform causing
      following warning during build.  Lets drop these unused macros.
      
        mm/memremap.c:16: warning: "SECTION_MASK" redefined
         #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
        arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition
         #define SECTION_MASK  (~(SECTION_SIZE-1))
      
        mm/memremap.c:17: warning: "SECTION_SIZE" redefined
         #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
        arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition
         #define SECTION_SIZE  (_AC(1, UL) << SECTION_SHIFT)
      
      Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d0e9849
    • Will Deacon's avatar
      panic: ensure preemption is disabled during panic() · 20bb759a
      Will Deacon authored
      Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
      calling CPU in an infinite loop, but with interrupts and preemption
      enabled.  From this state, userspace can continue to be scheduled,
      despite the system being "dead" as far as the kernel is concerned.
      
      This is easily reproducible on arm64 when booting with "nosmp" on the
      command line; a couple of shell scripts print out a periodic "Ping"
      message whilst another triggers a crash by writing to
      /proc/sysrq-trigger:
      
        | sysrq: Trigger a crash
        | Kernel panic - not syncing: sysrq triggered crash
        | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
        | Hardware name: linux,dummy-virt (DT)
        | Call trace:
        |  dump_backtrace+0x0/0x148
        |  show_stack+0x14/0x20
        |  dump_stack+0xa0/0xc4
        |  panic+0x140/0x32c
        |  sysrq_handle_reboot+0x0/0x20
        |  __handle_sysrq+0x124/0x190
        |  write_sysrq_trigger+0x64/0x88
        |  proc_reg_write+0x60/0xa8
        |  __vfs_write+0x18/0x40
        |  vfs_write+0xa4/0x1b8
        |  ksys_write+0x64/0xf0
        |  __arm64_sys_write+0x14/0x20
        |  el0_svc_common.constprop.0+0xb0/0x168
        |  el0_svc_handler+0x28/0x78
        |  el0_svc+0x8/0xc
        | Kernel Offset: disabled
        | CPU features: 0x0002,24002004
        | Memory Limit: none
        | ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
        |  Ping 2!
        |  Ping 1!
        |  Ping 1!
        |  Ping 2!
      
      The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
      otherwise local interrupts are disabled in 'smp_send_stop()'.
      
      Disable preemption in 'panic()' before re-enabling interrupts.
      
      Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
      Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstoneSigned-off-by: default avatarWill Deacon <will@kernel.org>
      Reported-by: default avatarXogium <contact@xogium.me>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20bb759a
    • Jia-Ju Bai's avatar
      fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc() · 2abb7d3b
      Jia-Ju Bai authored
      In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283
      to check whether inode_alloc is NULL:
      
          if (inode_alloc)
      
      When inode_alloc is NULL, it is used on line 287:
      
          ocfs2_inode_lock(inode_alloc, &bh, 0);
              ocfs2_inode_lock_full_nested(inode, ...)
                  struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
      
      Thus, a possible null-pointer dereference may occur.
      
      To fix this bug, inode_alloc is checked on line 286.
      
      This bug is found by a static analysis tool STCheck written by us.
      
      Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.comSigned-off-by: default avatarJia-Ju Bai <baijiaju1990@gmail.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2abb7d3b
    • Jia-Ju Bai's avatar
      fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock() · 583fee3e
      Jia-Ju Bai authored
      In ocfs2_write_end_nolock(), there are an if statement on lines 1976,
      2047 and 2058, to check whether handle is NULL:
      
          if (handle)
      
      When handle is NULL, it is used on line 2045:
      
      	ocfs2_update_inode_fsync_trans(handle, inode, 1);
              oi->i_sync_tid = handle->h_transaction->t_tid;
      
      Thus, a possible null-pointer dereference may occur.
      
      To fix this bug, handle is checked before calling
      ocfs2_update_inode_fsync_trans().
      
      This bug is found by a static analysis tool STCheck written by us.
      
      Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.comSigned-off-by: default avatarJia-Ju Bai <baijiaju1990@gmail.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      583fee3e
    • Jia-Ju Bai's avatar
      fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry() · 56e94ea1
      Jia-Ju Bai authored
      In ocfs2_xa_prepare_entry(), there is an if statement on line 2136 to
      check whether loc->xl_entry is NULL:
      
          if (loc->xl_entry)
      
      When loc->xl_entry is NULL, it is used on line 2158:
      
          ocfs2_xa_add_entry(loc, name_hash);
              loc->xl_entry->xe_name_hash = cpu_to_le32(name_hash);
              loc->xl_entry->xe_name_offset = cpu_to_le16(loc->xl_size);
      
      and line 2164:
      
          ocfs2_xa_add_namevalue(loc, xi);
              loc->xl_entry->xe_value_size = cpu_to_le64(xi->xi_value_len);
              loc->xl_entry->xe_name_len = xi->xi_name_len;
      
      Thus, possible null-pointer dereferences may occur.
      
      To fix these bugs, if loc-xl_entry is NULL, ocfs2_xa_prepare_entry()
      abnormally returns with -EINVAL.
      
      These bugs are found by a static analysis tool STCheck written by us.
      
      [akpm@linux-foundation.org: remove now-unused ocfs2_xa_add_entry()]
      Link: http://lkml.kernel.org/r/20190726101447.9153-1-baijiaju1990@gmail.comSigned-off-by: default avatarJia-Ju Bai <baijiaju1990@gmail.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56e94ea1
    • Jia Guo's avatar
      ocfs2: clear zero in unaligned direct IO · 7a243c82
      Jia Guo authored
      Unused portion of a part-written fs-block-sized block is not set to zero
      in unaligned append direct write.This can lead to serious data
      inconsistencies.
      
      Ocfs2 manage disk with cluster size(for example, 1M), part-written in
      one cluster will change the cluster state from UN-WRITTEN to WRITTEN,
      VFS(function dio_zero_block) doesn't do the cleaning because bh's state
      is not set to NEW in function ocfs2_dio_wr_get_block when we write a
      WRITTEN cluster.  For example, the cluster size is 1M, file size is 8k
      and we direct write from 14k to 15k, then 12k~14k and 15k~16k will
      contain dirty data.
      
      We have to deal with two cases:
       1.The starting position of direct write is outside the file.
       2.The starting position of direct write is located in the file.
      
      We need set bh's state to NEW in the first case.  In the second case, we
      need mapped twice because bh's state of area out file should be set to
      NEW while area in file not.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.comSigned-off-by: default avatarJia Guo <guojia12@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a243c82
    • Linus Torvalds's avatar
      uaccess: implement a proper unsafe_copy_to_user() and switch filldir over to it · c512c691
      Linus Torvalds authored
      In commit 9f79b78e ("Convert filldir[64]() from __put_user() to
      unsafe_put_user()") I made filldir() use unsafe_put_user(), which
      improves code generation on x86 enormously.
      
      But because we didn't have a "unsafe_copy_to_user()", the dirent name
      copy was also done by hand with unsafe_put_user() in a loop, and it
      turns out that a lot of other architectures didn't like that, because
      unlike x86, they have various alignment issues.
      
      Most non-x86 architectures trap and fix it up, and some (like xtensa)
      will just fail unaligned put_user() accesses unconditionally.  Which
      makes that "copy using put_user() in a loop" not work for them at all.
      
      I could make that code do explicit alignment etc, but the architectures
      that don't like unaligned accesses also don't really use the fancy
      "user_access_begin/end()" model, so they might just use the regular old
      __copy_to_user() interface.
      
      So this commit takes that looping implementation, turns it into the x86
      version of "unsafe_copy_to_user()", and makes other architectures
      implement the unsafe copy version as __copy_to_user() (the same way they
      do for the other unsafe_xyz() accessor functions).
      
      Note that it only does this for the copying _to_ user space, and we
      still don't have a unsafe version of copy_from_user().
      
      That's partly because we have no current users of it, but also partly
      because the copy_from_user() case is slightly different and cannot
      efficiently be implemented in terms of a unsafe_get_user() loop (because
      gcc can't do asm goto with outputs).
      
      It would be trivial to do this using "rep movsb", which would work
      really nicely on newer x86 cores, but really badly on some older ones.
      
      Al Viro is looking at cleaning up all our user copy routines to make
      this all a non-issue, but for now we have this simple-but-stupid version
      for x86 that works fine for the dirent name copy case because those
      names are short strings and we simply don't need anything fancier.
      
      Fixes: 9f79b78e ("Convert filldir[64]() from __put_user() to unsafe_put_user()")
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reported-and-tested-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c512c691
  7. 06 Oct, 2019 4 commits
    • Linus Torvalds's avatar
      Linux 5.4-rc2 · da0c9ea1
      Linus Torvalds authored
      da0c9ea1
    • Linus Torvalds's avatar
      elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings · b212921b
      Linus Torvalds authored
      In commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map") we
      changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
      executable mappings.
      
      Then, people reported that it broke some binaries that had overlapping
      segments from the same file, and commit ad55eac7 ("elf: enforce
      MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
      overlaying elf segment cases.  But only some - despite the summary line
      of that commit, it only did it when it also does a temporary brk vma for
      one obvious overlapping case.
      
      Now Russell King reports another overlapping case with old 32-bit x86
      binaries, which doesn't trigger that limited case.  End result: we had
      better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.
      
      Yes, it's a sign of old binaries generated with old tool-chains, but we
      do pride ourselves on not breaking existing setups.
      
      This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
      and the old load_elf_library() use-cases, because nobody has reported
      breakage for those. Yet.
      
      Note that in all the cases seen so far, the overlapping elf sections
      seem to be just re-mapping of the same executable with different section
      attributes.  We could possibly introduce a new MAP_FIXED_NOFILECHANGE
      flag or similar, which acts like NOREPLACE, but allows just remapping
      the same executable file using different protection flags.
      
      It's not clear that would make a huge difference to anything, but if
      people really hate that "elf remaps over previous maps" behavior, maybe
      at least a more limited form of remapping would alleviate some concerns.
      
      Alternatively, we should take a look at our elf_map() logic to see if we
      end up not mapping things properly the first time.
      
      In the meantime, this is the minimal "don't do that then" patch while
      people hopefully think about it more.
      Reported-by: default avatarRussell King <linux@armlinux.org.uk>
      Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map")
      Fixes: ad55eac7 ("elf: enforce  MAP_FIXED on overlaying elf segments")
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b212921b
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping · 7cdb85df
      Linus Torvalds authored
      Pull dma-mapping regression fix from Christoph Hellwig:
       "Revert an incorret hunk from a patch that caused problems on various
        arm boards (Andrey Smirnov)"
      
      * tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping:
        dma-mapping: fix false positive warnings in dma_common_free_remap()
      7cdb85df
    • Linus Torvalds's avatar
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 43b815c6
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "A few fixes this time around:
      
         - Fixup of some clock specifications for DRA7 (device-tree fix)
      
         - Removal of some dead/legacy CPU OPP/PM code for OMAP that throws
           warnings at boot
      
         - A few more minor fixups for OMAPs, most around display
      
         - Enable STM32 QSPI as =y since their rootfs sometimes comes from
           there
      
         - Switch CONFIG_REMOTEPROC to =y since it went from tristate to bool
      
         - Fix of thermal zone definition for ux500 (5.4 regression)"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        ARM: multi_v7_defconfig: Fix SPI_STM32_QSPI support
        ARM: dts: ux500: Fix up the CPU thermal zone
        arm64/ARM: configs: Change CONFIG_REMOTEPROC from m to y
        ARM: dts: am4372: Set memory bandwidth limit for DISPC
        ARM: OMAP2+: Fix warnings with broken omap2_set_init_voltage()
        ARM: OMAP2+: Add missing LCDC midlemode for am335x
        ARM: OMAP2+: Fix missing reset done flag for am3 and am43
        ARM: dts: Fix gpio0 flags for am335x-icev2
        ARM: omap2plus_defconfig: Enable more droid4 devices as loadable modules
        ARM: omap2plus_defconfig: Enable DRM_TI_TFP410
        DTS: ARM: gta04: introduce legacy spi-cs-high to make display work again
        ARM: dts: Fix wrong clocks for dra7 mcasp
        clk: ti: dra7: Fix mcasp8 clock bits
      43b815c6
  8. 05 Oct, 2019 4 commits
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v5.4' of... · 2d00aee2
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - remove unneeded ar-option and KBUILD_ARFLAGS
      
       - remove long-deprecated SUBDIRS
      
       - fix modpost to suppress false-positive warnings for UML builds
      
       - fix namespace.pl to handle relative paths to ${objtree}, ${srctree}
      
       - make setlocalversion work for /bin/sh
      
       - make header archive reproducible
      
       - fix some Makefiles and documents
      
      * tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kheaders: make headers archive reproducible
        kbuild: update compile-test header list for v5.4-rc2
        kbuild: two minor updates for Documentation/kbuild/modules.rst
        scripts/setlocalversion: clear local variable to make it work for sh
        namespace: fix namespace.pl script to support relative paths
        video/logo: do not generate unneeded logo C files
        video/logo: remove unneeded *.o pattern from clean-files
        integrity: remove pointless subdir-$(CONFIG_...)
        integrity: remove unneeded, broken attempt to add -fshort-wchar
        modpost: fix static EXPORT_SYMBOL warnings for UML build
        kbuild: correct formatting of header in kbuild module docs
        kbuild: remove SUBDIRS support
        kbuild: remove ar-option and KBUILD_ARFLAGS
      2d00aee2
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 126195c9
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Twelve patches mostly small but obvious fixes or cosmetic but small
        updates"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: qla2xxx: Fix Nport ID display value
        scsi: qla2xxx: Fix N2N link up fail
        scsi: qla2xxx: Fix N2N link reset
        scsi: qla2xxx: Optimize NPIV tear down process
        scsi: qla2xxx: Fix stale mem access on driver unload
        scsi: qla2xxx: Fix unbound sleep in fcport delete path.
        scsi: qla2xxx: Silence fwdump template message
        scsi: hisi_sas: Make three functions static
        scsi: megaraid: disable device when probe failed after enabled device
        scsi: storvsc: setup 1:1 mapping between hardware queue and CPU queue
        scsi: qedf: Remove always false 'tmp_prio < 0' statement
        scsi: ufs: skip shutdown if hba is not powered
        scsi: bnx2fc: Handle scope bits when array returns BUSY or TSF
      126195c9
    • Linus Torvalds's avatar
      Merge branch 'readdir' (readdir speedup and sanity checking) · 4f11918a
      Linus Torvalds authored
      This makes getdents() and getdents64() do sanity checking on the
      pathname that it gives to user space.  And to mitigate the performance
      impact of that, it first cleans up the way it does the user copying, so
      that the code avoids doing the SMAP/PAN updates between each part of the
      dirent structure write.
      
      I really wanted to do this during the merge window, but didn't have
      time.  The conversion of filldir to unsafe_put_user() is something I've
      had around for years now in a private branch, but the extra pathname
      checking finally made me clean it up to the point where it is mergable.
      
      It's worth noting that the filename validity checking really should be a
      bit smarter: it would be much better to delay the error reporting until
      the end of the readdir, so that non-corrupted filenames are still
      returned.  But that involves bigger changes, so let's see if anybody
      actually hits the corrupt directory entry case before worrying about it
      further.
      
      * branch 'readdir':
        Make filldir[64]() verify the directory entry filename is valid
        Convert filldir[64]() from __put_user() to unsafe_put_user()
      4f11918a
    • Linus Torvalds's avatar
      Make filldir[64]() verify the directory entry filename is valid · 8a23eb80
      Linus Torvalds authored
      This has been discussed several times, and now filesystem people are
      talking about doing it individually at the filesystem layer, so head
      that off at the pass and just do it in getdents{64}().
      
      This is partially based on a patch by Jann Horn, but checks for NUL
      bytes as well, and somewhat simplified.
      
      There's also commentary about how it might be better if invalid names
      due to filesystem corruption don't cause an immediate failure, but only
      an error at the end of the readdir(), so that people can still see the
      filenames that are ok.
      
      There's also been discussion about just how much POSIX strictly speaking
      requires this since it's about filesystem corruption.  It's really more
      "protect user space from bad behavior" as pointed out by Jann.  But
      since Eric Biederman looked up the POSIX wording, here it is for context:
      
       "From readdir:
      
         The readdir() function shall return a pointer to a structure
         representing the directory entry at the current position in the
         directory stream specified by the argument dirp, and position the
         directory stream at the next entry. It shall return a null pointer
         upon reaching the end of the directory stream. The structure dirent
         defined in the <dirent.h> header describes a directory entry.
      
        From definitions:
      
         3.129 Directory Entry (or Link)
      
         An object that associates a filename with a file. Several directory
         entries can associate names with the same file.
      
        ...
      
         3.169 Filename
      
         A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
         characters composing the name may be selected from the set of all
         character values excluding the slash character and the null byte. The
         filenames dot and dot-dot have special meaning. A filename is
         sometimes referred to as a 'pathname component'."
      
      Note that I didn't bother adding the checks to any legacy interfaces
      that nobody uses.
      
      Also note that if this ends up being noticeable as a performance
      regression, we can fix that to do a much more optimized model that
      checks for both NUL and '/' at the same time one word at a time.
      
      We haven't really tended to optimize 'memchr()', and it only checks for
      one pattern at a time anyway, and we really _should_ check for NUL too
      (but see the comment about "soft errors" in the code about why it
      currently only checks for '/')
      
      See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name
      lookup code looks for pathname terminating characters in parallel.
      
      Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a23eb80