1. 31 Aug, 2017 4 commits
    • Mel Gorman's avatar
      mm, madvise: ensure poisoned pages are removed from per-cpu lists · c461ad6a
      Mel Gorman authored
      Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
      and bisected it to the commit 479f854a ("mm, page_alloc: defer
      debugging checks of pages allocated from the PCP").
      
      The problem is that a page that was poisoned with madvise() is reused.
      The commit removed a check that would trigger if DEBUG_VM was enabled
      but re-enabling the check only fixes the problem as a side-effect by
      printing a bad_page warning and recovering.
      
      The root of the problem is that an madvise() can leave a poisoned page
      on the per-cpu list.  This patch drains all per-cpu lists after pages
      are poisoned so that they will not be reused.  Wendy reports that the
      test case in question passes with this patch applied.  While this could
      be done in a targeted fashion, it is over-complicated for such a rare
      operation.
      
      Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
      Fixes: 479f854a ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarWang, Wendy <wendy.wang@intel.com>
      Tested-by: default avatarWang, Wendy <wendy.wang@intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Hansen, Dave" <dave.hansen@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c461ad6a
    • Eric Biggers's avatar
      mm, uprobes: fix multiple free of ->uprobes_state.xol_area · 355627f5
      Eric Biggers authored
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
      being copied from the old mm_struct by the memcpy in dup_mm().  For a
      task that has previously hit a uprobe tracepoint, this resulted in the
      'struct xol_area' being freed multiple times if the task was killed at
      just the right time while forking.
      
      Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
      than in uprobe_dup_mmap().
      
      With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
      program given by commit 2b7e8665 ("fork: fix incorrect fput of
      ->exe_file causing use-after-free"), provided that a uprobe tracepoint
      has been set on the fork_thread() function.  For example:
      
          $ gcc reproducer.c -o reproducer -lpthread
          $ nm reproducer | grep fork_thread
          0000000000400719 t fork_thread
          $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
          $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
          $ ./reproducer
      
      Here is the use-after-free reported by KASAN:
      
          BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
          Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
      
          CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f #255
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
          Call Trace:
           dump_stack+0xdb/0x185
           print_address_description+0x7e/0x290
           kasan_report+0x23b/0x350
           __asan_report_load8_noabort+0x19/0x20
           uprobe_clear_state+0x1c4/0x200
           mmput+0xd6/0x360
           do_exit+0x740/0x1670
           do_group_exit+0x13f/0x380
           get_signal+0x597/0x17d0
           do_signal+0x99/0x1df0
           exit_to_usermode_loop+0x166/0x1e0
           syscall_return_slowpath+0x258/0x2c0
           entry_SYSCALL_64_fastpath+0xbc/0xbe
      
          ...
      
          Allocated by task 199:
           save_stack_trace+0x1b/0x20
           kasan_kmalloc+0xfc/0x180
           kmem_cache_alloc_trace+0xf3/0x330
           __create_xol_area+0x10f/0x780
           uprobe_notify_resume+0x1674/0x2210
           exit_to_usermode_loop+0x150/0x1e0
           prepare_exit_to_usermode+0x14b/0x180
           retint_user+0x8/0x20
      
          Freed by task 199:
           save_stack_trace+0x1b/0x20
           kasan_slab_free+0xa8/0x1a0
           kfree+0xba/0x210
           uprobe_clear_state+0x151/0x200
           mmput+0xd6/0x360
           copy_process.part.8+0x605f/0x65d0
           _do_fork+0x1a5/0xbd0
           SyS_clone+0x19/0x20
           do_syscall_64+0x22f/0x660
           return_from_SYSCALL_64+0x0/0x7a
      
      Note: without KASAN, you may instead see a "Bad page state" message, or
      simply a general protection fault.
      
      Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>    [4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      355627f5
    • Shaohua Li's avatar
      kernel/kthread.c: kthread_worker: don't hog the cpu · 22cf8bc6
      Shaohua Li authored
      If the worker thread continues getting work, it will hog the cpu and rcu
      stall complains.  Make it a good citizen.  This is triggered in a loop
      block device test.
      
      Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22cf8bc6
    • Tetsuo Handa's avatar
      mm,page_alloc: don't call __node_reclaim() with oom_lock held. · e746bf73
      Tetsuo Handa authored
      We are doing a last second memory allocation attempt before calling
      out_of_memory().  But since slab shrinker functions might indirectly
      wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
      allocations via sleeping locks, calling slab shrinker functions from
      node_reclaim() from get_page_from_freelist() with oom_lock held has
      possibility of deadlock.  Therefore, make sure that last second memory
      allocation attempt does not call slab shrinker functions.
      
      Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e746bf73
  2. 30 Aug, 2017 6 commits
  3. 29 Aug, 2017 15 commits
  4. 28 Aug, 2017 12 commits
    • Linus Torvalds's avatar
      page waitqueue: always add new entries at the end · 9c3a815f
      Linus Torvalds authored
      Commit 3510ca20 ("Minor page waitqueue cleanups") made the page
      queue code always add new waiters to the back of the queue, which helps
      upcoming patches to batch the wakeups for some horrid loads where the
      wait queues grow to thousands of entries.
      
      However, I forgot about the nasrt add_page_wait_queue() special case
      code that is only used by the cachefiles code.  That one still continued
      to add the new wait queue entries at the beginning of the list.
      
      Fix it, because any sane batched wakeup will require that we don't
      suddenly start getting new entries at the beginning of the list that we
      already handled in a previous batch.
      
      [ The current code always does the whole list while holding the lock, so
        wait queue ordering doesn't matter for correctness, but even then it's
        better to add later entries at the end from a fairness standpoint ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c3a815f
    • Tejun Heo's avatar
      cpumask: fix spurious cpumask_of_node() on non-NUMA multi-node configs · b339752d
      Tejun Heo authored
      When !NUMA, cpumask_of_node(@node) equals cpu_online_mask regardless of
      @node.  The assumption seems that if !NUMA, there shouldn't be more than
      one node and thus reporting cpu_online_mask regardless of @node is
      correct.  However, that assumption was broken years ago to support
      DISCONTIGMEM and whether a system has multiple nodes or not is
      separately controlled by NEED_MULTIPLE_NODES.
      
      This means that, on a system with !NUMA && NEED_MULTIPLE_NODES,
      cpumask_of_node() will report cpu_online_mask for all possible nodes,
      indicating that the CPUs are associated with multiple nodes which is an
      impossible configuration.
      
      This bug has been around forever but doesn't look like it has caused any
      noticeable symptoms.  However, it triggers a WARN recently added to
      workqueue to verify NUMA affinity configuration.
      
      Fix it by reporting empty cpumask on non-zero nodes if !NUMA.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b339752d
    • Alexey Brodkin's avatar
      ARCv2: SMP: Mask only private-per-core IRQ lines on boot at core intc · e8206d2b
      Alexey Brodkin authored
      Recent commit a8ec3ee8 "arc: Mask individual IRQ lines during core
      INTC init" breaks interrupt handling on ARCv2 SMP systems.
      
      That commit masked all interrupts at onset, as some controllers on some
      boards (customer as well as internal), would assert interrutps early
      before any handlers were installed.  For SMP systems, the masking was
      done at each cpu's core-intc.  Later, when the IRQ was actually
      requested, it was unmasked, but only on the requesting cpu.
      
      For "common" interrupts, which were wired up from the 2nd level IDU
      intc, this was as issue as they needed to be enabled on ALL the cpus
      (given that IDU IRQs are by default served Round Robin across cpus)
      
      So fix that by NOT masking "common" interrupts at core-intc, but instead
      at the 2nd level IDU intc (latter already being done in idu_of_init())
      
      Fixes: a8ec3ee8 ("arc: Mask individual IRQ lines during core INTC init")
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      [vgupta: reworked changelog, removed the extraneous idu_irq_mask_raw()]
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8206d2b
    • Helge Deller's avatar
      fs/select: Fix memory corruption in compat_get_fd_set() · 79de3cbe
      Helge Deller authored
      Commit 464d6242 ("select: switch compat_{get,put}_fd_set() to
      compat_{get,put}_bitmap()") changed the calculation on how many bytes
      need to be zeroed when userspace handed over a NULL pointer for a fdset
      array in the select syscall.
      
      The calculation was changed in compat_get_fd_set() wrongly from
      	memset(fdset, 0, ((nr + 1) & ~1)*sizeof(compat_ulong_t));
      to
      	memset(fdset, 0, ALIGN(nr, BITS_PER_LONG));
      
      The ALIGN(nr, BITS_PER_LONG) calculates the number of _bits_ which need
      to be zeroed in the target fdset array (rounded up to the next full bits
      for an unsigned long).
      
      But the memset() call expects the number of _bytes_ to be zeroed.
      
      This leads to clearing more memory than wanted (on the stack area or
      even at kmalloc()ed memory areas) and to random kernel crashes as we
      have seen them on the parisc platform.
      
      The correct change should have been
      
      	memset(fdset, 0, (ALIGN(nr, BITS_PER_LONG) / BITS_PER_LONG) * BYTES_PER_LONG);
      
      which is the same as can be archieved with a call to
      
      	zero_fd_set(nr, fdset).
      
      Fixes: 464d6242 ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()"
      Acked-by: default avatar: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79de3cbe
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming · 702e9762
      Linus Torvalds authored
      Pull c6x tweaks from Mark Salter.
      
      * tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming:
        c6x: Convert to using %pOF instead of full_name
        c6x: defconfig: Cleanup from old Kconfig options
      702e9762
    • Christoph Hellwig's avatar
      libata: quirk read log on no-name M.2 SSD · 35f0b6a7
      Christoph Hellwig authored
      Ido reported that reading the log page on his systems fails,
      so quirk it as it won't support ZBC or security protocols.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Tested-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      35f0b6a7
    • Dan Williams's avatar
      libnvdimm: clean up command definitions · 7a14724f
      Dan Williams authored
      Remove the command payloads that do not have an associated libnvdimm
      ioctl. I.e. remove the payloads that would only ever be carried in the
      ND_CMD_CALL envelope. This prevents userspace from growing unnecessary
      dependencies on this kernel header when userspace already has everything
      it needs to craft and send these commands.
      
      Cc: Jerry Hoemann <jerry.hoemann@hpe.com>
      Reported-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      7a14724f
    • Linus Torvalds's avatar
      Linux 4.13-rc7 · cc4a41fe
      Linus Torvalds authored
      cc4a41fe
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 2c25833c
      Linus Torvalds authored
      Pull IOMMU fix from Joerg Roedel:
       "Another fix, this time in common IOMMU sysfs code.
      
        In the conversion from the old iommu sysfs-code to the
        iommu_device_register interface, I missed to update the release path
        for the struct device associated with an IOMMU. It freed the 'struct
        device', which was a pointer before, but is now embedded in another
        struct.
      
        Freeing from the middle of allocated memory had all kinds of nasty
        side effects when an IOMMU was unplugged. Unfortunatly nobody
        unplugged and IOMMU until now, so this was not discovered earlier. The
        fix is to make the 'struct device' a pointer again"
      
      * tag 'iommu-fixes-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu: Fix wrong freeing of iommu_device->dev
      2c25833c
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 80f73b2d
      Linus Torvalds authored
      Pull char/misc fix from Greg KH:
       "Here is a single misc driver fix for 4.13-rc7. It resolves a reported
        problem in the Android binder driver due to previous patches in
        4.13-rc.
      
        It's been in linux-next with no reported issues"
      
      * tag 'char-misc-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        ANDROID: binder: fix proc->tsk check.
      80f73b2d
    • Linus Torvalds's avatar
      Merge tag 'staging-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · c3c16263
      Linus Torvalds authored
      Pull staging/iio fixes from Greg KH:
       "Here are few small staging driver fixes, and some more IIO driver
        fixes for 4.13-rc7. Nothing major, just resolutions for some reported
        problems.
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'staging-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        iio: magnetometer: st_magn: remove ihl property for LSM303AGR
        iio: magnetometer: st_magn: fix status register address for LSM303AGR
        iio: hid-sensor-trigger: Fix the race with user space powering up sensors
        iio: trigger: stm32-timer: fix get trigger mode
        iio: imu: adis16480: Fix acceleration scale factor for adis16480
        PATCH] iio: Fix some documentation warnings
        staging: rtl8188eu: add RNX-N150NUB support
        Revert "staging: fsl-mc: be consistent when checking strcmp() return"
        iio: adc: stm32: fix common clock rate
        iio: adc: ina219: Avoid underflow for sleeping time
        iio: trigger: stm32-timer: add enable attribute
        iio: trigger: stm32-timer: fix get/set down count direction
        iio: trigger: stm32-timer: fix write_raw return value
        iio: trigger: stm32-timer: fix quadrature mode get routine
        iio: bmp280: properly initialize device for humidity reading
      c3c16263
    • Linus Torvalds's avatar
      Merge tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb · fff4e7a0
      Linus Torvalds authored
      Pull NTB fixes from Jon Mason:
       "NTB bug fixes to address an incorrect ntb_mw_count reference in the
        NTB transport, improperly bringing down the link if SPADs are
        corrupted, and an out-of-order issue regarding link negotiation and
        data passing"
      
      * tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb:
        ntb: ntb_test: ensure the link is up before trying to configure the mws
        ntb: transport shouldn't disable link due to bogus values in SPADs
        ntb: use correct mw_count function in ntb_tool and ntb_transport
      fff4e7a0
  5. 27 Aug, 2017 3 commits
    • Linus Torvalds's avatar
      Avoid page waitqueue race leaving possible page locker waiting · a8b169af
      Linus Torvalds authored
      The "lock_page_killable()" function waits for exclusive access to the
      page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
      set.
      
      That means that if it gets woken up, other waiters may have been
      skipped.
      
      That, in turn, means that if it sees the page being unlocked, it *must*
      take that lock and return success, even if a lethal signal is also
      pending.
      
      So instead of checking for lethal signals first, we need to check for
      them after we've checked the actual bit that we were waiting for.  Even
      if that might then delay the killing of the process.
      
      This matches the order of the old "wait_on_bit_lock()" infrastructure
      that the page locking used to use (and is still used in a few other
      areas).
      
      Note that if we still return an error after having unsuccessfully tried
      to acquire the page lock, that is ok: that means that some other thread
      was able to get ahead of us and lock the page, and when that other
      thread then unlocks the page, the wakeup event will be repeated.  So any
      other pending waiters will now get properly woken up.
      
      Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8b169af
    • Linus Torvalds's avatar
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds authored
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
    • Linus Torvalds's avatar
      Clarify (and fix) MAX_LFS_FILESIZE macros · 0cc3b0ec
      Linus Torvalds authored
      We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
      filesystems (and other IO targets) that know they are 64-bit clean and
      don't have any 32-bit limits in their IO path.
      
      It turns out that our 32-bit value for that limit was bogus.  On 32-bit,
      the VM layer is limited by the page cache to only 32-bit index values,
      but our logic for that was confusing and actually wrong.  We used to
      define that value to
      
      	(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
      
      which is actually odd in several ways: it limits the index to 31 bits,
      and then it limits files so that they can't have data in that last byte
      of a page that has the highest 31-bit index (ie page index 0x7fffffff).
      
      Neither of those limitations make sense.  The index is actually the full
      32 bit unsigned value, and we can use that whole full page.  So the
      maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
      
      However, we do wan tto avoid the maximum index, because we have code
      that iterates over the page indexes, and we don't want that code to
      overflow.  So the maximum size of a file on a 32-bit host should
      actually be one page less than the full 32-bit index.
      
      So the actual limit is ULONG_MAX << PAGE_SHIFT.  That means that we will
      not actually be using the page of that last index (ULONG_MAX), but we
      can grow a file up to that limit.
      
      The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
      Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
      volume.  It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
      byte less), but the actual true VM limit is one page less than 16TiB.
      
      This was invisible until commit c2a9737f ("vfs,mm: fix a dead loop
      in truncate_inode_pages_range()"), which started applying that
      MAX_LFS_FILESIZE limit to block devices too.
      
      NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
      actually just the offset type itself (loff_t), which is signed.  But for
      clarity, on 64-bit, just use the maximum signed value, and don't make
      people have to count the number of 'f' characters in the hex constant.
      
      So just use LLONG_MAX for the 64-bit case.  That was what the value had
      been before too, just written out as a hex constant.
      
      Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
      Reported-and-tested-by: default avatarDoug Nazar <nazard@nazar.ca>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cc3b0ec