1. 29 Aug, 2017 5 commits
    • Paul Burton's avatar
      MIPS: Remove unused R6000 support · 3b2db173
      Paul Burton authored
      The kernel contains a small amount of incomplete code aimed at
      supporting old R6000 CPUs. This is:
      
        - Unused, as no machine selects CONFIG_SYS_HAS_CPU_R6000.
      
        - Broken, since there are glaring errors such as r6000_fpu.S moving
          the FCSR register to t1, then ignoring it & instead saving t0 into
          struct sigcontext...
      
        - A maintenance headache, since it's code that nobody can test which
          nevertheless imposes constraints on code which it shares with other
          machines.
      
      Remove this incomplete & broken R6000 CPU support in order to clean up
      and in preparation for changes which will no longer need to consider
      dragging the pretense of R6000 support along with them.
      Signed-off-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/16236/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      3b2db173
    • Matt Redfearn's avatar
      MIPS: R6: Constify r2_decoder_tables · 114c3708
      Matt Redfearn authored
      The r2_decoder_tables are never modified. They are arrays of constant
      values and as such should be declared const.
      
      This change saves 256 bytes of kernel text, and 128 bytes of kernel data
      (384 bytes total) on a 32r6el_defconfig (with SMP disabled)
      Before:
         text	   data	    bss	    dec	    hex	filename
      5576221	1080804	 267040	6924065	 69a721	vmlinux
      After:
         text	   data	    bss	    dec	    hex	filename
      5575965	1080676	 267040	6923681	 69a5a1	vmlinux
      Signed-off-by: default avatarMatt Redfearn <matt.redfearn@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/15289/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      114c3708
    • Matt Redfearn's avatar
      MIPS: SMP: Constify smp ops · ff2c8252
      Matt Redfearn authored
      smp_ops providers do not modify their ops structures, so they should be
      made const for robustness. Since currently the MIPS kernel is not mapped
      with memory protection, this does not in itself provide any security
      benefit, but it still makes sense to make this change.
      
      There are also slight code size efficincies from the structure being
      made read-only, saving 128 bytes of kernel text on a
      pistachio_defconfig.
      Before:
         text	   data	    bss	    dec	    hex	filename
      7187239	1772752	 470224	9430215	 8fe4c7	vmlinux
      After:
         text	   data	    bss	    dec	    hex	filename
      7187111	1772752	 470224	9430087	 8fe447	vmlinux
      Signed-off-by: default avatarMatt Redfearn <matt.redfearn@imgtec.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Huacai Chen <chenhc@lemote.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Kevin Cernekee <cernekee@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven J. Hill <steven.hill@cavium.com>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/16784/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      ff2c8252
    • Krzysztof Kozlowski's avatar
      MIPS: defconfig: Cleanup from non-existing options · b879c801
      Krzysztof Kozlowski authored
      Remove options which do not exist anymore:
       - CPU_FREQ_DEBUG is gone since commit 2d06d8c4  ("[CPUFREQ] use
         dynamic debug instead of custom infrastructure").
      
       - ECONET is gone since commit 349f29d8 ("econet: remove ancient bug
         ridden protocol");
      
       - IPDDP_DECAP is gone since commit 9b5645b5 ("appletalk: remove
         "config IPDDP_DECAP"");
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/16770/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      b879c801
    • Rob Herring's avatar
      MIPS: Convert to using %pOF instead of full_name · 7f27b5b8
      Rob Herring authored
      Now that we have a custom printf format specifier, convert users of
      full_name to use %pOF instead. This is preparation to remove storing
      of the full path string for each node.
      Signed-off-by: default avatarRob Herring <robh@kernel.org>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Cc: devicetree@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/16783/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      7f27b5b8
  2. 28 Aug, 2017 5 commits
    • Linus Torvalds's avatar
      Linux 4.13-rc7 · cc4a41fe
      Linus Torvalds authored
      cc4a41fe
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 2c25833c
      Linus Torvalds authored
      Pull IOMMU fix from Joerg Roedel:
       "Another fix, this time in common IOMMU sysfs code.
      
        In the conversion from the old iommu sysfs-code to the
        iommu_device_register interface, I missed to update the release path
        for the struct device associated with an IOMMU. It freed the 'struct
        device', which was a pointer before, but is now embedded in another
        struct.
      
        Freeing from the middle of allocated memory had all kinds of nasty
        side effects when an IOMMU was unplugged. Unfortunatly nobody
        unplugged and IOMMU until now, so this was not discovered earlier. The
        fix is to make the 'struct device' a pointer again"
      
      * tag 'iommu-fixes-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu: Fix wrong freeing of iommu_device->dev
      2c25833c
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 80f73b2d
      Linus Torvalds authored
      Pull char/misc fix from Greg KH:
       "Here is a single misc driver fix for 4.13-rc7. It resolves a reported
        problem in the Android binder driver due to previous patches in
        4.13-rc.
      
        It's been in linux-next with no reported issues"
      
      * tag 'char-misc-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        ANDROID: binder: fix proc->tsk check.
      80f73b2d
    • Linus Torvalds's avatar
      Merge tag 'staging-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · c3c16263
      Linus Torvalds authored
      Pull staging/iio fixes from Greg KH:
       "Here are few small staging driver fixes, and some more IIO driver
        fixes for 4.13-rc7. Nothing major, just resolutions for some reported
        problems.
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'staging-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        iio: magnetometer: st_magn: remove ihl property for LSM303AGR
        iio: magnetometer: st_magn: fix status register address for LSM303AGR
        iio: hid-sensor-trigger: Fix the race with user space powering up sensors
        iio: trigger: stm32-timer: fix get trigger mode
        iio: imu: adis16480: Fix acceleration scale factor for adis16480
        PATCH] iio: Fix some documentation warnings
        staging: rtl8188eu: add RNX-N150NUB support
        Revert "staging: fsl-mc: be consistent when checking strcmp() return"
        iio: adc: stm32: fix common clock rate
        iio: adc: ina219: Avoid underflow for sleeping time
        iio: trigger: stm32-timer: add enable attribute
        iio: trigger: stm32-timer: fix get/set down count direction
        iio: trigger: stm32-timer: fix write_raw return value
        iio: trigger: stm32-timer: fix quadrature mode get routine
        iio: bmp280: properly initialize device for humidity reading
      c3c16263
    • Linus Torvalds's avatar
      Merge tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb · fff4e7a0
      Linus Torvalds authored
      Pull NTB fixes from Jon Mason:
       "NTB bug fixes to address an incorrect ntb_mw_count reference in the
        NTB transport, improperly bringing down the link if SPADs are
        corrupted, and an out-of-order issue regarding link negotiation and
        data passing"
      
      * tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb:
        ntb: ntb_test: ensure the link is up before trying to configure the mws
        ntb: transport shouldn't disable link due to bogus values in SPADs
        ntb: use correct mw_count function in ntb_tool and ntb_transport
      fff4e7a0
  3. 27 Aug, 2017 3 commits
    • Linus Torvalds's avatar
      Avoid page waitqueue race leaving possible page locker waiting · a8b169af
      Linus Torvalds authored
      The "lock_page_killable()" function waits for exclusive access to the
      page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
      set.
      
      That means that if it gets woken up, other waiters may have been
      skipped.
      
      That, in turn, means that if it sees the page being unlocked, it *must*
      take that lock and return success, even if a lethal signal is also
      pending.
      
      So instead of checking for lethal signals first, we need to check for
      them after we've checked the actual bit that we were waiting for.  Even
      if that might then delay the killing of the process.
      
      This matches the order of the old "wait_on_bit_lock()" infrastructure
      that the page locking used to use (and is still used in a few other
      areas).
      
      Note that if we still return an error after having unsuccessfully tried
      to acquire the page lock, that is ok: that means that some other thread
      was able to get ahead of us and lock the page, and when that other
      thread then unlocks the page, the wakeup event will be repeated.  So any
      other pending waiters will now get properly woken up.
      
      Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8b169af
    • Linus Torvalds's avatar
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds authored
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
    • Linus Torvalds's avatar
      Clarify (and fix) MAX_LFS_FILESIZE macros · 0cc3b0ec
      Linus Torvalds authored
      We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
      filesystems (and other IO targets) that know they are 64-bit clean and
      don't have any 32-bit limits in their IO path.
      
      It turns out that our 32-bit value for that limit was bogus.  On 32-bit,
      the VM layer is limited by the page cache to only 32-bit index values,
      but our logic for that was confusing and actually wrong.  We used to
      define that value to
      
      	(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
      
      which is actually odd in several ways: it limits the index to 31 bits,
      and then it limits files so that they can't have data in that last byte
      of a page that has the highest 31-bit index (ie page index 0x7fffffff).
      
      Neither of those limitations make sense.  The index is actually the full
      32 bit unsigned value, and we can use that whole full page.  So the
      maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
      
      However, we do wan tto avoid the maximum index, because we have code
      that iterates over the page indexes, and we don't want that code to
      overflow.  So the maximum size of a file on a 32-bit host should
      actually be one page less than the full 32-bit index.
      
      So the actual limit is ULONG_MAX << PAGE_SHIFT.  That means that we will
      not actually be using the page of that last index (ULONG_MAX), but we
      can grow a file up to that limit.
      
      The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
      Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
      volume.  It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
      byte less), but the actual true VM limit is one page less than 16TiB.
      
      This was invisible until commit c2a9737f ("vfs,mm: fix a dead loop
      in truncate_inode_pages_range()"), which started applying that
      MAX_LFS_FILESIZE limit to block devices too.
      
      NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
      actually just the offset type itself (loff_t), which is signed.  But for
      clarity, on 64-bit, just use the maximum signed value, and don't make
      people have to count the number of 'f' characters in the hex constant.
      
      So just use LLONG_MAX for the 64-bit case.  That was what the value had
      been before too, just written out as a hex constant.
      
      Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
      Reported-and-tested-by: default avatarDoug Nazar <nazard@nazar.ca>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cc3b0ec
  4. 26 Aug, 2017 13 commits
  5. 25 Aug, 2017 14 commits
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 1f5de42d
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "I2C has some bugfixes for you: mainly Jarkko fixed up a few things in
        the designware driver regarding the new slave mode. But Ulf also fixed
        a long-standing and now agreed suspend problem. Plus, some simple
        stuff which nonetheless needs fixing"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: designware: Fix runtime PM for I2C slave mode
        i2c: designware: Remove needless pm_runtime_put_noidle() call
        i2c: aspeed: fixed potential null pointer dereference
        i2c: simtec: use release_mem_region instead of release_resource
        i2c: core: Make comment about I2C table requirement to reflect the code
        i2c: designware: Fix standard mode speed when configuring the slave mode
        i2c: designware: Fix oops from i2c_dw_irq_handler_slave
        i2c: designware: Fix system suspend
      1f5de42d
    • Christoph Hellwig's avatar
      PCI/MSI: Don't warn when irq_create_affinity_masks() returns NULL · 8e1101d2
      Christoph Hellwig authored
      irq_create_affinity_masks() can return NULL on non-SMP systems, when there
      are not enough "free" vectors available to spread, or if memory allocation
      for the CPU masks fails.  Only the allocation failure is of interest, and
      even then the system will work just fine except for non-optimally spread
      vectors.  Thus remove the warnings.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e1101d2
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · 299c4608
      Linus Torvalds authored
      Pull MMC fix from Ulf Hansson:
       "MMC core: don't return error code R1_OUT_OF_RANGE for open-ending mode"
      
      * tag 'mmc-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: block: prevent propagating R1_OUT_OF_RANGE for open-ending mode
      299c4608
    • Linus Torvalds's avatar
      Merge tag 'sound-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8efeb350
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "We're keeping in a good shape, this batch contains just a few small
        fixes (a regression fix for ASoC rt5677 codec, NULL dereference and
        error-path fixes in firewire, and a corner-case ioctl error fix for
        user TLV), as well as usual quirks for USB-audio and HD-audio"
      
      * tag 'sound-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ASoC: rt5677: Reintroduce I2C device IDs
        ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978)
        ALSA: core: Fix unexpected error at replacing user TLV
        ALSA: usb-audio: Add delay quirk for H650e/Jabra 550a USB headsets
        ALSA: firewire-motu: destroy stream data surely at failure of card initialization
        ALSA: firewire: fix NULL pointer dereference when releasing uninitialized data of iso-resource
      8efeb350
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-4.13-rc7' of git://git.infradead.org/users/vkoul/slave-dma · 985e7755
      Linus Torvalds authored
      Pull dmaengine fix from Vinod Koul:
       "A single fix for tegra210-adma driver to check of_irq_get() error"
      
      * tag 'dmaengine-fix-4.13-rc7' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: tegra210-adma: fix of_irq_get() error check
      985e7755
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-for-v4.13-rc7' of git://people.freedesktop.org/~airlied/linux · 9e154001
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Fixes for rc7, nothing too crazy, some core, i915, and sunxi fixes,
        Intel CI has been responsible for some of these fixes being required"
      
      * tag 'drm-fixes-for-v4.13-rc7' of git://people.freedesktop.org/~airlied/linux:
        drm/i915/gvt: Fix the kernel null pointer error
        drm: Release driver tracking before making the object available again
        drm/i915: Clear lost context-switch interrupts across reset
        drm/i915/bxt: use NULL for GPIO connection ID
        drm/i915/cnl: Fix LSPCON support.
        drm/i915/vbt: ignore extraneous child devices for a port
        drm/i915: Initialize 'data' in intel_dsi_dcs_backlight.c
        drm/atomic: If the atomic check fails, return its value first
        drm/atomic: Handle -EDEADLK with out-fences correctly
        drm: Fix framebuffer leak
        drm/imx: ipuv3-plane: fix YUV framebuffer scanout on the base plane
        gpu: ipu-v3: add DRM dependency
        drm/rockchip: Fix suspend crash when drm is not bound
        drm/sun4i: Implement drm_driver lastclose to restore fbdev console
      9e154001
    • Pavel Tatashin's avatar
      mm/memblock.c: reversed logic in memblock_discard() · 91b540f9
      Pavel Tatashin authored
      In recently introduced memblock_discard() there is a reversed logic bug.
      Memory is freed of static array instead of dynamically allocated one.
      
      Link: http://lkml.kernel.org/r/1503511441-95478-2-git-send-email-pasha.tatashin@oracle.com
      Fixes: 3010f876 ("mm: discard memblock data later")
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reported-by: default avatarWoody Suwalski <terraluna977@gmail.com>
      Tested-by: default avatarWoody Suwalski <terraluna977@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91b540f9
    • Eric Biggers's avatar
      fork: fix incorrect fput of ->exe_file causing use-after-free · 2b7e8665
      Eric Biggers authored
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      a reference is taken on the mm_struct's ->exe_file.  Since the
      ->exe_file of the new mm_struct was already set to the old ->exe_file by
      the memcpy() in dup_mm(), it was possible for the mmput() in the error
      path of dup_mm() to drop a reference to ->exe_file which was never
      taken.
      
      This caused the struct file to later be freed prematurely.
      
      Fix it by updating mm_init() to NULL out the ->exe_file, in the same
      place it clears other things like the list of mmaps.
      
      This bug was found by syzkaller.  It can be reproduced using the
      following C program:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          static void *mmap_thread(void *_arg)
          {
              for (;;) {
                  mmap(NULL, 0x1000000, PROT_READ,
                       MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
              }
          }
      
          static void *fork_thread(void *_arg)
          {
              usleep(rand() % 10000);
              fork();
          }
      
          int main(void)
          {
              fork();
              fork();
              fork();
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
      
                      pthread_create(&t, NULL, mmap_thread, NULL);
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
                  }
                  wait(NULL);
              }
          }
      
      No special kernel config options are needed.  It usually causes a NULL
      pointer dereference in __remove_shared_vm_struct() during exit, or in
      dup_mmap() (which is usually inlined into copy_process()) during fork.
      Both are due to a vm_area_struct's ->vm_file being used after it's
      already been freed.
      
      Google Bug Id: 64772007
      
      Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[v4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b7e8665
    • Eric Biggers's avatar
      mm/madvise.c: fix freeing of locked page with MADV_FREE · 263630e8
      Eric Biggers authored
      If madvise(..., MADV_FREE) split a transparent hugepage, it called
      put_page() before unlock_page().
      
      This was wrong because put_page() can free the page, e.g. if a
      concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
      mapping. put_page() then rightfully complained about freeing a locked
      page.
      
      Fix this by moving the unlock_page() before put_page().
      
      This bug was found by syzkaller, which encountered the following splat:
      
          BUG: Bad page state in process syzkaller412798  pfn:1bd800
          page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
          flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
          raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
          raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
          page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
          bad because of flags: 0x1(locked)
          Modules linked in:
          CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:16 [inline]
           dump_stack+0x194/0x257 lib/dump_stack.c:52
           bad_page+0x230/0x2b0 mm/page_alloc.c:565
           free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
           free_pages_check mm/page_alloc.c:952 [inline]
           free_pages_prepare mm/page_alloc.c:1043 [inline]
           free_pcp_prepare mm/page_alloc.c:1068 [inline]
           free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
           __put_single_page mm/swap.c:79 [inline]
           __put_page+0xfb/0x160 mm/swap.c:113
           put_page include/linux/mm.h:814 [inline]
           madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
           walk_pmd_range mm/pagewalk.c:50 [inline]
           walk_pud_range mm/pagewalk.c:108 [inline]
           walk_p4d_range mm/pagewalk.c:134 [inline]
           walk_pgd_range mm/pagewalk.c:160 [inline]
           __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
           walk_page_range+0x200/0x470 mm/pagewalk.c:326
           madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
           madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
           madvise_dontneed_free mm/madvise.c:555 [inline]
           madvise_vma mm/madvise.c:664 [inline]
           SYSC_madvise mm/madvise.c:832 [inline]
           SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Here is a C reproducer:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <sys/mman.h>
          #include <unistd.h>
      
          #define MADV_FREE	8
          #define PAGE_SIZE	4096
      
          static void *mapping;
          static const size_t mapping_size = 0x1000000;
      
          static void *madvise_thrproc(void *arg)
          {
              madvise(mapping, mapping_size, (long)arg);
          }
      
          int main(void)
          {
              pthread_t t[2];
      
              for (;;) {
                  mapping = mmap(NULL, mapping_size, PROT_WRITE,
                                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      
                  munmap(mapping + mapping_size / 2, PAGE_SIZE);
      
                  pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
                  pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
                  pthread_join(t[0], NULL);
                  pthread_join(t[1], NULL);
                  munmap(mapping, mapping_size);
              }
          }
      
      Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
      CONFIG_DEBUG_VM=y are needed.
      
      Google Bug Id: 64696096
      
      Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
      Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[v4.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      263630e8
    • Ross Zwisler's avatar
      dax: fix deadlock due to misaligned PMD faults · fffa281b
      Ross Zwisler authored
      In DAX there are two separate places where the 2MiB range of a PMD is
      defined.
      
      The first is in the page tables, where a PMD mapping inserted for a
      given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
      PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
      address to the 2MiB boundary above the address.
      
      So, for example, a fault at address 3MiB (0x30 0000) falls within the
      PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
      
      The second PMD range is in the mapping->page_tree, where a given file
      offset is covered by a radix tree entry that spans from one 2MiB aligned
      file offset to another 2MiB aligned file offset.
      
      So, for example, the file offset for 3MiB (pgoff 768) falls within the
      PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
      512) to 4MiB (pgoff 1024).
      
      This system works so long as the addresses and file offsets for a given
      mapping both have the same offsets relative to the start of each PMD.
      
      Consider the case where the starting address for a given file isn't 2MiB
      aligned - say our faulting address is 3 MiB (0x30 0000), but that
      corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
      the mapping are misaligned so that the 2MiB range defined in the page
      tables never matches up with the 2MiB range defined in the radix tree.
      
      The current code notices this case for DAX faults to storage with the
      following test in dax_pmd_insert_mapping():
      
      	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
      		goto unlock_fallback;
      
      This test makes sure that the pfn we get from the driver is 2MiB
      aligned, and relies on the assumption that the 2MiB alignment of the pfn
      we get back from the driver matches the 2MiB alignment of the faulting
      address.
      
      However, faults to holes were not checked and we could hit the problem
      described above.
      
      This was reported in response to the NVML nvml/src/test/pmempool_sync
      TEST5:
      
      	$ cd nvml/src/test/pmempool_sync
      	$ make TEST5
      
      You can grab NVML here:
      
      	https://github.com/pmem/nvml/
      
      The dmesg warning you see when you hit this error is:
      
        WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
      
      Where we notice in dax_insert_mapping_entry() that the radix tree entry
      we are about to replace doesn't match the locked entry that we had
      previously inserted into the tree.  This happens because the initial
      insertion was done in grab_mapping_entry() using a pgoff calculated from
      the faulting address (vmf->address), and the replacement in
      dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
      vmf->pgoff.
      
      In our failure case those two page offsets (one calculated from
      vmf->address, one using vmf->pgoff) point to different order 9 radix
      tree entries.
      
      This failure case can result in a deadlock because the radix tree unlock
      also happens on the pgoff calculated from vmf->address.  This means that
      the locked radix tree entry that we swapped in to the tree in
      dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
      future faults to that 2MiB range will block forever.
      
      Fix this by validating that the faulting address's PMD offset matches
      the PMD offset from the start of the file.  This check is done at the
      very beginning of the fault and covers faults that would have mapped to
      storage as well as faults to holes.  I left the COLOUR check in
      dax_pmd_insert_mapping() in place in case we ever hit the insanity
      condition where the alignment of the pfn we get from the driver doesn't
      match the alignment of the userspace address.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatar"Slusarz, Marcin" <marcin.slusarz@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fffa281b
    • Kirill A. Shutemov's avatar
      mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled · 435c0b87
      Kirill A. Shutemov authored
      /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
      to allocate huge pages when allocate pages for private in-kernel shmem
      mount.
      
      Unfortunately, as Dan noticed, I've screwed it up and the only way to
      make kernel allocate huge page for the mount is to use "force" there.
      All other values will be effectively ignored.
      
      Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
      Fixes: 5a6e75f8 ("shmem: prepare huge= mount option and sysfs knob")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: stable <stable@vger.kernel.org> [4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      435c0b87
    • Chen Yu's avatar
      PM/hibernate: touch NMI watchdog when creating snapshot · 556b969a
      Chen Yu authored
      There is a problem that when counting the pages for creating the
      hibernation snapshot will take significant amount of time, especially on
      system with large memory.  Since the counting job is performed with irq
      disabled, this might lead to NMI lockup.  The following warning were
      found on a system with 1.5TB DRAM:
      
        Freezing user space processes ... (elapsed 0.002 seconds) done.
        OOM killer disabled.
        PM: Preallocating image memory...
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
        CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
        task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
        RIP: 0010:memory_bm_find_bit+0xf4/0x100
        Call Trace:
         swsusp_set_page_free+0x2b/0x30
         mark_free_pages+0x147/0x1c0
         count_data_pages+0x41/0xa0
         hibernate_preallocate_memory+0x80/0x450
         hibernation_snapshot+0x58/0x410
         hibernate+0x17c/0x310
         state_store+0xdf/0xf0
         kobj_attr_store+0xf/0x20
         sysfs_kf_write+0x37/0x40
         kernfs_fop_write+0x11c/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xb1/0x1a0
         SyS_write+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        ...
        done (allocated 6590003 pages)
        PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
      
      It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
      triggered.  In case the timeout of the NMI watch dog has been set to 1
      second, a safe interval should be 6590003/20 = 320k pages in theory.
      However there might also be some platforms running at a lower frequency,
      so feed the watchdog every 100k pages.
      
      [yu.c.chen@intel.com: simplification]
        Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
      [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
      Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.comSigned-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Reported-by: default avatarJan Filipcewicz <jan.filipcewicz@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      556b969a
    • Christoph Hellwig's avatar
      virtio_pci: fix cpu affinity support · ba74b6f7
      Christoph Hellwig authored
      Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
      virtqueues"") removed the adjustment of the pre_vectors for the virtio
      MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
      allow drivers to request IRQ affinity when creating VQs"). This will
      lead to an incorrect assignment of MSI-X vectors, and potential
      deadlocks when offlining cpus.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
      Reported-by: default avatarYASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      ba74b6f7
    • Stefan Hajnoczi's avatar
      virtio_blk: fix incorrect message when disk is resized · 1046d304
      Stefan Hajnoczi authored
      The message printed on disk resize is incorrect.  The following is
      printed when resizing to 2 GiB:
      
        $ truncate -s 1G test.img
        $ qemu -device virtio-blk-pci,logical_block_size=4096,...
        (qemu) block_resize drive1 2G
      
        virtio_blk virtio0: new size: 4194304 4096-byte logical blocks (17.2 GB/16.0 GiB)
      
      The virtio_blk capacity config field is in 512-byte sector units
      regardless of logical_block_size as per the VIRTIO specification.
      Therefore the message should read:
      
        virtio_blk virtio0: new size: 524288 4096-byte logical blocks (2.15 GB/2.0 GiB)
      
      Note that this only affects the printed message.  Thankfully the actual
      block device has the correct size because the block layer expects
      capacity in sectors.
      Signed-off-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      1046d304