1. 27 Feb, 2018 2 commits
    • Tang Junhui's avatar
      bcache: fix kcrashes with fio in RAID5 backend dev · 60eb34ec
      Tang Junhui authored
      Kernel crashed when run fio in a RAID5 backend bcache device, the call
      trace is bellow:
      [  440.012034] kernel BUG at block/blk-ioc.c:146!
      [  440.012696] invalid opcode: 0000 [#1] SMP NOPTI
      [  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
      [  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
      /2015
      [  440.028615] RIP: 0010:put_io_context+0x8b/0x90
      [  440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246
      [  440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000
      0f4240
      [  440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882
      94fca0
      [  440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8
      0c1700
      [  440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000
      01000
      [  440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11
      bd1e0
      [  440.035239] FS:  0000000000000000(0000) GS:ffff949cba280000(0000) knlGS:
      0000000000000000
      [  440.060190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001
      606e0
      [  440.110498] Call Trace:
      [  440.135443]  bio_disassociate_task+0x1b/0x60
      [  440.160355]  bio_free+0x1b/0x60
      [  440.184666]  bio_put+0x23/0x30
      [  440.208272]  search_free+0x23/0x40 [bcache]
      [  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
      [  440.254468]  closure_put+0xb6/0xd0 [bcache]
      [  440.277087]  request_endio+0x30/0x40 [bcache]
      [  440.298703]  bio_endio+0xa1/0x120
      [  440.319644]  handle_stripe+0x418/0x2270 [raid456]
      [  440.340614]  ? load_balance+0x17b/0x9c0
      [  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
      [  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
      [  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
      [  440.419193]  ? schedule+0x36/0x80
      [  440.437932]  ? schedule_timeout+0x1d2/0x2f0
      [  440.456136]  md_thread+0x122/0x150
      [  440.473687]  ? wait_woken+0x80/0x80
      [  440.491411]  kthread+0x102/0x140
      [  440.508636]  ? find_pers+0x70/0x70
      [  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
      [  440.541791]  ret_from_fork+0x35/0x40
      [  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
      48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b
      0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
      [  440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8
      [  440.628575] ---[ end trace a1fd79d85643a73e ]--
      
      All the crash issue happened when a bypass IO coming, in such scenario
      s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
      s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
      invalid, then kernel would crash when calling bio_put(). Maybe its upper
      layer's faulty, since bio should not be freed before we calling bio_put(),
      but we'd better calling bio_put() first before calling bio_complete() to
      notify upper layer ending this bio.
      
      This patch moves bio_complete() under bio_put() to avoid kernel crash.
      
      [mlyle: fixed commit subject for character limits]
      Reported-by: default avatarMatthias Ferdinand <bcache@mfedv.net>
      Tested-by: default avatarMatthias Ferdinand <bcache@mfedv.net>
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      60eb34ec
    • Coly Li's avatar
      bcache: correct flash only vols (check all uuids) · 02aa8a8b
      Coly Li authored
      Commit 2831231d ("bcache: reduce cache_set devices iteration by
      devices_max_used") adds c->devices_max_used to reduce iteration of
      c->uuids elements, this value is updated in bcache_device_attach().
      
      But for flash only volume, when calling flash_devs_run(), the function
      bcache_device_attach() is not called yet and c->devices_max_used is not
      updated. The unexpected result is, the flash only volume won't be run
      by flash_devs_run().
      
      This patch fixes the issue by iterate all c->uuids elements in
      flash_devs_run(). c->devices_max_used will be updated properly when
      bcache_device_attach() gets called.
      
      [mlyle: commit subject edited for character limit]
      
      Fixes: 2831231d ("bcache: reduce cache_set devices iteration by devices_max_used")
      Reported-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      02aa8a8b
  2. 26 Feb, 2018 8 commits
    • Eric Biggers's avatar
      blktrace_api.h: fix comment for struct blk_user_trace_setup · 9c722588
      Eric Biggers authored
      'struct blk_user_trace_setup' is passed to BLKTRACESETUP, not
      BLKTRACESTART.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c722588
    • Jan Kara's avatar
      blockdev: Avoid two active bdev inodes for one device · 560e7cb2
      Jan Kara authored
      When blkdev_open() races with device removal and creation it can happen
      that unhashed bdev inode gets associated with newly created gendisk
      like:
      
      CPU0					CPU1
      blkdev_open()
        bdev = bd_acquire()
      					del_gendisk()
      					  bdev_unhash_inode(bdev);
      					remove device
      					create new device with the same number
        __blkdev_get()
          disk = get_gendisk()
            - gets reference to gendisk of the new device
      
      Now another blkdev_open() will not find original 'bdev' as it got
      unhashed, create a new one and associate it with the same 'disk' at
      which point problems start as we have two independent page caches for
      one device.
      
      Fix the problem by verifying that the bdev inode didn't get unhashed
      before we acquired gendisk reference. That way we make sure gendisk can
      get associated only with visible bdev inodes.
      Tested-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      560e7cb2
    • Jan Kara's avatar
      genhd: Fix BUG in blkdev_open() · 56c0908c
      Jan Kara authored
      When two blkdev_open() calls for a partition race with device removal
      and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
      blkdev_open(). The race can happen as follows:
      
      CPU0				CPU1			CPU2
      							del_gendisk()
      							  bdev_unhash_inode(part1);
      
      blkdev_open(part1, O_EXCL)	blkdev_open(part1, O_EXCL)
        bdev = bd_acquire()		  bdev = bd_acquire()
        blkdev_get(bdev)
          bd_start_claiming(bdev)
            - finds old inode 'whole'
            bd_prepare_to_claim() -> 0
      							  bdev_unhash_inode(whole);
      							<device removed>
      							<new device under same
      							 number created>
      				  blkdev_get(bdev);
      				    bd_start_claiming(bdev)
      				      - finds new inode 'whole'
      				      bd_prepare_to_claim()
      					- this also succeeds as we have
      					  different 'whole' here...
      					- bad things happen now as we
      					  have two exclusive openers of
      					  the same bdev
      
      The problem here is that block device opens can see various intermediate
      states while gendisk is shutting down and then being recreated.
      
      We fix the problem by introducing new lookup_sem in gendisk that
      synchronizes gendisk deletion with get_gendisk() and furthermore by
      making sure that get_gendisk() does not return gendisk that is being (or
      has been) deleted. This makes sure that once we ever manage to look up
      newly created bdev inode, we are also guaranteed that following
      get_gendisk() will either return failure (and we fail open) or it
      returns gendisk for the new device and following bdget_disk() will
      return new bdev inode (i.e., blkdev_open() follows the path as if it is
      completely run after new device is created).
      Reported-and-analyzed-by: default avatarHou Tao <houtao1@huawei.com>
      Tested-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      56c0908c
    • Jan Kara's avatar
      genhd: Fix use after free in __blkdev_get() · 89736653
      Jan Kara authored
      When two blkdev_open() calls race with device removal and recreation,
      __blkdev_get() can use looked up gendisk after it is freed:
      
      CPU0				CPU1			CPU2
      							del_gendisk(disk);
      							  bdev_unhash_inode(inode);
      blkdev_open()			blkdev_open()
        bdev = bd_acquire(inode);
          - creates and returns new inode
      				  bdev = bd_acquire(inode);
      				    - returns the same inode
        __blkdev_get(devt)		  __blkdev_get(devt)
          disk = get_gendisk(devt);
            - got structure of device going away
      							<finish device removal>
      							<new device gets
      							 created under the same
      							 device number>
      				  disk = get_gendisk(devt);
      				    - got new device structure
      				  if (!bdev->bd_openers) {
      				    does the first open
      				  }
          if (!bdev->bd_openers)
            - false
          } else {
            put_disk_and_module(disk)
              - remember this was old device - this was last ref and disk is
                now freed
          }
          disk_unblock_events(disk); -> oops
      
      Fix the problem by making sure we drop reference to disk in
      __blkdev_get() only after we are really done with it.
      Reported-by: default avatarHou Tao <houtao1@huawei.com>
      Tested-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89736653
    • Jan Kara's avatar
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara authored
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9df6c299
    • Jan Kara's avatar
      genhd: Rename get_disk() to get_disk_and_module() · 3079c22e
      Jan Kara authored
      Rename get_disk() to get_disk_and_module() to make sure what the
      function does. It's not a great name but at least it is now clear that
      put_disk() is not it's counterpart.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3079c22e
    • Jan Kara's avatar
      genhd: Fix leaked module reference for NVME devices · d52987b5
      Jan Kara authored
      Commit 8ddcd653 "block: introduce GENHD_FL_HIDDEN" added handling of
      hidden devices to get_gendisk() but forgot to drop module reference
      which is also acquired by get_disk(). Drop the reference as necessary.
      
      Arguably the function naming here is misleading as put_disk() is *not*
      the counterpart of get_disk() but let's fix that in the follow up
      commit since that will be more intrusive.
      
      Fixes: 8ddcd653
      CC: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d52987b5
    • Jan Kara's avatar
      direct-io: Fix sleep in atomic due to sync AIO · d9c10e5b
      Jan Kara authored
      Commit e864f395 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
      way for direct IO to become synchronous and thus trigger fsync from the
      IO completion handler. Then commit 9830f4be "fs: Use RWF_* flags for
      AIO operations" allowed these flags to be set for AIO as well. However
      that commit forgot to update the condition checking whether the IO
      completion handling should be defered to a workqueue and thus AIO DIO
      with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
      sleep in atomic.
      
      Fix the problem by checking directly iocb flags (the same way as it is
      done in dio_complete()) instead of checking all conditions that could
      lead to IO being synchronous.
      
      CC: Christoph Hellwig <hch@lst.de>
      CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
      CC: stable@vger.kernel.org
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Fixes: 9830f4beSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d9c10e5b
  3. 24 Feb, 2018 2 commits
  4. 23 Feb, 2018 4 commits
  5. 22 Feb, 2018 10 commits
    • Randy Dunlap's avatar
      fs/signalfd: fix build error for BUS_MCEERR_AR · 9026e820
      Randy Dunlap authored
      Fix build error in fs/signalfd.c by using same method that is used in
      kernel/signal.c: separate blocks for different signal si_code values.
      
      ./fs/signalfd.c: error: 'BUS_MCEERR_AR' undeclared (first use in this function)
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      9026e820
    • Linus Torvalds's avatar
      Merge tag 'usb-4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · a638af00
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a number of USB fixes for 4.16-rc3
      
        Nothing major, but a number of different fixes all over the place in
        the USB stack for reported issues. Mostly gadget driver fixes,
        although the typical set of xhci bugfixes are there, along with some
        new quirks additions as well.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'usb-4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (39 commits)
        Revert "usb: musb: host: don't start next rx urb if current one failed"
        usb: musb: fix enumeration after resume
        usb: cdc_acm: prevent race at write to acm while system resumes
        Add delay-init quirk for Corsair K70 RGB keyboards
        usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()
        usb: host: ehci: always enable interrupt for qtd completion at test mode
        usb: ldusb: add PIDs for new CASSY devices supported by this driver
        usb: renesas_usbhs: missed the "running" flag in usb_dmac with rx path
        usb: host: ehci: use correct device pointer for dma ops
        usbip: keep usbip_device sockfd state in sync with tcp_socket
        ohci-hcd: Fix race condition caused by ohci_urb_enqueue() and io_watchdog_func()
        USB: serial: option: Add support for Quectel EP06
        xhci: fix xhci debugfs errors in xhci_stop
        xhci: xhci debugfs device nodes weren't removed after device plugged out
        xhci: Fix xhci debugfs devices node disappearance after hibernation
        xhci: Fix NULL pointer in xhci debugfs
        xhci: Don't print a warning when setting link state for disabled ports
        xhci: workaround for AMD Promontory disabled ports wakeup
        usb: dwc3: core: Fix ULPI PHYs and prevent phy_get/ulpi_init during suspend/resume
        USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
        ...
      a638af00
    • Linus Torvalds's avatar
      Merge tag 'staging-4.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 77f892eb
      Linus Torvalds authored
      Pull staging/IIO fixes from Greg KH:
       "Here are a small number of staging and iio driver fixes for 4.16-rc2.
      
        The IIO fixes are all for reported things, and the android driver
        fixes also resolve some reported problems. The remaining fsl-mc
        Kconfig change resolves a build testing error that Arnd reported.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'staging-4.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        iio: buffer: check if a buffer has been set up when poll is called
        iio: adis_lib: Initialize trigger before requesting interrupt
        staging: android: ion: Zero CMA allocated memory
        staging: android: ashmem: Fix a race condition in pin ioctls
        staging: fsl-mc: fix build testing on x86
        iio: srf08: fix link error "devm_iio_triggered_buffer_setup" undefined
        staging: iio: ad5933: switch buffer mode to software
        iio: adc: stm32: fix stm32h7_adc_enable error handling
        staging: iio: adc: ad7192: fix external frequency setting
        iio: adc: aspeed: Fix error handling path
      77f892eb
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · bb17186a
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are a handful of char/misc driver fixes for 4.16-rc3.
      
        There are some binder driver fixes to resolve reported issues in
        stress testing the recent binder changes, some extcon driver fixes,
        and a few mei driver fixes and new device ids.
      
        All of these, with the exception of the mei driver id additions, have
        been in linux-next for a while. I forgot to push out the mei driver id
        additions to kernel.org until today, but all build tests pass with
        them enabled"
      
      * tag 'char-misc-4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        mei: me: add cannon point device ids for 4th device
        mei: me: add cannon point device ids
        mei: set device client to the disconnected state upon suspend.
        ANDROID: binder: synchronize_rcu() when using POLLFREE.
        binder: replace "%p" with "%pK"
        ANDROID: binder: remove WARN() for redundant txn error
        binder: check for binder_thread allocation failure in binder_poll()
        extcon: int3496: process id-pin first so that we start with the right status
        Revert "extcon: axp288: Redo charger type detection a couple of seconds after probe()"
        extcon: axp288: Constify the axp288_pwr_up_down_info array
      bb17186a
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 004e390d
      Linus Torvalds authored
      Pull rdma fixes from Doug Ledford:
       "Nothing in this is overly interesting, it's mostly your garden variety
        fixes.
      
        There was some work in this merge cycle around the new ioctl kABI, so
        there are fixes in here related to that (probably with more to come).
      
        We've also recently added new netlink support with a goal of moving
        the primary means of configuring the entire subsystem to netlink
        (eventually, this is a long term project), so there are fixes for
        that.
      
        Then a few bnxt_re driver fixes, and a few minor WARN_ON removals, and
        that covers this pull request. There are already a few more fixes on
        the list as of this morning, so there will certainly be more to come
        in this rc cycle ;-)
      
        Summary:
      
         - Lots of fixes for the new IOCTL interface and general uverbs flow.
           Found through testing and syzkaller
      
         - Bugfixes for the new resource track netlink reporting
      
         - Remove some unneeded WARN_ONs that were triggering for some users
           in IPoIB
      
         - Various fixes for the bnxt_re driver"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (27 commits)
        RDMA/uverbs: Fix kernel panic while using XRC_TGT QP type
        RDMA/bnxt_re: Avoid system hang during device un-reg
        RDMA/bnxt_re: Fix system crash during load/unload
        RDMA/bnxt_re: Synchronize destroy_qp with poll_cq
        RDMA/bnxt_re: Unpin SQ and RQ memory if QP create fails
        RDMA/bnxt_re: Disable atomic capability on bnxt_re adapters
        RDMA/restrack: don't use uaccess_kernel()
        RDMA/verbs: Check existence of function prior to accessing it
        RDMA/vmw_pvrdma: Fix usage of user response structures in ABI file
        RDMA/uverbs: Sanitize user entered port numbers prior to access it
        RDMA/uverbs: Fix circular locking dependency
        RDMA/uverbs: Fix bad unlock balance in ib_uverbs_close_xrcd
        RDMA/restrack: Increment CQ restrack object before committing
        RDMA/uverbs: Protect from command mask overflow
        IB/uverbs: Fix unbalanced unlock on error path for rdma_explicit_destroy
        IB/uverbs: Improve lockdep_check
        RDMA/uverbs: Protect from races between lookup and destroy of uobjects
        IB/uverbs: Hold the uobj write lock after allocate
        IB/uverbs: Fix possible oops with duplicate ioctl attributes
        IB/uverbs: Add ioctl support for 32bit processes
        ...
      004e390d
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-4.16-rc3-riscv_cleanups' of... · 24180a60
      Linus Torvalds authored
      Merge tag 'riscv-for-linus-4.16-rc3-riscv_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
      
      Pull RISC-V cleanups from Palmer Dabbelt:
       "This contains a handful of small cleanups.
      
        The only functional change is that IRQs are now enabled during
        exception handling, which was found when some warnings triggered with
        `CONFIG_DEBUG_ATOMIC_SLEEP=y`.
      
        The remaining fixes should have no functional change: `sbi_save()` has
        been renamed to `parse_dtb()` reflect what it actually does, and a
        handful of unused Kconfig entries have been removed"
      
      * tag 'riscv-for-linus-4.16-rc3-riscv_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
        Rename sbi_save to parse_dtb to improve code readability
        RISC-V: Enable IRQ during exception handling
        riscv: Remove ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE select
        riscv: kconfig: Remove RISCV_IRQ_INTC select
        riscv: Remove ARCH_WANT_OPTIONAL_GPIOLIB select
      24180a60
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 238ca357
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "16 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: don't defer struct page initialization for Xen pv guests
        lib/Kconfig.debug: enable RUNTIME_TESTING_MENU
        vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems
        selftests/memfd: add run_fuse_test.sh to TEST_FILES
        bug.h: work around GCC PR82365 in BUG()
        mm/swap.c: make functions and their kernel-doc agree (again)
        mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc
        ida: do zeroing in ida_pre_get()
        mm, swap, frontswap: fix THP swap if frontswap enabled
        certs/blacklist_nohashes.c: fix const confusion in certs blacklist
        kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
        mm, mlock, vmscan: no more skipping pagevecs
        mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats
        Kbuild: always define endianess in kconfig.h
        include/linux/sched/mm.h: re-inline mmdrop()
        tools: fix cross-compile var clobbering
      238ca357
    • Luck, Tony's avatar
      efivarfs: Limit the rate for non-root to read files · bef3efbe
      Luck, Tony authored
      Each read from a file in efivarfs results in two calls to EFI
      (one to get the file size, another to get the actual data).
      
      On X86 these EFI calls result in broadcast system management
      interrupts (SMI) which affect performance of the whole system.
      A malicious user can loop performing reads from efivarfs bringing
      the system to its knees.
      
      Linus suggested per-user rate limit to solve this.
      
      So we add a ratelimit structure to "user_struct" and initialize
      it for the root user for no limit. When allocating user_struct for
      other users we set the limit to 100 per second. This could be used
      for other places that want to limit the rate of some detrimental
      user action.
      
      In efivarfs if the limit is exceeded when reading, we take an
      interruptible nap for 50ms and check the rate limit again.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bef3efbe
    • Kees Cook's avatar
      kconfig.h: Include compiler types to avoid missed struct attributes · 28128c61
      Kees Cook authored
      The header files for some structures could get included in such a way
      that struct attributes (specifically __randomize_layout from path.h) would
      be parsed as variable names instead of attributes. This could lead to
      some instances of a structure being unrandomized, causing nasty GPFs, etc.
      
      This patch makes sure the compiler_types.h header is included in
      kconfig.h so that we've always got types and struct attributes defined,
      since kconfig.h is included from the compiler command line.
      Reported-by: default avatarPatrick McLean <chutzpah@gentoo.org>
      Root-caused-by: default avatarMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Tested-by: default avatarMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Fixes: 3859a271 ("randstruct: Mark various structs for randomization")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28128c61
    • H.J. Lu's avatar
      x86: Treat R_X86_64_PLT32 as R_X86_64_PC32 · b21ebf2f
      H.J. Lu authored
      On i386, there are 2 types of PLTs, PIC and non-PIC.  PIE and shared
      objects must use PIC PLT.  To use PIC PLT, you need to load
      _GLOBAL_OFFSET_TABLE_ into EBX first.  There is no need for that on
      x86-64 since x86-64 uses PC-relative PLT.
      
      On x86-64, for 32-bit PC-relative branches, we can generate PLT32
      relocation, instead of PC32 relocation, which can also be used as
      a marker for 32-bit PC-relative branches.  Linker can always reduce
      PLT32 relocation to PC32 if function is defined locally.   Local
      functions should use PC32 relocation.  As far as Linux kernel is
      concerned, R_X86_64_PLT32 can be treated the same as R_X86_64_PC32
      since Linux kernel doesn't use PLT.
      
      R_X86_64_PLT32 for 32-bit PC-relative branches has been enabled in
      binutils master branch which will become binutils 2.31.
      
      [ hjl is working on having better documentation on this all, but a few
        more notes from him:
      
         "PLT32 relocation is used as marker for PC-relative branches. Because
          of EBX, it looks odd to generate PLT32 relocation on i386 when EBX
          doesn't have GOT.
      
          As for symbol resolution, PLT32 and PC32 relocations are almost
          interchangeable. But when linker sees PLT32 relocation against a
          protected symbol, it can resolved locally at link-time since it is
          used on a branch instruction. Linker can't do that for PC32
          relocation"
      
        but for the kernel use, the two are basically the same, and this
        commit gets things building and working with the current binutils
        master   - Linus ]
      Signed-off-by: default avatarH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b21ebf2f
  6. 21 Feb, 2018 14 commits
    • Juergen Gross's avatar
      mm: don't defer struct page initialization for Xen pv guests · 895f7b8e
      Juergen Gross authored
      Commit f7f99100 ("mm: stop zeroing memory during allocation in
      vmemmap") broke Xen pv domains in some configurations, as the "Pinned"
      information in struct page of early page tables could get lost.
      
      This will lead to the kernel trying to write directly into the page
      tables instead of asking the hypervisor to do so.  The result is a crash
      like the following:
      
        BUG: unable to handle kernel paging request at ffff8801ead19008
        IP: xen_set_pud+0x4e/0xd0
        PGD 1c0a067 P4D 1c0a067 PUD 23a0067 PMD 1e9de0067 PTE 80100001ead19065
        Oops: 0003 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-default+ #271
        Hardware name: Dell Inc. Latitude E6440/0159N7, BIOS A07 06/26/2014
        task: ffffffff81c10480 task.stack: ffffffff81c00000
        RIP: e030:xen_set_pud+0x4e/0xd0
        Call Trace:
         __pmd_alloc+0x128/0x140
         ioremap_page_range+0x3f4/0x410
         __ioremap_caller+0x1c3/0x2e0
         acpi_os_map_iomem+0x175/0x1b0
         acpi_tb_acquire_table+0x39/0x66
         acpi_tb_validate_table+0x44/0x7c
         acpi_tb_verify_temp_table+0x45/0x304
         acpi_reallocate_root_table+0x12d/0x141
         acpi_early_init+0x4d/0x10a
         start_kernel+0x3eb/0x4a1
         xen_start_kernel+0x528/0x532
        Code: 48 01 e8 48 0f 42 15 a2 fd be 00 48 01 d0 48 ba 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 01 d0 48 8b 00 f6 c4 02 75 5d <4c> 89 65 00 5b 5d 41 5c c3 65 8b 05 52 9f fe 7e 89 c0 48 0f a3
        RIP: xen_set_pud+0x4e/0xd0 RSP: ffffffff81c03cd8
        CR2: ffff8801ead19008
        ---[ end trace 38eca2e56f1b642e ]---
      
      Avoid this problem by not deferring struct page initialization when
      running as Xen pv guest.
      
      Pavel said:
      
      : This is unique for Xen, so this particular issue won't effect other
      : configurations.  I am going to investigate if there is a way to
      : re-enable deferred page initialization on xen guests.
      
      [akpm@linux-foundation.org: explicitly include xen.h]
      Link: http://lkml.kernel.org/r/20180216154101.22865-1-jgross@suse.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.15.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      895f7b8e
    • Anders Roxell's avatar
      lib/Kconfig.debug: enable RUNTIME_TESTING_MENU · 908009e8
      Anders Roxell authored
      Commit d3deafaa ("lib/: make RUNTIME_TESTS a menuconfig to ease
      disabling it all") causes a regression when using runtime tests due to
      it defaults RUNTIME_TESTING_MENU to not set.
      
      Link: http://lkml.kernel.org/r/20180214133015.10090-1-anders.roxell@linaro.org
      Fixes: d3deafaa ("lib/: make RUNTIME_TESTS a menuconfig to easedisabling it all")
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Cc: Vincent Legoll <vincent.legoll@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      908009e8
    • Michal Hocko's avatar
      vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems · 698d0831
      Michal Hocko authored
      Kai Heng Feng has noticed that BUG_ON(PageHighMem(pg)) triggers in
      drivers/media/common/saa7146/saa7146_core.c since 19809c2d ("mm,
      vmalloc: use __GFP_HIGHMEM implicitly").
      
      saa7146_vmalloc_build_pgtable uses vmalloc_32 and it is reasonable to
      expect that the resulting page is not in highmem.  The above commit
      aimed to add __GFP_HIGHMEM only for those requests which do not specify
      any zone modifier gfp flag.  vmalloc_32 relies on GFP_VMALLOC32 which
      should do the right thing.  Except it has been missed that GFP_VMALLOC32
      is an alias for GFP_KERNEL on 32b architectures.  Thanks to Matthew to
      notice this.
      
      Fix the problem by unconditionally setting GFP_DMA32 in GFP_VMALLOC32
      for !64b arches (as a bailout).  This should do the right thing and use
      ZONE_NORMAL which should be always below 4G on 32b systems.
      
      Debugged by Matthew Wilcox.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20180212095019.GX21609@dhcp22.suse.cz
      Fixes: 19809c2d ("mm, vmalloc: use __GFP_HIGHMEM implicitly”)
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarKai Heng Feng <kai.heng.feng@canonical.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698d0831
    • Anders Roxell's avatar
      selftests/memfd: add run_fuse_test.sh to TEST_FILES · bdefe01a
      Anders Roxell authored
      While testing memfd tests, there is a missing script, as reported by
      kselftest:
      
        ./run_tests.sh: line 7: ./run_fuse_test.sh: No such file or directory
      
      Link: http://lkml.kernel.org/r/1517955779-11386-1-git-send-email-daniel.diaz@linaro.orgSigned-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: default avatarDaniel Díaz <daniel.diaz@linaro.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdefe01a
    • Arnd Bergmann's avatar
      bug.h: work around GCC PR82365 in BUG() · 173a3efd
      Arnd Bergmann authored
      Looking at functions with large stack frames across all architectures
      led me discovering that BUG() suffers from the same problem as
      fortify_panic(), which I've added a workaround for already.
      
      In short, variables that go out of scope by calling a noreturn function
      or __builtin_unreachable() keep using stack space in functions
      afterwards.
      
      A workaround that was identified is to insert an empty assembler
      statement just before calling the function that doesn't return.  I'm
      adding a macro "barrier_before_unreachable()" to document this, and
      insert calls to that in all instances of BUG() that currently suffer
      from this problem.
      
      The files that saw the largest change from this had these frame sizes
      before, and much less with my patch:
      
        fs/ext4/inode.c:82:1: warning: the frame size of 1672 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/namei.c:434:1: warning: the frame size of 904 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/super.c:2279:1: warning: the frame size of 1160 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/xattr.c:146:1: warning: the frame size of 1168 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/f2fs/inode.c:152:1: warning: the frame size of 1424 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_core.c:1195:1: warning: the frame size of 1068 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_core.c:395:1: warning: the frame size of 1084 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_ftp.c:298:1: warning: the frame size of 928 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_ftp.c:418:1: warning: the frame size of 908 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_lblcr.c:718:1: warning: the frame size of 960 bytes is larger than 800 bytes [-Wframe-larger-than=]
        drivers/net/xen-netback/netback.c:1500:1: warning: the frame size of 1088 bytes is larger than 800 bytes [-Wframe-larger-than=]
      
      In case of ARC and CRIS, it turns out that the BUG() implementation
      actually does return (or at least the compiler thinks it does),
      resulting in lots of warnings about uninitialized variable use and
      leaving noreturn functions, such as:
      
        block/cfq-iosched.c: In function 'cfq_async_queue_prio':
        block/cfq-iosched.c:3804:1: error: control reaches end of non-void function [-Werror=return-type]
        include/linux/dmaengine.h: In function 'dma_maxpq':
        include/linux/dmaengine.h:1123:1: error: control reaches end of non-void function [-Werror=return-type]
      
      This makes them call __builtin_trap() instead, which should normally
      dump the stack and kill the current process, like some of the other
      architectures already do.
      
      I tried adding barrier_before_unreachable() to panic() and
      fortify_panic() as well, but that had very little effect, so I'm not
      submitting that patch.
      
      Vineet said:
      
      : For ARC, it is double win.
      :
      : 1. Fixes 3 -Wreturn-type warnings
      :
      : | ../net/core/ethtool.c:311:1: warning: control reaches end of non-void function
      : [-Wreturn-type]
      : | ../kernel/sched/core.c:3246:1: warning: control reaches end of non-void function
      : [-Wreturn-type]
      : | ../include/linux/sunrpc/svc_xprt.h:180:1: warning: control reaches end of
      : non-void function [-Wreturn-type]
      :
      : 2.  bloat-o-meter reports code size improvements as gcc elides the
      :    generated code for stack return.
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
      Link: http://lkml.kernel.org/r/20171219114112.939391-1-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
      Tested-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Christopher Li <sparse@chrisli.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      173a3efd
    • Mike Rapoport's avatar
      mm/swap.c: make functions and their kernel-doc agree (again) · cb6f0f34
      Mike Rapoport authored
      There was a conflict between the commit e02a9f04 ("mm/swap.c: make
      functions and their kernel-doc agree") and the commit f144c390 ("mm:
      docs: fix parameter names mismatch") that both tried to fix mismatch
      betweeen pagevec_lookup_entries() parameter names and their description.
      
      Since nr_entries is a better name for the parameter, fix the description
      again.
      
      Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb6f0f34
    • Mike Rapoport's avatar
      14fec9eb
    • Rasmus Villemoes's avatar
      ida: do zeroing in ida_pre_get() · b1a8a7a7
      Rasmus Villemoes authored
      As far as I can tell, the only place the per-cpu ida_bitmap is populated
      is in ida_pre_get.  The pre-allocated element is stolen in two places in
      ida_get_new_above, in both cases immediately followed by a memset(0).
      
      Since ida_get_new_above is called with locks held, do the zeroing in
      ida_pre_get, or rather let kmalloc() do it.  Also, apparently gcc
      generates ~44 bytes of code to do a memset(, 0, 128):
      
        $ scripts/bloat-o-meter vmlinux.{0,1}
        add/remove: 0/0 grow/shrink: 2/1 up/down: 5/-88 (-83)
        Function                                     old     new   delta
        ida_pre_get                                  115     119      +4
        vermagic                                      27      28      +1
        ida_get_new_above                            715     627     -88
      
      Link: http://lkml.kernel.org/r/20180108225634.15340-1-linux@rasmusvillemoes.dkSigned-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1a8a7a7
    • Huang Ying's avatar
      mm, swap, frontswap: fix THP swap if frontswap enabled · 7ba71669
      Huang Ying authored
      It was reported by Sergey Senozhatsky that if THP (Transparent Huge
      Page) and frontswap (via zswap) are both enabled, when memory goes low
      so that swap is triggered, segfault and memory corruption will occur in
      random user space applications as follow,
      
      kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
       #0  0x00007fc08889ae0d _int_malloc (libc.so.6)
       #1  0x00007fc08889c2f3 malloc (libc.so.6)
       #2  0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
       #3  0x0000560e6005e75c n/a (urxvt)
       #4  0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
       #5  0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
       #6  0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
       #7  0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
       #8  0x0000560e6005cb55 ev_run (urxvt)
       #9  0x0000560e6003b9b9 main (urxvt)
       #10 0x00007fc08883af4a __libc_start_main (libc.so.6)
       #11 0x0000560e6003f9da _start (urxvt)
      
      After bisection, it was found the first bad commit is bd4c82c2 ("mm,
      THP, swap: delay splitting THP after swapped out").
      
      The root cause is as follows:
      
      When the pages are written to swap device during swapping out in
      swap_writepage(), zswap (fontswap) is tried to compress the pages to
      improve performance.  But zswap (frontswap) will treat THP as a normal
      page, so only the head page is saved.  After swapping in, tail pages
      will not be restored to their original contents, causing memory
      corruption in the applications.
      
      This is fixed by refusing to save page in the frontswap store functions
      if the page is a THP.  So that the THP will be swapped out to swap
      device.
      
      Another choice is to split THP if frontswap is enabled.  But it is found
      that the frontswap enabling isn't flexible.  For example, if
      CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
      zswap itself isn't enabled.
      
      Frontswap has multiple backends, to make it easy for one backend to
      enable THP support, the THP checking is put in backend frontswap store
      functions instead of the general interfaces.
      
      Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
      Fixes: bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped out")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reported-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Tested-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: Minchan Kim <minchan@kernel.org>	[put THP checking in backend]
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: <stable@vger.kernel.org>	[4.14]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ba71669
    • Andi Kleen's avatar
    • David Rientjes's avatar
      kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE · 88913bd8
      David Rientjes authored
      chan->n_subbufs is set by the user and relay_create_buf() does a kmalloc()
      of chan->n_subbufs * sizeof(size_t *).
      
      kmalloc_slab() will generate a warning when this fails if
      chan->subbufs * sizeof(size_t *) > KMALLOC_MAX_SIZE.
      
      Limit chan->n_subbufs to the maximum allowed kmalloc() size.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802061216100.122576@chino.kir.corp.google.com
      Fixes: f6302f1b ("relay: prevent integer overflow in relay_open()")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88913bd8
    • Shakeel Butt's avatar
      mm, mlock, vmscan: no more skipping pagevecs · 9c4e6b1a
      Shakeel Butt authored
      When a thread mlocks an address space backed either by file pages which
      are currently not present in memory or swapped out anon pages (not in
      swapcache), a new page is allocated and added to the local pagevec
      (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
      On I/O completion, the thread can wake on a different CPU, the mlock
      syscall will then sets the PageMlocked() bit of the page but will not be
      able to put that page in unevictable LRU as the page is on the pagevec
      of a different CPU.  Even on drain, that page will go to evictable LRU
      because the PageMlocked() bit is not checked on pagevec drain.
      
      The page will eventually go to right LRU on reclaim but the LRU stats
      will remain skewed for a long time.
      
      This patch puts all the pages, even unevictable, to the pagevecs and on
      the drain, the pages will be added on their LRUs correctly by checking
      their evictability.  This resolves the mlocked pages on pagevec of other
      CPUs issue because when those pagevecs will be drained, the mlocked file
      pages will go to unevictable LRU.  Also this makes the race with munlock
      easier to resolve because the pagevec drains happen in LRU lock.
      
      However there is still one place which makes a page evictable and does
      PageLRU check on that page without LRU lock and needs special attention.
      TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
      
      	#0: __pagevec_lru_add_fn	#1: clear_page_mlock
      
      	SetPageLRU()			if (!TestClearPageMlocked())
      					  return
      	smp_mb() // <--required
      					// inside does PageLRU
      	if (!PageMlocked())		if (isolate_lru_page())
      	  move to evictable LRU		  putback_lru_page()
      	else
      	  move to unevictable LRU
      
      In '#1', TestClearPageMlocked() provides full memory barrier semantics
      and thus the PageLRU check (inside isolate_lru_page) can not be
      reordered before it.
      
      In '#0', without explicit memory barrier, the PageMlocked() check can be
      reordered before SetPageLRU().  If that happens, '#0' can put a page in
      unevictable LRU and '#1' might have just cleared the Mlocked bit of that
      page but fails to isolate as PageLRU fails as '#0' still hasn't set
      PageLRU bit of that page.  That page will be stranded on the unevictable
      LRU.
      
      There is one (good) side effect though.  Without this patch, the pages
      allocated for System V shared memory segment are added to evictable LRUs
      even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
      put such pages to unevictable LRU.
      
      Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c4e6b1a
    • Johannes Weiner's avatar
      mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats · c3cc3911
      Johannes Weiner authored
      After commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting"), we observed slowly upward creeping NR_WRITEBACK
      counts over the course of several days, both the per-memcg stats as well
      as the system counter in e.g.  /proc/meminfo.
      
      The conversion from full per-cpu stat counts to per-cpu cached atomic
      stat counts introduced an irq-unsafe RMW operation into the updates.
      
      Most stat updates come from process context, but one notable exception
      is the NR_WRITEBACK counter.  While writebacks are issued from process
      context, they are retired from (soft)irq context.
      
      When writeback completions interrupt the RMW counter updates of new
      writebacks being issued, the decs from the completions are lost.
      
      Since the global updates are routed through the joint lruvec API, both
      the memcg counters as well as the system counters are affected.
      
      This patch makes the joint stat and event API irq safe.
      
      Link: http://lkml.kernel.org/r/20180203082353.17284-1-hannes@cmpxchg.org
      Fixes: a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Debugged-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3cc3911
    • Arnd Bergmann's avatar
      Kbuild: always define endianess in kconfig.h · 101110f6
      Arnd Bergmann authored
      Build testing with LTO found a couple of files that get compiled
      differently depending on whether asm/byteorder.h gets included early
      enough or not.  In particular, include/asm-generic/qrwlock_types.h is
      affected by this, but there are probably others as well.
      
      The symptom is a series of LTO link time warnings, including these:
      
          net/netlabel/netlabel_unlabeled.h:223: error: type of 'netlbl_unlhsh_add' does not match original declaration [-Werror=lto-type-mismatch]
           int netlbl_unlhsh_add(struct net *net,
          net/netlabel/netlabel_unlabeled.c:377: note: 'netlbl_unlhsh_add' was previously declared here
      
          include/net/ipv6.h:360: error: type of 'ipv6_renew_options_kern' does not match original declaration [-Werror=lto-type-mismatch]
           ipv6_renew_options_kern(struct sock *sk,
          net/ipv6/exthdrs.c:1162: note: 'ipv6_renew_options_kern' was previously declared here
      
          net/core/dev.c:761: note: 'dev_get_by_name_rcu' was previously declared here
           struct net_device *dev_get_by_name_rcu(struct net *net, const char *name)
          net/core/dev.c:761: note: code may be misoptimized unless -fno-strict-aliasing is used
      
          drivers/gpu/drm/i915/i915_drv.h:3377: error: type of 'i915_gem_object_set_to_wc_domain' does not match original declaration [-Werror=lto-type-mismatch]
           i915_gem_object_set_to_wc_domain(struct drm_i915_gem_object *obj, bool write);
          drivers/gpu/drm/i915/i915_gem.c:3639: note: 'i915_gem_object_set_to_wc_domain' was previously declared here
      
          include/linux/debugfs.h:92:9: error: type of 'debugfs_attr_read' does not match original declaration [-Werror=lto-type-mismatch]
           ssize_t debugfs_attr_read(struct file *file, char __user *buf,
          fs/debugfs/file.c:318: note: 'debugfs_attr_read' was previously declared here
      
          include/linux/rwlock_api_smp.h:30: error: type of '_raw_read_unlock' does not match original declaration [-Werror=lto-type-mismatch]
           void __lockfunc _raw_read_unlock(rwlock_t *lock) __releases(lock);
          kernel/locking/spinlock.c:246:26: note: '_raw_read_unlock' was previously declared here
      
          include/linux/fs.h:3308:5: error: type of 'simple_attr_open' does not match original declaration [-Werror=lto-type-mismatch]
           int simple_attr_open(struct inode *inode, struct file *file,
          fs/libfs.c:795: note: 'simple_attr_open' was previously declared here
      
      All of the above are caused by include/asm-generic/qrwlock_types.h
      failing to include asm/byteorder.h after commit e0d02285
      ("locking/qrwlock: Use 'struct qrwlock' instead of 'struct __qrwlock'")
      in linux-4.15.
      
      Similar bugs may or may not exist in older kernels as well, but there is
      no easy way to test those with link-time optimizations, and kernels
      before 4.14 are harder to fix because they don't have Babu's patch
      series
      
      We had similar issues with CONFIG_ symbols in the past and ended up
      always including the configuration headers though linux/kconfig.h.  This
      works around the issue through that same file, defining either
      __BIG_ENDIAN or __LITTLE_ENDIAN depending on CONFIG_CPU_BIG_ENDIAN,
      which is now always set on all architectures since commit 4c97a0c8
      ("arch: define CPU_BIG_ENDIAN for all fixed big endian archs").
      
      Link: http://lkml.kernel.org/r/20180202154104.1522809-2-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Babu Moger <babu.moger@amd.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      101110f6