1. 25 Jun, 2015 40 commits
    • Linus Torvalds's avatar
      Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 6597ac8a
      Linus Torvalds authored
      Pull device mapper updates from Mike Snitzer:
      
       - DM core cleanups:
      
           * blk-mq request-based DM no longer uses any mempools now that
             partial completions are no longer handled as part of cloned
             requests
      
       - DM raid cleanups and support for MD raid0
      
       - DM cache core advances and a new stochastic-multi-queue (smq) cache
         replacement policy
      
           * smq is the new default dm-cache policy
      
       - DM thinp cleanups and much more efficient large discard support
      
       - DM statistics support for request-based DM and nanosecond resolution
         timestamps
      
       - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
      
      * tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
        dm stats: add support for request-based DM devices
        dm stats: collect and report histogram of IO latencies
        dm stats: support precise timestamps
        dm stats: fix divide by zero if 'number_of_areas' arg is zero
        dm cache: switch the "default" cache replacement policy from mq to smq
        dm space map metadata: fix occasional leak of a metadata block on resize
        dm thin metadata: fix a race when entering fail mode
        dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
        dm thin: range discard support
        dm thin metadata: add dm_thin_remove_range()
        dm thin metadata: add dm_thin_find_mapped_range()
        dm btree: add dm_btree_remove_leaves()
        dm stats: Use kvfree() in dm_kvfree()
        dm cache: age and write back cache entries even without active IO
        dm cache: prefix all DMERR and DMINFO messages with cache device name
        dm cache: add fail io mode and needs_check flag
        dm cache: wake the worker thread every time we free a migration object
        dm cache: add stochastic-multi-queue (smq) policy
        dm cache: boost promotion of blocks that will be overwritten
        dm cache: defer whole cells
        ...
      6597ac8a
    • Linus Torvalds's avatar
      Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block · e4bc13ad
      Linus Torvalds authored
      Pull cgroup writeback support from Jens Axboe:
       "This is the big pull request for adding cgroup writeback support.
      
        This code has been in development for a long time, and it has been
        simmering in for-next for a good chunk of this cycle too.  This is one
        of those problems that has been talked about for at least half a
        decade, finally there's a solution and code to go with it.
      
        Also see last weeks writeup on LWN:
      
              http://lwn.net/Articles/648292/"
      
      * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
        writeback, blkio: add documentation for cgroup writeback support
        vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
        writeback: do foreign inode detection iff cgroup writeback is enabled
        v9fs: fix error handling in v9fs_session_init()
        bdi: fix wrong error return value in cgwb_create()
        buffer: remove unusued 'ret' variable
        writeback: disassociate inodes from dying bdi_writebacks
        writeback: implement foreign cgroup inode bdi_writeback switching
        writeback: add lockdep annotation to inode_to_wb()
        writeback: use unlocked_inode_to_wb transaction in inode_congested()
        writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
        writeback: implement [locked_]inode_to_wb_and_lock_list()
        writeback: implement foreign cgroup inode detection
        writeback: make writeback_control track the inode being written back
        writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
        mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
        writeback: implement memcg writeback domain based throttling
        writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
        writeback: implement memcg wb_domain
        writeback: update wb_over_bg_thresh() to use wb_domain aware operations
        ...
      e4bc13ad
    • Linus Torvalds's avatar
      Merge branch 'for-4.2/sg' of git://git.kernel.dk/linux-block · ad90fb97
      Linus Torvalds authored
      Pull asm/scatterlist.h removal from Jens Axboe:
       "We don't have any specific arch scatterlist anymore, since parisc
        finally switched over.  Kill the include"
      
      * 'for-4.2/sg' of git://git.kernel.dk/linux-block:
        remove scatterlist.h generation from arch Kbuild files
        remove <asm/scatterlist.h>
      ad90fb97
    • Linus Torvalds's avatar
      Merge branch 'for-4.2/drivers' of git://git.kernel.dk/linux-block · 6a398a3e
      Linus Torvalds authored
      Pull block driver updates from Jens Axboe:
       "This contains:
      
         - a few race fixes for null_blk, from Akinobu Mita.
      
         - a series of fixes for mtip32xx, from Asai Thambi and Selvan Mani at
           Micron.
      
         - NVMe:
              * Fix for missing error return on allocation failure, from Axel
                Lin.
      
              * Code consolidation and cleanups from Christoph.
      
              * Memory barrier addition, syncing queue count and queue
                pointers. From Jon Derrick.
      
              * Various fixes from Keith, an addition to support user
                issue reset from sysfs or ioctl, and automatic namespace
                rescan.
      
              * Fix from Matias, avoiding losing some request flags when
                marking the request failfast.
      
         - small cleanups and sparse fixups for ps3vram.  From Geert
           Uytterhoeven and Geoff Lavand.
      
         - s390/dasd dead code removal, from Jarod Wilson.
      
         - a set of fixes and optimizations for loop, from Ming Lei.
      
         - conversion to blkdev_reread_part() of loop, dasd, ndb.  From Ming
           Lei.
      
         - updates to cciss.  From Tomas Henzl"
      
      * 'for-4.2/drivers' of git://git.kernel.dk/linux-block: (44 commits)
        mtip32xx: Fix accessing freed memory
        block: nvme-scsi: Catch kcalloc failure
        NVMe: Fix IO for extended metadata formats
        nvme: don't overwrite req->cmd_flags on sync cmd
        mtip32xx: increase wait time for hba reset
        mtip32xx: fix minor number
        mtip32xx: remove unnecessary sleep in mtip_ftl_rebuild_poll()
        mtip32xx: fix crash on surprise removal of the drive
        mtip32xx: Abort I/O during secure erase operation
        mtip32xx: fix incorrectly setting MTIP_DDF_SEC_LOCK_BIT
        mtip32xx: remove unused variable 'port->allocated'
        mtip32xx: fix rmmod issue
        MAINTAINERS: Update ps3vram block driver
        block/ps3vram: Remove obsolete reference to MTD
        block/ps3vram: Fix sparse warnings
        NVMe: Automatic namespace rescan
        NVMe: Memory barrier before queue_count is incremented
        NVMe: add sysfs and ioctl controller reset
        null_blk: restart request processing on completion handler
        null_blk: prevent timer handler running on a different CPU where started
        ...
      6a398a3e
    • Linus Torvalds's avatar
      Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block · bfffa1cc
      Linus Torvalds authored
      Pull core block IO update from Jens Axboe:
       "Nothing really major in here, mostly a collection of smaller
        optimizations and cleanups, mixed with various fixes.  In more detail,
        this contains:
      
         - Addition of policy specific data to blkcg for block cgroups.  From
           Arianna Avanzini.
      
         - Various cleanups around command types from Christoph.
      
         - Cleanup of the suspend block I/O path from Christoph.
      
         - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.
      
         - Eliminating atomic inc/dec of both remaining IO count and reference
           count in a bio.  From me.
      
         - Fixes for SG gap and chunk size support for data-less (discards)
           IO, so we can merge these better.  From me.
      
         - Small restructuring of blk-mq shared tag support, freeing drivers
           from iterating hardware queues.  From Keith Busch.
      
         - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
           IOPS mode the default for non-rotational storage"
      
      * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
        cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
        cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
        cfq-iosched: move group scheduling functions under ifdef
        cfq-iosched: fix the setting of IOPS mode on SSDs
        blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
        block, cgroup: implement policy-specific per-blkcg data
        block: Make CFQ default to IOPS mode on SSDs
        block: add blk_set_queue_dying() to blkdev.h
        blk-mq: Shared tag enhancements
        block: don't honor chunk sizes for data-less IO
        block: only honor SG gap prevention for merges that contain data
        block: fix returnvar.cocci warnings
        block, dm: don't copy bios for request clones
        block: remove management of bi_remaining when restoring original bi_end_io
        block: replace trylock with mutex_lock in blkdev_reread_part()
        block: export blkdev_reread_part() and __blkdev_reread_part()
        suspend: simplify block I/O handling
        block: collapse bio bit space
        block: remove unused BIO_RW_BLOCK and BIO_EOF flags
        block: remove BIO_EOPNOTSUPP
        ...
      bfffa1cc
    • Linus Torvalds's avatar
      Merge tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs · cc8a0a94
      Linus Torvalds authored
      Pull UBI/UBIFS updates from Richard Weinberger:
       "Minor fixes for UBI and UBIFS"
      
      * tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs:
        UBI: Remove unnecessary `\'
        UBI: Use static class and attribute groups
        UBI: add a helper function for updatting on-flash layout volumes
        UBI: Fastmap: Do not add vol if it already exists
        UBI: Init vol->reserved_pebs by assignment
        UBI: Fastmap: Rename variables to make them meaningful
        UBI: Fastmap: Remove unnecessary `\'
        UBI: Fastmap: Use max() to get the larger value
        ubifs: fix to check error code of register_shrinker
        UBI: block: Dynamically allocate minor numbers
      cc8a0a94
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · d857da7b
      Linus Torvalds authored
      Pull ext4 updates from Ted Ts'o:
       "A very large number of cleanups and bug fixes --- in particular for
        the ext4 encryption patches, which is a new feature added in the last
        merge window.  Also fix a number of long-standing xfstest failures.
        (Quota writes failing due to ENOSPC, a race between truncate and
        writepage in data=journalled mode that was causing generic/068 to
        fail, and other corner cases.)
      
        Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
        performance eliminating locking when a buffer is modified more than
        once during a transaction (which is very common for allocation
        bitmaps, for example), in which case the state of the journalled
        buffer head doesn't need to change"
      
      [ I renamed "ext4_follow_link()" to "ext4_encrypted_follow_link()" in
        the merge resolution, to make it clear that that function is _only_
        used for encrypted symlinks.  The function doesn't actually work for
        non-encrypted symlinks at all, and they use the generic helpers
                                               - Linus ]
      
      * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
        ext4: set lazytime on remount if MS_LAZYTIME is set by mount
        ext4: only call ext4_truncate when size <= isize
        ext4: make online defrag error reporting consistent
        ext4: minor cleanup of ext4_da_reserve_space()
        ext4: don't retry file block mapping on bigalloc fs with non-extent file
        ext4: prevent ext4_quota_write() from failing due to ENOSPC
        ext4: call sync_blockdev() before invalidate_bdev() in put_super()
        jbd2: speedup jbd2_journal_dirty_metadata()
        jbd2: get rid of open coded allocation retry loop
        ext4: improve warning directory handling messages
        jbd2: fix ocfs2 corrupt when updating journal superblock fails
        ext4: mballoc: avoid 20-argument function call
        ext4: wait for existing dio workers in ext4_alloc_file_blocks()
        ext4: recalculate journal credits as inode depth changes
        jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
        ext4: use swap() in mext_page_double_lock()
        ext4: use swap() in memswap()
        ext4: fix race between truncate and __ext4_journalled_writepage()
        ext4 crypto: fail the mount if blocksize != pagesize
        ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
        ...
      d857da7b
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · 77d43164
      Linus Torvalds authored
      Pull sparc fixes from David Miller:
       "Sparc perf stack traversal fixes from David Ahern"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc64: perf: Use UREG_FP rather than UREG_I6
        sparc64: perf: Add sanity checking on addresses in user stack
        sparc64: Convert BUG_ON to warning
        sparc: perf: Disable pagefaults while walking userspace stacks
      77d43164
    • Linus Torvalds's avatar
      Merge tag 'for-4.2' of git://git.sourceforge.jp/gitroot/uclinux-h8/linux · 55a7d4b8
      Linus Torvalds authored
      Pull Renesas H8/300 architecture re-introduction from Yoshinori Sato.
      
      We dropped arch/h8300 two years ago as stale and old, this is a new and
      more modern rewritten arch support for the same architecture.
      
      * tag 'for-4.2' of git://git.sourceforge.jp/gitroot/uclinux-h8/linux: (27 commits)
        h8300: fix typo.
        h8300: Always build dtb
        h8300: Remove ARCH_WANT_IPC_PARSE_VERSION
        sh-sci: Get register size from platform device
        clk: h8300: fix error handling in h8s2678_pll_clk_setup()
        h8300: Symbol name fix
        h8300: devicetree source
        h8300: configs
        h8300: IRQ chip driver
        h8300: clocksource
        h8300: clock driver
        h8300: Build scripts
        h8300: library functions
        h8300: Memory management
        h8300: miscellaneous functions
        h8300: process helpers
        h8300: compressed image support
        h8300: Low level entry
        h8300: kernel startup
        h8300: Interrupt and exceptions
        ...
      55a7d4b8
    • David S. Miller's avatar
      Merge branch 'sparc-perf-stack' · f01cae4e
      David S. Miller authored
      David Ahern says:
      
      ====================
      sparc64: perf fixes for userspace stacks
      
      Coming back to the perf userspace callchain problem. As a reminder there are
      a series of problems trying to use perf to collect callchains with scheduling
      tracepoints, e.g., perf sched record -g -- <cmd>.
      
      The first patch disables pagefaults while walking the user stack. As discussed
      a couple of months ago this is the right fix, but I was puzzled as to why
      processes were terminating with sigbus (and sometimes sigsegv). I believe the
      root of this problem is bad addresses trying to walk the frames using frame
      pointers. The bad addresses lead to faults that get handled by do_sparc64_fault
      and it aborts the task though I am still puzzled as to why it gets past this
      check in do_sparc64_fault:
      
              if (in_atomic() || !mm)
                      goto intr_or_no_mm;
      
      pagefault_disable bumps the preempt_count which should make in_atomic return != 0
      (building kernels with preemption set to voluntar, CONFIG_PREEMPT_VOLUNTARY=y).
      
      While this set does not fully solve the problem it does prevent a number of
      pain points with the current code, most notably able to lock up the system.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f01cae4e
    • David Ahern's avatar
      sparc64: perf: Use UREG_FP rather than UREG_I6 · 2d89cd86
      David Ahern authored
      perf walks userspace callchains by following frame pointers. Use the
      UREG_FP macro to make it clearer that the %fp is being used.
      Signed-off-by: default avatarDavid Ahern <david.ahern@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d89cd86
    • David Ahern's avatar
      sparc64: perf: Add sanity checking on addresses in user stack · b69fb769
      David Ahern authored
      Processes are getting killed (sigbus or segv) while walking userspace
      callchains when using perf. In some instances I have seen ufp = 0x7ff
      which does not seem like a proper stack address.
      
      This patch adds a function to run validity checks against the address
      before attempting the copy_from_user. The checks are copied from the
      x86 version as a start point with the addition of a 4-byte alignment
      check.
      Signed-off-by: default avatarDavid Ahern <david.ahern@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b69fb769
    • David Ahern's avatar
      sparc64: Convert BUG_ON to warning · 2bf7c3ef
      David Ahern authored
      Pagefault handling has a BUG_ON path that panics the system. Convert it to
      a warning instead. There is no need to bring down the system for this kind
      of failure.
      
      The following was hit while running:
          perf sched record -g -- make -j 16
      
      [3609412.782801] kernel BUG at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:416!
      [3609412.782833]               \|/ ____ \|/
      [3609412.782833]               "@'/ .. \`@"
      [3609412.782833]               /_| \__/ |_\
      [3609412.782833]                  \__U_/
      [3609412.782870] cat(4516): Kernel bad sw trap 5 [#1]
      [3609412.782889] CPU: 0 PID: 4516 Comm: cat Tainted: G            E   4.1.0-rc8+ #6
      [3609412.782909] task: fff8000126e31f80 ti: fff8000110d90000 task.ti: fff8000110d90000
      [3609412.782931] TSTATE: 0000004411001603 TPC: 000000000096b164 TNPC: 000000000096b168 Y: 0000004e    Tainted: G            E
      [3609412.782964] TPC: <do_sparc64_fault+0x5e4/0x6a0>
      [3609412.782979] g0: 000000000096abe0 g1: 0000000000d314c4 g2: 0000000000000000 g3: 0000000000000001
      [3609412.783009] g4: fff8000126e31f80 g5: fff80001302d2000 g6: fff8000110d90000 g7: 00000000000000ff
      [3609412.783045] o0: 0000000000aff6a8 o1: 00000000000001a0 o2: 0000000000000001 o3: 0000000000000054
      [3609412.783080] o4: fff8000100026820 o5: 0000000000000001 sp: fff8000110d935f1 ret_pc: 000000000096b15c
      [3609412.783117] RPC: <do_sparc64_fault+0x5dc/0x6a0>
      [3609412.783137] l0: 000007feff996000 l1: 0000000000030001 l2: 0000000000000004 l3: fff8000127bd0120
      [3609412.783174] l4: 0000000000000054 l5: fff8000127bd0188 l6: 0000000000000000 l7: fff8000110d9dba8
      [3609412.783210] i0: fff8000110d93f60 i1: fff8000110ca5530 i2: 000000000000003f i3: 0000000000000054
      [3609412.783244] i4: fff800010000081a i5: fff8000100000398 i6: fff8000110d936a1 i7: 0000000000407c6c
      [3609412.783286] I7: <sparc64_realfault_common+0x10/0x20>
      [3609412.783308] Call Trace:
      [3609412.783329]  [0000000000407c6c] sparc64_realfault_common+0x10/0x20
      [3609412.783353] Disabling lock debugging due to kernel taint
      [3609412.783379] Caller[0000000000407c6c]: sparc64_realfault_common+0x10/0x20
      [3609412.783449] Caller[fff80001002283e4]: 0xfff80001002283e4
      [3609412.783471] Instruction DUMP: 921021a0  7feaff91  901222a8 <91d02005> 82086100  02f87f7b  808a2873  81cfe008  01000000
      [3609412.783542] Kernel panic - not syncing: Fatal exception
      [3609412.784605] Press Stop-A (L1-A) to return to the boot prom
      [3609412.784615] ---[ end Kernel panic - not syncing: Fatal exception
      
      With this patch rather than a panic I occasionally get something like this:
          perf sched record -g -m 1024  -- make -j N
      
      where N is based on number of cpus (128 to 1024 for a T7-4 and 8 for an 8 cpu
      VM on a T5-2).
      
      WARNING: CPU: 211 PID: 52565 at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:417 do_sparc64_fault+0x340/0x70c()
      address (7feffcd6000) != regs->tpc (fff80001004873c0)
      Modules linked in: ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cdc_ether usbnet mii ixgbe mdio igb i2c_algo_bit i2c_core ptp crc32c_sparc64 camellia_sparc64 des_sparc64 des_generic md5_sparc64 sha512_sparc64 sha1_sparc64 uio_pdrv_genirq uio usb_storage mpt3sas scsi_transport_sas raid_class aes_sparc64 sunvnet sunvdc sha256_sparc64(E) sha256_generic(E)
      CPU: 211 PID: 52565 Comm: ld Tainted: G        W   E   4.1.0-rc8+ #19
      Call Trace:
       [000000000045ce30] warn_slowpath_common+0x7c/0xa0
       [000000000045ceec] warn_slowpath_fmt+0x30/0x40
       [000000000098ad64] do_sparc64_fault+0x340/0x70c
       [0000000000407c2c] sparc64_realfault_common+0x10/0x20
      ---[ end trace 62ee02065a01a049 ]---
      ld[52565]: segfault at fff80001004873c0 ip fff80001004873c0 (rpc fff8000100158868) sp 000007feffcd70e1 error 30002 in libc-2.12.so[fff8000100410000+184000]
      
      The segfault is horrible, but better than a system panic.
      
      An 8-cpu VM on a T5-2 also showed the above traces from time to time,
      so it is a general problem and not specific to the T7 or baremetal.
      Signed-off-by: default avatarDavid Ahern <david.ahern@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bf7c3ef
    • David Ahern's avatar
      sparc: perf: Disable pagefaults while walking userspace stacks · c17af4dd
      David Ahern authored
      Page faults generated walking userspace stacks can call schedule to switch
      out the task. When collecting callchains for scheduler tracepoints this
      causes a deadlock as the tracepoints can be hit with the runqueue lock held:
      
      [ 8138.159054] WARNING: CPU: 758 PID: 12488 at /opt/dahern/linux.git/arch/sparc/kernel/nmi.c:80 perfctr_irq+0x1f8/0x2b4()
      
      [ 8138.203152] Watchdog detected hard LOCKUP on cpu 758
      
      [ 8138.410969] CPU: 758 PID: 12488 Comm: perf Not tainted 4.0.0-rc6+ #6
      [ 8138.437146] Call Trace:
      [ 8138.447193]  [000000000045cdd4] warn_slowpath_common+0x7c/0xa0
      [ 8138.471238]  [000000000045ce90] warn_slowpath_fmt+0x30/0x40
      [ 8138.494189]  [0000000000983e38] perfctr_irq+0x1f8/0x2b4
      [ 8138.515716]  [00000000004209f4] tl0_irq15+0x14/0x20
      [ 8138.535791]  [00000000009839ec] _raw_spin_trylock_bh+0x68/0x108
      [ 8138.560180]  [0000000000980018] __schedule+0xcc/0x710
      [ 8138.580981]  [00000000009806dc] preempt_schedule_common+0x10/0x3c
      [ 8138.606082]  [000000000098077c] _cond_resched+0x34/0x44
      [ 8138.627603]  [0000000000565990] kmem_cache_alloc_node+0x24/0x1a0
      [ 8138.652345]  [0000000000450b60] tsb_grow+0xac/0x488
      [ 8138.672429]  [0000000000985040] do_sparc64_fault+0x4dc/0x6e4
      [ 8138.695736]  [0000000000407c2c] sparc64_realfault_common+0x10/0x20
      [ 8138.721202]  [00000000006f2e24] NG4copy_from_user+0xa4/0x3c0
      [ 8138.744510]  [000000000044f900] perf_callchain_user+0x5c/0x6c
      [ 8138.768182]  [0000000000517b5c] perf_callchain+0x16c/0x19c
      [ 8138.790774]  [0000000000515f84] perf_prepare_sample+0x68/0x218
      [ 8138.814801] ---[ end trace 42ca6294b1ff7573 ]---
      
      As with PowerPC (b59a1bfc, "powerpc/perf: Disable pagefaults during
      callchain stack read") disable pagefaults while walking userspace stacks.
      Signed-off-by: default avatarDavid Ahern <david.ahern@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c17af4dd
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · aefbef10
      Linus Torvalds authored
      Merge first patchbomb from Andrew Morton:
      
       - a few misc things
      
       - ocfs2 udpates
      
       - kernel/watchdog.c feature work (took ages to get right)
      
       - most of MM.  A few tricky bits are held up and probably won't make 4.2.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (91 commits)
        mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
        mm, thp: respect MPOL_PREFERRED policy with non-local node
        tmpfs: truncate prealloc blocks past i_size
        mm/memory hotplug: print the last vmemmap region at the end of hot add memory
        mm/mmap.c: optimization of do_mmap_pgoff function
        mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
        mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
        mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
        mm: kmemleak: fix delete_object_*() race when called on the same memory block
        mm: kmemleak: allow safe memory scanning during kmemleak disabling
        memcg: convert mem_cgroup->under_oom from atomic_t to int
        memcg: remove unused mem_cgroup->oom_wakeups
        frontswap: allow multiple backends
        x86, mirror: x86 enabling - find mirrored memory ranges
        mm/memblock: allocate boot time data structures from mirrored memory
        mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
        mm: do not ignore mapping_gfp_mask in page cache allocation paths
        mm/cma.c: fix typos in comments
        mm/oom_kill.c: print points as unsigned int
        mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
        ...
      aefbef10
    • Linus Torvalds's avatar
      Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 266da6f1
      Linus Torvalds authored
      Pull pstore updates from Tony Luck:
       "Miscellaneous pstore improvements"
      
      * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        ramoops: make it possible to change mem_type param.
        pstore/ram: verify ramoops header before saving record
        fs/pstore: Optimization function ramoops_init_przs
        fs/pstore: update the backend parameter in pstore module
        pstore: do not use message compression without lock
      266da6f1
    • Linus Torvalds's avatar
      Merge tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs · cfcc0ad4
      Linus Torvalds authored
      Pull f2fs updates from Jaegeuk Kim:
       "New features:
         - per-file encryption (e.g., ext4)
         - FALLOC_FL_ZERO_RANGE
         - FALLOC_FL_COLLAPSE_RANGE
         - RENAME_WHITEOUT
      
        Major enhancement/fixes:
         - recovery broken superblocks
         - enhance f2fs_trim_fs with a discard_map
         - fix a race condition on dentry block allocation
         - fix a deadlock during summary operation
         - fix a missing fiemap result
      
        .. and many minor bug fixes and clean-ups were done"
      
      * tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits)
        f2fs: do not trim preallocated blocks when truncating after i_size
        f2fs crypto: add alloc_bounce_page
        f2fs crypto: fix to handle errors likewise ext4
        f2fs: drop the volatile_write flag only
        f2fs: skip committing valid superblock
        f2fs: setting discard option in parse_options()
        f2fs: fix to return exact trimmed size
        f2fs: support FALLOC_FL_INSERT_RANGE
        f2fs: hide common code in f2fs_replace_block
        f2fs: disable the discard option when device doesn't support
        f2fs crypto: remove alloc_page for bounce_page
        f2fs: fix a deadlock for summary page lock vs. sentry_lock
        f2fs crypto: clean up error handling in f2fs_fname_setup_filename
        f2fs crypto: avoid f2fs_inherit_context for symlink
        f2fs crypto: do not set encryption policy for non-directory by ioctl
        f2fs crypto: allow setting encryption policy once
        f2fs crypto: check context consistent for rename2
        f2fs: avoid duplicated code by reusing f2fs_read_end_io
        f2fs crypto: use per-inode tfm structure
        f2fs: recovering broken superblock during mount
        ...
      cfcc0ad4
    • Linus Torvalds's avatar
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · a7296b49
      Linus Torvalds authored
      Pull UDF fixes and cleanups from Jan Kara:
       "The contains some small fixes and improvements in error handling for
        UDF.
      
        Bundled is also one ext3 coding style fix and a fix in quota
        documentation"
      
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        udf: fix udf_load_pvoldesc()
        udf: remove double err declaration in udf_file_write_iter()
        UDF: support NFSv2 export
        fs: ext3: super: fixed a space coding style issue
        quota: Update documentation
        udf: Return error from udf_find_entry()
        udf: Make udf_get_filename() return error instead of 0 length file name
        udf: bug on exotic flag in udf_get_filename()
        udf: improve error management in udf_CS0toNLS()
        udf: improve error management in udf_CS0toUTF8()
        udf: unicode: update function name in comments
        udf: remove unnecessary test in udf_build_ustr_exact()
        udf: Return -ENOMEM when allocation fails in udf_get_filename()
      a7296b49
    • Linus Torvalds's avatar
      Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6 · 1e467e68
      Linus Torvalds authored
      Pull documentation updates from Jonathan Corbet:
       "The main thing here is Ingo's big subdirectory documenting feature
        support for each architecture.  Beyond that, it's the usual pile of
        fixes, tweaks, and small additions"
      
      * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (79 commits)
        doc:md: fix typo in md.txt.
        Documentation/mic/mpssd: don't build x86 userspace when cross compiling
        Documentation/prctl: don't build tsc tests when cross compiling
        Documentation/vDSO: don't build tests when cross compiling
        Doc:ABI/testing: Fix typo in sysfs-bus-fcoe
        Doc: Docbook: Change wikipedia's URL from http to https in scsi.tmpl
        Doc: Change wikipedia's URL from http to https
        Documentation/kernel-parameters: add missing pciserial to the earlyprintk
        Doc:pps: Fix typo in pps.txt
        kbuild : Fix documentation of INSTALL_HDR_PATH
        Documentation: filesystems: updated struct file_operations documentation in vfs.txt
        kbuild: edit explanation of clean-files variable
        Doc: ja_JP: Fix typo in HOWTO
        Move freefall program from Documentation/ to tools/
        Documentation: ARM: EXYNOS: Describe boot loaders interface
        Doc:nfc: Fix typo in nfc-hci.txt
        vfs: Minor documentation fix
        Doc: networking: txtimestamp: fix printf format warning
        Documentation, intel_pstate: Improve legacy mode internal governors description
        Documentation: extend use case for EXPORT_SYMBOL_GPL()
        ...
      1e467e68
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 14738e03
      Linus Torvalds authored
      Pull input subsystem updates from Dmitry Torokhov:
       "Thanks to Samuel Thibault input device (keyboard) LEDs are no longer
        hardwired within the input core but use LED subsystem and so allow use
        of different triggers; Hans de Goede did a large update for the ALPS
        touchpad driver; we have new TI drv2665 haptics driver and DA9063
        OnKey driver, and host of other drivers got various fixes"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (55 commits)
        Input: pixcir_i2c_ts - fix receive error
        MAINTAINERS: remove non existent input mt git tree
        Input: improve usage of gpiod API
        tty/vt/keyboard: define LED triggers for VT keyboard lock states
        tty/vt/keyboard: define LED triggers for VT LED states
        Input: export LEDs as class devices in sysfs
        Input: cyttsp4 - use swap() in cyttsp4_get_touch()
        Input: goodix - do not explicitly set evbits in input device
        Input: goodix - export id and version read from device
        Input: goodix - fix variable length array warning
        Input: goodix - fix alignment issues
        Input: add OnKey driver for DA9063 MFD part
        Input: elan_i2c - add product IDs FW names
        Input: elan_i2c - add support for multi IC type and iap format
        Input: focaltech - report finger width to userspace
        tty: remove platform_sysrq_reset_seq
        Input: synaptics_i2c - use proper boolean values
        Input: psmouse - use true instead of 1 for boolean values
        Input: cyapa - fix a few typos in comments
        Input: stmpe-ts - enforce device tree only mode
        ...
      14738e03
    • Linus Torvalds's avatar
      Merge tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp · 45471cd9
      Linus Torvalds authored
      Pull EDAC updates from Borislav Petkov:
      
       - New APM X-Gene SoC EDAC driver (Loc Ho)
      
       - AMD error injection module improvements (Aravind Gopalakrishnan)
      
       - Altera Arria 10 support (Thor Thayer)
      
       - misc fixes and cleanups all over the place
      
      * tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (28 commits)
        EDAC: Update Documentation/edac.txt
        EDAC: Fix typos in Documentation/edac.txt
        EDAC, mce_amd_inj: Set MISCV on injection
        EDAC, mce_amd_inj: Move bit preparations before the injection
        EDAC, mce_amd_inj: Cleanup and simplify README
        EDAC, altera: Do not allow suspend when EDAC is enabled
        EDAC, mce_amd_inj: Make inj_type static
        arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support
        EDAC, altera: Add Arria10 EDAC support
        EDAC, altera: Refactor for Altera CycloneV SoC
        EDAC, altera: Generalize driver to use DT Memory size
        EDAC, mce_amd_inj: Add README file
        EDAC, mce_amd_inj: Add individual permissions field to dfs_node
        EDAC, mce_amd_inj: Modify flags attribute to use string arguments
        EDAC, mce_amd_inj: Read out number of MCE banks from the hardware
        EDAC, mce_amd_inj: Use MCE_INJECT_GET macro for bank node too
        EDAC, xgene: Fix cpuid abuse
        EDAC, mpc85xx: Extend error address to 64 bit
        EDAC, mpc8xxx: Adapt for FSL SoC
        EDAC, edac_stub: Drop arch-specific include
        ...
      45471cd9
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 93a4b1b9
      Linus Torvalds authored
      Pull pin control updates from Linus Walleij:
       "Here is the bulk of pin control changes for the v4.2 series: Quite a
        lot of new SoC subdrivers and two new main drivers this time, apart
        from that business as usual.
      
        Details:
      
        Core functionality:
         - Enable exclusive pin ownership: it is possible to flag a pin
           controller so that GPIO and other functions cannot use a single pin
           simultaneously.
      
        New drivers:
         - NXP LPC18xx System Control Unit pin controller
         - Imagination Pistachio SoC pin controller
      
        New subdrivers:
         - Freescale i.MX7d SoC
         - Intel Sunrisepoint-H PCH
         - Renesas PFC R8A7793
         - Renesas PFC R8A7794
         - Mediatek MT6397, MT8127
         - SiRF Atlas 7
         - Allwinner A33
         - Qualcomm MSM8660
         - Marvell Armada 395
         - Rockchip RK3368
      
        Cleanups:
         - A big cleanup of the Marvell MVEBU driver rectifying it to
           correspond to reality
         - Drop platform device probing from the SH PFC driver, we are now a
           DT only shop for SuperH
         - Drop obsolte multi-platform check for SH PFC
         - Various janitorial: constification, grammar etc
      
        Improvements:
         - The AT91 GPIO portions now supports the set_multiple() feature
         - Split out SPI pins on the Xilinx Zynq
         - Support DTs without specific function nodes in the i.MX driver"
      
      * tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (99 commits)
        pinctrl: rockchip: add support for the rk3368
        pinctrl: rockchip: generalize perpin driver-strength setting
        pinctrl: sh-pfc: r8a7794: add SDHI pin groups
        pinctrl: sh-pfc: r8a7794: add MMCIF pin groups
        pinctrl: sh-pfc: add R8A7794 PFC support
        pinctrl: make pinctrl_register() return proper error code
        pinctrl: mvebu: armada-39x: add support for Armada 395 variant
        pinctrl: mvebu: armada-39x: add missing SATA functions
        pinctrl: mvebu: armada-39x: add missing PCIe functions
        pinctrl: mvebu: armada-38x: add ptp functions
        pinctrl: mvebu: armada-38x: add ua1 functions
        pinctrl: mvebu: armada-38x: add nand functions
        pinctrl: mvebu: armada-38x: add sata functions
        pinctrl: mvebu: armada-xp: add dram functions
        pinctrl: mvebu: armada-xp: add nand rb function
        pinctrl: mvebu: armada-xp: add spi1 function
        pinctrl: mvebu: armada-39x: normalize ref clock naming
        pinctrl: mvebu: armada-xp: rename spi to spi0
        pinctrl: mvebu: armada-370: align spi1 clock pin naming
        pinctrl: mvebu: armada-370: align VDD cpu-pd pin naming with datasheet
        ...
      93a4b1b9
    • Linus Torvalds's avatar
      Merge tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight · d59b92f9
      Linus Torvalds authored
      Pull backlight updates from Lee Jones:
       "Changes to existing drivers:
      
         - supply MODULE_DEVICE_TABLE() to ensure probing
         - constify struct; da9052_bl
         - enable compile test; lcd_l4f00242t03, lcd_lms283fg05, backlight_gpio
         - suspend/resume bugfix; lp855x_bl
         - devm_gpiod_get_optional() API fixup; pwm_bl
         - error handling fixup; backlight"
      
      * tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
        backlight: Change the return type of backlight_update_status() to int
        backlight: pwm_bl: Simplify usage of devm_gpiod_get_optional
        backlight: lp855x: Don't clear level on suspend/blank
        backlight: Allow compile test of GPIO consumers if !GPIOLIB
        video: backlight: da9052: Constify platform_device_id
        gpio-backlight: Discover driver during boot time
      d59b92f9
    • Larry Finger's avatar
      mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() · 8a8c35fa
      Larry Finger authored
      Beginning at commit d52d3997 ("ipv6: Create percpu rt6_info"), the
      following INFO splat is logged:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.1.0-rc7-next-20150612 #1 Not tainted
        -------------------------------
        kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
        other info that might help us debug this:
        rcu_scheduler_active = 1, debug_locks = 0
         3 locks held by systemd/1:
         #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
         #1:  (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
         #2:  (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
        stack backtrace:
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
        Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20   04/17/2014
        Call Trace:
          dump_stack+0x4c/0x6e
          lockdep_rcu_suspicious+0xe7/0x120
          ___might_sleep+0x1d5/0x1f0
          __might_sleep+0x4d/0x90
          kmem_cache_alloc+0x47/0x250
          create_object+0x39/0x2e0
          kmemleak_alloc_percpu+0x61/0xe0
          pcpu_alloc+0x370/0x630
      
      Additional backtrace lines are truncated.  In addition, the above splat
      is followed by several "BUG: sleeping function called from invalid
      context at mm/slub.c:1268" outputs.  As suggested by Martin KaFai Lau,
      these are the clue to the fix.  Routine kmemleak_alloc_percpu() always
      uses GFP_KERNEL for its allocations, whereas it should follow the gfp
      from its callers.
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a8c35fa
    • Vlastimil Babka's avatar
      mm, thp: respect MPOL_PREFERRED policy with non-local node · 0867a57c
      Vlastimil Babka authored
      Since commit 077fcf11 ("mm/thp: allocate transparent hugepages on
      local node"), we handle THP allocations on page fault in a special way -
      for non-interleave memory policies, the allocation is only attempted on
      the node local to the current CPU, if the policy's nodemask allows the
      node.
      
      This is motivated by the assumption that THP benefits cannot offset the
      cost of remote accesses, so it's better to fallback to base pages on the
      local node (which might still be available, while huge pages are not due
      to fragmentation) than to allocate huge pages on a remote node.
      
      The nodemask check prevents us from violating e.g.  MPOL_BIND policies
      where the local node is not among the allowed nodes.  However, the
      current implementation can still give surprising results for the
      MPOL_PREFERRED policy when the preferred node is different than the
      current CPU's local node.
      
      In such case we should honor the preferred node and not use the local
      node, which is what this patch does.  If hugepage allocation on the
      preferred node fails, we fall back to base pages and don't try other
      nodes, with the same motivation as is done for the local node hugepage
      allocations.  The patch also moves the MPOL_INTERLEAVE check around to
      simplify the hugepage specific test.
      
      The difference can be demonstrated using in-tree transhuge-stress test
      on the following 2-node machine where half memory on one node was
      occupied to show the difference.
      
      > numactl --hardware
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
      node 0 size: 7878 MB
      node 0 free: 3623 MB
      node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
      node 1 size: 8045 MB
      node 1 free: 7818 MB
      node distances:
      node   0   1
        0:  10  21
        1:  21  10
      
      Before the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages
      
      Number of successful THP allocations corresponds to free memory on node 0 in
      the first case and node 1 in the second case, i.e. -p parameter is ignored and
      cpu binding "wins".
      
      After the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -p1 -C0 ./transhuge-stress
      transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages
      
      The -p parameter is respected regardless of cpu binding.
      
      > numactl -C0 ./transhuge-stress
      transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -C12 ./transhuge-stress
      transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages
      
      Without -p parameter, hugepage restriction to CPU-local node works as before.
      
      Fixes: 077fcf11 ("mm/thp: allocate transparent hugepages on local node")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0867a57c
    • Josef Bacik's avatar
      tmpfs: truncate prealloc blocks past i_size · afa2db2f
      Josef Bacik authored
      One of the rocksdb people noticed that when you do something like this
      
          fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
          pwrite(fd, buf, 5M, 0)
          ftruncate(5M)
      
      on tmpfs, the file would still take up 10M: which led to super fun
      issues because we were getting ENOSPC before we thought we should be
      getting ENOSPC.  This patch fixes the problem, and mirrors what all the
      other fs'es do (and was agreed to be the correct behaviour at LSF).
      
      I tested it locally to make sure it worked properly with the following
      
          xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file
      
      Without the patch we have "Blocks: 20480", with the patch we have the
      correct value of "Blocks: 10240".
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afa2db2f
    • Zhu Guihua's avatar
      mm/memory hotplug: print the last vmemmap region at the end of hot add memory · c435a390
      Zhu Guihua authored
      When hot add two nodes continuously, we found the vmemmap region info is
      a bit messed.  The last region of node 2 is printed when node 3 hot
      added, like the following:
      
        Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 2 totalpages: 0
         Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
          [mem 0x40000000000-0x407ffffffff] page 1G
          [ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
          [ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
        ...
          [ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
          [ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
        Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 3 totalpages: 0
         Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
          [mem 0x60000000000-0x607ffffffff] page 1G
          [ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ???
          [ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3
          [ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
          [ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
        ...
      
      The cause is the last region was missed at the and of hot add memory,
      and p_start, p_end, node_start were not reset, so when hot add memory to
      a new node, it will consider they are not contiguous blocks and print
      the previous one.  So we print the last vmemmap region at the end of hot
      add memory to avoid the confusion.
      Signed-off-by: default avatarZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c435a390
    • Piotr Kwapulinski's avatar
      mm/mmap.c: optimization of do_mmap_pgoff function · e37609bb
      Piotr Kwapulinski authored
      The simple check for zero length memory mapping may be performed
      earlier.  So that in case of zero length memory mapping some unnecessary
      code is not executed at all.  It does not make the code less readable
      and saves some CPU cycles.
      Signed-off-by: default avatarPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e37609bb
    • Catalin Marinas's avatar
      mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan · 93ada579
      Catalin Marinas authored
      The kmemleak memory scanning uses finer grained object->lock spinlocks
      primarily to avoid races with the memory block freeing.  However, the
      pointer lookup in the rb tree requires the kmemleak_lock to be held.
      This is currently done in the find_and_get_object() function for each
      pointer-like location read during scanning.  While this allows a low
      latency on kmemleak_*() callbacks on other CPUs, the memory scanning is
      slower.
      
      This patch moves the kmemleak_lock outside the scan_block() loop,
      acquiring/releasing it only once per scanned memory block.  The
      allow_resched logic is moved outside scan_block() and a new
      scan_large_block() function is implemented which splits large blocks in
      MAX_SCAN_SIZE chunks with cond_resched() calls in-between.  A redundant
      (object->flags & OBJECT_NO_SCAN) check is also removed from
      scan_object().
      
      With this patch, the kmemleak scanning performance is significantly
      improved: at least 50% with lock debugging disabled and over an order of
      magnitude with lock proving enabled (on an arm64 system).
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93ada579
    • Catalin Marinas's avatar
      mm: kmemleak: avoid deadlock on the kmemleak object insertion error path · 9d5a4c73
      Catalin Marinas authored
      While very unlikely (usually kmemleak or sl*b bug), the create_object()
      function in mm/kmemleak.c may fail to insert a newly allocated object into
      the rb tree.  When this happens, kmemleak disables itself and prints
      additional information about the object already found in the rb tree.
      Such printing is done with the parent->lock acquired, however the
      kmemleak_lock is already held.  This is a potential race with the scanning
      thread which acquires object->lock and kmemleak_lock in a
      
      This patch removes the locking around the 'parent' object information
      printing.  Such object cannot be freed or removed from object_tree_root
      and object_list since kmemleak_lock is already held.  There is a very
      small risk that some of the object data is being modified on another CPU
      but the only downside is inconsistent information printing.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d5a4c73
    • Catalin Marinas's avatar
      mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup() · 5f369f37
      Catalin Marinas authored
      The kmemleak_do_cleanup() work thread already waits for the kmemleak_scan
      thread to finish via kthread_stop().  Waiting in kthread_stop() while
      scan_mutex is held may lead to deadlock if kmemleak_scan_thread() also
      waits to acquire for scan_mutex.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f369f37
    • Catalin Marinas's avatar
      mm: kmemleak: fix delete_object_*() race when called on the same memory block · e781a9ab
      Catalin Marinas authored
      Calling delete_object_*() on the same pointer is not a standard use case
      (unless there is a bug in the code calling kmemleak_free()).  However,
      during kmemleak disabling (error or user triggered via /sys), there is a
      potential race between kmemleak_free() calls on a CPU and
      __kmemleak_do_cleanup() on a different CPU.
      
      The current delete_object_*() implementation first performs a look-up
      holding kmemleak_lock, increments the object->use_count and then
      re-acquires kmemleak_lock to remove the object from object_tree_root and
      object_list.
      
      This patch simplifies the delete_object_*() mechanism to both look up
      and remove an object from the object_tree_root and object_list
      atomically (guarded by kmemleak_lock).  This allows safe concurrent
      calls to delete_object_*() on the same pointer without additional
      locking for synchronising the kmemleak_free_enabled flag.
      
      A side effect is a slight improvement in the delete_object_*() performance
      by avoiding acquiring kmemleak_lock twice and incrementing/decrementing
      object->use_count.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e781a9ab
    • Catalin Marinas's avatar
      mm: kmemleak: allow safe memory scanning during kmemleak disabling · c5f3b1a5
      Catalin Marinas authored
      The kmemleak scanning thread can run for minutes.  Callbacks like
      kmemleak_free() are allowed during this time, the race being taken care
      of by the object->lock spinlock.  Such lock also prevents a memory block
      from being freed or unmapped while it is being scanned by blocking the
      kmemleak_free() -> ...  -> __delete_object() function until the lock is
      released in scan_object().
      
      When a kmemleak error occurs (e.g.  it fails to allocate its metadata),
      kmemleak_enabled is set and __delete_object() is no longer called on
      freed objects.  If kmemleak_scan is running at the same time,
      kmemleak_free() no longer waits for the object scanning to complete,
      allowing the corresponding memory block to be freed or unmapped (in the
      case of vfree()).  This leads to kmemleak_scan potentially triggering a
      page fault.
      
      This patch separates the kmemleak_free() enabling/disabling from the
      overall kmemleak_enabled nob so that we can defer the disabling of the
      object freeing tracking until the scanning thread completed.  The
      kmemleak_free_part() is deliberately ignored by this patch since this is
      only called during boot before the scanning thread started.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarVignesh Radhakrishnan <vigneshr@codeaurora.org>
      Tested-by: default avatarVignesh Radhakrishnan <vigneshr@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5f3b1a5
    • Tejun Heo's avatar
      memcg: convert mem_cgroup->under_oom from atomic_t to int · c2b42d3c
      Tejun Heo authored
      memcg->under_oom tracks whether the memcg is under OOM conditions and is
      an atomic_t counter managed with mem_cgroup_[un]mark_under_oom().  While
      atomic_t appears to be simple synchronization-wise, when used as a
      synchronization construct like here, it's trickier and more error-prone
      due to weak memory ordering rules, especially around atomic_read(), and
      false sense of security.
      
      For example, both non-trivial read sites of memcg->under_oom are a bit
      problematic although not being actually broken.
      
      * mem_cgroup_oom_register_event()
      
        It isn't explicit what guarantees the memory ordering between event
        addition and memcg->under_oom check.  This isn't broken only because
        memcg_oom_lock is used for both event list and memcg->oom_lock.
      
      * memcg_oom_recover()
      
        The lockless test doesn't have any explanation why this would be
        safe.
      
      mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
      in avoiding locking memcg_oom_lock there.  This patch converts
      memcg->under_oom from atomic_t to int, puts their modifications under
      memcg_oom_lock and documents why the lockless test in
      memcg_oom_recover() is safe.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2b42d3c
    • Tejun Heo's avatar
      memcg: remove unused mem_cgroup->oom_wakeups · f4b90b70
      Tejun Heo authored
      Since commit 49426420 ("mm: memcg: handle non-error OOM situations
      more gracefully"), nobody uses mem_cgroup->oom_wakeups.  Remove it.
      
      While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
      is its only user.  This cleanup was suggested by Michal.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4b90b70
    • Dan Streetman's avatar
      frontswap: allow multiple backends · d1dc6f1b
      Dan Streetman authored
      Change frontswap single pointer to a singly linked list of frontswap
      implementations.  Update Xen tmem implementation as register no longer
      returns anything.
      
      Frontswap only keeps track of a single implementation; any
      implementation that registers second (or later) will replace the
      previously registered implementation, and gets a pointer to the previous
      implementation that the new implementation is expected to pass all
      frontswap functions to if it can't handle the function itself.  However
      that method doesn't really make much sense, as passing that work on to
      every implementation adds unnecessary work to implementations; instead,
      frontswap should simply keep a list of all registered implementations
      and try each implementation for any function.  Most importantly, neither
      of the two currently existing frontswap implementations in the kernel
      actually do anything with any previous frontswap implementation that
      they replace when registering.
      
      This allows frontswap to successfully manage multiple implementations by
      keeping a list of them all.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1dc6f1b
    • Tony Luck's avatar
      x86, mirror: x86 enabling - find mirrored memory ranges · b05b9f5f
      Tony Luck authored
      UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
      address ranges.  See UEFI 2.5 spec pages 157-158:
      
        http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf
      
      On EFI enabled systems scan the memory map and tell memblock about any
      mirrored ranges.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b05b9f5f
    • Tony Luck's avatar
      mm/memblock: allocate boot time data structures from mirrored memory · a3f5bafc
      Tony Luck authored
      Try to allocate all boot time kernel data structures from mirrored
      memory.
      
      If we run out of mirrored memory print warnings, but fall back to using
      non-mirrored memory to make sure that we still boot.
      
      By number of bytes, most of what we allocate at boot time is the page
      structures.  64 bytes per 4K page on x86_64 ...  or about 1.5% of total
      system memory.  For workloads where the bulk of memory is allocated to
      applications this may represent a useful improvement to system
      availability since 1.5% of total memory might be a third of the memory
      allocated to the kernel.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3f5bafc
    • Tony Luck's avatar
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck authored
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
    • Michal Hocko's avatar
      mm: do not ignore mapping_gfp_mask in page cache allocation paths · 6afdb859
      Michal Hocko authored
      page_cache_read, do_generic_file_read, __generic_file_splice_read and
      __ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
      add_to_page_cache_lru which might cause recursion into fs down in the
      direct reclaim path if the mapping really relies on GFP_NOFS semantic.
      
      This doesn't seem to be the case now because page_cache_read (page fault
      path) doesn't seem to suffer from the reclaim recursion issues and
      do_generic_file_read and __generic_file_splice_read also shouldn't be
      called under fs locks which would deadlock in the reclaim path.  Anyway it
      is better to obey mapping gfp mask and prevent from later breakage.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6afdb859