1. 05 Aug, 2011 10 commits
    • Anton Blanchard's avatar
      powerpc: Lack of ibm,io-events not that important! · 53876e38
      Anton Blanchard authored
      The ibm,io-events code is a bit verbose with its error messages.
      Reverse the reporting so we only print when we successfully enable
      I/O event interrupts.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      53876e38
    • Anton Blanchard's avatar
      powerpc: Move kdump default base address to half RMO size on 64bit · 8aa6d359
      Anton Blanchard authored
      We are seeing boot failures on some very large boxes even with
      commit b5416ca9 (powerpc: Move kdump default base address to
      64MB on 64bit).
      
      This patch halves the RMO so both kernels get about the same
      amount of RMO memory. On large machines this region will be
      at least 256MB, so each kernel will get 128MB.
      
      We cap it at 256MB (small SLB size) since some early allocations need
      to be in the bolted SLB region. We could relax this on machines with
      1TB SLBs in a future patch.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8aa6d359
    • David Ahern's avatar
      powerpc/perf: Disable pagefaults during callchain stack read · b59a1bfc
      David Ahern authored
      Panic observed on an older kernel when collecting call chains for
      the context-switch software event:
      
       [<b0180e00>]rb_erase+0x1b4/0x3e8
       [<b00430f4>]__dequeue_entity+0x50/0xe8
       [<b0043304>]set_next_entity+0x178/0x1bc
       [<b0043440>]pick_next_task_fair+0xb0/0x118
       [<b02ada80>]schedule+0x500/0x614
       [<b02afaa8>]rwsem_down_failed_common+0xf0/0x264
       [<b02afca0>]rwsem_down_read_failed+0x34/0x54
       [<b02aed4c>]down_read+0x3c/0x54
       [<b0023b58>]do_page_fault+0x114/0x5e8
       [<b001e350>]handle_page_fault+0xc/0x80
       [<b0022dec>]perf_callchain+0x224/0x31c
       [<b009ba70>]perf_prepare_sample+0x240/0x2fc
       [<b009d760>]__perf_event_overflow+0x280/0x398
       [<b009d914>]perf_swevent_overflow+0x9c/0x10c
       [<b009db54>]perf_swevent_ctx_event+0x1d0/0x230
       [<b009dc38>]do_perf_sw_event+0x84/0xe4
       [<b009dde8>]perf_sw_event_context_switch+0x150/0x1b4
       [<b009de90>]perf_event_task_sched_out+0x44/0x2d4
       [<b02ad840>]schedule+0x2c0/0x614
       [<b0047dc0>]__cond_resched+0x34/0x90
       [<b02adcc8>]_cond_resched+0x4c/0x68
       [<b00bccf8>]move_page_tables+0xb0/0x418
       [<b00d7ee0>]setup_arg_pages+0x184/0x2a0
       [<b0110914>]load_elf_binary+0x394/0x1208
       [<b00d6e28>]search_binary_handler+0xe0/0x2c4
       [<b00d834c>]do_execve+0x1bc/0x268
       [<b0015394>]sys_execve+0x84/0xc8
       [<b001df10>]ret_from_syscall+0x0/0x3c
      
      A page fault occurred walking the callchain while creating a perf
      sample for the context-switch event. To handle the page fault the
      mmap_sem is needed, but it is currently held by setup_arg_pages.
      (setup_arg_pages calls shift_arg_pages with the mmap_sem held.
      shift_arg_pages then calls move_page_tables which has a cond_resched
      at the top of its for loop - hitting that cond_resched is what caused
      the context switch.)
      
      This is an extension of Anton's proposed patch:
      https://lkml.org/lkml/2011/7/24/151
      adding case for 32-bit ppc.
      
      Tested on the system that first generated the panic and then again
      with latest kernel using a PPC VM. I am not able to test the 64-bit
      path - I do not have H/W for it and 64-bit PPC VMs (qemu on Intel)
      is horribly slow.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b59a1bfc
    • Peter Zijlstra's avatar
      ppc: Remove duplicate definition of PV_POWER7 · 501d2386
      Peter Zijlstra authored
      One definition of PV_POWER7 seems enough to me.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      501d2386
    • Anton Blanchard's avatar
      powerpc: pseries: Fix kexec on machines with more than 4TB of RAM · bed9a315
      Anton Blanchard authored
      On a box with 8TB of RAM the MMU hashtable is 64GB in size. That
      means we have 4G PTEs. pSeries_lpar_hptab_clear was using a signed
      int to store the index which will overflow at 2G.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Cc: <stable@kernel.org>
      Acked-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      bed9a315
    • Anton Blanchard's avatar
      powerpc: Jump label misalignment causes oops at boot · c113a3ae
      Anton Blanchard authored
      I hit an oops at boot on the first instruction of timer_cpu_notify:
      
      NIP [c000000000722f88] .timer_cpu_notify+0x0/0x388
      
      The code should look like:
      
      c000000000722f78:       eb e9 00 30     ld      r31,48(r9)
      c000000000722f7c:       2f bf 00 00     cmpdi   cr7,r31,0
      c000000000722f80:       40 9e ff 44     bne+    cr7,c000000000722ec4
      c000000000722f84:       4b ff ff 74     b       c000000000722ef8
      
      c000000000722f88 <.timer_cpu_notify>:
      c000000000722f88:       7c 08 02 a6     mflr    r0
      c000000000722f8c:       2f a4 00 07     cmpdi   cr7,r4,7
      c000000000722f90:       fb c1 ff f0     std     r30,-16(r1)
      c000000000722f94:       fb 61 ff d8     std     r27,-40(r1)
      
      But the oops output shows:
      
      eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 7c0803a6 ebe1fff8 4e800020
      00000000 ebe90030 c0000000 00ad0a28 00000000 2fa40007 fbc1fff0 fb61ffd8
      
      So we scribbled over our instructions with c000000000ad0a28, which
      is an address inside the jump_table ELF section.
      
      It turns out the jump_table section is only aligned to 8 bytes but
      we are aligning our entries within the section to 16 bytes. This
      means our entries are offset from the table:
      
      c000000000acd4a8 <__start___jump_table>:
              ...
      c000000000ad0a10:       c0 00 00 00     lfs     f0,0(0)
      c000000000ad0a14:       00 70 cd 5c     .long 0x70cd5c
      c000000000ad0a18:       c0 00 00 00     lfs     f0,0(0)
      c000000000ad0a1c:       00 70 cd 90     .long 0x70cd90
      c000000000ad0a20:       c0 00 00 00     lfs     f0,0(0)
      c000000000ad0a24:       00 ac a4 20     .long 0xaca420
      
      And the jump table sort code gets very confused and writes into the
      wrong spot. Remove the alignment, and also remove the padding since
      we it saves some space and we shouldn't need it.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c113a3ae
    • Anton Blanchard's avatar
      powerpc: Clean up some panic messages in prom_init · fbafd728
      Anton Blanchard authored
      Add a newline to the panic messages in make_room. Also fix a
      comment that suggested our chunk size is 4Mb. It's 1MB.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fbafd728
    • Anton Blanchard's avatar
      powerpc: Fix device tree claim code · 966728dd
      Anton Blanchard authored
      I have a box that fails in OF during boot with:
      
      DEFAULT CATCH!, exception-handler=fff00400
      at   %SRR0: 49424d2c4c6f6768   %SRR1: 800000004000b002
      
      ie "IBM,Logh". OF got corrupted with a device tree string.
      
      Looking at make_room and alloc_up, we claim the first chunk (1 MB)
      but we never claim any more. mem_end is always set to alloc_top
      which is the top of our available address space, guaranteeing we will
      never call alloc_up and claim more memory.
      
      Also alloc_up wasn't setting alloc_bottom to the bottom of the
      available address space.
      
      This doesn't help the box to boot, but we at least fail with
      an obvious error. We could relocate the device tree in a future
      patch.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      966728dd
    • Scott Wood's avatar
      powerpc: Return the_cpu_ spec from identify_cpu · 26ee9767
      Scott Wood authored
      Commit af9eef3c caused cpu_setup to see
      the_cpu_spec, rather than the source struct.  However, on 32-bit, the
      return value of identify_cpu was being used for feature fixups, and
      identify_cpu was returning the source struct.  So if cpu_setup patches
      the feature bits, the update won't affect the fixups.
      Signed-off-by: default avatarScott Wood <scottwood@freescale.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      26ee9767
    • Scott Wood's avatar
      powerpc: mtspr/mtmsr should take an unsigned long · 326ed6a9
      Scott Wood authored
      Add a cast in case the caller passes in a different type, as it would
      if mtspr/mtmsr were functions.
      
      Previously, if a 64-bit type was passed in on 32-bit, GCC would bind the
      constraint to a pair of registers, and would substitute the first register
      in the pair in the asm code.  This corresponds to the upper half of the
      64-bit register, which is generally not the desired behavior.
      Signed-off-by: default avatarScott Wood <scottwood@freescale.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      326ed6a9
  2. 04 Aug, 2011 30 commits
    • Linus Torvalds's avatar
      Merge branch 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6 · 53d1e658
      Linus Torvalds authored
      * 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6:
        Revert "dt: add of_alias_scan and of_alias_get_id"
        dt: remove of_alias_get_id() reference
      53d1e658
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6 · 455ce9d8
      Linus Torvalds authored
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
        [PARISC] wire up sendmmsg syscall
        [PARISC] fix return type of __atomic64_add_return
        [PARISC] Fix futex support
      455ce9d8
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 · 447e1363
      Linus Torvalds authored
      * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
        [S390] signal: use set_restore_sigmask() helper
        [S390] smp: remove pointless comments in startup_secondary()
        [S390] qdio: Use kstrtoul_from_user
        [S390] sclp_async: Use kstrtoul_from_user
        [S390] exec: remove redundant set_fs(USER_DS)
        [S390] cpu hotplug: on cpu start wait until being marked active
        [S390] signal: convert to use set_current_blocked()
        [S390] asm offsets: fix coding style
        [S390] Add support for IBM zEnterprise 114
        [S390] dasd: check if raw track access is supported
        [S390] Use diagnose 308 for system reset
        [S390] Export store_status() function
        [S390] dasd: use vmalloc for statistics input buffer
        [S390] Add PSW restart shutdown trigger
        [S390] missing return in page_table_alloc_pgste
        [S390] qdio: 2nd stage retry on SIGA-W busy conditions
      447e1363
    • Arnaud Lacombe's avatar
      eisa/pci_eisa.c: fix BUG introduced by 005bdad7 · 82de9a0c
      Arnaud Lacombe authored
      While `pci_eisa_driver' still refer `pci_eisa_init', the .probe() function
      should not be called after init memory release, as pointed out by commit
      74b9a297. The structure is still referenced in the drivers subsystem, and can
      be accesseed through sysfs, so the modpost warning is a false positive. Mark
      it as such.
      
      In the same time, the warning referenced in 005bdad7 did only mention
      `pci_eisa_driver', not `pci_eisa_pci_tbl', so remove its marking.
      
      Broken-by: Arnaud Lacombe <lacombar@gmail.com> (in 005bdad7)
      Reported-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarArnaud Lacombe <lacombar@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82de9a0c
    • Grant Likely's avatar
      Revert "dt: add of_alias_scan and of_alias_get_id" · fe55c184
      Grant Likely authored
      This reverts commit 750f463a.
      
      of_alias_* still needs work to be generalized for 'promtree' dt
      platforms, and to no implicitly create entries for available ids.
      Signed-off-by: default avatarGrant Likely <grant.likely@secretlab.ca>
      fe55c184
    • Grant Likely's avatar
      dt: remove of_alias_get_id() reference · 9e191b22
      Grant Likely authored
      of_alias_get_id() is broken and being reverted.  Remove the reference
      to it and replace with a single incrementing id number.
      
      There is no risk of regression here on the imx driver since the imx
      change to use of_alias_get_id() is commit 22698aa2, "serial/imx: add
      device tree probe support" which is new for v3.1, and it won't get
      used unless CONFIG_OF is enabled and the board is booted using a
      device tree.  A single incrementing integer is sufficient for now.
      Signed-off-by: default avatarGrant Likely <grant.likely@secretlab.ca>
      Acked-by: default avatarShawn Guo <shawn.guo@linaro.org>
      9e191b22
    • Linus Torvalds's avatar
      Boot up with usermodehelper disabled · 288d5abe
      Linus Torvalds authored
      The core device layer sends tons of uevent notifications for each device
      it finds, and if the kernel has been built with a non-empty
      CONFIG_UEVENT_HELPER_PATH that will make us try to execute the usermode
      helper binary for all these events very early in the boot.
      
      Not only won't the root filesystem even be mounted at that point, we
      literally won't have necessarily even initialized all the process
      handling data structures at that point, which causes no end of silly
      problems even when the usermode helper doesn't actually succeed in
      executing.
      
      So just use our existing infrastructure to disable the usermodehelpers
      to make the kernel start out with them disabled.  We enable them when
      we've at least initialized stuff a bit.
      
      Problems related to an uninitialized
      
      	init_ipc_ns.ids[IPC_SHM_IDS].rw_mutex
      
      reported by various people.
      Reported-by: default avatarManuel Lauss <manuel.lauss@googlemail.com>
      Reported-by: default avatarRichard Weinberger <richard@nod.at>
      Reported-by: default avatarMarc Zyngier <maz@misterjones.org>
      Acked-by: default avatarKay Sievers <kay.sievers@vrfy.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      288d5abe
    • Linus Torvalds's avatar
      x86: don't include xen/xen.h in <asm/io.h> unless XEN is enabled · 33f35f2a
      Linus Torvalds authored
      Dmitry Kasatkin reports:
        "kernel-devel package with kernel headers have no <include/xen>
         directory if XEN is disabled.  Modules which inclide asm/io.h won't
         compile.
      
         XEN related content is behind the CONFIG_XEN flag in the io.h.  And
         <xen/xen.h> should be also behind CONFIG_XEN flag."
      
      So move the include of <xen/xen.h> down into the section that is
      conditional on CONFIG_XEN.
      Reported-by: default avatarDmitry Kasatkin <dmitry.kasatkin@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33f35f2a
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 0ea64844
      Linus Torvalds authored
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: ad7879 - fix deficient device disable
        Input: gpio_keys - fix two typos in devicetree documentation
        Input: mma8450 - add device tree probe support
        Input: gpio_keys - return proper error code if memory allocation fails
        Input: lm8323 - add missing device_remove_file for dev_attr_time
        Input: tegra-kbc - fix computation of polling time
        Input: kxtj9 - explicitly include module.h
        Input: psmouse - hgpk.c needs module.h
      0ea64844
    • Linus Torvalds's avatar
      Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 · 35e51fe8
      Linus Torvalds authored
      * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
        cpuidle: stop depending on pm_idle
        x86 idle: move mwait_idle_with_hints() to where it is used
        cpuidle: replace xen access to x86 pm_idle and default_idle
        cpuidle: create bootparam "cpuidle.off=1"
        mrst_pmu: driver for Intel Moorestown Power Management Unit
      35e51fe8
    • Linus Torvalds's avatar
      Merge branch 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 · c0c770e6
      Linus Torvalds authored
      * 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
        ACPI, APEI, EINJ Param support is disabled by default
        APEI GHES: 32-bit buildfix
        ACPI: APEI build fix
        ACPI, APEI, GHES: Add hardware memory error recovery support
        HWPoison: add memory_failure_queue()
        ACPI, APEI, GHES, Error records content based throttle
        ACPI, APEI, GHES, printk support for recoverable error via NMI
        lib, Make gen_pool memory allocator lockless
        lib, Add lock-less NULL terminated single list
        Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
        ACPI, APEI, Add WHEA _OSC support
        ACPI, APEI, Add APEI bit support in generic _OSC call
        ACPI, APEI, GHES, Support disable GHES at boot time
        ACPI, APEI, GHES, Prevent GHES to be built as module
        ACPI, APEI, Use apei_exec_run_optional in APEI EINJ and ERST
        ACPI, APEI, Add apei_exec_run_optional
        ACPI, APEI, GHES, Do not ratelimit fatal error printk before panic
        ACPI, APEI, ERST, Fix erst-dbg long record reading issue
        ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled
      c0c770e6
    • Linus Torvalds's avatar
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · a9e4e6e1
      Linus Torvalds authored
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        tcm_fc: Handle DDP/SW fc_frame_payload_get failures in ft_recv_write_data
        target: Fix bug for transport_generic_wait_for_tasks with direct operation
        target: iscsi_target depends on NET
        target: Fix WRITE_SAME_16 lba assignment breakage
        MAINTAINERS: Add target-devel list for drivers/target/
        iscsi-target: Fix CONFIG_SMP=n and CONFIG_MODULES=n build failure
        iscsi-target: Fix snprintf usage with MAX_PORTAL_LEN
        iscsi-target: Fix uninitialized usage of cmd->pad_bytes
        iscsi-target: strlen() doesn't count the terminator
        iscsi-target: Fix NULL dereference on allocation failure
      a9e4e6e1
    • Linus Torvalds's avatar
      Merge branch 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6 · 27665ffa
      Linus Torvalds authored
      * 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6:
        dt: add of_alias_scan and of_alias_get_id
      27665ffa
    • Linus Torvalds's avatar
    • Vasiliy Kulikov's avatar
      shm: optimize exit_shm() · 298507d4
      Vasiliy Kulikov authored
      We may optimistically check .in_use == 0 without holding the rw_mutex:
      it's the common case, and if it's zero, there certainly won't be any
      segments associated with us.
      
      After taking the lock, the idr_for_each() will do the right thing, so we
      could now drop the re-check inside the lock without any real cost.  But
      it won't hurt.
      Signed-off-by: default avatarVasiliy Kulikov <segoon@openwall.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      298507d4
    • Vasiliy Kulikov's avatar
      shm: fix wrong tests · 33a30ed4
      Vasiliy Kulikov authored
      Commit 4c677e2e ("shm: optimize locking and ipc_namespace getting")
      introduced a copy-paste bug.  Due to the bug cycle optimizations were
      disabled.
      Signed-off-by: default avatarVasiliy Kulikov <segoon@openwall.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33a30ed4
    • Robert P. J. Day's avatar
      tmpfs: expand "help" to explain value of TMPFS_POSIX_ACL · 206506cc
      Robert P. J. Day authored
      Expand the fs/Kconfig "help" info to clarify why it's a bad idea to
      deselect the TMPFS_POSIX_ACL config variable.
      Signed-off-by: default avatarRobert P. J. Day <rpjday@crashcourse.ca>
      Acked-by: default avatarRandy Dunlap <rdunlap@xenotime.net>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      206506cc
    • Hugh Dickins's avatar
      mm: clarify the radix_tree exceptional cases · 8079b1c8
      Hugh Dickins authored
      Make the radix_tree exceptional cases, mostly in filemap.c, clearer.
      
      It's hard to devise a suitable snappy name that illuminates the use by
      shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
      generality.  And akpm points out that /* radix_tree_deref_retry(page) */
      comments look like calls that have been commented out for unknown
      reason.
      
      Skirt the naming difficulty by rearranging these blocks to handle the
      transient radix_tree_deref_retry(page) case first; then just explain the
      remaining shmem/tmpfs swap case in a comment.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8079b1c8
    • Hugh Dickins's avatar
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins authored
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • Hugh Dickins's avatar
      mm: a few small updates for radix-swap · 31475dd6
      Hugh Dickins authored
      Remove PageSwapBacked (!page_is_file_cache) cases from
      add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
      go through shmem_add_to_page_cache().
      
      Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
      and add a comment on swap entries to invalidate_mapping_pages().
      
      And mincore_page() uses find_get_page() on what might be shmem or a
      tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
      find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31475dd6
    • Hugh Dickins's avatar
      tmpfs: use kmemdup for short symlinks · 69f07ec9
      Hugh Dickins authored
      But we've not yet removed the old swp_entry_t i_direct[16] from
      shmem_inode_info.  That's because it was still being shared with the
      inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
      size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
      
      I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
      rather than shmem_evict_inode(), where we usually do such freeing? I
      guess it doesn't matter, and I'm not into NUMA mpol testing right now.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69f07ec9
    • Hugh Dickins's avatar
      tmpfs: convert shmem_writepage and enable swap · 6922c0c7
      Hugh Dickins authored
      Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
      shmem_radix_tree_replace() to substitute swap entry for page pointer
      atomically in the radix tree.
      
      As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
      copying such code from delete_from_swap_cache, but again judged easier
      to sell than making its other callers go through the extras.
      
      Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
      now unreferenced, and the hack to disable swap: it's now good to go.
      
      The way things have worked out, info->lock no longer helps to guard the
      shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
      That global mutex exclusion between shmem_writepage() and shmem_unuse()
      is not pretty, and we ought to find another way; but it's been forced on
      us by recent race discoveries, not a consequence of this patchset.
      
      And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
      swap entry was found already present? That's no longer possible, the
      (unknown) one inserting this page into filecache would hit the swap
      entry occupying that slot.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6922c0c7
    • Hugh Dickins's avatar
      tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895
      Hugh Dickins authored
      Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
      had to move swappage to filecache with GFP_NOWAIT.
      
      Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
      moving its call out from shmem_add_to_page_cache() to two of thats three
      callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
      although asymmetrical, it's easier for all 3 callers to handle.
      
      These two changes would also be appropriate if anyone were to start
      using shmem_read_mapping_page_gfp() with GFP_NOWAIT.
      
      Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
      radix_tree_exceptional_entry() to get what it needs for itself.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa3b1895
    • Hugh Dickins's avatar
      tmpfs: convert shmem_getpage_gfp to radix-swap · 54af6042
      Hugh Dickins authored
      Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
      swap entry returned from radix tree by find_lock_page().
      
      Whereas the repetitive old method proceeded mainly under info->lock,
      dropping and repeating whenever one of the conditions needed was not
      met, now we can proceed without it, leaving shmem_add_to_page_cache() to
      check for a race.
      
      This way there is no need to preallocate a page, no need for an early
      radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().
      
      Move the error unwinding down to the bottom instead of repeating it
      throughout.  ENOSPC handling is a little different from before: there is
      no longer any race between find_lock_page() and finding swap, but we can
      arrive at ENOSPC before calling shmem_recalc_inode(), which might
      occasionally discover freed space.
      
      Be stricter to check i_size before returning.  info->lock is used for
      little but alloced, swapped, i_blocks updates.  Move i_blocks updates
      out from under the max_blocks check, so even an unlimited size=0 mount
      can show accurate du.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54af6042
    • Hugh Dickins's avatar
      tmpfs: convert shmem_unuse_inode to radix-swap · 46f65ec1
      Hugh Dickins authored
      Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
      tree, searching for matching swap.
      
      This is somewhat slower than the old method: because of repeated radix
      tree descents, because of copying entries up, but probably most because
      the old method noted and skipped once a vector page was cleared of swap.
      Perhaps we can devise a use of radix tree tagging to achieve that later.
      
      shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
      for the lockless lookup by checking that the expected entry is in place,
      under lock.  It is not very satisfactory to be copying this much from
      add_to_page_cache_locked(), but I think easier to sell than insisting
      that every caller of add_to_page_cache*() go through the extras.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46f65ec1
    • Hugh Dickins's avatar
      tmpfs: convert shmem_truncate_range to radix-swap · 7a5d0fbb
      Hugh Dickins authored
      Disable the toy swapping implementation in shmem_writepage() - it's hard
      to support two schemes at once - and convert shmem_truncate_range() to a
      lockless gang lookup of swap entries along with pages, freeing both.
      
      Since the second loop tightens its noose until all entries of either
      kind have been squeezed out (and we shall make sure that there's not an
      instant when neither is visible), there is no longer a need for yet
      another pass below.
      
      shmem_radix_tree_replace() compensates for the lockless lookup by
      checking that the expected entry is in place, under lock, before
      replacing it.  Here it just deletes, but will be used in later patches
      to substitute swap entry for page or page for swap entry.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a5d0fbb
    • Hugh Dickins's avatar
      tmpfs: copy truncate_inode_pages_range · bda97eab
      Hugh Dickins authored
      Bring truncate.c's code for truncate_inode_pages_range() inline into
      shmem_truncate_range(), replacing its first call (there's a followup
      call below, but leave that one, it will disappear next).
      
      Don't play with it yet, apart from leaving out the cleancache flush, and
      (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
      partial page preparation into its partial page handling.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bda97eab
    • Hugh Dickins's avatar
      tmpfs: miscellaneous trivial cleanups · 41ffe5d5
      Hugh Dickins authored
      While it's at its least, make a number of boring nitpicky cleanups to
      shmem.c, mostly for consistency of variable naming.  Things like "swap"
      instead of "entry", "pgoff_t index" instead of "unsigned long idx".
      
      And since everything else here is prefixed "shmem_", better change
      init_tmpfs() to shmem_init().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41ffe5d5
    • Hugh Dickins's avatar
      tmpfs: demolish old swap vector support · 285b2c4f
      Hugh Dickins authored
      The maximum size of a shmem/tmpfs file has been limited by the maximum
      size of its triple-indirect swap vector.  With 4kB page size, maximum
      filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
      that on a 64-bit kernel.  (With 8kB page size, maximum filesize was just
      over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
      MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)
      
      It's a shame that tmpfs should be more restrictive than ramfs, and this
      limitation has now been noticed.  Add another level to the swap vector?
      No, it became obscure and hard to maintain, once I complicated it to
      make use of highmem pages nine years ago: better choose another way.
      
      Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
      tmpfs would never have invented its own peculiar radix tree: we would
      have fitted swap entries into the common radix tree instead, in much the
      same way as we fit swap entries into page tables.
      
      And why should each file have a separate radix tree for its pages and
      for its swap entries? The swap entries are required precisely where and
      when the pages are not.  We want to put them together in a single radix
      tree: which can then avoid much of the locking which was needed to
      prevent them from being exchanged underneath us.
      
      This also avoids the waste of memory devoted to swap vectors, first in
      the shmem_inode itself, then at least two more pages once a file grew
      beyond 16 data pages (pages accounted by df and du, but not by memcg).
      Allocated upfront, to avoid allocation when under swapping pressure, but
      pure waste when CONFIG_SWAP is not set - I have never spattered around
      the ifdefs to prevent that, preferring this move to sharing the common
      radix tree instead.
      
      There are three downsides to sharing the radix tree.  One, that it binds
      tmpfs more tightly to the rest of mm, either requiring knowledge of swap
      entries in radix tree there, or duplication of its code here in shmem.c.
      I believe that the simplications and memory savings (and probable higher
      performance, not yet measured) justify that.
      
      Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
      nodes that cannot be freed under memory pressure - whereas before it was
      the less precious highmem swap vector pages that could not be freed.
      I'm hoping that 64-bit has now been accessible for long enough, that the
      highmem argument has grown much less persuasive.
      
      Three, that swapoff is slower than it used to be on tmpfs files, since
      it's using a simple generic mechanism not tailored to it: I find this
      noticeable, and shall want to improve, but maybe nobody else will
      notice.
      
      So...  now remove most of the old swap vector code from shmem.c.  But,
      for the moment, keep the simple i_direct vector of 16 pages, with simple
      accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
      to help mark where swap needs to be handled in subsequent patches.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      285b2c4f
    • Hugh Dickins's avatar
      mm: let swap use exceptional entries · a2c16d6c
      Hugh Dickins authored
      If swap entries are to be stored along with struct page pointers in a
      radix tree, they need to be distinguished as exceptional entries.
      
      Most of the handling of swap entries in radix tree will be contained in
      shmem.c, but a few functions in filemap.c's common code need to check
      for their appearance: find_get_page(), find_lock_page(),
      find_get_pages() and find_get_pages_contig().
      
      So as not to slow their fast paths, tuck those checks inside the
      existing checks for unlikely radix_tree_deref_slot(); except for
      find_lock_page(), where it is an added test.  And make it a BUG in
      find_get_pages_tag(), which is not applied to tmpfs files.
      
      A part of the reason for eliminating shmem_readpage() earlier, was to
      minimize the places where common code would need to allow for swap
      entries.
      
      The swp_entry_t known to swapfile.c must be massaged into a slightly
      different form when stored in the radix tree, just as it gets massaged
      into a pte_t when stored in page tables.
      
      In an i386 kernel this limits its information (type and page offset) to
      30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
      swapfile size of 128GB.  Which is less than the 512GB we previously
      allowed with X86_PAE (where the swap entry can occupy the entire upper
      32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
      there's not a new limitation on 64-bit (where swap filesize is already
      limited to 16TB by a 32-bit page offset).  Thirty areas of 128GB is
      probably still enough swap for a 64GB 32-bit machine.
      
      Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
      enforce filesize limit in read_swap_header(), just as for ptes.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2c16d6c