1. 12 May, 2011 8 commits
    • Hugh Dickins's avatar
      tmpfs: fix race between umount and swapoff · 778dd893
      Hugh Dickins authored
      The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
      to umount as that in shmem_writepage().
      
      Fix this instance by extending the protection of shmem_swaplist_mutex
      right across shmem_unuse_inode(): while it's on the list, the inode cannot
      be evicted (and the filesystem cannot be unmounted) without
      shmem_evict_inode() taking that mutex to remove it from the list.
      
      But since shmem_writepage() might take that mutex, we should avoid making
      memory allocations or memcg charges while holding it: prepare them at the
      outer level in shmem_unuse().  When mem_cgroup_cache_charge() was
      originally placed, we didn't know until that point that the page from swap
      was actually a shmem page; but nowadays it's noted in the swap_map, so
      we're safe to charge upfront.  For the radix_tree, do as is done in
      shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a
      habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves
      on occasion if subsequently preempted.
      
      With the allocation and charge moved out from shmem_unuse_inode(),
      we can also hold index map and info->lock over from finding the entry.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      778dd893
    • Hugh Dickins's avatar
      tmpfs: fix race between umount and writepage · b1dea800
      Hugh Dickins authored
      Konstanin Khlebnikov reports that a dangerous race between umount and
      shmem_writepage can be reproduced by this script:
      
        for i in {1..300} ; do
      	mkdir $i
      	while true ; do
      		mount -t tmpfs none $i
      		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
      		umount $i
      	done &
        done
      
      on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)
      
      Kernel log:
      
        VFS: Busy inodes after unmount of tmpfs.
                       Self-destruct in 5 seconds.  Have a nice day...
      
        WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
        Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
        Call Trace:
         warn_slowpath_common+0x80/0x98
         warn_slowpath_fmt+0x41/0x43
         __list_del_entry+0x8d/0x98
         evict+0x50/0x113
         iput+0x138/0x141
        ...
        BUG: unable to handle kernel paging request at ffffffffffffffff
        IP: shmem_free_blocks+0x18/0x4c
        Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4
        Call Trace:
         shmem_recalc_inode+0x61/0x66
         shmem_writepage+0xba/0x1dc
         pageout+0x13c/0x24c
         shrink_page_list+0x28e/0x4be
         shrink_inactive_list+0x21f/0x382
        ...
      
      shmem_writepage() calls igrab() on the inode for the page which came from
      page reclaim, to add it later into shmem_swaplist for swapoff operation.
      
      This igrab() can race with super-block deactivating process:
      
        shrink_inactive_list()          deactivate_super()
        pageout()                       tmpfs_fs_type->kill_sb()
        shmem_writepage()               kill_litter_super()
                                        generic_shutdown_super()
                                         evict_inodes()
         igrab()
                                          atomic_read(&inode->i_count)
                                           skip-inode
         iput()
                                         if (!list_empty(&sb->s_inodes))
                                                printk("VFS: Busy inodes after...
      
      This igrap-iput pair was added in commit 1b1b32f2 "tmpfs: fix
      shmem_swaplist races" based on incorrect assumptions: igrab() protects the
      inode from concurrent eviction by deletion, but it does nothing to protect
      it from concurrent unmounting, which goes ahead despite the raised
      i_count.
      
      So this use of igrab() was wrong all along, but the race made much worse
      in 2.6.37 when commit 63997e98 "split invalidate_inodes()" replaced
      two attempts at invalidate_inodes() by a single evict_inodes().
      
      Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure
      whether it was correct or not; but burnt once by igrab(), I am sure that
      we don't want to rely more deeply upon externals here.
      
      Fix it by adding the inode to shmem_swaplist earlier, while the page lock
      on page in page cache still secures the inode against eviction, without
      artifically raising i_count.  It was originally added later because
      shmem_unuse_inode() is liable to remove an inode from the list while it's
      unswapped; but we can guard against that by taking spinlock before
      dropping mutex.
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1dea800
    • Andi Kleen's avatar
      memcg: allocate memory cgroup structures in local nodes · 21a3c964
      Andi Kleen authored
      Commit dde79e00 ("page_cgroup: reduce allocation overhead for
      page_cgroup array for CONFIG_SPARSEMEM") added a regression that the
      memory cgroup data structures all end up in node 0 because the first
      attempt at allocating them would not pass in a node hint.  Since the
      initialization runs on CPU #0 it would all end up node 0.  This is a
      problem on large memory systems, where node 0 would lose a lot of
      memory.
      
      Change the alloc_pages_exact() to alloc_pages_exact_nid().  This will
      still fall back to other nodes if not enough memory is available.
      
       [ RED-PEN: right now it would fall back first before trying
         vmalloc_node.  Probably not the best strategy ...  But I left it like
         that for now. ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reported-by: Doug Nelson
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21a3c964
    • Andi Kleen's avatar
      mm: add alloc_pages_exact_nid() · ee85c2e1
      Andi Kleen authored
      Add a alloc_pages_exact_nid() that allocates on a specific node.
      
      The naming is quite broken, but fixing that would need a larger renaming
      action.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee85c2e1
    • Harry Wei's avatar
      MAINTAINERS: fix sorting · 71a6d0af
      Harry Wei authored
      Take alphabetical orders for MAINTAINERS file.
      Signed-off-by: default avatarHarry Wei <harryxiyou@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71a6d0af
    • Yinghai Lu's avatar
      mm: use alloc_bootmem_node_nopanic() on really needed path · 8f389a99
      Yinghai Lu authored
      Stefan found nobootmem does not work on his system that has only 8M of
      RAM.  This causes an early panic:
      
        BIOS-provided physical RAM map:
         BIOS-88: 0000000000000000 - 000000000009f000 (usable)
         BIOS-88: 0000000000100000 - 0000000000840000 (usable)
        bootconsole [earlyser0] enabled
        Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
        DMI not present or invalid.
        last_pfn = 0x840 max_arch_pfn = 0x100000
        init_memory_mapping: 0000000000000000-0000000000840000
        8MB LOWMEM available.
          mapped low ram: 0 - 00840000
          low ram: 0 - 00840000
        Zone PFN ranges:
          DMA      0x00000001 -> 0x00001000
          Normal   empty
        Movable zone start PFN for each node
        early_node_map[2] active PFN ranges
            0: 0x00000001 -> 0x0000009f
            0: 0x00000100 -> 0x00000840
        BUG: Int 6: CR2 (null)
             EDI c034663c  ESI (null)  EBP c0329f38  ESP c0329ef4
             EBX c0346380  EDX 00000006  ECX ffffffff  EAX fffffff4
             err (null)  EIP c0353191   CS c0320060  flg 00010082
        Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
               00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
               (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
        Pid: 0, comm: swapper Not tainted 2.6.36 #5
        Call Trace:
         [<c02e3707>] ? 0xc02e3707
         [<c035e6e5>] 0xc035e6e5
         [<c0353191>] ? 0xc0353191
         [<c03534d6>] 0xc03534d6
         [<c034f1cd>] 0xc034f1cd
         [<c034a824>] 0xc034a824
         [<c03513cb>] ? 0xc03513cb
         [<c0349432>] 0xc0349432
         [<c0349066>] 0xc0349066
      
      It turns out that we should ignore the low limit of 16M.
      
      Use alloc_bootmem_node_nopanic() in this case.
      
      [akpm@linux-foundation.org: less mess]
      Signed-off-by: default avatarYinghai LU <yinghai@kernel.org>
      Reported-by: default avatarStefan Hellermann <stefan@the2masters.de>
      Tested-by: default avatarStefan Hellermann <stefan@the2masters.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>		[2.6.34+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f389a99
    • Minchan Kim's avatar
      mm: check PageUnevictable in lru_deactivate_fn() · bad49d9c
      Minchan Kim authored
      The lru_deactivate_fn should not move page which in on unevictable lru
      into inactive list.  Otherwise, we can meet BUG when we use
      isolate_lru_pages as __isolate_lru_page could return -EINVAL.
      Reported-by: default avatarYing Han <yinghan@google.com>
      Tested-by: default avatarYing Han <yinghan@google.com>
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bad49d9c
    • Ben Dooks's avatar
      drivers/rtc/rtc-s3c.c: fixup wake support for rtc · 52cd4e5c
      Ben Dooks authored
      The driver is not balancing set_irq and disable_irq_wake() calls, so
      ensure that it keeps track of whether the wake is enabled.
      
      The fixes the following error on S3C6410 devices:
      
        WARNING: at kernel/irq/manage.c:382 set_irq_wake+0x84/0xec()
        Unbalanced IRQ 92 wake disable
      Signed-off-by: default avatarBen Dooks <ben-linux@fluff.org>
      Signed-off-by: default avatarMark Brown <broonie@opensource.wolfsonmicro.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52cd4e5c
  2. 11 May, 2011 2 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · 9f381a61
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
        slcan: fix ldisc->open retval
        net/usb: mark LG VL600 LTE modem ethernet interface as WWAN
        xfrm: Don't allow esn with disabled anti replay detection
        xfrm: Assign the inner mode output function to the dst entry
        net: dev_close() should check IFF_UP
        vlan: fix GVRP at dismantle time
        netfilter: revert a2361c87
        netfilter: IPv6: fix DSCP mangle code
        netfilter: IPv6: initialize TOS field in REJECT target module
        IPVS: init and cleanup restructuring
        IPVS: Change of socket usage to enable name space exit.
        netfilter: ebtables: only call xt_compat_add_offset once per rule
        netfilter: fix ebtables compat support
        netfilter: ctnetlink: fix timestamp support for new conntracks
        pch_gbe: support ML7223 IOH
        PCH_GbE : Fixed the issue of checksum judgment
        PCH_GbE : Fixed the issue of collision detection
        NET: slip, fix ldisc->open retval
        be2net: Fixed bugs related to PVID.
        ehea: fix wrongly reported speed and port
        ...
      9f381a61
    • David Rientjes's avatar
      slub: Revert "[PARISC] slub: fix panic with DISCONTIGMEM" · 21a43e39
      David Rientjes authored
      This reverts commit 4a5fa359, which did not allow SLUB to be used
      on architectures that use DISCONTIGMEM without compiling NUMA support
      without CONFIG_BROKEN also set.
      
      The slub panic that it was intended to prevent is addressed by
      d9b41e0b ("[PARISC] set memory ranges in N_NORMAL_MEMORY when
      onlined") on parisc so there is no further slub issues with such a
      configuration.
      
      The reverts allows SLUB now to be used on such architectures since
      there haven't been any reports of additional errors.
      
      Cc: James Bottomley <James.Bottomley@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21a43e39
  3. 10 May, 2011 30 commits