1. 21 Nov, 2012 3 commits
  2. 20 Nov, 2012 2 commits
  3. 19 Nov, 2012 5 commits
  4. 18 Nov, 2012 7 commits
  5. 17 Nov, 2012 11 commits
    • Daniel M. Weeks's avatar
      gpio-mcp23s08: Build I2C support even when CONFIG_I2C=m · cbf24fad
      Daniel M. Weeks authored
      The driver has both SPI and I2C pieces. The appropriate pieces are built based
      on whether SPI and/or I2C is/are enabled. However, it was only checking if I2C
      was built-in, never if it was built as a module. This patch checks for either
      since building both this driver and I2C as modules is possible.
      Signed-off-by: default avatarDaniel M. Weeks <dan@danweeks.net>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      cbf24fad
    • Thierry Reding's avatar
      gpio: adnp: Depend on OF_GPIO instead of OF · cb144fe8
      Thierry Reding authored
      The driver accesses the of_node field of struct gpio_chip, which is only
      available if OF_GPIO is selected. This solves a build issue on SPARC
      which conflicts with OF_GPIO and therefore does not provide this field.
      Signed-off-by: default avatarThierry Reding <thierry.reding@avionic-design.de>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      cb144fe8
    • Jamie Lentin's avatar
      mvebu-gpio: Disable blinking when enabling a GPIO for output · e9133760
      Jamie Lentin authored
      The plat-orion GPIO driver would disable any pin blinking whenever
      using a pin for output. Do the same here, as a blinking LED will
      continue to blink regardless of what the GPIO pin level is.
      Signed-off-by: default avatarJamie Lentin <jm@lentin.co.uk>
      Acked-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      e9133760
    • Dave Chinner's avatar
      xfs: drop buffer io reference when a bad bio is built · d69043c4
      Dave Chinner authored
      Error handling in xfs_buf_ioapply_map() does not handle IO reference
      counts correctly. We increment the b_io_remaining count before
      building the bio, but then fail to decrement it in the failure case.
      This leads to the buffer never running IO completion and releasing
      the reference that the IO holds, so at unmount we can leak the
      buffer. This leak is captured by this assert failure during unmount:
      
      XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273
      
      This is not a new bug - the b_io_remaining accounting has had this
      problem for a long, long time - it's just very hard to get a
      zero length bio being built by this code...
      
      Further, the buffer IO error can be overwritten on a multi-segment
      buffer by subsequent bio completions for partial sections of the
      buffer. Hence we should only set the buffer error status if the
      buffer is not already carrying an error status. This ensures that a
      partial IO error on a multi-segment buffer will not be lost. This
      part of the problem is a regression, however.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d69043c4
    • Dave Chinner's avatar
      xfs: fix broken error handling in xfs_vm_writepage · 3daed8bc
      Dave Chinner authored
      When we shut down the filesystem, it might first be detected in
      writeback when we are allocating a inode size transaction. This
      happens after we have moved all the pages into the writeback state
      and unlocked them. Unfortunately, if we fail to set up the
      transaction we then abort writeback and try to invalidate the
      current page. This then triggers are BUG() in block_invalidatepage()
      because we are trying to invalidate an unlocked page.
      
      Fixing this is a bit of a chicken and egg problem - we can't
      allocate the transaction until we've clustered all the pages into
      the IO and we know the size of it (i.e. whether the last block of
      the IO is beyond the current EOF or not). However, we don't want to
      hold pages locked for long periods of time, especially while we lock
      other pages to cluster them into the write.
      
      To fix this, we need to make a clear delineation in writeback where
      errors can only be handled by IO completion processing. That is,
      once we have marked a page for writeback and unlocked it, we have to
      report errors via IO completion because we've already started the
      IO. We may not have submitted any IO, but we've changed the page
      state to indicate that it is under IO so we must now use the IO
      completion path to report errors.
      
      To do this, add an error field to xfs_submit_ioend() to pass it the
      error that occurred during the building on the ioend chain. When
      this is non-zero, mark each ioend with the error and call
      xfs_finish_ioend() directly rather than building bios. This will
      immediately push the ioends through completion processing with the
      error that has occurred.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      3daed8bc
    • Dave Chinner's avatar
      xfs: fix attr tree double split corruption · 42e2976f
      Dave Chinner authored
      In certain circumstances, a double split of an attribute tree is
      needed to insert or replace an attribute. In rare situations, this
      can go wrong, leaving the attribute tree corrupted. In this case,
      the attr being replaced is the last attr in a leaf node, and the
      replacement is larger so doesn't fit in the same leaf node.
      When we have the initial condition of a node format attribute
      btree with two leaves at index 1 and 2. Call them L1 and L2.  The
      leaf L1 is completely full, there is not a single byte of free space
      in it. L2 is mostly empty.  The attribute being replaced - call it X
      - is the last attribute in L1.
      
      The way an attribute replace is executed is that the replacement
      attribute - call it Y - is first inserted into the tree, but has an
      INCOMPLETE flag set on it so that list traversals ignore it. Once
      this transaction is committed, a second transaction it run to
      atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
      traversal will now find Y and skip X. Once that transaction is
      committed, attribute X is then removed.
      
      So, the initial condition is:
      
           +--------+     +--------+
           |   L1   |     |   L2   |
           | fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |
           | fsp: 0 |     | fsp: N |
           |--------|     |--------|
           | attr A |     | attr 1 |
           |--------|     |--------|
           | attr B |     | attr 2 |
           |--------|     |--------|
           ..........     ..........
           |--------|     |--------|
           | attr X |     | attr n |
           +--------+     +--------+
      
      So now we go to replace X, and see that L1:fsp = 0 - it is full so
      we can't insert Y in the same leaf. So we record the the location of
      attribute X so we can track it for later use, then we split L1 into
      L1 and L3 and reblance across the two leafs. We end with:
      
           +--------+     +--------+     +--------+
           |   L1   |     |   L3   |     |   L2   |
           | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
           | fsp: M |     | fsp: J |     | fsp: N |
           |--------|     |--------|     |--------|
           | attr A |     | attr X |     | attr 1 |
           |--------|     +--------+     |--------|
           | attr B |                    | attr 2 |
           |--------|                    |--------|
           ..........                    ..........
           |--------|                    |--------|
           | attr W |                    | attr n |
           +--------+                    +--------+
      
      And we track that the original attribute is now at L3:0.
      
      We then try to insert Y into L1 again, and find that there isn't
      enough room because the new attribute is larger than the old one.
      Hence we have to split again to make room for Y. We end up with
      this:
      
           +--------+     +--------+     +--------+     +--------+
           |   L1   |     |   L4   |     |   L3   |     |   L2   |
           | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
           | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
           |--------|     |--------|     |--------|     |--------|
           | attr A |     | attr Y |     | attr X |     | attr 1 |
           |--------|     + INCOMP +     +--------+     |--------|
           | attr B |     +--------+                    | attr 2 |
           |--------|                                   |--------|
           ..........                                   ..........
           |--------|                                   |--------|
           | attr W |                                   | attr n |
           +--------+                                   +--------+
      
      And now we have the new (incomplete) attribute @ L4:0, and the
      original attribute at L3:0. At this point, the first transaction is
      committed, and we move to the flipping of the flags.
      
      This is where we are supposed to end up with this:
      
           +--------+     +--------+     +--------+     +--------+
           |   L1   |     |   L4   |     |   L3   |     |   L2   |
           | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
           | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
           |--------|     |--------|     |--------|     |--------|
           | attr A |     | attr Y |     | attr X |     | attr 1 |
           |--------|     +--------+     + INCOMP +     |--------|
           | attr B |                    +--------+     | attr 2 |
           |--------|                                   |--------|
           ..........                                   ..........
           |--------|                                   |--------|
           | attr W |                                   | attr n |
           +--------+                                   +--------+
      
      But that doesn't happen properly - the attribute tracking indexes
      are not pointing to the right locations. What we end up with is both
      the old attribute to be removed pointing at L4:0 and the new
      attribute at L4:1.  On a debug kernel, this assert fails like so:
      
      XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725
      
      because the new attribute location does not exist. On a production
      kernel, this goes unnoticed and the code proceeds ahead merrily and
      removes L4 because it thinks that is the block that is no longer
      needed. This leaves the hash index node pointing to entries
      L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
      leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
      space, and so everything is busted. This corruption is caused by the
      removal of the old attribute triggering a join - it joins everything
      correctly but then frees the wrong block.
      
      xfs_repair will report something like:
      
      bad sibling back pointer for block 4 in attribute fork for inode 131
      problem with attribute contents in inode 131
      would clear attr fork
      bad nblocks 8 for inode 131, would reset to 3
      bad anextents 4 for inode 131, would reset to 0
      
      The problem lies in the assignment of the old/new blocks for
      tracking purposes when the double leaf split occurs. The first split
      tries to place the new attribute inside the current leaf (i.e.
      "inleaf == true") and moves the old attribute (X) to the new block.
      This sets up the old block/index to L1:X, and newly allocated
      block to L3:0. It then moves attr X to the new block and tries to
      insert attr Y at the old index. That fails, so it splits again.
      
      With the second split, the rebalance ends up placing the new attr in
      the second new block - L4:0 - and this is where the code goes wrong.
      What is does is it sets both the new and old block index to the
      second new block. Hence it inserts attr Y at the right place (L4:0)
      but overwrites the current location of the attr to replace that is
      held in the new block index (currently L3:0). It over writes it with
      L4:1 - the index we later assert fail on.
      
      Hopefully this table will show this in a foramt that is a bit easier
      to understand:
      
      Split		old attr index		new attr index
      		vanilla	patched		vanilla	patched
      before 1st	L1:26	L1:26		N/A	N/A
      after 1st	L3:0	L3:0		L1:26	L1:26
      after 2nd	L4:0	L3:0		L4:1	L4:0
                      ^^^^			^^^^
      		wrong			wrong
      
      The fix is surprisingly simple, for all this analysis - just stop
      the rebalance on the out-of leaf case from overwriting the new attr
      index - it's already correct for the double split case.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      42e2976f
    • Alex Williamson's avatar
      intel-iommu: Fix lookup in add device · 3da4af0a
      Alex Williamson authored
      We can't assume this device exists, fall back to the bridge itself.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Tested-by: default avatarMatthew Thode <prometheanfire@gentoo.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJoerg Roedel <joro@8bytes.org>
      3da4af0a
    • Cyril Roelandt's avatar
      iommu/tegra-smmu.c: fix dentry reference leak in smmu_debugfs_stats_show(). · b334b648
      Cyril Roelandt authored
      Call to d_find_alias() needs a corresponding dput().
      Signed-off-by: default avatarCyril Roelandt <tipecaml@gmail.com>
      Signed-off-by: default avatarJoerg Roedel <joro@8bytes.org>
      b334b648
    • Joerg Roedel's avatar
      iommu/amd: Update MAINTAINERS entry · e4110568
      Joerg Roedel authored
      I have no access to my AMD email address anymore. Update
      entry in MAINTAINERS to the new address.
      Signed-off-by: default avatarJoerg Roedel <joro@8bytes.org>
      e4110568
    • Linus Torvalds's avatar
      Linux 3.7-rc6 · f4a75d2e
      Linus Torvalds authored
      f4a75d2e
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/virt/kvm/kvm · 51844b0f
      Linus Torvalds authored
      Pull KVM fix from Marcelo Tosatti:
       "A correction for oops on module init with older Intel hosts."
      
      * git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()
      51844b0f
  6. 16 Nov, 2012 12 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (Fixes from Andrew) · 0cad3ff4
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (12 patches)
        revert "mm: fix-up zone present pages"
        tmpfs: change final i_blocks BUG to WARNING
        tmpfs: fix shmem_getpage_gfp() VM_BUG_ON
        mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address
        mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
        rapidio: fix kernel-doc warnings
        swapfile: fix name leak in swapoff
        memcg: fix hotplugged memory zone oops
        mips, arc: fix build failure
        memcg: oom: fix totalpages calculation for memory.swappiness==0
        mm: fix build warning for uninitialized value
        mm: add anon_vma_lock to validate_mm()
      0cad3ff4
    • Andrew Morton's avatar
      revert "mm: fix-up zone present pages" · 5576646f
      Andrew Morton authored
      Revert commit 7f1290f2 ("mm: fix-up zone present pages")
      
      That patch tried to fix a issue when calculating zone->present_pages,
      but it caused a regression on 32bit systems with HIGHMEM.  With that
      change, reset_zone_present_pages() resets all zone->present_pages to
      zero, and fixup_zone_present_pages() is called to recalculate
      zone->present_pages when the boot allocator frees core memory pages into
      buddy allocator.  Because highmem pages are not freed by bootmem
      allocator, all highmem zones' present_pages becomes zero.
      
      Various options for improving the situation are being discussed but for
      now, let's return to the 3.6 code.
      
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarChris Clayton <chris2553@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5576646f
    • Hugh Dickins's avatar
      tmpfs: change final i_blocks BUG to WARNING · 0f3c42f5
      Hugh Dickins authored
      Under a particular load on one machine, I have hit shmem_evict_inode()'s
      BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
      race between swapout and eviction.
      
      It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
      and the lack of coherent locking between mapping's nrpages and shmem's
      swapped count.  There's a window in shmem_writepage(), between lowering
      nrpages in shmem_delete_from_page_cache() and then raising swapped
      count, when the freed count appears to be +1 when it should be 0, and
      then the asymmetry stops it from being corrected with -1 before hitting
      the BUG.
      
      One answer is coherent locking: using tree_lock throughout, without
      info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
      used_blocks makes that messier than expected.  Another answer may be a
      further effort to eliminate the weird shmem_recalc_inode() altogether,
      but previous attempts at that failed.
      
      So far undecided, but for now change the BUG_ON to WARN_ON: in usual
      circumstances it remains a useful consistency check.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f3c42f5
    • Hugh Dickins's avatar
      tmpfs: fix shmem_getpage_gfp() VM_BUG_ON · 215c02bc
      Hugh Dickins authored
      Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora
      has converted to WARNING) in shmem_getpage_gfp():
      
        WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
        Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
        Call Trace:
          warn_slowpath_common+0x7f/0xc0
          warn_slowpath_null+0x1a/0x20
          shmem_getpage_gfp+0xa5c/0xa70
          shmem_fault+0x4f/0xa0
          __do_fault+0x71/0x5c0
          handle_pte_fault+0x97/0xae0
          handle_mm_fault+0x289/0x350
          __do_page_fault+0x18e/0x530
          do_page_fault+0x2b/0x50
          page_fault+0x28/0x30
          tracesys+0xe1/0xe6
      
      Thanks to Johannes for pointing to truncation: free_swap_and_cache()
      only does a trylock on the page, so the page lock we've held since
      before confirming swap is not enough to protect against truncation.
      
      What cleanup is needed in this case? Just delete_from_swap_cache(),
      which takes care of the memcg uncharge.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      215c02bc
    • Will Deacon's avatar
      mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address · 498c2280
      Will Deacon authored
      kmap_to_page returns the corresponding struct page for a virtual address
      of an arbitrary mapping.  This works by checking whether the address
      falls in the pkmap region and using the pkmap page tables instead of the
      linear mapping if appropriate.
      
      Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is
      incorrectly treated as a highmem address and we can end up walking off
      the end of pkmap_page_table and subsequently passing junk to pte_page.
      
      This patch fixes the bound check to stay within the pkmap tables.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      498c2280
    • Mel Gorman's avatar
      mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" · 96710098
      Mel Gorman authored
      Jiri Slaby reported the following:
      
      	(It's an effective revert of "mm: vmscan: scale number of pages
      	reclaimed by reclaim/compaction based on failures".) Given kswapd
      	had hours of runtime in ps/top output yesterday in the morning
      	and after the revert it's now 2 minutes in sum for the last 24h,
      	I would say, it's gone.
      
      The intention of the patch in question was to compensate for the loss of
      lumpy reclaim.  Part of the reason lumpy reclaim worked is because it
      aggressively reclaimed pages and this patch was meant to be a sane
      compromise.
      
      When compaction fails, it gets deferred and both compaction and
      reclaim/compaction is deferred avoid excessive reclaim.  However, since
      commit c6543459 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
      each time and continues reclaiming which was not taken into account when
      the patch was developed.
      
      Attempts to address the problem ended up just changing the shape of the
      problem instead of fixing it.  The release window gets closer and while
      a THP allocation failing is not a major problem, kswapd chewing up a lot
      of CPU is.
      
      This patch reverts commit 83fde0f2 ("mm: vmscan: scale number of
      pages reclaimed by reclaim/compaction based on failures") and will be
      revisited in the future.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Tested-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96710098
    • Randy Dunlap's avatar
      rapidio: fix kernel-doc warnings · 2ca3cb50
      Randy Dunlap authored
      Fix rapidio kernel-doc warnings:
      
        Warning(drivers/rapidio/rio.c:415): No description found for parameter 'local'
        Warning(drivers/rapidio/rio.c:415): Excess function parameter 'lstart' description in 'rio_map_inb_region'
        Warning(include/linux/rio.h:290): No description found for parameter 'switches'
        Warning(include/linux/rio.h:290): No description found for parameter 'destid_table'
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Acked-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca3cb50
    • Xiaotian Feng's avatar
      swapfile: fix name leak in swapoff · f58b59c1
      Xiaotian Feng authored
      There's a name leak introduced by commit 91a27b2a ("vfs: define
      struct filename and have getname() return it").  Add the missing
      putname.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: default avatarXiaotian Feng <dannyfeng@tencent.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f58b59c1
    • Hugh Dickins's avatar
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins authored
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
          __pagevec_lru_add_fn+0xdf/0x140
          pagevec_lru_move_fn+0xb1/0x100
          __pagevec_lru_add+0x1c/0x30
          lru_add_drain_cpu+0xa3/0x130
          lru_add_drain+0x2f/0x40
         ...
      
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bea8c150
    • David Rientjes's avatar
      mips, arc: fix build failure · 18f69427
      David Rientjes authored
      Using a cross-compiler to fix another issue, the following build error
      occurred for mips defconfig:
      
        arch/mips/fw/arc/misc.c: In function 'ArcHalt':
        arch/mips/fw/arc/misc.c:25:2: error: implicit declaration of function 'local_irq_disable'
      
      Fix it up by including irqflags.h.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18f69427
    • Michal Hocko's avatar
      memcg: oom: fix totalpages calculation for memory.swappiness==0 · 9a5a8f19
      Michal Hocko authored
      oom_badness() takes a totalpages argument which says how many pages are
      available and it uses it as a base for the score calculation.  The value
      is calculated by mem_cgroup_get_limit which considers both limit and
      total_swap_pages (resp.  memsw portion of it).
      
      This is usually correct but since fe35004f ("mm: avoid swapping out
      with swappiness==0") we do not swap when swappiness is 0 which means
      that we cannot really use up all the totalpages pages.  This in turn
      confuses oom score calculation if the memcg limit is much smaller than
      the available swap because the used memory (capped by the limit) is
      negligible comparing to totalpages so the resulting score is too small
      if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
      A wrong process might be selected as result.
      
      The problem can be worked around by checking mem_cgroup_swappiness==0
      and not considering swap at all in such a case.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a5a8f19
    • David Rientjes's avatar
      mm: fix build warning for uninitialized value · 1756954c
      David Rientjes authored
      do_wp_page() sets mmun_called if mmun_start and mmun_end were
      initialized and, if so, may call mmu_notifier_invalidate_range_end()
      with these values.  This doesn't prevent gcc from emitting a build
      warning though:
      
        mm/memory.c: In function `do_wp_page':
        mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function
        mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function
      
      It's much easier to initialize the variables to impossible values and do
      a simple comparison to determine if they were initialized to remove the
      bool entirely.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1756954c