1. 24 Feb, 2013 40 commits
    • Maxim Mikityanskiy's avatar
      msi-wmi: Introduced quirk_last_pressed · fedda8e7
      Maxim Mikityanskiy authored
      Introduced quirk_last_pressed variable that would indicate if
      last_pressed is used or not. Also converted last_pressed to simple
      variable in order to allow keymap to be non-contiguous.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      fedda8e7
    • Maxim Mikityanskiy's avatar
      msi-wmi: Make keys and backlight independent · da850628
      Maxim Mikityanskiy authored
      Introduced function msi_wmi_backlight_setup() that initializes backlight
      device. Made driver load and work if only one WMI (only for hotkeys or
      only for backlight) is present.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      da850628
    • Maxim Mikityanskiy's avatar
      msi-wmi: Use enums for scancodes · b0d3bb53
      Maxim Mikityanskiy authored
      Use enums for consecutive scancodes, rename key names from MSI_WMI_* to
      MSI_KEY_* and use tabs for whitespace in msi_wmi_keymap.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Acked-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      b0d3bb53
    • Maxim Mikityanskiy's avatar
      msi-wmi: Avoid repeating constants · dd2b0251
      Maxim Mikityanskiy authored
      Use UUID defines in MODULE_ALIAS strings to avoid repeating strings.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Acked-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      dd2b0251
    • Maxim Mikityanskiy's avatar
      msi-wmi: Fix memory leak · 51c94491
      Maxim Mikityanskiy authored
      Fix memory leak - don't forget to kfree ACPI object when returning from
      msi_wmi_notify() after suppressing key event.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Acked-by: default avatarAnisse Astier <anisse@astier.eu>
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      51c94491
    • Maxim Mikityanskiy's avatar
      msi-laptop: Disable brightness control for new EC · 03696e51
      Maxim Mikityanskiy authored
      It seems that existing brightness control works only for old EC models.
      On newer ones auto_brightness access always timeouts and lcd_level
      always shows 0. So disable brightness control for new EC models. It
      works fine with ACPI video driver anyway.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      03696e51
    • Maxim Mikityanskiy's avatar
      msi-laptop: Add missing ABI documentation · cdeaf386
      Maxim Mikityanskiy authored
      Add ABI documentation for all sysfs files exposed by msi-laptop driver.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      cdeaf386
    • Maxim Mikityanskiy's avatar
      msi-laptop: Add MSI Wind U90/U100 support · 0de6575a
      Maxim Mikityanskiy authored
      Add MSI Wind U90/U100 to DMI table and add some missing EC features
      support such as basic fan control, turbo and ECO modes and touchpad
      state. Tested on MSI Wind U100.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      0de6575a
    • Lee, Chun-Yi's avatar
      msi-laptop: merge quirk tables to one · 0816392b
      Lee, Chun-Yi authored
      This patch introduced a quirk_entry struct, then we merged all quirk
      tables to msi_dmi_table. Then we can more easily to set different quirk
      attributes for different machine.
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      
      Changed this patch so that it could be applied before MSI Wind U100
      support patch. Changed rfkill logic for ec_read_only quirk support.
      Removed delays if ec_delay = false.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Acked-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      0816392b
    • Maxim Mikityanskiy's avatar
      msi-laptop: Work around gcc warning · 1b6517a0
      Maxim Mikityanskiy authored
      Assign initial value to variable in order to prevent gcc warning about
      uninitialized variable.
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      1b6517a0
    • Maxim Mikityanskiy's avatar
      msi-laptop: Use proper return codes instead of -1 · 27eb9e7f
      Maxim Mikityanskiy authored
      Use proper function return codes instead of -1
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      27eb9e7f
    • AceLan Kao's avatar
      asus-wmi: update wlan LED through rfkill led trigger · 6cae06e6
      AceLan Kao authored
      For those machines with wapf=4, BIOS won't update the wireless LED,
      since wapf=4 means user application will take in chage of the wifi and bt.
      So, we have to update wlan LED status explicitly.
      
      But I found there is another wireless LED bug in launchpad and which is
      not in the wapf=4 quirk.
      So, it might be better to set wireless LED status explicitly for all
      machines.
      
      BugLink: https://launchpad.net/bugs/901105Signed-off-by: default avatarAceLan Kao <acelan.kao@canonical.com>
      Signed-off-by: default avatarMatthew Garrett <mjg@redhat.com>
      6cae06e6
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal · 9e2d59ad
      Linus Torvalds authored
      Pull signal handling cleanups from Al Viro:
       "This is the first pile; another one will come a bit later and will
        contain SYSCALL_DEFINE-related patches.
      
         - a bunch of signal-related syscalls (both native and compat)
           unified.
      
         - a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE
           (fixing several potential problems with missing argument
           validation, while we are at it)
      
         - a lot of now-pointless wrappers killed
      
         - a couple of architectures (cris and hexagon) forgot to save
           altstack settings into sigframe, even though they used the
           (uninitialized) values in sigreturn; fixed.
      
         - microblaze fixes for delivery of multiple signals arriving at once
      
         - saner set of helpers for signal delivery introduced, several
           architectures switched to using those."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits)
        x86: convert to ksignal
        sparc: convert to ksignal
        arm: switch to struct ksignal * passing
        alpha: pass k_sigaction and siginfo_t using ksignal pointer
        burying unused conditionals
        make do_sigaltstack() static
        arm64: switch to generic old sigaction() (compat-only)
        arm64: switch to generic compat rt_sigaction()
        arm64: switch compat to generic old sigsuspend
        arm64: switch to generic compat rt_sigqueueinfo()
        arm64: switch to generic compat rt_sigpending()
        arm64: switch to generic compat rt_sigprocmask()
        arm64: switch to generic sigaltstack
        sparc: switch to generic old sigsuspend
        sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE
        sparc: kill sign-extending wrappers for native syscalls
        kill sparc32_open()
        sparc: switch to use of generic old sigaction
        sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE
        mips: switch to generic sys_fork() and sys_clone()
        ...
      9e2d59ad
    • Linus Torvalds's avatar
      Merge branch 'akpm' (more incoming from Andrew) · 5ce1a70e
      Linus Torvalds authored
      Merge second patch-bomb from Andrew Morton:
      
       - A little DM fix
      
       - the MM queue
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits)
        ksm: allocate roots when needed
        mm: cleanup "swapcache" in do_swap_page
        mm,ksm: swapoff might need to copy
        mm,ksm: FOLL_MIGRATION do migration_entry_wait
        ksm: shrink 32-bit rmap_item back to 32 bytes
        ksm: treat unstable nid like in stable tree
        ksm: add some comments
        tmpfs: fix mempolicy object leaks
        tmpfs: fix use-after-free of mempolicy object
        mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
        mm: export mmu notifier invalidates
        mm: accelerate mm_populate() treatment of THP pages
        mm: use long type for page counts in mm_populate() and get_user_pages()
        mm: accurately document nr_free_*_pages functions with code comments
        HWPOISON: change order of error_states[]'s elements
        HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
        memcg: stop warning on memcg_propagate_kmem
        net: change type of virtio_chan->p9_max_pages
        vmscan: change type of vm_total_pages to unsigned long
        fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used
        ...
      5ce1a70e
    • Hugh Dickins's avatar
      ksm: allocate roots when needed · ef53d16c
      Hugh Dickins authored
      It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
      allocated, particularly when very few users will ever actually tune
      merge_across_nodes 0 to use more than 1+1 of those trees.  Not a big
      deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.
      
      Start off with 1+1 statically allocated, then if merge_across_nodes is
      ever tuned, allocate for nr_node_ids+nr_node_ids.  Do not attempt to
      free up the extra if it's tuned back, that would be a waste of effort.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef53d16c
    • Hugh Dickins's avatar
      mm: cleanup "swapcache" in do_swap_page · 56f31801
      Hugh Dickins authored
      I dislike the way in which "swapcache" gets used in do_swap_page():
      there is always a page from swapcache there (even if maybe uncached by
      the time we lock it), but tests are made according to "swapcache".
      Rework that with "page != swapcache", as has been done in unuse_pte().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56f31801
    • Hugh Dickins's avatar
      mm,ksm: swapoff might need to copy · 9e16b7fb
      Hugh Dickins authored
      Before establishing that KSM page migration was the cause of my
      WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
      lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
      many respects is equivalent to faulting in a page.
      
      In fact I've never caught that as the cause: but in theory it does at
      least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
      avoid bringing a KSM page back in when it's not supposed to be.
      
      I intended to copy how it's done in do_swap_page(), but have a strong
      aversion to how "swapcache" ends up being used there: rework it with
      "page != swapcache".
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e16b7fb
    • Hugh Dickins's avatar
      mm,ksm: FOLL_MIGRATION do migration_entry_wait · 5117b3b8
      Hugh Dickins authored
      In "ksm: remove old stable nodes more thoroughly" I said that I'd never
      seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
      but it soon appeared once I tried fuller tests on the whole series.
      
      It turned out to be due to the KSM page migration itself: unmerge_and_
      remove_all_rmap_items() failed to locate and replace all the KSM pages,
      because of that hiatus in page migration when old pte has been replaced
      by migration entry, but not yet by new pte.  follow_page() finds no page
      at that instant, but a KSM page reappears shortly after, without a
      fault.
      
      Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
      for KSM's break_cow().  I'd have preferred to avoid another flag, and do
      it every time, in case someone else makes the same easy mistake; but did
      not find another transgressor (the common get_user_pages() is of course
      safe), and cannot be sure that every follow_page() caller is prepared to
      sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
      already sleep there, since anon_vma locking was changed to mutex, but
      maybe that's somehow excluded.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5117b3b8
    • Hugh Dickins's avatar
      ksm: shrink 32-bit rmap_item back to 32 bytes · bc56620b
      Hugh Dickins authored
      Think of struct rmap_item as an extension of struct page (restricted to
      MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
      small, especially on 32-bit architectures of limited lowmem.
      
      Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
      making no change to its 64-byte struct rmap_item; but bloats the 32-bit
      struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
      rounds up to 40 bytes once allocated from slab.  We'd better avoid that.
      
      Hey, I only just remembered that the anon_vma pointer in struct
      rmap_item has no purpose until the rmap_item is hung from a stable tree
      node (which has its own nid field); and rmap_item's nid field no purpose
      than to say which tree root to tell rb_erase() when unlinking from an
      unstable tree.
      
      Double them up in a union.  There's just one place where we set anon_vma
      early (when we already hold mmap_sem): now we must remove tree_rmap_item
      from its unstable tree there, before overwriting nid.  No need to
      spatter BUG()s around: we'd be seeing oopses if this were wrong.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc56620b
    • Hugh Dickins's avatar
      ksm: treat unstable nid like in stable tree · b599cbdf
      Hugh Dickins authored
      An inconsistency emerged in reviewing the NUMA node changes to KSM: when
      meeting a page from the wrong NUMA node in a stable tree, we say that
      it's okay for comparisons, but not as a leaf for merging; whereas when
      meeting a page from the wrong NUMA node in an unstable tree, we bail out
      immediately.
      
      Now, it might be that a wrong NUMA node in an unstable tree is more
      likely to correlate with instablility (different content, with rbnode
      now misplaced) than page migration; but even so, we are accustomed to
      instablility in the unstable tree.
      
      Without strong evidence for which strategy is generally better, I'd
      rather be consistent with what's done in the stable tree: accept a page
      from the wrong NUMA node for comparison, but not as a leaf for merging.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b599cbdf
    • Hugh Dickins's avatar
      ksm: add some comments · 8fdb3dbf
      Hugh Dickins authored
      Added slightly more detail to the Documentation of merge_across_nodes, a
      few comments in areas indicated by review, and renamed get_ksm_page()'s
      argument from "locked" to "lock_it".  No functional change.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fdb3dbf
    • Greg Thelen's avatar
      tmpfs: fix mempolicy object leaks · 49cd0a5c
      Greg Thelen authored
      Fix several mempolicy leaks in the tmpfs mount logic.  These leaks are
      slow - on the order of one object leaked per mount attempt.
      
      Leak 1 (umount doesn't free mpol allocated in mount):
          while true; do
              mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      Leak 2 (errors parsing remount options will leak mpol):
          mount -t tmpfs -o size=100M nodev /mnt
          while true; do
              mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
          done
          umount /mnt
      
      Leak 3 (multiple mpol per mount leak mpol):
          while true; do
              mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      This patch fixes all of the above.  I could have broken the patch into
      three pieces but is seemed easier to review as one.
      
      [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49cd0a5c
    • Greg Thelen's avatar
      tmpfs: fix use-after-free of mempolicy object · 5f00110f
      Greg Thelen authored
      The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
      option is not specified in the remount request.  A new policy can be
      specified if mpol=M is given.
      
      Before this patch remounting an mpol bound tmpfs without specifying
      mpol= mount option in the remount request would set the filesystem's
      mempolicy object to a freed mempolicy object.
      
      To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
          # mkdir /tmp/x
      
          # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
      
          # mount -o remount,size=200M nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
              # note ? garbage in mpol=... output above
      
          # dd if=/dev/zero of=/tmp/x/f count=1
              # panic here
      
      Panic:
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          [...]
          Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
          Call Trace:
            mpol_shared_policy_init+0xa5/0x160
            shmem_get_inode+0x209/0x270
            shmem_mknod+0x3e/0xf0
            shmem_create+0x18/0x20
            vfs_create+0xb5/0x130
            do_last+0x9a1/0xea0
            path_openat+0xb3/0x4d0
            do_filp_open+0x42/0xa0
            do_sys_open+0xfe/0x1e0
            compat_sys_open+0x1b/0x20
            cstar_dispatch+0x7/0x1f
      
      Non-debug kernels will not crash immediately because referencing the
      dangling mpol will not cause a fault.  Instead the filesystem will
      reference a freed mempolicy object, which will cause unpredictable
      behavior.
      
      The problem boils down to a dropped mpol reference below if
      shmem_parse_options() does not allocate a new mpol:
      
          config = *sbinfo
          shmem_parse_options(data, &config, true)
          mpol_put(sbinfo->mpol)
          sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */
      
      This patch avoids the crash by not releasing the mempolicy if
      shmem_parse_options() doesn't create a new mpol.
      
      How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
      not look back further.
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f00110f
    • Mel Gorman's avatar
      mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages · 67d46b29
      Mel Gorman authored
      Rob van der Heij reported the following (paraphrased) on private mail.
      
      	The scenario is that I want to avoid backups to fill up the page
      	cache and purge stuff that is more likely to be used again (this is
      	with s390x Linux on z/VM, so I don't give it as much memory that
      	we don't care anymore). So I have something with LD_PRELOAD that
      	intercepts the close() call (from tar, in this case) and issues
      	a posix_fadvise() just before closing the file.
      
      	This mostly works, except for small files (less than 14 pages)
      	that remains in page cache after the face.
      
      Unfortunately Rob has not had a chance to test this exact patch but the
      test program below should be reproducing the problem he described.
      
      The issue is the per-cpu pagevecs for LRU additions.  If the pages are
      added by one CPU but fadvise() is called on another then the pages
      remain resident as the invalidate_mapping_pages() only drains the local
      pagevecs via its call to pagevec_release().  The user-visible effect is
      that a program that uses fadvise() properly is not obeyed.
      
      A possible fix for this is to put the necessary smarts into
      invalidate_mapping_pages() to globally drain the LRU pagevecs if a
      pagevec page could not be discarded.  The downside with this is that an
      inode cache shrink would send a global IPI and memory pressure
      potentially causing global IPI storms is very undesirable.
      
      Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
      check if invalidate_mapping_pages() discarded all the requested pages.
      If a subset of pages are discarded it drains the LRU pagevecs and tries
      again.  If the second attempt fails, it assumes it is due to the pages
      being mapped, locked or dirty and does not care.  With this patch, an
      application using fadvise() correctly will be obeyed but there is a
      downside that a malicious application can force the kernel to send
      global IPIs and increase overhead.
      
      If accepted, I would like this to be considered as a -stable candidate.
      It's not an urgent issue but it's a system call that is not working as
      advertised which is weak.
      
      The following test program demonstrates the problem.  It should never
      report that pages are still resident but will without this patch.  It
      assumes that CPU 0 and 1 exist.
      
      int main() {
      	int fd;
      	int pagesize = getpagesize();
      	ssize_t written = 0, expected;
      	char *buf;
      	unsigned char *vec;
      	int resident, i;
      	cpu_set_t set;
      
      	/* Prepare a buffer for writing */
      	expected = FILESIZE_PAGES * pagesize;
      	buf = malloc(expected + 1);
      	if (buf == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      	buf[expected] = 0;
      	memset(buf, 'a', expected);
      
      	/* Prepare the mincore vec */
      	vec = malloc(FILESIZE_PAGES);
      	if (vec == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Bind ourselves to CPU 0 */
      	CPU_ZERO(&set);
      	CPU_SET(0, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* open file, unlink and write buffer */
      	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
      	if (fd == -1) {
      		perror("open");
      		exit(EXIT_FAILURE);
      	}
      	unlink("fadvise-test-file");
      	while (written < expected) {
      		ssize_t this_write;
      		this_write = write(fd, buf + written, expected - written);
      
      		if (this_write == -1) {
      			perror("write");
      			exit(EXIT_FAILURE);
      		}
      
      		written += this_write;
      	}
      	free(buf);
      
      	/*
      	 * Force ourselves to another CPU. If fadvise only flushes the local
      	 * CPUs pagevecs then the fadvise will fail to discard all file pages
      	 */
      	CPU_ZERO(&set);
      	CPU_SET(1, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* sync and fadvise to discard the page cache */
      	fsync(fd);
      	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
      		perror("posix_fadvise");
      		exit(EXIT_FAILURE);
      	}
      
      	/* map the file and use mincore to see which parts of it are resident */
      	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
      	if (buf == NULL) {
      		perror("mmap");
      		exit(EXIT_FAILURE);
      	}
      	if (mincore(buf, expected, vec) == -1) {
      		perror("mincore");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Check residency */
      	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
      		if (vec[i])
      			resident++;
      	}
      	if (resident != 0) {
      		printf("Nr unexpected pages resident: %d\n", resident);
      		exit(EXIT_FAILURE);
      	}
      
      	munmap(buf, expected);
      	close(fd);
      	free(vec);
      	exit(EXIT_SUCCESS);
      }
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Tested-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67d46b29
    • Cliff Wickman's avatar
      mm: export mmu notifier invalidates · fa794199
      Cliff Wickman authored
      We at SGI have a need to address some very high physical address ranges
      with our GRU (global reference unit), sometimes across partitioned
      machine boundaries and sometimes with larger addresses than the cpu
      supports.  We do this with the aid of our own 'extended vma' module
      which mimics the vma.  When something (either unmap or exit) frees an
      'extended vma' we use the mmu notifiers to clean them up.
      
      We had been able to mimic the functions
      __mmu_notifier_invalidate_range_start() and
      __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
      walking the per-mm notifier list.  But with the change to a global srcu
      lock (static in mmu_notifier.c) we can no longer do that.  Our module has
      no access to that lock.
      
      So we request that these two functions be exported.
      Signed-off-by: default avatarCliff Wickman <cpw@sgi.com>
      Acked-by: default avatarRobin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa794199
    • Michel Lespinasse's avatar
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse authored
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • Michel Lespinasse's avatar
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse authored
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • Zhang Yanfei's avatar
      mm: accurately document nr_free_*_pages functions with code comments · e0fb5815
      Zhang Yanfei authored
      nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
      are horribly badly named, so accurately document them with code comments
      in case of the misuse of them.
      
      [akpm@linux-foundation.org: tweak comments]
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0fb5815
    • Naoya Horiguchi's avatar
      HWPOISON: change order of error_states[]'s elements · 5f4b9fc5
      Naoya Horiguchi authored
      error_states[] has two separate states "unevictable LRU page" and
      "mlocked LRU page", and the former one has the higher priority now.  But
      because of that the latter one is rarely chosen because pages with
      PageMlocked highly likely have PG_unevictable set.  On the other hand,
      PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
      shared memory, so reversing the priority of these two states helps us
      clearly distinguish them.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f4b9fc5
    • Naoya Horiguchi's avatar
      HWPOISON: fix misjudgement of page_action() for errors on mlocked pages · 524fca1e
      Naoya Horiguchi authored
      memory_failure() can't handle memory errors on mlocked pages correctly,
      because page_action() judges such errors as ones on "unknown pages"
      instead of ones on "unevictable LRU page" or "mlocked LRU page".  In
      order to determine page_state page_action() checks page flags at the
      timing of the judgement, but such page flags are not the same with those
      just after memory_failure() is called, because memory_failure() does
      unmapping of the error pages before doing page_action().  This unmapping
      changes the page state, especially page_remove_rmap() (called from
      try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
      mlocked pages after that.
      
      With this patch, we store the page flag of the error page before doing
      unmap, and (only) if the first check with page flags at the time decided
      the error page is unknown, we do the second check with the stored page
      flag.  This implementation doesn't change error handling for the page
      types for which the first check can determine the page state correctly.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      524fca1e
    • Hugh Dickins's avatar
      memcg: stop warning on memcg_propagate_kmem · 6d043990
      Hugh Dickins authored
      Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
      I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
      "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
      used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d043990
    • Zhang Yanfei's avatar
      net: change type of virtio_chan->p9_max_pages · 7293bfba
      Zhang Yanfei authored
      This member of struct virtio_chan is calculated from nr_free_buffer_pages
      so change its type to unsigned long in case of overflow.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7293bfba
    • Zhang Yanfei's avatar
      vmscan: change type of vm_total_pages to unsigned long · b21e0b90
      Zhang Yanfei authored
      This variable is calculated from nr_free_pagecache_pages so
      change its type to unsigned long.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b21e0b90
    • Zhang Yanfei's avatar
      fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used · 697ce9be
      Zhang Yanfei authored
      The three variables are calculated from nr_free_buffer_pages so change
      their types to unsigned long in case of overflow.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      697ce9be
    • Zhang Yanfei's avatar
      fs/buffer.c: change type of max_buffer_heads to unsigned long · 43be594a
      Zhang Yanfei authored
      max_buffer_heads is calculated from nr_free_buffer_pages(), so change
      its type to unsigned long in case of overflow.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43be594a
    • Zhang Yanfei's avatar
      ia64: use %ld to print pages calculated in nr_free_buffer_pages · 6434b94a
      Zhang Yanfei authored
      Now the function nr_free_buffer_pages returns unsigned long, so use %ld
      to print its return value.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6434b94a
    • Zhang Yanfei's avatar
      mm: fix return type for functions nr_free_*_pages · ebec3862
      Zhang Yanfei authored
      Currently, the amount of RAM that functions nr_free_*_pages return is
      held in unsigned int.  But in machines with big memory (exceeding 16TB),
      the amount may be incorrect because of overflow, so fix it.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebec3862
    • Michal Hocko's avatar
      memcg: cleanup mem_cgroup_init comment · 1081312f
      Michal Hocko authored
      We should encourage all memcg controller initialization independent on a
      specific mem_cgroup to be done here rather than exploit css_alloc
      callback and assume that nothing happens before root cgroup is created.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1081312f
    • Michal Hocko's avatar
      memcg: move memcg_stock initialization to mem_cgroup_init · e4777496
      Michal Hocko authored
      memcg_stock are currently initialized during the root cgroup allocation
      which is OK but it pointlessly pollutes memcg allocation code with
      something that can be called when the memcg subsystem is initialized by
      mem_cgroup_init along with other controller specific parts.
      
      This patch wraps the current memcg_stock initialization code into a
      helper calls it from the controller subsystem initialization code.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4777496
    • Michal Hocko's avatar
      memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init · 8787a1df
      Michal Hocko authored
      Per-node-zone soft limit tree is currently initialized when the root
      cgroup is created which is OK but it pointlessly pollutes memcg
      allocation code with something that can be called when the memcg
      subsystem is initialized by mem_cgroup_init along with other controller
      specific parts.
      
      While we are at it let's make mem_cgroup_soft_limit_tree_init void
      because it doesn't make much sense to report memory failure because if
      we fail to allocate memory that early during the boot then we are
      screwed anyway (this saves some code).
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8787a1df