1. 27 Jan, 2011 8 commits
  2. 26 Jan, 2011 32 commits
    • Michael Chan's avatar
      bnx2: Eliminate AER error messages on systems not supporting it · 4bb9ebc7
      Michael Chan authored
      On PPC for example, AER is not supported and we see unnecessary AER
      error message without this patch:
      
      bnx2 0003:01:00.1: pci_cleanup_aer_uncorrect_error_status failed 0xfffffffb
      Reported-by: default avatarBreno Leitao <leitao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bb9ebc7
    • Michael Chan's avatar
      cnic: Fix big endian bug · 5138826b
      Michael Chan authored
      The chip's page tables did not set up properly on big endian machines,
      causing EEH errors on PPC machines.
      Reported-by: default avatarBreno Leitao <leitao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5138826b
    • David S. Miller's avatar
      xfrm6: Don't forget to propagate peer into ipsec route. · 7cc2edb8
      David S. Miller authored
      Like ipv4, we have to propagate the ipv6 route peer into
      the ipsec top-level route during instantiation.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7cc2edb8
    • Matt Carlson's avatar
      tg3: Use new VLAN code · 34c92049
      Matt Carlson authored
      This patch pivots the tg3 driver to the new VLAN infrastructure.
      All references to vlgrp have been removed.  The driver still attempts to
      disable VLAN tag stripping if CONFIG_VLAN_8021Q or
      CONFIG_VLAN_8021Q_MODULE is not defined.
      Signed-off-by: default avatarMatt Carlson <mcarlson@broadcom.com>
      Reviewed-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34c92049
    • David S. Miller's avatar
    • Hans-Christian Egtvedt's avatar
      avr32: add missing include causing undefined pgtable_page_* references · 6cb8e872
      Hans-Christian Egtvedt authored
      This patch adds the linux/mm.h header file to the AVR32 arch pgalloc.c
      implementation to fix the undefined reference to pgtable_page_ctor() and
      pgtable_page_dtor().
      Signed-off-by: default avatarHans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
      6cb8e872
    • Paul Turner's avatar
      sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages · 05ca62c6
      Paul Turner authored
      The delta in clock_task is a more fair attribution of how much time a tg has
      been contributing load to the current cpu.
      
      While not really important it also means we're more in sync (by magnitude)
      with respect to periodic updates (since __update_curr deltas are clock_task
      based).
      Signed-off-by: default avatarPaul Turner <pjt@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110122044852.007092349@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      05ca62c6
    • Paul Turner's avatar
      sched: Fix/remove redundant cfs_rq checks · b815f196
      Paul Turner authored
      Since updates are against an entity's queuing cfs_rq it's not possible to
      enter update_cfs_{shares,load} with a NULL cfs_rq.  (Indeed, update_cfs_load
      would crash prior to the check if we did anyway since we load is examined
      during the initializers).
      
      Also, in the update_cfs_load case there's no point
      in maintaining averages for rq->cfs_rq since we don't perform shares
      distribution at that level -- NULL check is replaced accordingly.
      
      Thanks to Dan Carpenter for pointing out the deference before NULL check.
      Signed-off-by: default avatarPaul Turner <pjt@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110122044851.825284940@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b815f196
    • Paul Turner's avatar
      sched: Fix sign under-flows in wake_affine · e37b6a7b
      Paul Turner authored
      While care is taken around the zero-point in effective_load to not exceed
      the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0)
      for (load + effective_load) to underflow.
      
      In this case the comparing the unsigned values can result in incorrect balanced
      decisions.
      Signed-off-by: default avatarPaul Turner <pjt@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110122044851.734245014@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e37b6a7b
    • Eric Dumazet's avatar
      percpu, x86: Fix percpu_xchg_op() · 889a7a6a
      Eric Dumazet authored
      These recent percpu commits:
      
        2485b646: x86,percpu: Move out of place 64 bit ops into X86_64 section
        8270137a: cpuops: Use cmpxchg for xchg to avoid lock semantics
      
      Caused this 'perf top' crash:
      
       Kernel panic - not syncing: Fatal exception in interrupt
       Pid: 0, comm: swapper Tainted: G     D
       2.6.38-rc2-00181-gef71723 #413 Call Trace: <IRQ> [<ffffffff810465b5>]
          ? panic
          ? kmsg_dump
          ? kmsg_dump
          ? oops_end
          ? no_context
          ? __bad_area_nosemaphore
          ? perf_output_begin
          ? bad_area_nosemaphore
          ? do_page_fault
          ? __task_pid_nr_ns
          ? perf_event_tid
          ? __perf_event_header__init_id
          ? validate_chain
          ? perf_output_sample
          ? trace_hardirqs_off
          ? page_fault
          ? irq_work_run
          ? update_process_times
          ? tick_sched_timer
          ? tick_sched_timer
          ? __run_hrtimer
          ? hrtimer_interrupt
          ? account_system_vtime
          ? smp_apic_timer_interrupt
          ? apic_timer_interrupt
       ...
      
      Looking at assembly code, I found:
      
      list = this_cpu_xchg(irq_work_list, NULL);
      
      gives this wrong code : (gcc-4.1.2 cross compiler)
      
      ffffffff810bc45e:
      	mov    %gs:0xead0,%rax
      	cmpxchg %rax,%gs:0xead0
      	jne    ffffffff810bc45e <irq_work_run+0x3e>
      	test   %rax,%rax
      	je     ffffffff810bc4aa <irq_work_run+0x8a>
      
      Tell gcc we dirty eax/rax register in percpu_xchg_op()
      
      Compiler must use another register to store pxo_new__
      
      We also dont need to reload percpu value after a jump,
      since a 'failed' cmpxchg already updated eax/rax
      
      Wrong generated code was :
      	xor     %rax,%rax   /* load 0 into %rax */
      1:	mov     %gs:0xead0,%rax
      	cmpxchg %rax,%gs:0xead0
      	jne     1b
      	test    %rax,%rax
      
      After patch :
      
      	xor     %rdx,%rdx   /* load 0 into %rdx */
      	mov     %gs:0xead0,%rax
      1:	cmpxchg %rdx,%gs:0xead0
      	jne     1b:
      	test    %rax,%rax
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      LKML-Reference: <1295973114.3588.312.camel@edumazet-laptop>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      889a7a6a
    • Yinghai Lu's avatar
      x86: Remove left over system_64.h · 9a57c3e4
      Yinghai Lu authored
      Left-over from the x86 merge ...
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4D3E23D1.7010405@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      9a57c3e4
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 6fb1b304
      Linus Torvalds authored
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: wacom - pass touch resolution to clients through input_absinfo
        Input: wacom - add 2 Bamboo Pen and touch models
        Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistent
        Input: sparse-keymap - fix KEY_VSW handling in sparse_keymap_setup
        Input: tegra-kbc - add tegra keyboard driver
        Input: gpio_keys - switch to using request_any_context_irq
        Input: serio - allow registered drivers to get status flag
        Input: ct82710c - return proper error code for ct82c710_open
        Input: bu21013_ts - added regulator support
        Input: bu21013_ts - remove duplicate resolution parameters
        Input: tnetv107x-ts - don't treat NULL clk as an error
        Input: tnetv107x-keypad - don't treat NULL clk as an error
      
      Fix up trivial conflicts in drivers/input/keyboard/Makefile due to
      additions of tc3589x/Tegra drivers
      6fb1b304
    • Sonic Zhang's avatar
      mmc: bfin_sdh: fix alloc size for private data · a34650f0
      Sonic Zhang authored
      The bfin_sdh driver allocates the wrong size for the private data
      in the mmc_host.  The first parameter of mmc_alloc_host should be
      the size of the local driver struct rather than the common mmc_host.
      Signed-off-by: default avatarSonic Zhang <sonic.zhang@analog.com>
      Signed-off-by: default avatarMike Frysinger <vapier@gentoo.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      a34650f0
    • Jaehoon Chung's avatar
      mmc: sdhci-s3c: add platform_8bit_width() hook · 548f07d2
      Jaehoon Chung authored
      We have 8-bit width support but is not a v3 controller.
      So we need platform_8bit_width() to support 8-bit buswidth.
      Also we need MMC_CAP_8_BIT_DATA, so we add it in platdata.
      
      This gets 8-bit support working again on s3c, after we previously
      disabled 8-bit by default on non-v3 controllers.
      Signed-off-by: default avatarJaehoon Chung <jh80.chung@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      548f07d2
    • Jamie Iles's avatar
      mmc: jz4740: don't treat NULL clk as an error · 3119cbda
      Jamie Iles authored
      clk_get() returns a struct clk cookie to the driver and some platforms
      may return NULL if they only support a single clock.  clk_get() has only
      failed if it returns a ERR_PTR() encoded pointer.
      Signed-off-by: default avatarJamie Iles <jamie@jamieiles.com>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      3119cbda
    • Russell King - ARM Linux's avatar
      mmc: mmci: don't read command response when invalid · 9047b435
      Russell King - ARM Linux authored
      Don't read the command response from the registers when either the
      command timed out (because there was no response from the card) or
      the checksum on the response was invalid.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      9047b435
    • Jesper Juhl's avatar
      mmc: ushc: Remove duplicate include of usb.h · 021cb59a
      Jesper Juhl authored
      Including usb.h once is enough in drivers/mmc/host/ushc.c
      This removes the duplicate.
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      021cb59a
    • Ping Cheng's avatar
      Input: wacom - pass touch resolution to clients through input_absinfo · 409550f2
      Ping Cheng authored
      Also remove fake ABS_RX/ABS_RY "axes" that were used to report physical
      dimensions now that we have better way.
      Signed-off-by: default avatarPing Cheng <pingc@wacom.com>
      Reviewed-by: default avatarHenrik Rydberg <rydberg@euromail.se>
      Signed-off-by: default avatarDmitry Torokhov <dtor@mail.ru>
      409550f2
    • Torben Hohn's avatar
      console: rename acquire/release_console_sem() to console_lock/unlock() · ac751efa
      Torben Hohn authored
      The -rt patches change the console_semaphore to console_mutex.  As a
      result, a quite large chunk of the patches changes all
      acquire/release_console_sem() to acquire/release_console_mutex()
      
      This commit makes things use more neutral function names which dont make
      implications about the underlying lock.
      
      The only real change is the return value of console_trylock which is
      inverted from try_acquire_console_sem()
      
      This patch also paves the way to switching console_sem from a semaphore to
      a mutex.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert]
      Signed-off-by: default avatarTorben Hohn <torbenh@gmx.de>
      Cc: Thomas Gleixner <tglx@tglx.de>
      Cc: Greg KH <gregkh@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac751efa
    • Phillip Lougher's avatar
      squashfs: fix use of uninitialised variable in zlib & xz decompressors · 3689456b
      Phillip Lougher authored
      Fix potential use of uninitialised variable caused by recent
      decompressor code optimisations.
      
      In zlib_uncompress (zlib_wrapper.c) we have
      
      	int zlib_err, zlib_init = 0;
      	...
      	do {
      		...
      			if (avail == 0) {
      				offset = 0;
      				put_bh(bh[k++]);
      				continue;
      			}
      		...
      		zlib_err = zlib_inflate(stream, Z_SYNC_FLUSH);
      		...
      	} while (zlib_err == Z_OK);
      
      If continue is executed (avail == 0) then the while condition will be
      evaluated testing zlib_err, which is uninitialised first time around the
      loop.
      
      Fix this by getting rid of the 'if (avail == 0)' condition test, this
      edge condition should not be being handled in the decompressor code, and
      instead handle it generically in the caller code.
      
      Similarly for xz_wrapper.c.
      
      Incidentally, on most architectures (bar Mips and Parisc), no
      uninitialised variable warning is generated by gcc, this is because the
      while condition test on continue is optimised out and not performed
      (when executing continue zlib_err has not been changed since entering
      the loop, and logically if the while condition was true previously, then
      it's still true).
      Signed-off-by: default avatarPhillip Lougher <phillip@lougher.demon.co.uk>
      Reported-by: default avatarJesper Juhl <jj@chaosbits.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3689456b
    • Toshiyuki Okajima's avatar
      radix_tree: radix_tree_gang_lookup_tag_slot() may never return · ac15ee69
      Toshiyuki Okajima authored
      Executed command: fsstress -d /mnt -n 600 -p 850
      
        crash> bt
        PID: 7947   TASK: ffff880160546a70  CPU: 0   COMMAND: "fsstress"
         #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9
         #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952
         #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8
         #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969
         #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b
         #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514
         #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60
            [exception RIP: __lookup_tag+100]
            RIP: ffffffff812274b4  RSP: ffff88016056b998  RFLAGS: 00000287
            RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000006
            RDX: 000000000000001d  RSI: ffff88016056bb18  RDI: ffff8800c85366e0
            RBP: ffff88016056b9c8   R8: ffff88016056b9e8   R9: 0000000000000000
            R10: 000000000000000e  R11: ffff8800c8536908  R12: 0000000000000010
            R13: 0000000000000040  R14: ffffffffffffffc0  R15: ffff8800c85366e0
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        <NMI exception stack>
         #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4
         #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605
         #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110
        #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85
        #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47
        #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014
        #13 [ffff88016056bbe0] do_writepages at ffffffff81105055
        #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb
        #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a
        #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc
        #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5
        #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085
        #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea
        #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8
        #21 [ffff88016056bf30] sys_write at ffffffff81150691
        #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2
      
      I think this root cause is the following:
      
       radix_tree_range_tag_if_tagged() always tags the root tag with settag
       if the root tag is set with iftag even if there are no iftag tags
       in the specified range (Of course, there are some iftag tags
       outside the specified range).
      
      ===============================================================================
      [[[Detailed description]]]
      
      (1) Why cannot radix_tree_gang_lookup_tag_slot() return forever?
      
      __lookup_tag():
       - Return with 0.
       - Return with the index which is not bigger than the old one as the
         input parameter.
      
      Therefore the following "while" repeats forever because the above
      conditions cause "ret" not to be updated and the cur_index cannot be
      changed into the bigger one.
      
      (So, radix_tree_gang_lookup_tag_slot() cannot return forever.)
      
      radix_tree_gang_lookup_tag_slot():
      1178         while (ret < max_items) {
      1179                 unsigned int slots_found;
      1180                 unsigned long next_index;       /* Index of next search */
      1181
      1182                 if (cur_index > max_index)
      1183                         break;
      1184                 slots_found = __lookup_tag(node, results + ret,
      1185                                 cur_index, max_items - ret, &next_index,
      tag);
      1186                 ret += slots_found;
      			// cannot update ret because slots_found == 0.
      			// so, this while loops forever.
      1187                 if (next_index == 0)
      1188                         break;
      1189                 cur_index = next_index;
      1190         }
      
      (2) Why does __lookup_tag() return with 0 and doesn't update the index?
      
      Assuming the following:
        - the one of the slot in radix_tree_node is NULL.
        - the one of the tag which corresponds to the slot sets with
          PAGECACHE_TAG_TOWRITE or other.
        - In a certain height(!=0), the corresponding index is 0.
      
      a) __lookup_tag() notices that the tag is set.
      
      1005 static unsigned int
      1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
      1007         unsigned int max_items, unsigned long *next_index, unsigned int tag)
      1008 {
      1009         unsigned int nr_found = 0;
      1010         unsigned int shift, height;
      1011
      1012         height = slot->height;
      1013         if (height == 0)
      1014                 goto out;
      1015         shift = (height-1) * RADIX_TREE_MAP_SHIFT;
      1016
      1017         while (height > 0) {
      1018                 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
      1019
      1020                 for (;;) {
      1021                         if (tag_get(slot, tag, i))
      1022                                 break;
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      * the index is not updated yet.
      
      b) __lookup_tag() notices that the slot is NULL.
      
      1023                         index &= ~((1UL << shift) - 1);
      1024                         index += 1UL << shift;
      1025                         if (index == 0)
      1026                                 goto out;       /* 32-bit wraparound */
      1027                         i++;
      1028                         if (i == RADIX_TREE_MAP_SIZE)
      1029                                 goto out;
      1030                 }
      1031                 height--;
      1032                 if (height == 0) {      /* Bottom level: grab some items */
      ...
      1055                 }
      1056                 shift -= RADIX_TREE_MAP_SHIFT;
      1057                 slot = rcu_dereference_raw(slot->slots[i]);
      1058                 if (slot == NULL)
      1059                         break;
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      c) __lookup_tag() doesn't update the index and return with 0.
      
      1060         }
      1061 out:
      1062         *next_index = index;
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      1063         return nr_found;
      1064 }
      
      (3) Why is the slot NULL even if the tag is set?
      
      Because radix_tree_range_tag_if_tagged() always sets the root tag with
      PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY,
      even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE
      in the specified range (from *first_indexp to last_index). Of course,
      some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range.
      (radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback())
      
       640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
      *root,
       641                 unsigned long *first_indexp, unsigned long last_index,
       642                 unsigned long nr_to_tag,
       643                 unsigned int iftag, unsigned int settag)
       644 {
       645         unsigned int height = root->height;
       646         struct radix_tree_path path[height];
       647         struct radix_tree_path *pathp = path;
       648         struct radix_tree_node *slot;
       649         unsigned int shift;
       650         unsigned long tagged = 0;
       651         unsigned long index = *first_indexp;
       652
       653         last_index = min(last_index, radix_tree_maxindex(height));
       654         if (index > last_index)
       655                 return 0;
       656         if (!nr_to_tag)
       657                 return 0;
       658         if (!root_tag_get(root, iftag)) {
       659                 *first_indexp = last_index + 1;
       660                 return 0;
       661         }
       662         if (height == 0) {
       663                 *first_indexp = last_index + 1;
       664                 root_tag_set(root, settag);
       665                 return 1;
       666         }
      ...
       733         root_tag_set(root, settag);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       734         *first_indexp = index;
       735
       736         return tagged;
       737 }
      
      As the result, there is no radix_tree_node which is set with
      PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with
      PAGECACHE_TAG_TOWRITE.
      
      [figure: inside radix_tree]
      (Please see the figure with typewriter font)
      ===========================================
                [roottag = DIRTY]
                       |             tag=0:NOTHING
               tag[0 0 0 1]              1:DIRTY
                  [x x x +]              2:WRITEBACK
                         |               3:DIRTY,WRITEBACK
                         p               4:TOWRITE
                   <--->                 5:DIRTY,TOWRITE ...
           specified range (index: 0 to 2)
      
      * There is no DIRTY tag within the specified range.
       (But there is a DIRTY tag outside that range.)
      
                  | | | | | | | | |
          after calling tag_pages_for_writeback()
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |                 p is "page".
               tag[0 0 0 1]              x is NULL.
                  [x x x +]              +- is a pointer to "page".
                         |
                         p
      
      * But TOWRITE tag is set on the root tag.
      ============================================
      
      After that, radix_tree_extend() via radix_tree_insert() is called
      when the page is added.
      This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE
      to succeed the status of the root tag.
      
       246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long
      index)
       247 {
       248         struct radix_tree_node *node;
       249         unsigned int height;
       250         int tag;
       251
       252         /* Figure out what the height should be.  */
       253         height = root->height + 1;
       254         while (index > radix_tree_maxindex(height))
       255                 height++;
       256
       257         if (root->rnode == NULL) {
       258                 root->height = height;
       259                 goto out;
       260         }
       261
       262         do {
       263                 unsigned int newheight;
       264                 if (!(node = radix_tree_node_alloc(root)))
       265                         return -ENOMEM;
       266
       267                 /* Increase the height.  */
       268                 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
       269
       270                 /* Propagate the aggregated tag info into the new root */
       271                 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
       272                         if (root_tag_get(root, tag))
       273                                 tag_set(node, tag, 0);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       274                 }
      
      ===========================================
                [roottag = DIRTY,TOWRITE]
                       |     :
               tag[0 0 0 1] [0 0 0 0]
                  [x x x +] [+ x x x]
                         |   |
                         p   p (new page)
      
                  | | | | | | | | |
          after calling radix_tree_insert
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |
               tag [5 0 0 0]    *  DIRTY and TOWRITE tags are
                   [+ + x x]       succeeded to the new node.
                    | |
        tag [0 0 0 1] [0 0 0 0]
            [x x x +] [+ x x x]
                   |   |
                   p   p
      ============================================
      
      After that, the index 3 page is released by remove_from_page_cache().
      Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE
      and that the slot which corresponds to the tag is NULL.
      ===========================================
                [roottag = DIRTY,TOWRITE]
                       |
               tag [5 0 0 0]
                   [+ + x x]
                    | |
        tag [0 0 0 1] [0 0 0 0]
            [x x x +] [+ x x x]
                   |   |
                   p   p
               (remove)
      
                  | | | | | | | | |
          after calling remove_page_cache
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |
               tag [4 0 0 0]      * Only DIRTY tag is cleared
                   [x + x x]        because no TOWRITE tag is existed
                      |             in the bottom node.
                      [0 0 0 0]
                      [+ x x x]
                       |
                       p
      ============================================
      
      To solve this problem
      
      Change to that radix_tree_tag_if_tagged() doesn't tag the root tag
      if it doesn't set any tags within the specified range.
      
      Like this.
      ============================================
       640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
      *root,
       641                 unsigned long *first_indexp, unsigned long last_index,
       642                 unsigned long nr_to_tag,
       643                 unsigned int iftag, unsigned int settag)
       644 {
       650         unsigned long tagged = 0;
      ...
       733 	     if (tagged)
      ^^^^^^^^^^^^^^^^^^^^^^^^
       734            root_tag_set(root, settag);
       735         *first_indexp = index;
       736
       737         return tagged;
       738 }
      
      ============================================
      Signed-off-by: default avatarToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac15ee69
    • Voss, Nikolaus's avatar
      drivers/clocksource/tcb_clksrc.c: fix init sequence · 1817dc03
      Voss, Nikolaus authored
      setup_irq() was called before clockevents_register_device() which is
      needed by the irq handler.  Bug was reproducible by restarting the
      kernel using kexec (reliable crash).
      Signed-off-by: default avatarNikolaus Voss <n.voss@weinmann.de>
      Cc: David Brownell <dbrownell@users.sourceforge.net>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1817dc03
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix race at move_parent around compound_order() · 52dbb905
      KAMEZAWA Hiroyuki authored
      A fix up mem_cgroup_move_parent() which use compound_order() in
      asynchronous manner.  This compound_order() may return unknown value
      because we don't take lock.  Use PageTransHuge() and HPAGE_SIZE instead
      of it.
      
      Also clean up for mem_cgroup_move_parent().
       - remove unnecessary initialization of local variable.
       - rename charge_size -> page_size
       - remove unnecessary (wrong) comment.
       - added a comment about THP.
      
      Note:
       Current design take compound_page_lock() in caller of move_account().
       This should be revisited when we implement direct move_task of hugepage
       without splitting.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52dbb905
    • KAMEZAWA Hiroyuki's avatar
      memcg: bugfix check mem_cgroup_disabled() at split fixup · 3d37c4a9
      KAMEZAWA Hiroyuki authored
      mem_cgroup_disabled() should be checked at splitting.  If disabled, no
      heavy work is necesary.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d37c4a9
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix account leak at failure of memsw acconting · 01c88e2d
      KAMEZAWA Hiroyuki authored
      Commit 4b534334 ("memcg: clean up try_charge main loop") removes a
      cancel of charge at case: memory charge-> success.  mem+swap charge->
      failure.
      
      This leaks usage of memory.  Fix it.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: <stable@kernel.org>	[2.6.36+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01c88e2d
    • Minchan Kim's avatar
      mm: migration: clarify migrate_pages() comment · 28bd6578
      Minchan Kim authored
      Callers of migrate_pages should putback_lru_pages to return pages
      isolated to LRU or free list.  Now comment is rather confusing.  It says
      caller always have to call it.
      
      It is more clear to point out that the caller has to call it if
      migrate_pages's return value isn't zero.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28bd6578
    • Andrea Arcangeli's avatar
      mm: compaction: don't depend on HUGETLB_PAGE · 33a93877
      Andrea Arcangeli authored
      Commit 5d689240 ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE
      enabled") causes this warning during the configuration process:
      
        warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet
        direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU)
      
      COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP
      either, it is also useful for regular alloc_pages(order > 0) including
      the very kernel stack during fork (THREAD_ORDER = 1).  It's always
      better to enable COMPACTION.
      
      The warning should be an error because we would end up with MIGRATION
      not selected, and COMPACTION wouldn't work without migration (despite it
      seems to build with an inline migrate_pages returning -ENOSYS).
      
      I'd also like to remove EXPERIMENTAL: compaction has been in the kernel
      for some releases (for full safety the default remains disabled which I
      think is enough).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarLuca Tettamanti <kronos.it@gmail.com>
      Tested-by: default avatarLuca Tettamanti <kronos.it@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33a93877
    • Jesper Juhl's avatar
      mm/memcontrol.c: fix uninitialized variable use in mem_cgroup_move_parent() · 8dba474f
      Jesper Juhl authored
      In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps
      to the 'put_back' label
      
        	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
        	if (ret || !parent)
        		goto put_back;
      
      where we'll
      
        	if (charge > PAGE_SIZE)
        		compound_unlock_irqrestore(page, flags);
      
      but, we have not assigned anything to 'flags' at this point, nor have we
      called 'compound_lock_irqsave()' (which is what sets 'flags').  The
      'put_back' label should be moved below the call to
      compound_unlock_irqrestore() as per this patch.
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dba474f
    • David Rientjes's avatar
      mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator · 2ff754fa
      David Rientjes authored
      Commit 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone") uncovered a livelock in the page
      allocator that resulted in tasks infinitely looping trying to find
      memory and kswapd running at 100% cpu.
      
      The issue occurs because drain_all_pages() is called immediately
      following direct reclaim when no memory is freed and try_to_free_pages()
      returns non-zero because all zones in the zonelist do not have their
      all_unreclaimable flag set.
      
      When draining the per-cpu pagesets back to the buddy allocator for each
      zone, the zone->pages_scanned counter is cleared to avoid erroneously
      setting zone->all_unreclaimable later.  The problem is that no pages may
      actually be drained and, thus, the unreclaimable logic never fails
      direct reclaim so the oom killer may be invoked.
      
      This apparently only manifested after wait_iff_congested() was
      introduced and the zone was full of anonymous memory that would not
      congest the backing store.  The page allocator would infinitely loop if
      there were no other tasks waiting to be scheduled and clear
      zone->pages_scanned because of drain_all_pages() as the result of this
      change before kswapd could scan enough pages to trigger the reclaim
      logic.  Additionally, with every loop of the page allocator and in the
      reclaim path, kswapd would be kicked and would end up running at 100%
      cpu.  In this scenario, current and kswapd are all running continuously
      with kswapd incrementing zone->pages_scanned and current clearing it.
      
      The problem is even more pronounced when current swaps some of its
      memory to swap cache and the reclaimable logic then considers all active
      anonymous memory in the all_unreclaimable logic, which requires a much
      higher zone->pages_scanned value for try_to_free_pages() to return zero
      that is never attainable in this scenario.
      
      Before wait_iff_congested(), the page allocator would incur an
      unconditional timeout and allow kswapd to elevate zone->pages_scanned to
      a level that the oom killer would be called the next time it loops.
      
      The fix is to only attempt to drain pcp pages if there is actually a
      quantity to be drained.  The unconditional clearing of
      zone->pages_scanned in free_pcppages_bulk() need not be changed since
      other callers already ensure that draining will occur.  This patch
      ensures that free_pcppages_bulk() will actually free memory before
      calling into it from drain_all_pages() so zone->pages_scanned is only
      cleared if appropriate.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ff754fa
    • David Rientjes's avatar
      mm: fix deferred congestion timeout if preferred zone is not allowed · f33261d7
      David Rientjes authored
      Before 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone"), preferred_zone was only used for NUMA
      statistics, to determine the zoneidx from which to allocate from given
      the type requested, and whether to utilize memory compaction.
      
      wait_iff_congested(), though, uses preferred_zone to determine if the
      congestion wait should be deferred because its dirty pages are backed by
      a congested bdi.  This incorrectly defers the timeout and busy loops in
      the page allocator with various cond_resched() calls if preferred_zone
      is not allowed in the current context, usually consuming 100% of a cpu.
      
      This patch ensures preferred_zone is an allowed zone in the fastpath
      depending on whether current is constrained by its cpuset or nodes in
      its mempolicy (when the nodemask passed is non-NULL).  This is correct
      since the fastpath allocation always passes ALLOC_CPUSET when trying to
      allocate memory.  In the slowpath, this patch resets preferred_zone to
      the first zone of the allowed type when the allocation is not
      constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.
      
      This patch also ensures preferred_zone is from the set of allowed nodes
      when called from within direct reclaim since allocations are always
      constrained by cpusets in this context (it is blockable).
      
      Both of these uses of cpuset_current_mems_allowed are protected by
      get_mems_allowed().
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f33261d7
    • Alexander Gordeev's avatar
      pps: claim parallel port exclusively · 4f542e3d
      Alexander Gordeev authored
      Both pps_parport and pps_gen_parport are written in a way that they
      can't share a port with any other driver.  This can result in locking up
      the process that loads modules or even the whole kernel if the modules
      are compiled in.  Use PARPORT_FLAG_EXCL to indicate this.
      Signed-off-by: default avatarAlexander Gordeev <lasaine@lvk.cs.msu.su>
      Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f542e3d
    • Rodolfo Giometti's avatar
      pps ktimer: remove noisy message · a783ac44
      Rodolfo Giometti authored
      Signed-off-by: default avatarRodolfo Giometti <giometti@linux.it>
      Reported-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a783ac44