1. 24 Apr, 2015 4 commits
    • Gu Zheng's avatar
      mm/memory hotplug: postpone the reset of obsolete pgdat · ec4db266
      Gu Zheng authored
      [ Upstream commit b0dc3a34 ]
      
      Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
      stress condition:
      
        BUG: unable to handle kernel paging request at 0000000000025f60
        IP: next_online_pgdat+0x1/0x50
        PGD 0
        Oops: 0000 [#1] SMP
        ACPI: Device does not support D3cold
        Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
        CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
        Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
        Workqueue: events vmstat_update
        task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
        RIP: 0010: next_online_pgdat+0x1/0x50
        RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
        RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
        RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
        RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
        R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
        R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
        FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          refresh_cpu_vm_stats+0xd0/0x140
          vmstat_update+0x11/0x50
          process_one_work+0x194/0x3d0
          worker_thread+0x12b/0x410
          kthread+0xc6/0xd0
          ret_from_fork+0x7c/0xb0
      
      The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
      try_offline_node, which will reset all the content of pgdat to 0, as the
      pgdat is accessed lock-free, so that the users still using the pgdat
      will panic, such as the vmstat_update routine.
      
      process A:				offline node XX:
      
      vmstat_updat()
         refresh_cpu_vm_stats()
           for_each_populated_zone()
             find online node XX
           cond_resched()
      					offline cpu and memory, then try_offline_node()
      					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
             zone = next_zone(zone)
               pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
                 next_online_pgdat(pgdat)
                   next_online_node(pgdat->node_id);  // NULL pointer access
      
      So the solution here is postponing the reset of obsolete pgdat from
      try_offline_node() to hotadd_new_pgdat(), and just resetting
      pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
      0 to avoid breaking pointer information in pgdat.
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Suggested-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ec4db266
    • Leon Yu's avatar
      mm: fix anon_vma->degree underflow in anon_vma endless growing prevention · 5bc650f0
      Leon Yu authored
      [ Upstream commit 3fe89b3e ]
      
      I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
      upgrading to 3.19 and had no luck with 4.0-rc1 neither.
      
      So, after looking into new logic introduced by commit 7a3ef208 ("mm:
      prevent endless growth of anon_vma hierarchy"), I found chances are that
      unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
      in anon_vma_clone() due to allocation failure.  If dst->anon_vma is not
      NULL in error path, its degree will be incorrectly decremented in
      unlink_anon_vmas() and eventually underflow when exiting as a result of
      another call to unlink_anon_vmas().  That's how "kernel BUG at
      mm/rmap.c:399!" is triggered for me.
      
      This patch fixes the underflow by dropping dst->anon_vma when allocation
      fails.  It's safe to do so regardless of original value of dst->anon_vma
      because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
      fails.  Besides, callers don't care dst->anon_vma in such case neither.
      
      Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
      anon_vma_clone() now does the work.
      
      [akpm@linux-foundation.org: tweak comment]
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Signed-off-by: default avatarLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5bc650f0
    • Konstantin Khlebnikov's avatar
      mm: fix corner case in anon_vma endless growing prevention · f94faaaa
      Konstantin Khlebnikov authored
      [ Upstream commit b800c91a ]
      
      Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel
      BUG at mm/rmap.c:399!") caused by commit 7a3ef208 ("mm: prevent
      endless growth of anon_vma hierarchy")
      
      Anon_vma_clone() is usually called for a copy of source vma in
      destination argument.  If source vma has anon_vma it should be already
      in dst->anon_vma.  NULL in dst->anon_vma is used as a sign that it's
      called from anon_vma_fork().  In this case anon_vma_clone() finds
      anon_vma for reusing.
      
      Vma_adjust() calls it differently and this breaks anon_vma reusing
      logic: anon_vma_clone() links vma to old anon_vma and updates degree
      counters but vma_adjust() overrides vma->anon_vma right after that.  As
      a result final unlink_anon_vmas() decrements degree for wrong anon_vma.
      
      This patch assigns ->anon_vma before calling anon_vma_clone().
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reported-and-tested-by: default avatarChris Clayton <chris2553@googlemail.com>
      Reported-and-tested-by: default avatarOded Gabbay <oded.gabbay@amd.com>
      Reported-and-tested-by: default avatarChih-Wei Huang <cwhuang@android-x86.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: stable@vger.kernel.org  # to match back-porting of 7a3ef208Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f94faaaa
    • Konstantin Khlebnikov's avatar
      mm: prevent endless growth of anon_vma hierarchy · 3f161800
      Konstantin Khlebnikov authored
      [ Upstream commit 7a3ef208 ]
      
      Constantly forking task causes unlimited grow of anon_vma chain.  Each
      next child allocates new level of anon_vmas and links vma to all
      previous levels because pages might be inherited from any level.
      
      This patch adds heuristic which decides to reuse existing anon_vma
      instead of forking new one.  It adds counter anon_vma->degree which
      counts linked vmas and directly descending anon_vmas and reuses anon_vma
      if counter is lower than two.  As a result each anon_vma has either vma
      or at least two descending anon_vmas.  In such trees half of nodes are
      leafs with alive vmas, thus count of anon_vmas is no more than two times
      bigger than count of vmas.
      
      This heuristic reuses anon_vmas as few as possible because each reuse
      adds false aliasing among vmas and rmap walker ought to scan more ptes
      when it searches where page is might be mapped.
      
      Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
      Fixes: 5beb4930 ("mm: change anon_vma linking to fix multi-process server scalability issue")
      [akpm@linux-foundation.org: fix typo, per Rik]
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reported-by: default avatarDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Tested-by: default avatarMichal Hocko <mhocko@suse.cz>
      Tested-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3f161800
  2. 23 Apr, 2015 36 commits