1. 30 May, 2018 40 commits
    • Anders Roxell's avatar
      selftests: memfd: add config fragment for fuse · b611d454
      Anders Roxell authored
      [ Upstream commit 9a606f8d ]
      
      The memfd test requires to insert the fuse module (CONFIG_FUSE_FS).
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: default avatarDaniel Díaz <daniel.diaz@linaro.org>
      Signed-off-by: default avatarShuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b611d454
    • Naresh Kamboju's avatar
      selftests: pstore: Adding config fragment CONFIG_PSTORE_RAM=m · 3f3beab9
      Naresh Kamboju authored
      [ Upstream commit 9a379e77 ]
      
      pstore_tests and pstore_post_reboot_tests need CONFIG_PSTORE_RAM=m
      Signed-off-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarShuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f3beab9
    • Dong Bo's avatar
      libata: Fix compile warning with ATA_DEBUG enabled · 3a6ebe27
      Dong Bo authored
      [ Upstream commit 0d3e45bc ]
      
      This fixs the following comile warnings with ATA_DEBUG enabled,
      which detected by Linaro GCC 5.2-2015.11:
      
        drivers/ata/libata-scsi.c: In function 'ata_scsi_dump_cdb':
        ./include/linux/kern_levels.h:5:18: warning: format '%d' expects
        argument of type 'int', but argument 6 has type 'u64 {aka long
         long unsigned int}' [-Wformat=]
      
      tj: Patch hand-applied and description trimmed.
      Signed-off-by: default avatarDong Bo <dongbo4@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a6ebe27
    • Jason Wang's avatar
      ptr_ring: prevent integer overflow when calculating size · 2e857aaf
      Jason Wang authored
      [ Upstream commit 54e02162 ]
      
      Switch to use dividing to prevent integer overflow when size is too
      big to calculate allocation size properly.
      Reported-by: default avatarEric Biggers <ebiggers3@gmail.com>
      Fixes: 6e6e41c3 ("ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e857aaf
    • Ulf Magnusson's avatar
      ARC: Fix malformed ARC_EMUL_UNALIGNED default · a5338dbd
      Ulf Magnusson authored
      [ Upstream commit 827cc2fa ]
      
      'default N' should be 'default n', though they happen to have the same
      effect here, due to undefined symbols (N in this case) evaluating to n
      in a tristate sense.
      
      Remove the default from ARC_EMUL_UNALIGNED instead of changing it. bool
      and tristate symbols implicitly default to n.
      
      Discovered with the
      https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ulfalizer_Kconfiglib_blob_master_examples_list-5Fundefined.py&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=c14YS-cH-kdhTOW89KozFhBtBJgs1zXscZojEZQ0THs&m=WxxD8ozR7QQUVzNCBksiznaisBGO_crN7PBOvAoju8s&s=1LmxsNqxwT-7wcInVpZ6Z1J27duZKSoyKxHIJclXU_M&e=
      script.
      Signed-off-by: default avatarUlf Magnusson <ulfalizer@gmail.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a5338dbd
    • Mark Salter's avatar
      irqchip/gic-v3: Change pr_debug message to pr_devel · 5fa8ed82
      Mark Salter authored
      [ Upstream commit b6dd4d83 ]
      
      The pr_debug() in gic-v3 gic_send_sgi() can trigger a circular locking
      warning:
      
       GICv3: CPU10: ICC_SGI1R_EL1 5000400
       ======================================================
       WARNING: possible circular locking dependency detected
       4.15.0+ #1 Tainted: G        W
       ------------------------------------------------------
       dynamic_debug01/1873 is trying to acquire lock:
        ((console_sem).lock){-...}, at: [<0000000099c891ec>] down_trylock+0x20/0x4c
      
       but task is already holding lock:
        (&rq->lock){-.-.}, at: [<00000000842e1587>] __task_rq_lock+0x54/0xdc
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #2 (&rq->lock){-.-.}:
              __lock_acquire+0x3b4/0x6e0
              lock_acquire+0xf4/0x2a8
              _raw_spin_lock+0x4c/0x60
              task_fork_fair+0x3c/0x148
              sched_fork+0x10c/0x214
              copy_process.isra.32.part.33+0x4e8/0x14f0
              _do_fork+0xe8/0x78c
              kernel_thread+0x48/0x54
              rest_init+0x34/0x2a4
              start_kernel+0x45c/0x488
      
       -> #1 (&p->pi_lock){-.-.}:
              __lock_acquire+0x3b4/0x6e0
              lock_acquire+0xf4/0x2a8
              _raw_spin_lock_irqsave+0x58/0x70
              try_to_wake_up+0x48/0x600
              wake_up_process+0x28/0x34
              __up.isra.0+0x60/0x6c
              up+0x60/0x68
              __up_console_sem+0x4c/0x7c
              console_unlock+0x328/0x634
              vprintk_emit+0x25c/0x390
              dev_vprintk_emit+0xc4/0x1fc
              dev_printk_emit+0x88/0xa8
              __dev_printk+0x58/0x9c
              _dev_info+0x84/0xa8
              usb_new_device+0x100/0x474
              hub_port_connect+0x280/0x92c
              hub_event+0x740/0xa84
              process_one_work+0x240/0x70c
              worker_thread+0x60/0x400
              kthread+0x110/0x13c
              ret_from_fork+0x10/0x18
      
       -> #0 ((console_sem).lock){-...}:
              validate_chain.isra.34+0x6e4/0xa20
              __lock_acquire+0x3b4/0x6e0
              lock_acquire+0xf4/0x2a8
              _raw_spin_lock_irqsave+0x58/0x70
              down_trylock+0x20/0x4c
              __down_trylock_console_sem+0x3c/0x9c
              console_trylock+0x20/0xb0
              vprintk_emit+0x254/0x390
              vprintk_default+0x58/0x90
              vprintk_func+0xbc/0x164
              printk+0x80/0xa0
              __dynamic_pr_debug+0x84/0xac
              gic_raise_softirq+0x184/0x18c
              smp_cross_call+0xac/0x218
              smp_send_reschedule+0x3c/0x48
              resched_curr+0x60/0x9c
              check_preempt_curr+0x70/0xdc
              wake_up_new_task+0x310/0x470
              _do_fork+0x188/0x78c
              SyS_clone+0x44/0x50
              __sys_trace_return+0x0/0x4
      
       other info that might help us debug this:
      
       Chain exists of:
         (console_sem).lock --> &p->pi_lock --> &rq->lock
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&rq->lock);
                                      lock(&p->pi_lock);
                                      lock(&rq->lock);
         lock((console_sem).lock);
      
        *** DEADLOCK ***
      
       2 locks held by dynamic_debug01/1873:
        #0:  (&p->pi_lock){-.-.}, at: [<000000001366df53>] wake_up_new_task+0x40/0x470
        #1:  (&rq->lock){-.-.}, at: [<00000000842e1587>] __task_rq_lock+0x54/0xdc
      
       stack backtrace:
       CPU: 10 PID: 1873 Comm: dynamic_debug01 Tainted: G        W        4.15.0+ #1
       Hardware name: GIGABYTE R120-T34-00/MT30-GS2-00, BIOS T48 10/02/2017
       Call trace:
        dump_backtrace+0x0/0x188
        show_stack+0x24/0x2c
        dump_stack+0xa4/0xe0
        print_circular_bug.isra.31+0x29c/0x2b8
        check_prev_add.constprop.39+0x6c8/0x6dc
        validate_chain.isra.34+0x6e4/0xa20
        __lock_acquire+0x3b4/0x6e0
        lock_acquire+0xf4/0x2a8
        _raw_spin_lock_irqsave+0x58/0x70
        down_trylock+0x20/0x4c
        __down_trylock_console_sem+0x3c/0x9c
        console_trylock+0x20/0xb0
        vprintk_emit+0x254/0x390
        vprintk_default+0x58/0x90
        vprintk_func+0xbc/0x164
        printk+0x80/0xa0
        __dynamic_pr_debug+0x84/0xac
        gic_raise_softirq+0x184/0x18c
        smp_cross_call+0xac/0x218
        smp_send_reschedule+0x3c/0x48
        resched_curr+0x60/0x9c
        check_preempt_curr+0x70/0xdc
        wake_up_new_task+0x310/0x470
        _do_fork+0x188/0x78c
        SyS_clone+0x44/0x50
        __sys_trace_return+0x0/0x4
       GICv3: CPU0: ICC_SGI1R_EL1 12000
      
      This could be fixed with printk_deferred() but that might lessen its
      usefulness for debugging. So change it to pr_devel to keep it out of
      production kernels. Developers working on gic-v3 can enable it as
      needed in their kernels.
      Signed-off-by: default avatarMark Salter <msalter@redhat.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5fa8ed82
    • Michael Kelley's avatar
      cpumask: Make for_each_cpu_wrap() available on UP as well · 31710e63
      Michael Kelley authored
      [ Upstream commit d207af2e ]
      
      for_each_cpu_wrap() was originally added in the #else half of a
      large "#if NR_CPUS == 1" statement, but was omitted in the #if
      half.  This patch adds the missing #if half to prevent compile
      errors when NR_CPUS is 1.
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarMichael Kelley <mhkelley@outlook.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kys@microsoft.com
      Cc: martin.petersen@oracle.com
      Cc: mikelley@microsoft.com
      Fixes: c743f0a5 ("sched/fair, cpumask: Export for_each_cpu_wrap()")
      Link: http://lkml.kernel.org/r/SN6PR1901MB2045F087F59450507D4FCC17CBF50@SN6PR1901MB2045.namprd19.prod.outlook.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      31710e63
    • Stephen Boyd's avatar
      irqchip/gic-v3: Ignore disabled ITS nodes · 7f409f15
      Stephen Boyd authored
      [ Upstream commit 95a25625 ]
      
      On some platforms there's an ITS available but it's not enabled
      because reading or writing the registers is denied by the
      firmware. In fact, reading or writing them will cause the system
      to reset. We could remove the node from DT in such a case, but
      it's better to skip nodes that are marked as "disabled" in DT so
      that we can describe the hardware that exists and use the status
      property to indicate how the firmware has configured things.
      
      Cc: Stuart Yoder <stuyoder@gmail.com>
      Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Rajendra Nayak <rnayak@codeaurora.org>
      Signed-off-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f409f15
    • Will Deacon's avatar
      locking/qspinlock: Ensure node->count is updated before initialising node · c8723cee
      Will Deacon authored
      [ Upstream commit 11dc1322 ]
      
      When queuing on the qspinlock, the count field for the current CPU's head
      node is incremented. This needn't be atomic because locking in e.g. IRQ
      context is balanced and so an IRQ will return with node->count as it
      found it.
      
      However, the compiler could in theory reorder the initialisation of
      node[idx] before the increment of the head node->count, causing an
      IRQ to overwrite the initialised node and potentially corrupt the lock
      state.
      
      Avoid the potential for this harmful compiler reordering by placing a
      barrier() between the increment of the head node->count and the subsequent
      node initialisation.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8723cee
    • Jia Zhang's avatar
      vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page · 059befd4
      Jia Zhang authored
      [ Upstream commit 595dd46e ]
      
      Commit:
      
        df04abfd ("fs/proc/kcore.c: Add bounce buffer for ktext data")
      
      ... introduced a bounce buffer to work around CONFIG_HARDENED_USERCOPY=y.
      However, accessing the vsyscall user page will cause an SMAP fault.
      
      Replace memcpy() with copy_from_user() to fix this bug works, but adding
      a common way to handle this sort of user page may be useful for future.
      
      Currently, only vsyscall page requires KCORE_USER.
      Signed-off-by: default avatarJia Zhang <zhang.jia@linux.alibaba.com>
      Reviewed-by: default avatarJiri Olsa <jolsa@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jolsa@redhat.com
      Link: http://lkml.kernel.org/r/1518446694-21124-2-git-send-email-zhang.jia@linux.alibaba.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      059befd4
    • Daniel Borkmann's avatar
      bpf: fix rlimit in reuseport net selftest · 517fbc77
      Daniel Borkmann authored
      [ Upstream commit 941ff6f1 ]
      
      Fix two issues in the reuseport_bpf selftests that were
      reported by Linaro CI:
      
        [...]
        + ./reuseport_bpf
        ---- IPv4 UDP ----
        Testing EBPF mod 10...
        Reprograming, testing mod 5...
        ./reuseport_bpf: ebpf error. log:
        0: (bf) r6 = r1
        1: (20) r0 = *(u32 *)skb[0]
        2: (97) r0 %= 10
        3: (95) exit
        processed 4 insns
        : Operation not permitted
        + echo FAIL
        [...]
        ---- IPv4 TCP ----
        Testing EBPF mod 10...
        ./reuseport_bpf: failed to bind send socket: Address already in use
        + echo FAIL
        [...]
      
      For the former adjust rlimit since this was the cause of
      failure for loading the BPF prog, and for the latter add
      SO_REUSEADDR.
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Link: https://bugs.linaro.org/show_bug.cgi?id=3502Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      517fbc77
    • Jesper Dangaard Brouer's avatar
      tools/libbpf: handle issues with bpf ELF objects containing .eh_frames · a7f9a7eb
      Jesper Dangaard Brouer authored
      [ Upstream commit e3d91b0c ]
      
      V3: More generic skipping of relo-section (suggested by Daniel)
      
      If clang >= 4.0.1 is missing the option '-target bpf', it will cause
      llc/llvm to create two ELF sections for "Exception Frames", with
      section names '.eh_frame' and '.rel.eh_frame'.
      
      The BPF ELF loader library libbpf fails when loading files with these
      sections.  The other in-kernel BPF ELF loader in samples/bpf/bpf_load.c,
      handle this gracefully. And iproute2 loader also seems to work with these
      "eh" sections.
      
      The issue in libbpf is caused by bpf_object__elf_collect() skipping
      some sections, and later when performing relocation it will be
      pointing to a skipped section, as these sections cannot be found by
      bpf_object__find_prog_by_idx() in bpf_object__collect_reloc().
      
      This is a general issue that also occurs for other sections, like
      debug sections which are also skipped and can have relo section.
      
      As suggested by Daniel.  To avoid keeping state about all skipped
      sections, instead perform a direct qlookup in the ELF object.  Lookup
      the section that the relo-section points to and check if it contains
      executable machine instructions (denoted by the sh_flags
      SHF_EXECINSTR).  Use this check to also skip irrelevant relo-sections.
      
      Note, for samples/bpf/ the '-target bpf' parameter to clang cannot be used
      due to incompatibility with asm embedded headers, that some of the samples
      include. This is explained in more details by Yonghong Song in bpf_devel_QA.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7f9a7eb
    • Tang Junhui's avatar
      bcache: return attach error when no cache set exist · d4008f81
      Tang Junhui authored
      [ Upstream commit 7f4fc93d ]
      
      I attach a back-end device to a cache set, and the cache set is not
      registered yet, this back-end device did not attach successfully, and no
      error returned:
      [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
      [root]#
      
      In sysfs_attach(), the return value "v" is initialized to "size" in
      the beginning, and if no cache set exist in bch_cache_sets, the "v" value
      would not change any more, and return to sysfs, sysfs regard it as success
      since the "size" is a positive number.
      
      This patch fixes this issue by assigning "v" with "-ENOENT" in the
      initialization.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4008f81
    • Tang Junhui's avatar
      bcache: fix for data collapse after re-attaching an attached device · 0d5da312
      Tang Junhui authored
      [ Upstream commit 73ac105b ]
      
      back-end device sdm has already attached a cache_set with ID
      f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
      another cache set, and it returns with an error:
      [root]# cd /sys/block/sdm/bcache
      [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
      -bash: echo: write error: Invalid argument
      
      After that, execute a command to modify the label of bcache
      device:
      [root]# echo data_disk1 > label
      
      Then we reboot the system, when the system power on, the back-end
      device can not attach to cache_set, a messages show in the log:
      Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
      bch_cached_dev_attach() couldn't find uuid for sdm in set
      
      In sysfs_attach(), dc->sb.set_uuid was assigned to the value
      which input through sysfs, no matter whether it is success
      or not in bch_cached_dev_attach(). For example, If the back-end
      device has already attached to an cache set, bch_cached_dev_attach()
      would fail, but dc->sb.set_uuid was changed. Then modify the
      label of bcache device, it will call bch_write_bdev_super(),
      which would write the dc->sb.set_uuid to the super block, so we
      record a wrong cache set ID in the super block, after the system
      reboot, the cache set couldn't find the uuid of the back-end
      device, so the bcache device couldn't exist and use any more.
      
      In this patch, we don't assigned cache set ID to dc->sb.set_uuid
      in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
      and assigned dc->sb.set_uuid to the cache set ID after the back-end
      device attached to the cache set successful.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d5da312
    • Tang Junhui's avatar
      bcache: fix for allocator and register thread race · d26dcc05
      Tang Junhui authored
      [ Upstream commit 682811b3 ]
      
      After long time running of random small IO writing,
      I reboot the machine, and after the machine power on,
      I found bcache got stuck, the stack is:
      [root@ceph153 ~]# cat /proc/2510/task/*/stack
      [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
      [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
      [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
      [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
      [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
      [<ffffffff810a631f>] kthread+0xcf/0xe0
      [<ffffffff8164c318>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      [root@ceph153 ~]# cat /proc/2038/task/*/stack
      [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
      [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
      [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
      [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
      [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
      [<ffffffff812f702f>] kobj_attr_store+0xf/0x20
      [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
      [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
      [<ffffffff811e069f>] SyS_write+0x7f/0xe0
      [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
      The stack shows the register thread and allocator thread
      were getting stuck when registering cache device.
      
      I reboot the machine several times, the issue always
      exsit in this machine.
      
      I debug the code, and found the call trace as bellow:
      register_bcache()
         ==>run_cache_set()
            ==>bch_journal_replay()
               ==>bch_btree_insert()
                  ==>__bch_btree_map_nodes()
                     ==>btree_insert_fn()
                        ==>btree_split() //node need split
                           ==>btree_check_reserve()
      In btree_check_reserve(), It will check if there is enough buckets
      of RESERVE_BTREE type, since allocator thread did not work yet, so
      no buckets of RESERVE_BTREE type allocated, so the register thread
      waits on c->btree_cache_wait, and goes to sleep.
      
      Then the allocator thread initialized, the call trace is bellow:
      bch_allocator_thread()
      ==>bch_prio_write()
         ==>bch_journal_meta()
            ==>bch_journal()
               ==>journal_wait_for_write()
      In journal_wait_for_write(), It will check if journal is full by
      journal_full(), but the long time random small IO writing
      causes the exhaustion of journal buckets(journal.blocks_free=0),
      In order to release the journal buckets,
      the allocator calls btree_flush_write() to flush keys to
      btree nodes, and waits on c->journal.wait until btree nodes writing
      over or there has already some journal buckets space, then the
      allocator thread goes to sleep. but in btree_flush_write(), since
      bch_journal_replay() is not finished, so no btree nodes have journal
      (condition "if (btree_current_write(b)->journal)" never satisfied),
      so we got no btree node to flush, no journal bucket released,
      and allocator sleep all the times.
      
      Through the above analysis, we can see that:
      1) Register thread wait for allocator thread to allocate buckets of
         RESERVE_BTREE type;
      2) Alloctor thread wait for register thread to replay journal, so it
         can flush btree nodes and get journal bucket.
         then they are all got stuck by waiting for each other.
      
      Hua Rui provided a patch for me, by allocating some buckets of
      RESERVE_BTREE type in advance, so the register thread can get bucket
      when btree node splitting and no need to waiting for the allocator
      thread. I tested it, it has effect, and register thread run a step
      forward, but finally are still got stuck, the reason is only 8 bucket
      of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
      after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
      then btree_check_reserve() is not satisfied anymore, so it goes to sleep
      again, and in the same time, alloctor thread did not flush enough btree
      nodes to release a journal bucket, so they all got stuck again.
      
      So we need to allocate more buckets of RESERVE_BTREE type in advance,
      but how much is enough?  By experience and test, I think it should be
      as much as journal buckets. Then I modify the code as this patch,
      and test in the machine, and it works.
      
      This patch modified base on Hua Rui’s patch, and allocate more buckets
      of RESERVE_BTREE type in advance to avoid register thread and allocate
      thread going to wait for each other.
      
      [patch v2] ca->sb.njournal_buckets would be 0 in the first time after
      cache creation, and no journal exists, so just 8 btree buckets is OK.
      Signed-off-by: default avatarHua Rui <huarui.dev@gmail.com>
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d26dcc05
    • Coly Li's avatar
      bcache: properly set task state in bch_writeback_thread() · ee6fcd83
      Coly Li authored
      [ Upstream commit 99361bbf ]
      
      Kernel thread routine bch_writeback_thread() has the following code block,
      
      447         down_write(&dc->writeback_lock);
      448~450     if (check conditions) {
      451                 up_write(&dc->writeback_lock);
      452                 set_current_state(TASK_INTERRUPTIBLE);
      453
      454                 if (kthread_should_stop())
      455                         return 0;
      456
      457                 schedule();
      458                 continue;
      459         }
      
      If condition check is true, its task state is set to TASK_INTERRUPTIBLE
      and call schedule() to wait for others to wake up it.
      
      There are 2 issues in current code,
      1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
         another process changes the condition and call wake_up_process(dc->
         writeback_thread), then at line 452 task state is set back to
         TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
         waken up.
      2, At line 454 if kthread_should_stop() is true, writeback kernel thread
         will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
         call do_exit(). It is not good to enter do_exit() with task state
         TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
         warning message is reported by __might_sleep(): "WARNING: do not call
         blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
      
      For the first issue, task state should be set before condition checks.
      Ineed because dc->writeback_lock is required when modifying all the
      conditions, calling set_current_state() inside code block where dc->
      writeback_lock is hold is safe. But this is quite implicit, so I still move
      set_current_state() before all the condition checks.
      
      For the second issue, frankley speaking it does not hurt when kernel thread
      exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
      makes them feel there might be something risky with bcache and hurt their
      data.  Setting task state to TASK_RUNNING before returning fixes this
      problem.
      
      In alloc.c:allocator_wait(), there is also a similar issue, and is also
      fixed in this patch.
      
      Changelog:
      v3: merge two similar fixes into one patch
      v2: fix the race issue in v1 patch.
      v1: initial buggy fix.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee6fcd83
    • Arnd Bergmann's avatar
      cifs: silence compiler warnings showing up with gcc-8.0.0 · 4bf53b51
      Arnd Bergmann authored
      [ Upstream commit ade7db99 ]
      
      This bug was fixed before, but came up again with the latest
      compiler in another function:
      
      fs/cifs/cifssmb.c: In function 'CIFSSMBSetEA':
      fs/cifs/cifssmb.c:6362:3: error: 'strncpy' offset 8 is out of the bounds [0, 4] [-Werror=array-bounds]
         strncpy(parm_data->list[0].name, ea_name, name_len);
      
      Let's apply the same fix that was used for the other instances.
      
      Fixes: b2a3ad9c ("cifs: silence compiler warnings showing up with gcc-4.7.0")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4bf53b51
    • Alexey Dobriyan's avatar
      proc: fix /proc/*/map_files lookup · e0a1a017
      Alexey Dobriyan authored
      [ Upstream commit ac7f1061 ]
      
      Current code does:
      
      	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
      
      However sscanf() is broken garbage.
      
      It silently accepts whitespace between format specifiers
      (did you know that?).
      
      It silently accepts valid strings which result in integer overflow.
      
      Do not use sscanf() for any even remotely reliable parsing code.
      
      	OK
      	# readlink '/proc/1/map_files/55a23af39000-55a23b05b000'
      	/lib/systemd/systemd
      
      	broken
      	# readlink '/proc/1/map_files/               55a23af39000-55a23b05b000'
      	/lib/systemd/systemd
      
      	broken
      	# readlink '/proc/1/map_files/55a23af39000-55a23b05b000    '
      	/lib/systemd/systemd
      
      	very broken
      	# readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000'
      	/lib/systemd/systemd
      
      Andrei said:
      
      : This patch breaks criu.  It was a bug in criu.  And this bug is on a minor
      : path, which works when memfd_create() isn't available.  It is a reason why
      : I ask to not backport this patch to stable kernels.
      :
      : In CRIU this bug can be triggered, only if this patch will be backported
      : to a kernel which version is lower than v3.16.
      
      Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Andrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0a1a017
    • Will Deacon's avatar
      arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics · 0675ec13
      Will Deacon authored
      [ Upstream commit 202fb4ef ]
      
      If the spinlock "next" ticket wraps around between the initial LDR
      and the cmpxchg in the LSE version of spin_trylock, then we can erroneously
      think that we have successfuly acquired the lock because we only check
      whether the next ticket return by the cmpxchg is equal to the owner ticket
      in our updated lock word.
      
      This patch fixes the issue by performing a full 32-bit check of the lock
      word when trying to determine whether or not the CASA instruction updated
      memory.
      Reported-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0675ec13
    • Guanglei Li's avatar
      RDS: IB: Fix null pointer issue · a0138dc3
      Guanglei Li authored
      [ Upstream commit 2c0aa086 ]
      
      Scenario:
      1. Port down and do fail over
      2. Ap do rds_bind syscall
      
      PID: 47039  TASK: ffff89887e2fe640  CPU: 47  COMMAND: "kworker/u:6"
       #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9
       #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3
       #2 [ffff898e35f15b30] oops_end at ffffffff8150f518
       #3 [ffff898e35f15b60] no_context at ffffffff8104854c
       #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675
       #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3
       #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8
       #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95
          [exception RIP: unknown or invalid address]
          RIP: 0000000000000000  RSP: ffff898e35f15dc8  RFLAGS: 00010282
          RAX: 00000000fffffffe  RBX: ffff889b77f6fc00  RCX:ffffffff81c99d88
          RDX: 0000000000000000  RSI: ffff896019ee08e8  RDI:ffff889b77f6fc00
          RBP: ffff898e35f15df0   R8: ffff896019ee08c8  R9:0000000000000000
          R10: 0000000000000400  R11: 0000000000000000  R12:ffff896019ee08c0
          R13: ffff889b77f6fe68  R14: ffffffff81c99d80  R15: ffffffffa022a1e0
          ORIG_RAX: ffffffffffffffff  CS: 0010 SS: 0018
       #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm]
       #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6
       #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0
       #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6
      
      PID: 45659  TASK: ffff880d313d2500  CPU: 31  COMMAND: "oracle_45659_ap"
       #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4
       #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf
       #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7
       #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb
       #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm]
       #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma]
       #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds]
       #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds]
       #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670
      
      PID: 45659                          PID: 47039
      rds_ib_laddr_check
        /* create id_priv with a null event_handler */
        rdma_create_id
        rdma_bind_addr
          cma_acquire_dev
            /* add id_priv to cma_dev->id_list */
            cma_attach_to_dev
                                          cma_ndev_work_handler
                                            /* event_hanlder is null */
                                            id_priv->id.event_handler
      Signed-off-by: default avatarGuanglei Li <guanglei.li@oracle.com>
      Signed-off-by: default avatarHonglei Wang <honglei.wang@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarYanjun Zhu <yanjun.zhu@oracle.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0138dc3
    • Ross Lagerwall's avatar
      xen/grant-table: Use put_page instead of free_page · 240ef711
      Ross Lagerwall authored
      [ Upstream commit 3ac7292a ]
      
      The page given to gnttab_end_foreign_access() to free could be a
      compound page so use put_page() instead of free_page() since it can
      handle both compound and single pages correctly.
      
      This bug was discovered when migrating a Xen VM with several VIFs and
      CONFIG_DEBUG_VM enabled. It hits a BUG usually after fewer than 10
      iterations. All netfront devices disconnect from the backend during a
      suspend/resume and this will call gnttab_end_foreign_access() if a
      netfront queue has an outstanding skb. The mismatch between calling
      get_page() and free_page() on a compound page causes a reference
      counting error which is detected when DEBUG_VM is enabled.
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      240ef711
    • Ross Lagerwall's avatar
      xen-netfront: Fix race between device setup and open · ca3108cd
      Ross Lagerwall authored
      [ Upstream commit f599c64f ]
      
      When a netfront device is set up it registers a netdev fairly early on,
      before it has set up the queues and is actually usable. A userspace tool
      like NetworkManager will immediately try to open it and access its state
      as soon as it appears. The bug can be reproduced by hotplugging VIFs
      until the VM runs out of grant refs. It registers the netdev but fails
      to set up any queues (since there are no more grant refs). In the
      meantime, NetworkManager opens the device and the kernel crashes trying
      to access the queues (of which there are none).
      
      Fix this in two ways:
      * For initial setup, register the netdev much later, after the queues
      are setup. This avoids the race entirely.
      * During a suspend/resume cycle, the frontend reconnects to the backend
      and the queues are recreated. It is possible (though highly unlikely) to
      race with something opening the device and accessing the queues after
      they have been destroyed but before they have been recreated. Extend the
      region covered by the rtnl semaphore to protect against this race. There
      is a possibility that we fail to recreate the queues so check for this
      in the open function.
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ca3108cd
    • Matt Redfearn's avatar
      MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS · d6a4ef16
      Matt Redfearn authored
      [ Upstream commit 0cde5b44 ]
      
      When commit b27311e1 ("MIPS: TXx9: Add RBTX4939 board support")
      added board support for the RBTX4939, it added a call to
      led_classdev_register even if the LED class is built as a module.
      Built-in arch code cannot call module code directly like this. Commit
      b33b4407 ("MIPS: TXX9: use IS_ENABLED() macro") subsequently
      changed the inclusion of this code to a single check that
      CONFIG_LEDS_CLASS is either builtin or a module, but the same issue
      remains.
      
      This leads to MIPS allmodconfig builds failing when CONFIG_MACH_TX49XX=y
      is set:
      
      arch/mips/txx9/rbtx4939/setup.o: In function `rbtx4939_led_probe':
      setup.c:(.init.text+0xc0): undefined reference to `of_led_classdev_register'
      make: *** [Makefile:999: vmlinux] Error 1
      
      Fix this by using the IS_BUILTIN() macro instead.
      
      Fixes: b27311e1 ("MIPS: TXx9: Add RBTX4939 board support")
      Signed-off-by: default avatarMatt Redfearn <matt.redfearn@mips.com>
      Reviewed-by: default avatarJames Hogan <jhogan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/18544/Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6a4ef16
    • James Hogan's avatar
      MIPS: generic: Fix machine compatible matching · 51b896a8
      James Hogan authored
      [ Upstream commit 9a9ab307 ]
      
      We now have a platform (Ranchu) in the "generic" platform which matches
      based on the FDT compatible string using mips_machine_is_compatible(),
      however that function doesn't stop at a blank struct
      of_device_id::compatible as that is an array in the struct, not a
      pointer to a string.
      
      Fix the loop completion to check the first byte of the compatible array
      rather than the address of the compatible array in the struct.
      
      Fixes: eed0eabd ("MIPS: generic: Introduce generic DT-based board support")
      Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Reviewed-by: default avatarPaul Burton <paul.burton@mips.com>
      Reviewed-by: default avatarMatt Redfearn <matt.redfearn@mips.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/18580/Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51b896a8
    • Yonghong Song's avatar
      bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y · ee4bba56
      Yonghong Song authored
      [ Upstream commit 09584b40 ]
      
      With CONFIG_BPF_JIT_ALWAYS_ON is defined in the config file,
      tools/testing/selftests/bpf/test_kmod.sh failed like below:
        [root@localhost bpf]# ./test_kmod.sh
        sysctl: setting key "net.core.bpf_jit_enable": Invalid argument
        [ JIT enabled:0 hardened:0 ]
        [  132.175681] test_bpf: #297 BPF_MAXINSNS: Jump, gap, jump, ... FAIL to prog_create err=-524 len=4096
        [  132.458834] test_bpf: Summary: 348 PASSED, 1 FAILED, [340/340 JIT'ed]
        [ JIT enabled:1 hardened:0 ]
        [  133.456025] test_bpf: #297 BPF_MAXINSNS: Jump, gap, jump, ... FAIL to prog_create err=-524 len=4096
        [  133.730935] test_bpf: Summary: 348 PASSED, 1 FAILED, [340/340 JIT'ed]
        [ JIT enabled:1 hardened:1 ]
        [  134.769730] test_bpf: #297 BPF_MAXINSNS: Jump, gap, jump, ... FAIL to prog_create err=-524 len=4096
        [  135.050864] test_bpf: Summary: 348 PASSED, 1 FAILED, [340/340 JIT'ed]
        [ JIT enabled:1 hardened:2 ]
        [  136.442882] test_bpf: #297 BPF_MAXINSNS: Jump, gap, jump, ... FAIL to prog_create err=-524 len=4096
        [  136.821810] test_bpf: Summary: 348 PASSED, 1 FAILED, [340/340 JIT'ed]
        [root@localhost bpf]#
      
      The test_kmod.sh load/remove test_bpf.ko multiple times with different
      settings for sysctl net.core.bpf_jit_{enable,harden}. The failed test #297
      of test_bpf.ko is designed such that JIT always fails.
      
      Commit 290af866 (bpf: introduce BPF_JIT_ALWAYS_ON config)
      introduced the following tightening logic:
          ...
              if (!bpf_prog_is_dev_bound(fp->aux)) {
                      fp = bpf_int_jit_compile(fp);
          #ifdef CONFIG_BPF_JIT_ALWAYS_ON
                      if (!fp->jited) {
                              *err = -ENOTSUPP;
                              return fp;
                      }
          #endif
          ...
      With this logic, Test #297 always gets return value -ENOTSUPP
      when CONFIG_BPF_JIT_ALWAYS_ON is defined, causing the test failure.
      
      This patch fixed the failure by marking Test #297 as expected failure
      when CONFIG_BPF_JIT_ALWAYS_ON is defined.
      
      Fixes: 290af866 (bpf: introduce BPF_JIT_ALWAYS_ON config)
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee4bba56
    • Hans de Goede's avatar
      ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs · cbaf06cc
      Hans de Goede authored
      [ Upstream commit 63347db0 ]
      
      The acpi_get_bus_status wrapper for acpi_bus_get_status_handle has some
      code to handle certain device quirks, in some cases we also need this
      quirk handling for the initial _STA call.
      
      Specifically on some devices calling _STA before all _DEP dependencies
      are met results in errors like these:
      
      [    0.123579] ACPI Error: No handler for Region [ECRM] (00000000ba9edc4c)
                     [GenericSerialBus] (20170831/evregion-166)
      [    0.123601] ACPI Error: Region GenericSerialBus (ID=9) has no handler
                     (20170831/exfldio-299)
      [    0.123618] ACPI Error: Method parse/execution failed
                     \_SB.I2C1.BAT1._STA, AE_NOT_EXIST (20170831/psparse-550)
      
      acpi_get_bus_status already has code to avoid this, so by using it we
      also silence these errors from the initial _STA call.
      
      Note that in order for the acpi_get_bus_status handling for this to work,
      we initialize dep_unmet to 1 until acpi_device_dep_initialize gets called,
      this means that battery devices will be instantiated with an initial
      status of 0. This is not a problem, acpi_bus_attach will get called soon
      after the instantiation anyways and it will update the status as first
      point of order.
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbaf06cc
    • Chen Yu's avatar
      ACPI: processor_perflib: Do not send _PPC change notification if not ready · 9a18bac1
      Chen Yu authored
      [ Upstream commit ba1edb9a ]
      
      The following warning was triggered after resumed from S3 -
      if all the nonboot CPUs were put offline before suspend:
      
      [ 1840.329515] unchecked MSR access error: RDMSR from 0x771 at rIP: 0xffffffff86061e3a (native_read_msr+0xa/0x30)
      [ 1840.329516] Call Trace:
      [ 1840.329521]  __rdmsr_on_cpu+0x33/0x50
      [ 1840.329525]  generic_exec_single+0x81/0xb0
      [ 1840.329527]  smp_call_function_single+0xd2/0x100
      [ 1840.329530]  ? acpi_ds_result_pop+0xdd/0xf2
      [ 1840.329532]  ? acpi_ds_create_operand+0x215/0x23c
      [ 1840.329534]  rdmsrl_on_cpu+0x57/0x80
      [ 1840.329536]  ? cpumask_next+0x1b/0x20
      [ 1840.329538]  ? rdmsrl_on_cpu+0x57/0x80
      [ 1840.329541]  intel_pstate_update_perf_limits+0xf3/0x220
      [ 1840.329544]  ? notifier_call_chain+0x4a/0x70
      [ 1840.329546]  intel_pstate_set_policy+0x4e/0x150
      [ 1840.329548]  cpufreq_set_policy+0xcd/0x2f0
      [ 1840.329550]  cpufreq_update_policy+0xb2/0x130
      [ 1840.329552]  ? cpufreq_update_policy+0x130/0x130
      [ 1840.329556]  acpi_processor_ppc_has_changed+0x65/0x80
      [ 1840.329558]  acpi_processor_notify+0x80/0x100
      [ 1840.329561]  acpi_ev_notify_dispatch+0x44/0x5c
      [ 1840.329563]  acpi_os_execute_deferred+0x14/0x20
      [ 1840.329565]  process_one_work+0x193/0x3c0
      [ 1840.329567]  worker_thread+0x35/0x3b0
      [ 1840.329569]  kthread+0x125/0x140
      [ 1840.329571]  ? process_one_work+0x3c0/0x3c0
      [ 1840.329572]  ? kthread_park+0x60/0x60
      [ 1840.329575]  ? do_syscall_64+0x67/0x180
      [ 1840.329577]  ret_from_fork+0x25/0x30
      [ 1840.329585] unchecked MSR access error: WRMSR to 0x774 (tried to write 0x0000000000000000) at rIP: 0xffffffff86061f78 (native_write_msr+0x8/0x30)
      [ 1840.329586] Call Trace:
      [ 1840.329587]  __wrmsr_on_cpu+0x37/0x40
      [ 1840.329589]  generic_exec_single+0x81/0xb0
      [ 1840.329592]  smp_call_function_single+0xd2/0x100
      [ 1840.329594]  ? acpi_ds_create_operand+0x215/0x23c
      [ 1840.329595]  ? cpumask_next+0x1b/0x20
      [ 1840.329597]  wrmsrl_on_cpu+0x57/0x70
      [ 1840.329598]  ? rdmsrl_on_cpu+0x57/0x80
      [ 1840.329599]  ? wrmsrl_on_cpu+0x57/0x70
      [ 1840.329602]  intel_pstate_hwp_set+0xd3/0x150
      [ 1840.329604]  intel_pstate_set_policy+0x119/0x150
      [ 1840.329606]  cpufreq_set_policy+0xcd/0x2f0
      [ 1840.329607]  cpufreq_update_policy+0xb2/0x130
      [ 1840.329610]  ? cpufreq_update_policy+0x130/0x130
      [ 1840.329613]  acpi_processor_ppc_has_changed+0x65/0x80
      [ 1840.329615]  acpi_processor_notify+0x80/0x100
      [ 1840.329617]  acpi_ev_notify_dispatch+0x44/0x5c
      [ 1840.329619]  acpi_os_execute_deferred+0x14/0x20
      [ 1840.329620]  process_one_work+0x193/0x3c0
      [ 1840.329622]  worker_thread+0x35/0x3b0
      [ 1840.329624]  kthread+0x125/0x140
      [ 1840.329625]  ? process_one_work+0x3c0/0x3c0
      [ 1840.329626]  ? kthread_park+0x60/0x60
      [ 1840.329628]  ? do_syscall_64+0x67/0x180
      [ 1840.329631]  ret_from_fork+0x25/0x30
      
      This is because if there's only one online CPU, the MSR_PM_ENABLE
      (package wide)can not be enabled after resumed, due to
      intel_pstate_hwp_enable() will only be invoked on AP's online
      process after resumed - if there's no AP online, the HWP remains
      disabled after resumed (BIOS has disabled it in S3). Then if
      there comes a _PPC change notification which touches HWP register
      during this stage, the warning is triggered.
      
      Since we don't call acpi_processor_register_performance() when
      HWP is enabled, the pr->performance will be NULL. When this is
      NULL we don't need to do _PPC change notification.
      Reported-by: default avatarDoug Smythies <dsmythies@telus.net>
      Suggested-by: default avatarSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: default avatarYu Chen <yu.c.chen@intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a18bac1
    • Jean Delvare's avatar
      firmware: dmi_scan: Fix handling of empty DMI strings · 6fdca0dc
      Jean Delvare authored
      [ Upstream commit a7770ae1 ]
      
      The handling of empty DMI strings looks quite broken to me:
      * Strings from 1 to 7 spaces are not considered empty.
      * True empty DMI strings (string index set to 0) are not considered
        empty, and result in allocating a 0-char string.
      * Strings with invalid index also result in allocating a 0-char
        string.
      * Strings starting with 8 spaces are all considered empty, even if
        non-space characters follow (sounds like a weird thing to do, but
        I have actually seen occurrences of this in DMI tables before.)
      * Strings which are considered empty are reported as 8 spaces,
        instead of being actually empty.
      
      Some of these issues are the result of an off-by-one error in memcmp,
      the rest is incorrect by design.
      
      So let's get it square: missing strings and strings made of only
      spaces, regardless of their length, should be treated as empty and
      no memory should be allocated for them. All other strings are
      non-empty and should be allocated.
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Fixes: 79da4721 ("x86: fix DMI out of memory problems")
      Cc: Parag Warudkar <parag.warudkar@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fdca0dc
    • Arnd Bergmann's avatar
      x86/power: Fix swsusp_arch_resume prototype · b2e949bf
      Arnd Bergmann authored
      [ Upstream commit 328008a7 ]
      
      The declaration for swsusp_arch_resume marks it as 'asmlinkage', but the
      definition in x86-32 does not, and it fails to include the header with the
      declaration. This leads to a warning when building with
      link-time-optimizations:
      
      kernel/power/power.h:108:23: error: type of 'swsusp_arch_resume' does not match original declaration [-Werror=lto-type-mismatch]
       extern asmlinkage int swsusp_arch_resume(void);
                             ^
      arch/x86/power/hibernate_32.c:148:0: note: 'swsusp_arch_resume' was previously declared here
       int swsusp_arch_resume(void)
      
      This moves the declaration into a globally visible header file and fixes up
      both x86 definitions to match it.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: linux-pm@vger.kernel.org
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Link: https://lkml.kernel.org/r/20180202145634.200291-2-arnd@arndb.deSigned-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2e949bf
    • Subash Abhinov Kasiviswanathan's avatar
      netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure · dd5968e8
      Subash Abhinov Kasiviswanathan authored
      [ Upstream commit ea23d5e3 ]
      
      Failures were seen in ICMPv6 fragmentation timeout tests if they were
      run after the RFC2460 failure tests. Kernel was not sending out the
      ICMPv6 fragment reassembly time exceeded packet after the fragmentation
      reassembly timeout of 1 minute had elapsed.
      
      This happened because the frag queue was not released if an error in
      IPv6 fragmentation header was detected by RFC2460.
      
      Fixes: 83f1999c ("netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd5968e8
    • Karol Herbst's avatar
      drm/nouveau/pmu/fuc: don't use movw directly anymore · e7bce211
      Karol Herbst authored
      [ Upstream commit fe9748b7 ]
      
      Fixes failure to compile with recent envyas as a result of the 'movw'
      alias being removed for v5.
      
      A bit of history:
      
      v3 only has a 16-bit sign-extended immediate mov op. In order to set
      the high bits, there's a separate 'sethi' op. envyas validates that
      the value passed to mov(imm) is between -0x8000 and 0x7fff. In order
      to simplify macros that load both the low and high word, a 'movw'
      alias was added which takes an unsigned 16-bit immediate. However the
      actual hardware op still sign extends.
      
      v5 has a full 32-bit immediate mov op. The v3 16-bit immediate mov op
      is gone (loads 0 into the dst reg). However due to a bug in envyas,
      the movw alias still existed, and selected the no-longer-present v3
      16-bit immediate mov op. As a result usage of movw on v5 is the same
      as mov with a 0x0 argument.
      
      The proper fix throughout is to only ever use the 'movw' alias in
      combination with 'sethi'. Anything else should get the sign-extended
      validation to ensure that the intended value ends up in the
      destination register.
      
      Changes in fuc3 binaries is the result of a different encoding being
      selected for a mov with an 8-bit value.
      
      v2: added commit message written by Ilia, thanks for that!
      v3: messed up rebasing, now it should apply
      Signed-off-by: default avatarKarol Herbst <kherbst@redhat.com>
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e7bce211
    • Alex Estrin's avatar
      IB/ipoib: Fix for potential no-carrier state · e405d2eb
      Alex Estrin authored
      [ Upstream commit 10293610 ]
      
      On reboot SM can program port pkey table before ipoib registered its
      event handler, which could result in missing pkey event and leave root
      interface with initial pkey value from index 0.
      
      Since OPA port starts with invalid pkey in index 0, root interface will
      fail to initialize and stay down with no-carrier flag.
      
      For IB ipoib interface may end up with pkey different from value
      opensm put in pkey table idx 0, resulting in connectivity issues
      (different mcast groups, for example).
      
      Close the window by calling event handler after registration
      to make sure ipoib pkey is in sync with port pkey table.
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAlex Estrin <alex.estrin@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e405d2eb
    • Ed Swierk's avatar
      openvswitch: Remove padding from packet before L3+ conntrack processing · bfd188fb
      Ed Swierk authored
      [ Upstream commit 9382fe71 ]
      
      IPv4 and IPv6 packets may arrive with lower-layer padding that is not
      included in the L3 length. For example, a short IPv4 packet may have
      up to 6 bytes of padding following the IP payload when received on an
      Ethernet device with a minimum packet length of 64 bytes.
      
      Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
      and help() in nf_conntrack_ftp) assume skb->len reflects the length of
      the L3 header and payload, rather than referring back to
      ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
      lower-layer padding.
      
      In the normal IPv4 receive path, ip_rcv() trims the packet to
      ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
      path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
      in the br_netfilter receive path, br_validate_ipv4() and
      br_validate_ipv6() trim the packet to the L3 length before invoking
      netfilter hooks.
      
      Currently in the OVS conntrack receive path, ovs_ct_execute() pulls
      the skb to the L3 header but does not trim it to the L3 length before
      calling nf_conntrack_in(NF_INET_PRE_ROUTING). When
      nf_conntrack_proto_tcp encounters a packet with lower-layer padding,
      nf_ip_checksum() fails causing a "nf_ct_tcp: bad TCP checksum" log
      message. While extra zero bytes don't affect the checksum, the length
      in the IP pseudoheader does. That length is based on skb->len, and
      without trimming, it doesn't match the length the sender used when
      computing the checksum.
      
      In ovs_ct_execute(), trim the skb to the L3 length before higher-layer
      processing.
      Signed-off-by: default avatarEd Swierk <eswierk@skyportsystems.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bfd188fb
    • shidao.ytt's avatar
      mm/fadvise: discard partial page if endbyte is also EOF · a8b21508
      shidao.ytt authored
      [ Upstream commit a7ab400d ]
      
      During our recent testing with fadvise(FADV_DONTNEED), we find that if
      given offset/length is not page-aligned, the last page will not be
      discarded.  The tool we use is vmtouch (https://hoytech.com/vmtouch/),
      we map a 10KB-sized file into memory and then try to run this tool to
      evict the whole file mapping, but the last single page always remains
      staying in the memory:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 2.1e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 1/3  4K/12K  33.3%
               Elapsed: 5.5e-05 seconds
      
      However when we test with an older kernel, say 3.10, this problem is
      gone.  So we wonder if this is a regression:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 8.2e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
               Elapsed: 5e-05 seconds
      
      After digging a little bit into this problem, we find it seems not a
      regression.  Not discarding partial page is likely to be on purpose
      according to commit 441c228f ("mm: fadvise: document the
      fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel
      Gorman.  He explained why partial pages should be preserved instead of
      being discarded when using fadvise(FADV_DONTNEED).
      
      However, the interesting part is that the actual code did NOT work as
      the same as it was described, the partial page was still discarded
      anyway, due to a calculation mistake of `end_index' passed to
      invalidate_mapping_pages().  This mistake has not been fixed until
      recently, that's why we fail to reproduce our problem in old kernels.
      The fix is done in commit 18aba41c ("mm/fadvise.c: do not discard
      partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin.
      
      Back to the original testing, our problem becomes that there is a
      special case that, if the page-unaligned `endbyte' is also the end of
      file, it is not necessary at all to preserve the last partial page, as
      we all know no one else will use the rest of it.  It should be safe
      enough if we just discard the whole page.  So we add an EOF check in
      this patch.
      
      We also find a poosbile real world issue in mainline kernel.  Assume
      such scenario: A userspace backup application want to backup a huge
      amount of small files (<4k) at once, the developer might (I guess) want
      to use fadvise(FADV_DONTNEED) to save memory.  However, FADV_DONTNEED
      won't really happen since the only page mapped is a partial page, and
      kernel will preserve it.  Our patch also fixes this problem, since we
      know the endbyte is EOF, so we discard it.
      
      Here is a simple reproducer to reproduce and verify each scenario we
      described above:
      
        test_fadvise.c
        ==============================
        #include <sys/mman.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <stdlib.h>
        #include <string.h>
        #include <stdio.h>
        #include <unistd.h>
      
        int main(int argc, char **argv)
        {
        	int i, fd, ret, len;
        	struct stat buf;
        	void *addr;
        	unsigned char *vec;
        	char *strbuf;
        	ssize_t pagesize = getpagesize();
        	ssize_t filesize;
      
        	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
        	if (fd < 0)
        		return -1;
        	filesize = strtoul(argv[2], NULL, 10);
      
        	strbuf = malloc(filesize);
        	memset(strbuf, 42, filesize);
        	write(fd, strbuf, filesize);
        	free(strbuf);
        	fsync(fd);
      
        	len = (filesize + pagesize - 1) / pagesize;
        	printf("length of pages: %d\n", len);
      
        	addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
        	if (addr == MAP_FAILED)
        		return -1;
      
        	ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
        	if (ret < 0)
        		return -1;
      
        	vec = malloc(len);
        	ret = mincore(addr, filesize, (void *)vec);
        	if (ret < 0)
        		return -1;
      
        	for (i = 0; i < len; i++)
        		printf("pages[%d]: %x\n", i, vec[i] & 0x1);
      
        	free(vec);
        	close(fd);
      
        	return 0;
        }
        ==============================
      
      Test 1: running on kernel with commit 18aba41c reverted:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.revert+
        [root@caspar ~]# ./test_fadvise file1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page discarded
        [root@caspar ~]# ./test_fadvise file2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise file3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page discarded
      
      Test 2: running on mainline kernel:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 1    # <-- partial and the only page not discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 1    # <-- partial page not discarded
      
      Test 3: running on kernel with this patch:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.patched+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page and EOF, discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page and EOF, discarded
      
      [akpm@linux-foundation.org: tweak code comment]
      Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.comSigned-off-by: default avatarshidao.ytt <shidao.ytt@alibaba-inc.com>
      Signed-off-by: default avatarCaspar Zhang <jinli.zjl@alibaba-inc.com>
      Reviewed-by: default avatarOliver Yang <zhiche.yy@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8b21508
    • Mel Gorman's avatar
      mm: pin address_space before dereferencing it while isolating an LRU page · ab88b8a2
      Mel Gorman authored
      [ Upstream commit 69d763fc ]
      
      Minchan Kim asked the following question -- what locks protects
      address_space destroying when race happens between inode trauncation and
      __isolate_lru_page? Jan Kara clarified by describing the race as follows
      
      CPU1                                            CPU2
      
      truncate(inode)                                 __isolate_lru_page()
        ...
        truncate_inode_page(mapping, page);
          delete_from_page_cache(page)
            spin_lock_irqsave(&mapping->tree_lock, flags);
              __delete_from_page_cache(page, NULL)
                page_cache_tree_delete(..)
                  ...                                   mapping = page_mapping(page);
                  page->mapping = NULL;
                  ...
            spin_unlock_irqrestore(&mapping->tree_lock, flags);
            page_cache_free_page(mapping, page)
              put_page(page)
                if (put_page_testzero(page)) -> false
      - inode now has no pages and can be freed including embedded address_space
      
                                                        if (mapping && !mapping->a_ops->migratepage)
      - we've dereferenced mapping which is potentially already free.
      
      The race is theoretically possible but unlikely.  Before the
      delete_from_page_cache, truncate_cleanup_page is called so the page is
      likely to be !PageDirty or PageWriteback which gets skipped by the only
      caller that checks the mappping in __isolate_lru_page.  Even if the race
      occurs, a substantial amount of work has to happen during a tiny window
      with no preemption but it could potentially be done using a virtual
      machine to artifically slow one CPU or halt it during the critical
      window.
      
      This patch should eliminate the race with truncation by try-locking the
      page before derefencing mapping and aborting if the lock was not
      acquired.  There was a suggestion from Huang Ying to use RCU as a
      side-effect to prevent mapping being freed.  However, I do not like the
      solution as it's an unconventional means of preserving a mapping and
      it's not a context where rcu_read_lock is obviously protecting rcu data.
      
      Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
      Fixes: c8244935 ("mm: compaction: make isolate_lru_page() filter-aware again")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab88b8a2
    • Yang Shi's avatar
      mm: thp: use down_read_trylock() in khugepaged to avoid long block · e56d3700
      Yang Shi authored
      [ Upstream commit 3b454ad3 ]
      
      In the current design, khugepaged needs to acquire mmap_sem before
      scanning an mm.  But in some corner cases, khugepaged may scan a process
      which is modifying its memory mapping, so khugepaged blocks in
      uninterruptible state.  But the process might hold the mmap_sem for a
      long time when modifying a huge memory space and it may trigger the
      below khugepaged hung issue:
      
        INFO: task khugepaged:270 blocked for more than 120 seconds.
        Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        khugepaged D 0 270 2 0x00000000 
        ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
        ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
        0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
        Call Trace:
          schedule+0x36/0x80
          rwsem_down_read_failed+0xf0/0x150
          call_rwsem_down_read_failed+0x18/0x30
          down_read+0x20/0x40
          khugepaged+0x476/0x11d0
          kthread+0xe6/0x100
          ret_from_fork+0x25/0x30
      
      So it sounds pointless to just block khugepaged waiting for the
      semaphore so replace down_read() with down_read_trylock() to move to
      scan the next mm quickly instead of just blocking on the semaphore so
      that other processes can get more chances to install THP.  Then
      khugepaged can come back to scan the skipped mm when it has finished the
      current round full_scan.
      
      And it appears that the change can improve khugepaged efficiency a
      little bit.
      
      Below is the test result when running LTP on a 24 cores 4GB memory 2
      nodes NUMA VM:
      
                                          pristine          w/ trylock
        full_scan                         197               187
        pages_collapsed                   21                26
        thp_fault_alloc                   40818             44466
        thp_fault_fallback                18413             16679
        thp_collapse_alloc                21                150
        thp_collapse_alloc_failed         14                16
        thp_file_alloc                    369               369
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: tweak comment]
      [arnd@arndb.de: avoid uninitialized variable use]
        Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@alibaba-inc.comSigned-off-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e56d3700
    • Nitin Gupta's avatar
      sparc64: update pmdp_invalidate() to return old pmd value · 9da97a95
      Nitin Gupta authored
      [ Upstream commit a8e654f0 ]
      
      It's required to avoid losing dirty and accessed bits.
      
      [akpm@linux-foundation.org: add a `do' to the do-while loop]
      Link: http://lkml.kernel.org/r/20171213105756.69879-9-kirill.shutemov@linux.intel.comSigned-off-by: default avatarNitin Gupta <nitin.m.gupta@oracle.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9da97a95
    • Kirill A. Shutemov's avatar
      asm-generic: provide generic_pmdp_establish() · 038ab51e
      Kirill A. Shutemov authored
      [ Upstream commit c58f0bb7 ]
      
      Patch series "Do not lose dirty bit on THP pages", v4.
      
      Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
      dirty and access bits if CPU sets them after pmdp dereference, but
      before set_pmd_at().
      
      The bug can lead to data loss, but the race window is tiny and I haven't
      seen any reports that suggested that it happens in reality.  So I don't
      think it worth sending it to stable.
      
      Unfortunately, there's no way to address the issue in a generic way.  We
      need to fix all architectures that support THP one-by-one.
      
      All architectures that have THP supported have to provide atomic
      pmdp_invalidate() that returns previous value.
      
      If generic implementation of pmdp_invalidate() is used, architecture
      needs to provide atomic pmdp_estabish().
      
      pmdp_estabish() is not used out-side generic implementation of
      pmdp_invalidate() so far, but I think this can change in the future.
      
      This patch (of 12):
      
      This is an implementation of pmdp_establish() that is only suitable for
      an architecture that doesn't have hardware dirty/accessed bits.  In this
      case we can't race with CPU which sets these bits and non-atomic
      approach is fine.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-2-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      038ab51e
    • Yisheng Xie's avatar
      mm/mempolicy: add nodes_empty check in SYSC_migrate_pages · 4cf2463f
      Yisheng Xie authored
      [ Upstream commit 0486a38b ]
      
      As in manpage of migrate_pages, the errno should be set to EINVAL when
      none of the node IDs specified by new_nodes are on-line and allowed by
      the process's current cpuset context, or none of the specified nodes
      contain memory.  However, when test by following case:
      
      	new_nodes = 0;
      	old_nodes = 0xf;
      	ret = migrate_pages(pid, old_nodes, new_nodes, MAX);
      
      The ret will be 0 and no errno is set.  As the new_nodes is empty, we
      should expect EINVAL as documented.
      
      To fix the case like above, this patch check whether target nodes AND
      current task_nodes is empty, and then check whether AND
      node_states[N_MEMORY] is empty.
      
      Link: http://lkml.kernel.org/r/1510882624-44342-4-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4cf2463f
    • Yisheng Xie's avatar
      mm/mempolicy: fix the check of nodemask from user · 2851e3bd
      Yisheng Xie authored
      [ Upstream commit 56521e7a ]
      
      As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system
      which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2:
      
        migrate_pages01    0  TINFO  :  test_invalid_nodes
        migrate_pages01   14  TFAIL  :  migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
        migrate_pages01   15  TFAIL  :  migrate_pages_common.c:55: call succeeded unexpectedly
      
      In this case the test_invalid_nodes of migrate_pages01 will call:
      SYSC_migrate_pages as:
      
        migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0
      
      The new nodes specifies one or more node IDs that are greater than the
      maximum supported node ID, however, the errno is not set to EINVAL as
      expected.
      
      As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
      mentioned, when nodemask specifies one or more node IDs that are greater
      than the maximum supported node ID, the errno should set to EINVAL.
      However, get_nodes only check whether the part of bits
      [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
      remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
      unchecked.
      
      This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
      to let migrate_pages set the errno to EINVAL when nodemask specifies one
      or more node IDs that are greater than the maximum supported node ID,
      which follows the manpage's guide.
      
      [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
      [2] http://man7.org/linux/man-pages/man2/mbind.2.html
      [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html
      
      Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarTan Xiaojun <tanxiaojun@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2851e3bd