1. 21 Aug, 2020 40 commits
    • Ahmad Fatoum's avatar
      watchdog: f71808e_wdt: indicate WDIOF_CARDRESET support in watchdog_info.options · 203dbe7c
      Ahmad Fatoum authored
      commit e871e93f upstream.
      
      The driver supports populating bootstatus with WDIOF_CARDRESET, but so
      far userspace couldn't portably determine whether absence of this flag
      meant no watchdog reset or no driver support. Or-in the bit to fix this.
      
      Fixes: b97cb21a ("watchdog: f71808e_wdt: Fix WDTMOUT_STS register read")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAhmad Fatoum <a.fatoum@pengutronix.de>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Link: https://lore.kernel.org/r/20200611191750.28096-3-a.fatoum@pengutronix.deSigned-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarWim Van Sebroeck <wim@linux-watchdog.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      203dbe7c
    • Steven Rostedt (VMware)'s avatar
      tracing: Use trace_sched_process_free() instead of exit() for pid tracing · 2c98c4a0
      Steven Rostedt (VMware) authored
      commit afcab636 upstream.
      
      On exit, if a process is preempted after the trace_sched_process_exit()
      tracepoint but before the process is done exiting, then when it gets
      scheduled in, the function tracers will not filter it properly against the
      function tracing pid filters.
      
      That is because the function tracing pid filters hooks to the
      sched_process_exit() tracepoint to remove the exiting task's pid from the
      filter list. Because the filtering happens at the sched_switch tracepoint,
      when the exiting task schedules back in to finish up the exit, it will no
      longer be in the function pid filtering tables.
      
      This was noticeable in the notrace self tests on a preemptable kernel, as
      the tests would fail as it exits and preempted after being taken off the
      notrace filter table and on scheduling back in it would not be in the
      notrace list, and then the ending of the exit function would trace. The test
      detected this and would fail.
      
      Cc: stable@vger.kernel.org
      Cc: Namhyung Kim <namhyung@kernel.org>
      Fixes: 1e10486f ("ftrace: Add 'function-fork' trace option")
      Fixes: c37775d5 ("tracing: Add infrastructure to allow set_event_pid to follow children"
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c98c4a0
    • Kevin Hao's avatar
      tracing/hwlat: Honor the tracing_cpumask · b3b77736
      Kevin Hao authored
      commit 96b4833b upstream.
      
      In calculation of the cpu mask for the hwlat kernel thread, the wrong
      cpu mask is used instead of the tracing_cpumask, this causes the
      tracing/tracing_cpumask useless for hwlat tracer. Fixes it.
      
      Link: https://lkml.kernel.org/r/20200730082318.42584-2-haokexin@gmail.com
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 0330f7aa ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3b77736
    • Muchun Song's avatar
      kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler · 46c9d392
      Muchun Song authored
      commit 0cb2f137 upstream.
      
      We found a case of kernel panic on our server. The stack trace is as
      follows(omit some irrelevant information):
      
        BUG: kernel NULL pointer dereference, address: 0000000000000080
        RIP: 0010:kprobe_ftrace_handler+0x5e/0xe0
        RSP: 0018:ffffb512c6550998 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffff8e9d16eea018 RCX: 0000000000000000
        RDX: ffffffffbe1179c0 RSI: ffffffffc0535564 RDI: ffffffffc0534ec0
        RBP: ffffffffc0534ec1 R08: ffff8e9d1bbb0f00 R09: 0000000000000004
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff8e9d1f797060 R14: 000000000000bacc R15: ffff8e9ce13eca00
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000080 CR3: 00000008453d0005 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         ftrace_ops_assist_func+0x56/0xe0
         ftrace_call+0x5/0x34
         tcpa_statistic_send+0x5/0x130 [ttcp_engine]
      
      The tcpa_statistic_send is the function being kprobed. After analysis,
      the root cause is that the fourth parameter regs of kprobe_ftrace_handler
      is NULL. Why regs is NULL? We use the crash tool to analyze the kdump.
      
        crash> dis tcpa_statistic_send -r
               <tcpa_statistic_send>: callq 0xffffffffbd8018c0 <ftrace_caller>
      
      The tcpa_statistic_send calls ftrace_caller instead of ftrace_regs_caller.
      So it is reasonable that the fourth parameter regs of kprobe_ftrace_handler
      is NULL. In theory, we should call the ftrace_regs_caller instead of the
      ftrace_caller. After in-depth analysis, we found a reproducible path.
      
        Writing a simple kernel module which starts a periodic timer. The
        timer's handler is named 'kprobe_test_timer_handler'. The module
        name is kprobe_test.ko.
      
        1) insmod kprobe_test.ko
        2) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'
        3) echo 0 > /proc/sys/kernel/ftrace_enabled
        4) rmmod kprobe_test
        5) stop step 2) kprobe
        6) insmod kprobe_test.ko
        7) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'
      
      We mark the kprobe as GONE but not disarm the kprobe in the step 4).
      The step 5) also do not disarm the kprobe when unregister kprobe. So
      we do not remove the ip from the filter. In this case, when the module
      loads again in the step 6), we will replace the code to ftrace_caller
      via the ftrace_module_enable(). When we register kprobe again, we will
      not replace ftrace_caller to ftrace_regs_caller because the ftrace is
      disabled in the step 3). So the step 7) will trigger kernel panic. Fix
      this problem by disarming the kprobe when the module is going away.
      
      Link: https://lkml.kernel.org/r/20200728064536.24405-1-songmuchun@bytedance.com
      
      Cc: stable@vger.kernel.org
      Fixes: ae6aa16f ("kprobes: introduce ftrace based optimization")
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Co-developed-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46c9d392
    • Chengming Zhou's avatar
      ftrace: Setup correct FTRACE_FL_REGS flags for module · 892fd363
      Chengming Zhou authored
      commit 8a224ffb upstream.
      
      When module loaded and enabled, we will use __ftrace_replace_code
      for module if any ftrace_ops referenced it found. But we will get
      wrong ftrace_addr for module rec in ftrace_get_addr_new, because
      rec->flags has not been setup correctly. It can cause the callback
      function of a ftrace_ops has FTRACE_OPS_FL_SAVE_REGS to be called
      with pt_regs set to NULL.
      So setup correct FTRACE_FL_REGS flags for rec when we call
      referenced_filters to find ftrace_ops references it.
      
      Link: https://lkml.kernel.org/r/20200728180554.65203-1-zhouchengming@bytedance.com
      
      Cc: stable@vger.kernel.org
      Fixes: 8c4f3c3f ("ftrace: Check module functions being traced on reload")
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      892fd363
    • Michal Koutný's avatar
      mm/page_counter.c: fix protection usage propagation · e88a72e8
      Michal Koutný authored
      commit a6f23d14 upstream.
      
      When workload runs in cgroups that aren't directly below root cgroup and
      their parent specifies reclaim protection, it may end up ineffective.
      
      The reason is that propagate_protected_usage() is not called in all
      hierarchy up.  All the protected usage is incorrectly accumulated in the
      workload's parent.  This means that siblings_low_usage is overestimated
      and effective protection underestimated.  Even though it is transitional
      phenomenon (uncharge path does correct propagation and fixes the wrong
      children_low_usage), it can undermine the intended protection
      unexpectedly.
      
      We have noticed this problem while seeing a swap out in a descendant of a
      protected memcg (intermediate node) while the parent was conveniently
      under its protection limit and the memory pressure was external to that
      hierarchy.  Michal has pinpointed this down to the wrong
      siblings_low_usage which led to the unwanted reclaim.
      
      The fix is simply updating children_low_usage in respective ancestors also
      in the charging path.
      
      Fixes: 23067153 ("mm: memory.low hierarchical behavior")
      Signed-off-by: default avatarMichal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e88a72e8
    • Junxiao Bi's avatar
      ocfs2: change slot number type s16 to u16 · 73cbb8af
      Junxiao Bi authored
      commit 38d51b2d upstream.
      
      Dan Carpenter reported the following static checker warning.
      
      	fs/ocfs2/super.c:1269 ocfs2_parse_options() warn: '(-1)' 65535 can't fit into 32767 'mopt->slot'
      	fs/ocfs2/suballoc.c:859 ocfs2_init_inode_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_inode_steal_slot'
      	fs/ocfs2/suballoc.c:867 ocfs2_init_meta_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_meta_steal_slot'
      
      That's because OCFS2_INVALID_SLOT is (u16)-1. Slot number in ocfs2 can be
      never negative, so change s16 to u16.
      
      Fixes: 9277f833 ("ocfs2: fix value of OCFS2_INVALID_SLOT")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200627001259.19757-1-junxiao.bi@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73cbb8af
    • Mikulas Patocka's avatar
      ext2: fix missing percpu_counter_inc · 41d71ef2
      Mikulas Patocka authored
      commit bc2fbaa4 upstream.
      
      sbi->s_freeinodes_counter is only decreased by the ext2 code, it is never
      increased. This patch fixes it.
      
      Note that sbi->s_freeinodes_counter is only used in the algorithm that
      tries to find the group for new allocations, so this bug is not easily
      visible (the only visibility is that the group finding algorithm selects
      inoptinal result).
      
      Link: https://lore.kernel.org/r/alpine.LRH.2.02.2004201538300.19436@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41d71ef2
    • Huacai Chen's avatar
      MIPS: CPU#0 is not hotpluggable · baa5bd36
      Huacai Chen authored
      commit 9cce844a upstream.
      
      Now CPU#0 is not hotpluggable on MIPS, so prevent to create /sys/devices
      /system/cpu/cpu0/online which confuses some user-space tools.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      baa5bd36
    • Lukas Wunner's avatar
      driver core: Avoid binding drivers to dead devices · 706695d4
      Lukas Wunner authored
      commit 65488832 upstream.
      
      Commit 3451a495 ("driver core: Establish order of operations for
      device_add and device_del via bitflag") sought to prevent asynchronous
      driver binding to a device which is being removed.  It added a
      per-device "dead" flag which is checked in the following code paths:
      
      * asynchronous binding in __driver_attach_async_helper()
      *  synchronous binding in device_driver_attach()
      * asynchronous binding in __device_attach_async_helper()
      
      It did *not* check the flag upon:
      
      *  synchronous binding in __device_attach()
      
      However __device_attach() may also be called asynchronously from:
      
      deferred_probe_work_func()
        bus_probe_device()
          device_initial_probe()
            __device_attach()
      
      So if the commit's intention was to check the "dead" flag in all
      asynchronous code paths, then a check is also necessary in
      __device_attach().  Add the missing check.
      
      Fixes: 3451a495 ("driver core: Establish order of operations for device_add and device_del via bitflag")
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: stable@vger.kernel.org # v5.1+
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Link: https://lore.kernel.org/r/de88a23a6fe0ef70f7cfd13c8aea9ab51b4edab6.1594214103.git.lukas@wunner.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      706695d4
    • Johannes Berg's avatar
      mac80211: fix misplaced while instead of if · 4cf1d191
      Johannes Berg authored
      commit 5981fe5b upstream.
      
      This never was intended to be a 'while' loop, it should've
      just been an 'if' instead of 'while'. Fix this.
      
      I noticed this while applying another patch from Ben that
      intended to fix a busy loop at this spot.
      
      Cc: stable@vger.kernel.org
      Fixes: b16798f5 ("mac80211: mark station unauthorized before key removal")
      Reported-by: default avatarBen Greear <greearb@candelatech.com>
      Link: https://lore.kernel.org/r/20200803110209.253009ae41ff.I3522aad099392b31d5cf2dcca34cbac7e5832dde@changeidSigned-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4cf1d191
    • Coly Li's avatar
      bcache: fix overflow in offset_to_stripe() · 2a72c283
      Coly Li authored
      commit 7a148126 upstream.
      
      offset_to_stripe() returns the stripe number (in type unsigned int) from
      an offset (in type uint64_t) by the following calculation,
      	do_div(offset, d->stripe_size);
      For large capacity backing device (e.g. 18TB) with small stripe size
      (e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
      returned value which caller receives is 536870912, due to the overflow.
      
      Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
      range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
      in range [0, bcache_dev->nr_stripes - 1].
      
      This patch adds a upper limition check in offset_to_stripe(): the max
      valid stripe number should be less than bcache_device->nr_stripes. If
      the calculated stripe number from do_div() is equal to or larger than
      bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
      is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
      therefore -EOVERFLOW is not used as error code.)
      
      This patch also changes nr_stripes' type of struct bcache_device from
      'unsigned int' to 'int', and return value type of offset_to_stripe()
      from 'unsigned int' to 'int', to match their exact data ranges.
      
      All locations where bcache_device->nr_stripes and offset_to_stripe() are
      referenced also get updated for the above type change.
      Reported-and-tested-by: default avatarKen Raeburn <raeburn@redhat.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a72c283
    • Coly Li's avatar
      bcache: allocate meta data pages as compound pages · d6e2394c
      Coly Li authored
      commit 5fe48867 upstream.
      
      There are some meta data of bcache are allocated by multiple pages,
      and they are used as bio bv_page for I/Os to the cache device. for
      example cache_set->uuids, cache->disk_buckets, journal_write->data,
      bset_tree->data.
      
      For such meta data memory, all the allocated pages should be treated
      as a single memory block. Then the memory management and underlying I/O
      code can treat them more clearly.
      
      This patch adds __GFP_COMP flag to all the location allocating >0 order
      pages for the above mentioned meta data. Then their pages are treated
      as compound pages now.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6e2394c
    • ChangSyun Peng's avatar
      md/raid5: Fix Force reconstruct-write io stuck in degraded raid5 · 566cba3c
      ChangSyun Peng authored
      commit a1c6ae3d upstream.
      
      In degraded raid5, we need to read parity to do reconstruct-write when
      data disks fail. However, we can not read parity from
      handle_stripe_dirtying() in force reconstruct-write mode.
      
      Reproducible Steps:
      
      1. Create degraded raid5
      mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
      2. Set rmw_level to 0
      echo 0 > /sys/block/md2/md/rmw_level
      3. IO to raid5
      
      Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
      the parity in this situation.
      
      Cc: <stable@vger.kernel.org> # v4.4+
      Reviewed-by: default avatarAlex Wu <alexwu@synology.com>
      Reviewed-by: default avatarBingJing Chang <bingjingc@synology.com>
      Reviewed-by: default avatarDanny Shih <dannyshih@synology.com>
      Signed-off-by: default avatarChangSyun Peng <allenpeng@synology.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      566cba3c
    • Kees Cook's avatar
      net/compat: Add missing sock updates for SCM_RIGHTS · f90339a4
      Kees Cook authored
      commit d9539752 upstream.
      
      Add missed sock updates to compat path via a new helper, which will be
      used more in coming patches. (The net/core/scm.c code is left as-is here
      to assist with -stable backports for the compat path.)
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 48a87cc2 ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly")
      Fixes: d8429506 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly")
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f90339a4
    • Jonathan McDowell's avatar
      net: stmmac: dwmac1000: provide multicast filter fallback · c334db67
      Jonathan McDowell authored
      commit 592d751c upstream.
      
      If we don't have a hardware multicast filter available then instead of
      silently failing to listen for the requested ethernet broadcast
      addresses fall back to receiving all multicast packets, in a similar
      fashion to other drivers with no multicast filter.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJonathan McDowell <noodles@earth.li>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c334db67
    • Jonathan McDowell's avatar
      net: ethernet: stmmac: Disable hardware multicast filter · 26f0092f
      Jonathan McDowell authored
      commit df43dd52 upstream.
      
      The IPQ806x does not appear to have a functional multicast ethernet
      address filter. This was observed as a failure to correctly receive IPv6
      packets on a LAN to the all stations address. Checking the vendor driver
      shows that it does not attempt to enable the multicast filter and
      instead falls back to receiving all multicast packets, internally
      setting ALLMULTI.
      
      Use the new fallback support in the dwmac1000 driver to correctly
      achieve the same with the mainline IPQ806x driver. Confirmed to fix IPv6
      functionality on an RB3011 router.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJonathan McDowell <noodles@earth.li>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26f0092f
    • Eugeniu Rosca's avatar
      media: vsp1: dl: Fix NULL pointer dereference on unbind · 62f8d714
      Eugeniu Rosca authored
      commit c92d30e4 upstream.
      
      In commit f3b98e3c ("media: vsp1: Provide support for extended
      command pools"), the vsp pointer used for referencing the VSP1 device
      structure from a command pool during vsp1_dl_ext_cmd_pool_destroy() was
      not populated.
      
      Correctly assign the pointer to prevent the following
      null-pointer-dereference when removing the device:
      
      [*] h3ulcb-kf #>
      echo fea28000.vsp > /sys/bus/platform/devices/fea28000.vsp/driver/unbind
       Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
       Mem abort info:
         ESR = 0x96000006
         EC = 0x25: DABT (current EL), IL = 32 bits
         SET = 0, FnV = 0
         EA = 0, S1PTW = 0
       Data abort info:
         ISV = 0, ISS = 0x00000006
         CM = 0, WnR = 0
       user pgtable: 4k pages, 48-bit VAs, pgdp=00000007318be000
       [0000000000000028] pgd=00000007333a1003, pud=00000007333a6003, pmd=0000000000000000
       Internal error: Oops: 96000006 [#1] PREEMPT SMP
       Modules linked in:
       CPU: 1 PID: 486 Comm: sh Not tainted 5.7.0-rc6-arm64-renesas-00118-ge644645a #185
       Hardware name: Renesas H3ULCB Kingfisher board based on r8a77951 (DT)
       pstate: 40000005 (nZcv daif -PAN -UAO)
       pc : vsp1_dlm_destroy+0xe4/0x11c
       lr : vsp1_dlm_destroy+0xc8/0x11c
       sp : ffff800012963b60
       x29: ffff800012963b60 x28: ffff0006f83fc440
       x27: 0000000000000000 x26: ffff0006f5e13e80
       x25: ffff0006f5e13ed0 x24: ffff0006f5e13ed0
       x23: ffff0006f5e13ed0 x22: dead000000000122
       x21: ffff0006f5e3a080 x20: ffff0006f5df2938
       x19: ffff0006f5df2980 x18: 0000000000000003
       x17: 0000000000000000 x16: 0000000000000016
       x15: 0000000000000003 x14: 00000000000393c0
       x13: ffff800011a5ec18 x12: ffff800011d8d000
       x11: ffff0006f83fcc68 x10: ffff800011a53d70
       x9 : ffff8000111f3000 x8 : 0000000000000000
       x7 : 0000000000210d00 x6 : 0000000000000000
       x5 : ffff800010872e60 x4 : 0000000000000004
       x3 : 0000000078068000 x2 : ffff800012781000
       x1 : 0000000000002c00 x0 : 0000000000000000
       Call trace:
        vsp1_dlm_destroy+0xe4/0x11c
        vsp1_wpf_destroy+0x10/0x20
        vsp1_entity_destroy+0x24/0x4c
        vsp1_destroy_entities+0x54/0x130
        vsp1_remove+0x1c/0x40
        platform_drv_remove+0x28/0x50
        __device_release_driver+0x178/0x220
        device_driver_detach+0x44/0xc0
        unbind_store+0xe0/0x104
        drv_attr_store+0x20/0x30
        sysfs_kf_write+0x48/0x70
        kernfs_fop_write+0x148/0x230
        __vfs_write+0x18/0x40
        vfs_write+0xdc/0x1c4
        ksys_write+0x68/0xf0
        __arm64_sys_write+0x18/0x20
        el0_svc_common.constprop.0+0x70/0x170
        do_el0_svc+0x20/0x80
        el0_sync_handler+0x134/0x1b0
        el0_sync+0x140/0x180
       Code: b40000c2 f9403a60 d2800084 a9400663 (f9401400)
       ---[ end trace 3875369841fb288a ]---
      
      Fixes: f3b98e3c ("media: vsp1: Provide support for extended command pools")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarEugeniu Rosca <erosca@de.adit-jv.com>
      Reviewed-by: default avatarKieran Bingham <kieran.bingham+renesas@ideasonboard.com>
      Tested-by: default avatarKieran Bingham <kieran.bingham+renesas@ideasonboard.com>
      Reviewed-by: default avatarLaurent Pinchart <laurent.pinchart@ideasonboard.com>
      Signed-off-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62f8d714
    • Michael Ellerman's avatar
      powerpc: Fix circular dependency between percpu.h and mmu.h · e83f99c4
      Michael Ellerman authored
      commit 0c83b277 upstream.
      
      Recently random.h started including percpu.h (see commit
      f227e3ec ("random32: update the net random state on interrupt and
      activity")), which broke corenet64_smp_defconfig:
      
        In file included from /linux/arch/powerpc/include/asm/paca.h:18,
                         from /linux/arch/powerpc/include/asm/percpu.h:13,
                         from /linux/include/linux/random.h:14,
                         from /linux/lib/uuid.c:14:
        /linux/arch/powerpc/include/asm/mmu.h:139:22: error: unknown type name 'next_tlbcam_idx'
          139 | DECLARE_PER_CPU(int, next_tlbcam_idx);
      
      This is due to a circular header dependency:
        asm/mmu.h includes asm/percpu.h, which includes asm/paca.h, which
        includes asm/mmu.h
      
      Which means DECLARE_PER_CPU() isn't defined when mmu.h needs it.
      
      We can fix it by moving the include of paca.h below the include of
      asm-generic/percpu.h.
      
      This moves the include of paca.h out of the #ifdef __powerpc64__, but
      that is OK because paca.h is almost entirely inside #ifdef
      CONFIG_PPC64 anyway.
      
      It also moves the include of paca.h out of the #ifdef CONFIG_SMP,
      which could possibly break something, but seems to have no ill
      effects.
      
      Fixes: f227e3ec ("random32: update the net random state on interrupt and activity")
      Cc: stable@vger.kernel.org # v5.8
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200804130558.292328-1-mpe@ellerman.id.auSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e83f99c4
    • Michael Ellerman's avatar
      powerpc: Allow 4224 bytes of stack expansion for the signal frame · b11ac832
      Michael Ellerman authored
      commit 63dee5df upstream.
      
      We have powerpc specific logic in our page fault handling to decide if
      an access to an unmapped address below the stack pointer should expand
      the stack VMA.
      
      The code was originally added in 2004 "ported from 2.4". The rough
      logic is that the stack is allowed to grow to 1MB with no extra
      checking. Over 1MB the access must be within 2048 bytes of the stack
      pointer, or be from a user instruction that updates the stack pointer.
      
      The 2048 byte allowance below the stack pointer is there to cover the
      288 byte "red zone" as well as the "about 1.5kB" needed by the signal
      delivery code.
      
      Unfortunately since then the signal frame has expanded, and is now
      4224 bytes on 64-bit kernels with transactional memory enabled. This
      means if a process has consumed more than 1MB of stack, and its stack
      pointer lies less than 4224 bytes from the next page boundary, signal
      delivery will fault when trying to expand the stack and the process
      will see a SEGV.
      
      The total size of the signal frame is the size of struct rt_sigframe
      (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
      64-bit).
      
      The 2048 byte allowance was correct until 2008 as the signal frame
      was:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1440 */
              /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1440    16 */
              unsigned int               tramp[6];             /*  1456    24 */
              struct siginfo *           pinfo;                /*  1480     8 */
              void *                     puc;                  /*  1488     8 */
              struct siginfo     info;                         /*  1496   128 */
              /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1624   288 */
      
              /* size: 1920, cachelines: 15, members: 7 */
              /* padding: 8 */
      };
      
      1920 + 128 = 2048
      
      Then in commit ce48b210 ("powerpc: Add VSX context save/restore,
      ptrace and signal support") (Jul 2008) the signal frame expanded to
      2304 bytes:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */	<--
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1696    16 */
              unsigned int               tramp[6];             /*  1712    24 */
              struct siginfo *           pinfo;                /*  1736     8 */
              void *                     puc;                  /*  1744     8 */
              struct siginfo     info;                         /*  1752   128 */
              /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1880   288 */
      
              /* size: 2176, cachelines: 17, members: 7 */
              /* padding: 8 */
      };
      
      2176 + 128 = 2304
      
      At this point we should have been exposed to the bug, though as far as
      I know it was never reported. I no longer have a system old enough to
      easily test on.
      
      Then in 2010 commit 320b2b8d ("mm: keep a guard page below a
      grow-down stack segment") caused our stack expansion code to never
      trigger, as there was always a VMA found for a write up to PAGE_SIZE
      below r1.
      
      That meant the bug was hidden as we continued to expand the signal
      frame in commit 2b0a576d ("powerpc: Add new transactional memory
      state to the signal context") (Feb 2013):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */	<--
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[288];          /*  3576   288 */
      
              /* size: 3872, cachelines: 31, members: 8 */
              /* padding: 8 */
              /* last cacheline: 32 bytes */
      };
      
      3872 + 128 = 4000
      
      And commit 573ebfa6 ("powerpc: Increase stack redzone for 64-bit
      userspace to 512 bytes") (Feb 2014):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[512];          /*  3576   512 */	<--
      
              /* size: 4096, cachelines: 32, members: 8 */
              /* padding: 8 */
      };
      
      4096 + 128 = 4224
      
      Then finally in 2017, commit 1be7107f ("mm: larger stack guard
      gap, between vmas") exposed us to the existing bug, because it changed
      the stack VMA to be the correct/real size, meaning our stack expansion
      code is now triggered.
      
      Fix it by increasing the allowance to 4224 bytes.
      
      Hard-coding 4224 is obviously unsafe against future expansions of the
      signal frame in the same way as the existing code. We can't easily use
      sizeof() because the signal frame structure is not in a header. We
      will either fix that, or rip out all the custom stack expansion
      checking logic entirely.
      
      Fixes: ce48b210 ("powerpc: Add VSX context save/restore, ptrace and signal support")
      Cc: stable@vger.kernel.org # v2.6.27+
      Reported-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Tested-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.auSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b11ac832
    • Paul Aurich's avatar
      cifs: Fix leak when handling lease break for cached root fid · d9710cc6
      Paul Aurich authored
      commit baf57b56 upstream.
      
      Handling a lease break for the cached root didn't free the
      smb2_lease_break_work allocation, resulting in a leak:
      
          unreferenced object 0xffff98383a5af480 (size 128):
            comm "cifsd", pid 684, jiffies 4294936606 (age 534.868s)
            hex dump (first 32 bytes):
              c0 ff ff ff 1f 00 00 00 88 f4 5a 3a 38 98 ff ff  ..........Z:8...
              88 f4 5a 3a 38 98 ff ff 80 88 d6 8a ff ff ff ff  ..Z:8...........
            backtrace:
              [<0000000068957336>] smb2_is_valid_oplock_break+0x1fa/0x8c0
              [<0000000073b70b9e>] cifs_demultiplex_thread+0x73d/0xcc0
              [<00000000905fa372>] kthread+0x11c/0x150
              [<0000000079378e4e>] ret_from_fork+0x22/0x30
      
      Avoid this leak by only allocating when necessary.
      
      Fixes: a93864d9 ("cifs: add lease tracking to the cached root fid")
      Signed-off-by: default avatarPaul Aurich <paul@darkrain42.org>
      CC: Stable <stable@vger.kernel.org> # v4.18+
      Reviewed-by: default avatarAurelien Aptel <aaptel@suse.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9710cc6
    • Max Filippov's avatar
      xtensa: fix xtensa_pmu_setup prototype · 6ffc89ca
      Max Filippov authored
      commit 6d65d376 upstream.
      
      Fix the following build error in configurations with
      CONFIG_XTENSA_VARIANT_HAVE_PERF_EVENTS=y:
      
        arch/xtensa/kernel/perf_event.c:420:29: error: passing argument 3 of
        ‘cpuhp_setup_state’ from incompatible pointer type
      
      Cc: stable@vger.kernel.org
      Fixes: 25a77b55 ("xtensa/perf: Convert the hotplug notifier to state machine callbacks")
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ffc89ca
    • Alexandru Ardelean's avatar
      iio: dac: ad5592r: fix unbalanced mutex unlocks in ad5592r_read_raw() · b86f06e1
      Alexandru Ardelean authored
      commit 65afb093 upstream.
      
      There are 2 exit paths where the lock isn't held, but try to unlock the
      mutex when exiting. In these places we should just return from the
      function.
      
      A neater approach would be to cleanup the ad5592r_read_raw(), but that
      would make this patch more difficult to backport to stable versions.
      
      Fixes 56ca9db8: ("iio: dac: Add support for the AD5592R/AD5593R ADCs/DACs")
      Reported-by: default avatarCharles Stanhope <charles.stanhope@gmail.com>
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Cc: <Stable@vger.kernel.org>
      Signed-off-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b86f06e1
    • Christian Eggers's avatar
      dt-bindings: iio: io-channel-mux: Fix compatible string in example code · 0d4abc35
      Christian Eggers authored
      commit add48ba4 upstream.
      
      The correct compatible string is "gpio-mux" (see
      bindings/mux/gpio-mux.txt).
      
      Cc: stable@vger.kernel.org # v4.13+
      Reviewed-by: default avatarPeter Rosin <peda@axentia.se>
      Signed-off-by: default avatarChristian Eggers <ceggers@arri.de>
      Link: https://lore.kernel.org/r/20200727101605.24384-1-ceggers@arri.deSigned-off-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d4abc35
    • Pavel Machek's avatar
      btrfs: fix return value mixup in btrfs_get_extent · a34b58b5
      Pavel Machek authored
      commit 881a3a11 upstream.
      
      btrfs_get_extent() sets variable ret, but out: error path expect error
      to be in variable err so the error code is lost.
      
      Fixes: 6bf9e4bd ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarPavel Machek (CIP) <pavel@denx.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a34b58b5
    • Filipe Manana's avatar
      btrfs: fix memory leaks after failure to lookup checksums during inode logging · 183af2d2
      Filipe Manana authored
      commit 4f26433e upstream.
      
      While logging an inode, at copy_items(), if we fail to lookup the checksums
      for an extent we release the destination path, free the ins_data array and
      then return immediately. However a previous iteration of the for loop may
      have added checksums to the ordered_sums list, in which case we leak the
      memory used by them.
      
      So fix this by making sure we iterate the ordered_sums list and free all
      its checksums before returning.
      
      Fixes: 3650860b ("Btrfs: remove almost all of the BUG()'s from tree-log.c")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      183af2d2
    • Josef Bacik's avatar
      btrfs: only search for left_info if there is no right_info in try_merge_free_space · 627fa9d8
      Josef Bacik authored
      commit bf53d468 upstream.
      
      In try_to_merge_free_space we attempt to find entries to the left and
      right of the entry we are adding to see if they can be merged.  We
      search for an entry past our current info (saved into right_info), and
      then if right_info exists and it has a rb_prev() we save the rb_prev()
      into left_info.
      
      However there's a slight problem in the case that we have a right_info,
      but no entry previous to that entry.  At that point we will search for
      an entry just before the info we're attempting to insert.  This will
      simply find right_info again, and assign it to left_info, making them
      both the same pointer.
      
      Now if right_info _can_ be merged with the range we're inserting, we'll
      add it to the info and free right_info.  However further down we'll
      access left_info, which was right_info, and thus get a use-after-free.
      
      Fix this by only searching for the left entry if we don't find a right
      entry at all.
      
      The CVE referenced had a specially crafted file system that could
      trigger this use-after-free. However with the tree checker improvements
      we no longer trigger the conditions for the UAF.  But the original
      conditions still apply, hence this fix.
      
      Reference: CVE-2019-19448
      Fixes: 96303081 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      627fa9d8
    • David Sterba's avatar
      btrfs: fix messages after changing compression level by remount · 7c1ddfc9
      David Sterba authored
      commit 27942c99 upstream.
      
      Reported by Forza on IRC that remounting with compression options does
      not reflect the change in level, or at least it does not appear to do so
      according to the messages:
      
        mount -o compress=zstd:1 /dev/sda /mnt
        mount -o remount,compress=zstd:15 /mnt
      
      does not print the change to the level to syslog:
      
        [   41.366060] BTRFS info (device vda): use zstd compression, level 1
        [   41.368254] BTRFS info (device vda): disk space caching is enabled
        [   41.390429] BTRFS info (device vda): disk space caching is enabled
      
      What really happens is that the message is lost but the level is actualy
      changed.
      
      There's another weird output, if compression is reset to 'no':
      
        [   45.413776] BTRFS info (device vda): use no compression, level 4
      
      To fix that, save the previous compression level and print the message
      in that case too and use separate message for 'no' compression.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c1ddfc9
    • Josef Bacik's avatar
      btrfs: open device without device_list_mutex · 35b4a280
      Josef Bacik authored
      commit 18c850fd upstream.
      
      There's long existed a lockdep splat because we open our bdev's under
      the ->device_list_mutex at mount time, which acquires the bd_mutex.
      Usually this goes unnoticed, but if you do loopback devices at all
      suddenly the bd_mutex comes with a whole host of other dependencies,
      which results in the splat when you mount a btrfs file system.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
      ------------------------------------------------------
      systemd-journal/509 is trying to acquire lock:
      ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
      
      but task is already holding lock:
      ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
       -> #6 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x13e/0x220
             btrfs_page_mkwrite+0x59/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
       -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x60/0x80
             _copy_from_user+0x20/0xb0
             get_sg_io_hdr+0x9a/0xb0
             scsi_cmd_ioctl+0x1ea/0x2f0
             cdrom_ioctl+0x3c/0x12b4
             sr_block_ioctl+0xa4/0xd0
             block_ioctl+0x3f/0x50
             ksys_ioctl+0x82/0xc0
             __x64_sys_ioctl+0x16/0x20
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #4 (&cd->lock){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             sr_block_open+0xa2/0x180
             __blkdev_get+0xdd/0x550
             blkdev_get+0x38/0x150
             do_dentry_open+0x16b/0x3e0
             path_openat+0x3c9/0xa00
             do_filp_open+0x75/0x100
             do_sys_openat2+0x8a/0x140
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             __blkdev_get+0x6a/0x550
             blkdev_get+0x85/0x150
             blkdev_get_by_path+0x2c/0x70
             btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
             open_fs_devices+0x88/0x240 [btrfs]
             btrfs_open_devices+0x92/0xa0 [btrfs]
             btrfs_mount_root+0x250/0x490 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             vfs_kern_mount.part.0+0x71/0xb0
             btrfs_mount+0x119/0x380 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             do_mount+0x8c6/0xca0
             __x64_sys_mount+0x8e/0xd0
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_run_dev_stats+0x36/0x420 [btrfs]
             commit_cowonly_roots+0x91/0x2d0 [btrfs]
             btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1241/0x20c0
             lock_acquire+0xb0/0x400
             __mutex_lock+0x7b/0x820
             btrfs_record_root_in_trans+0x44/0x70 [btrfs]
             start_transaction+0xd2/0x500 [btrfs]
             btrfs_dirty_inode+0x44/0xd0 [btrfs]
             file_update_time+0xc6/0x120
             btrfs_page_mkwrite+0xda/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
      Possible unsafe locking scenario:
      
           CPU0                    CPU1
           ----                    ----
       lock(sb_pagefaults);
                                   lock(&mm->mmap_lock#2);
                                   lock(sb_pagefaults);
       lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by systemd-journal/509:
       #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
       #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
       #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x92/0xc8
       check_noncircular+0x134/0x150
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       __mutex_lock+0x7b/0x820
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? kvm_sched_clock_read+0x14/0x30
       ? sched_clock+0x5/0x10
       ? sched_clock_cpu+0xc/0xb0
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       ? sched_clock+0x5/0x10
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7fa3972fdbfe
      Code: Bad RIP value.
      
      Fix this by not holding the ->device_list_mutex at this point.  The
      device_list_mutex exists to protect us from modifying the device list
      while the file system is running.
      
      However it can also be modified by doing a scan on a device.  But this
      action is specifically protected by the uuid_mutex, which we are holding
      here.  We cannot race with opening at this point because we have the
      ->s_mount lock held during the mount.  Not having the
      ->device_list_mutex here is perfectly safe as we're not going to change
      the devices at this point.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add some comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35b4a280
    • Anand Jain's avatar
      btrfs: don't traverse into the seed devices in show_devname · fa511954
      Anand Jain authored
      commit 4faf55b0 upstream.
      
      ->show_devname currently shows the lowest devid in the list. As the seed
      devices have the lowest devid in the sprouted filesystem, the userland
      tool such as findmnt end up seeing seed device instead of the device from
      the read-writable sprouted filesystem. As shown below.
      
       mount /dev/sda /btrfs
       mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
       btrfs dev add -f /dev/sdb /btrfs
      
       umount /btrfs
       mount /dev/sdb /btrfs
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
      All sprouts from a single seed will show the same seed device and the
      same fsid. That's confusing.
      This is causing problems in our prototype as there isn't any reference
      to the sprout file-system(s) which is being used for actual read and
      write.
      
      This was added in the patch which implemented the show_devname in btrfs
      commit 9c5085c1 ("Btrfs: implement ->show_devname").
      I tried to look for any particular reason that we need to show the seed
      device, there isn't any.
      
      So instead, do not traverse through the seed devices, just show the
      lowest devid in the sprouted fsid.
      
      After the patch:
      
       mount /dev/sda /btrfs
       mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
       btrfs dev add -f /dev/sdb /btrfs
       mount -o rw,remount /dev/sdb /btrfs
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sdb /btrfs 595ca0e6-b82e-46b5-b9e2-c72a6928be48
      
       mount /dev/sda /btrfs1
       mount: /btrfs1: WARNING: device write-protected, mounted read-only.
      
       btrfs dev add -f /dev/sdc /btrfs1
      
       findmnt --output SOURCE,TARGET,UUID /btrfs1
       SOURCE   TARGET  UUID
       /dev/sdc /btrfs1 ca1dbb7a-8446-4f95-853c-a20f3f82bdbb
      
       cat /proc/self/mounts | grep btrfs
       /dev/sdb /btrfs btrfs rw,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
       /dev/sdc /btrfs1 btrfs ro,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
      Reported-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa511954
    • Tom Rix's avatar
      btrfs: ref-verify: fix memory leak in add_block_entry · 6bf983c8
      Tom Rix authored
      commit d60ba8de upstream.
      
      clang static analysis flags this error
      
      fs/btrfs/ref-verify.c:290:3: warning: Potential leak of memory pointed to by 're' [unix.Malloc]
                      kfree(be);
                      ^~~~~
      
      The problem is in this block of code:
      
      	if (root_objectid) {
      		struct root_entry *exist_re;
      
      		exist_re = insert_root_entry(&exist->roots, re);
      		if (exist_re)
      			kfree(re);
      	}
      
      There is no 'else' block freeing when root_objectid is 0. Add the
      missing kfree to the else branch.
      
      Fixes: fd708b81 ("Btrfs: add a extent ref verify tool")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bf983c8
    • Qu Wenruo's avatar
      btrfs: don't allocate anonymous block device for user invisible roots · 8eadf67b
      Qu Wenruo authored
      commit 851fd730 upstream.
      
      [BUG]
      When a lot of subvolumes are created, there is a user report about
      transaction aborted:
      
        BTRFS: Transaction aborted (error -24)
        WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
        RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
        Call Trace:
         create_pending_snapshots+0x82/0xa0 [btrfs]
         btrfs_commit_transaction+0x275/0x8c0 [btrfs]
         btrfs_mksubvol+0x4b9/0x500 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0x11a4/0x2da0 [btrfs]
         do_vfs_ioctl+0xa9/0x640
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x5a/0x110
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 33f2f83f3d5250e9 ]---
        BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
        BTRFS info (device sda1): forced readonly
        BTRFS warning (device sda1): Skipping commit of aborted transaction.
        BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
      
      [CAUSE]
      The error is EMFILE (Too many files open) and comes from the anonymous
      block device allocation. The ids are in a shared pool of size 1<<20.
      
      The ids are assigned to live subvolumes, ie. the root structure exists
      in memory (eg. after creation or after the root appears in some path).
      The pool could be exhausted if the numbers are not reclaimed fast
      enough, after subvolume deletion or if other system component uses the
      anon block devices.
      
      [WORKAROUND]
      Since it's not possible to completely solve the problem, we can only
      minimize the time the id is allocated to a subvolume root.
      
      Firstly, we can reduce the use of anon_dev by trees that are not
      subvolume roots, like data reloc tree.
      
      This patch will do extra check on root objectid, to skip roots that
      don't need anon_dev.  Currently it's only data reloc tree and orphan
      roots.
      Reported-by: default avatarGreed Rong <greedrong@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8eadf67b
    • Qu Wenruo's avatar
      btrfs: free anon block device right after subvolume deletion · 3b5318a9
      Qu Wenruo authored
      commit 082b6c97 upstream.
      
      [BUG]
      When a lot of subvolumes are created, there is a user report about
      transaction aborted caused by slow anonymous block device reclaim:
      
        BTRFS: Transaction aborted (error -24)
        WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
        RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
        Call Trace:
         create_pending_snapshots+0x82/0xa0 [btrfs]
         btrfs_commit_transaction+0x275/0x8c0 [btrfs]
         btrfs_mksubvol+0x4b9/0x500 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0x11a4/0x2da0 [btrfs]
         do_vfs_ioctl+0xa9/0x640
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x5a/0x110
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 33f2f83f3d5250e9 ]---
        BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
        BTRFS info (device sda1): forced readonly
        BTRFS warning (device sda1): Skipping commit of aborted transaction.
        BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
      
      [CAUSE]
      The anonymous device pool is shared and its size is 1M. It's possible to
      hit that limit if the subvolume deletion is not fast enough and the
      subvolumes to be cleaned keep the ids allocated.
      
      [WORKAROUND]
      We can't avoid the anon device pool exhaustion but we can shorten the
      time the id is attached to the subvolume root once the subvolume becomes
      invisible to the user.
      Reported-by: default avatarGreed Rong <greedrong@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b5318a9
    • Bjorn Helgaas's avatar
      PCI: Probe bridge window attributes once at enumeration-time · 54a7a9d7
      Bjorn Helgaas authored
      commit 51c48b31 upstream.
      
      pci_bridge_check_ranges() determines whether a bridge supports the optional
      I/O and prefetchable memory windows and sets the flag bits in the bridge
      resources.  This *could* be done once during enumeration except that the
      resource allocation code completely clears the flag bits, e.g., in the
      pci_assign_unassigned_bridge_resources() path.
      
      The problem with pci_bridge_check_ranges() in the resource allocation path
      is that we may allocate resources after devices have been claimed by
      drivers, and pci_bridge_check_ranges() *changes* the window registers to
      determine whether they're writable.  This may break concurrent accesses to
      devices behind the bridge.
      
      Add a new pci_read_bridge_windows() to determine whether a bridge supports
      the optional windows, call it once during enumeration, remember the
      results, and change pci_bridge_check_ranges() so it doesn't touch the
      bridge windows but sets the flag bits based on those remembered results.
      
      Link: https://lore.kernel.org/linux-pci/1506151482-113560-1-git-send-email-wangzhou1@hisilicon.com
      Link: https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg02082.htmlReported-by: default avatarYandong Xu <xuyandong2@huawei.com>
      Tested-by: default avatarYandong Xu <xuyandong2@huawei.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Ofer Hayut <ofer@lightbitslabs.com>
      Cc: Roy Shterman <roys@lightbitslabs.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Zhou Wang <wangzhou1@hisilicon.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208371Signed-off-by: default avatarDima Stepanov <dimastep@yandex-team.ru>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54a7a9d7
    • Ansuel Smith's avatar
      PCI: qcom: Add support for tx term offset for rev 2.1.0 · dd6dc2fd
      Ansuel Smith authored
      commit de3c4bf6 upstream.
      
      Add tx term offset support to pcie qcom driver need in some revision of
      the ipq806x SoC. Ipq8064 needs tx term offset set to 7.
      
      Link: https://lore.kernel.org/r/20200615210608.21469-9-ansuelsmth@gmail.com
      Fixes: 82a82383 ("PCI: qcom: Add Qualcomm PCIe controller driver")
      Signed-off-by: default avatarSham Muthayyan <smuthayy@codeaurora.org>
      Signed-off-by: default avatarAnsuel Smith <ansuelsmth@gmail.com>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarStanimir Varbanov <svarbanov@mm-sol.com>
      Cc: stable@vger.kernel.org # v4.5+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd6dc2fd
    • Ansuel Smith's avatar
      PCI: qcom: Define some PARF params needed for ipq8064 SoC · 56e2a445
      Ansuel Smith authored
      commit 5149901e upstream.
      
      Set some specific value for Tx De-Emphasis, Tx Swing and Rx equalization
      needed on some ipq8064 based device (Netgear R7800 for example). Without
      this the system locks on kernel load.
      
      Link: https://lore.kernel.org/r/20200615210608.21469-8-ansuelsmth@gmail.com
      Fixes: 82a82383 ("PCI: qcom: Add Qualcomm PCIe controller driver")
      Signed-off-by: default avatarAnsuel Smith <ansuelsmth@gmail.com>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Acked-by: default avatarStanimir Varbanov <svarbanov@mm-sol.com>
      Cc: stable@vger.kernel.org # v4.5+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      56e2a445
    • Rajat Jain's avatar
      PCI: Add device even if driver attach failed · ae33b1eb
      Rajat Jain authored
      commit 2194bc7c upstream.
      
      device_attach() returning failure indicates a driver error while trying to
      probe the device. In such a scenario, the PCI device should still be added
      in the system and be visible to the user.
      
      When device_attach() fails, merely warn about it and keep the PCI device in
      the system.
      
      This partially reverts ab1a187b ("PCI: Check device_attach() return
      value always").
      
      Link: https://lore.kernel.org/r/20200706233240.3245512-1-rajatja@google.comSigned-off-by: default avatarRajat Jain <rajatja@google.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org	# v4.6+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae33b1eb
    • Kai-Heng Feng's avatar
      PCI: Mark AMD Navi10 GPU rev 0x00 ATS as broken · 71c6716c
      Kai-Heng Feng authored
      commit 45beb31d upstream.
      
      We are seeing AMD Radeon Pro W5700 doesn't work when IOMMU is enabled:
      
        iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=63:00.0 address=0x42b5b01a0]
        iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=63:00.0 address=0x42b5b01c0]
      
      The error also makes graphics driver fail to probe the device.
      
      It appears to be the same issue as commit 5e89cd30 ("PCI: Mark AMD
      Navi14 GPU rev 0xc5 ATS as broken") addresses, and indeed the same ATS
      quirk can workaround the issue.
      
      See-also: 5e89cd30 ("PCI: Mark AMD Navi14 GPU rev 0xc5 ATS as broken")
      See-also: d28ca864 ("PCI: Mark AMD Stoney Radeon R7 GPU ATS as broken")
      See-also: 9b44b0b0 ("PCI: Mark AMD Stoney GPU ATS as broken")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208725
      Link: https://lore.kernel.org/r/20200728104554.28927-1-kai.heng.feng@canonical.comSigned-off-by: default avatarKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71c6716c
    • Rafael J. Wysocki's avatar
      PCI: hotplug: ACPI: Fix context refcounting in acpiphp_grab_context() · c59ea9bd
      Rafael J. Wysocki authored
      commit dae68d7f upstream.
      
      If context is not NULL in acpiphp_grab_context(), but the
      is_going_away flag is set for the device's parent, the reference
      counter of the context needs to be decremented before returning
      NULL or the context will never be freed, so make that happen.
      
      Fixes: edf5bf34 ("ACPI / dock: Use callback pointers from devices' ACPI hotplug contexts")
      Reported-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: 3.15+ <stable@vger.kernel.org> # 3.15+
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c59ea9bd
    • Thomas Gleixner's avatar
      genirq/affinity: Make affinity setting if activated opt-in · 5c4d9eef
      Thomas Gleixner authored
      commit f0c7baca upstream.
      
      John reported that on a RK3288 system the perf per CPU interrupts are all
      affine to CPU0 and provided the analysis:
      
       "It looks like what happens is that because the interrupts are not per-CPU
        in the hardware, armpmu_request_irq() calls irq_force_affinity() while
        the interrupt is deactivated and then request_irq() with IRQF_PERCPU |
        IRQF_NOBALANCING.
      
        Now when irq_startup() runs with IRQ_STARTUP_NORMAL, it calls
        irq_setup_affinity() which returns early because IRQF_PERCPU and
        IRQF_NOBALANCING are set, leaving the interrupt on its original CPU."
      
      This was broken by the recent commit which blocked interrupt affinity
      setting in hardware before activation of the interrupt. While this works in
      general, it does not work for this particular case. As contrary to the
      initial analysis not all interrupt chip drivers implement an activate
      callback, the safe cure is to make the deferred interrupt affinity setting
      at activation time opt-in.
      
      Implement the necessary core logic and make the two irqchip implementations
      for which this is required opt-in. In hindsight this would have been the
      right thing to do, but ...
      
      Fixes: baedb87d ("genirq/affinity: Handle affinity setting on inactive interrupts correctly")
      Reported-by: default avatarJohn Keeping <john@metanate.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMarc Zyngier <maz@kernel.org>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/87blk4tzgm.fsf@nanos.tec.linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c4d9eef